| ... | ... | @@ -60,6 +60,13 @@ |
|
|
|
- [Darshan on PRP](#darshan-on-prp)
|
|
|
|
- [Darshan to observe an ML application](#darshan-to-observe-an-ml-application)
|
|
|
|
- [I/O Behavior of Flight Anomaly Detection](#io-behavior-of-flight-anomaly-detection)
|
|
|
|
- [Replicating Chowdhury et al IO Evaluation of BeeGFS for Deep Learning](#replicating-chowdhury-et-al-io-evaluation-of-beegfs-for-deep-learning)
|
|
|
|
- [Darshan on NAS](#darshan-on-nas-1)
|
|
|
|
- [Psuedo pipeline to observe with Darshan](#psuedo-pipeline-to-observe-with-darshan)
|
|
|
|
- [Fire detection setup](#fire-detection-setup)
|
|
|
|
- [Validating Darshan outputs](#validating-darshan-outputs)
|
|
|
|
- [Flood detection profiling](#flood-detection-profiling)
|
|
|
|
- [Darshan incomplete logs and segfault](#darshan-incomplete-logs-and-segfault)
|
|
|
|
|
|
|
|
<!-- vim-markdown-toc -->
|
|
|
|
|
| ... | ... | @@ -1618,6 +1625,162 @@ Here's the current WIP Python plotting implementation: https://github.com/darsha |
|
|
|
|
|
|
|
###### 06/01
|
|
|
|
|
|
|
|
## Replicating Chowdhury et al IO Evaluation of BeeGFS for Deep Learning
|
|
|
|
|
|
|
|
I know it feels a bit late in the process to do this -- but it's *really* about time that I actually do real research and consult more papers.
|
|
|
|
|
|
|
|
In an effort to follow https://dl.acm.org/doi/pdf/10.1145/3337821.3337902 (IO analysis of BeeGFS for deep learning) I am trying to set up similar conditions on the PRP and run the same experiments.
|
|
|
|
|
|
|
|
> ###### 4af8366e93fbc8d6632858446a3d247f7204ee25
|
|
|
|
> "Fix bash alias" | HEAD -> main | 2021-06-01
|
|
|
|
|
|
|
|
## Darshan on NAS
|
|
|
|
|
|
|
|
Again. Let's figure this out.
|
|
|
|
|
|
|
|
```bash
|
|
|
|
wget -O- ...3.3.0 | tar zxf -
|
|
|
|
cd darshan-3.3.0/darshan-runtime
|
|
|
|
module load mpi-hpe
|
|
|
|
./configure --with-log-path=$HOME/darshan-logs --with-jobid-env=PBS_JOBID --prefix=$HOME/usr/local
|
|
|
|
make
|
|
|
|
make install
|
|
|
|
mkdir -p ~/darshan-logs
|
|
|
|
chmod +x darshan-mk-log-dirs.pl
|
|
|
|
./darshan-mk-log-dirs.pl
|
|
|
|
```
|
|
|
|
|
|
|
|
So far that worked without any issue. We can install the util the same way.
|
|
|
|
```bash
|
|
|
|
cd ../darshan-util
|
|
|
|
./configure --prefix=$HOME/usr/local
|
|
|
|
make
|
|
|
|
make install
|
|
|
|
```
|
|
|
|
|
|
|
|
That worked too. Time to launch a compute node and test if I can get it to monitor without crashing.
|
|
|
|
|
|
|
|
I *should* be able to use
|
|
|
|
```bash
|
|
|
|
env DARSHAN_ENABLE_NONMPI= LD_PRELOAD=$HOME/usr/local/lib/libdarshan.so <my_command>
|
|
|
|
```
|
|
|
|
|
|
|
|
At first I got:
|
|
|
|
```
|
|
|
|
/bin/sh: error while loading shared libraries: libmpi.so: cannot open shared object file: No such file or directory
|
|
|
|
```
|
|
|
|
but that's just complaining I didn't do `module load mpi-hpe`.
|
|
|
|
|
|
|
|
Then I ran it again, I was initially scared because the test script I wrote just reads and writes 4k random bytes a few times (very quick), but when I ran it with Darshan nothing showed up -- I assumed it had crashed again.
|
|
|
|
|
|
|
|
It **hadn't**. Rather, it just took ages to do each operation -- why? I thought Darshan had a low overhead?
|
|
|
|
|
|
|
|
Ah... It generated a separate summary for every individual operation -- so in this case I have 5 files in the directory now, one for the `sh` invocation (the script itself), then three `dd` and one `ls`. Interesting.
|
|
|
|
|
|
|
|
Good news, `darshan-parser` works on the files! So, ultimately, Darshan is working on NAS!
|
|
|
|
|
|
|
|
The run took 75 seconds total, when typically it takes only a fraction of a second. Good news, fortunately each operation is still logged as taking only a fraction of a second, so it didn't affect the operations it was merely overhead.
|
|
|
|
|
|
|
|
## Psuedo pipeline to observe with Darshan
|
|
|
|
|
|
|
|
While we wait for real NASA data-intensive ML apps to become available, we can run Darshan on other ML models or artificial pipelines made to mimic the app (e.g. using IOR).
|
|
|
|
|
|
|
|
~~When using IOR, Darshan creates a separate log for every run -- one per operation per iteration. And mpiexec gets a log too. I think in order to see the whole thing I need to read all logs with the same ID.~~
|
|
|
|
|
|
|
|
There is a separate log for every mpi process invocation of IOR. There are also somehow three mpiexec logs.
|
|
|
|
|
|
|
|
###### 06/21
|
|
|
|
|
|
|
|
## Fire detection setup
|
|
|
|
|
|
|
|
The GOES data is on NAS. The MTBS data can be downloaded online.
|
|
|
|
|
|
|
|
A conda environment can be made with all requirements. Start by making an empty environment `conda create -n geo`, then `conda install -c conda-forge gdal`, then `cartopy, xarray, rioxarray`.
|
|
|
|
|
|
|
|
- https://gdal.org/download.html#conda
|
|
|
|
- https://scitools.org.uk/cartopy/docs/latest/installing.html#conda-pre-built-binaries
|
|
|
|
|
|
|
|
The Jupyter notebook provided to me doesn't need to import `gdal` (it's used as a command line utility), nor the `from utils` (not used in the file).
|
|
|
|
|
|
|
|
All other imports in the notebook work except for `geopandas` which has not been installed yet. There is a conflict when I try to install `geopandas`... let's wait a long time and find out why. I need geopandas just to import the MTBS shapefiles. Maybe I should have installed geopandas before cartopy and the rest...
|
|
|
|
|
|
|
|
I made a new environment with `conda create -n ... gdal geopandas cartopy` (did not specify the Python version, let conda figure that out) and it made one with Python 3.9.5. Then `xarray` then `-c conda-forge rioxarray`.
|
|
|
|
|
|
|
|
Thought that worked... but then importing anything which imports numpy results in
|
|
|
|
```
|
|
|
|
File "/nobackup/paddison/.conda/envs/geofire/lib/python3.9/site-packages/numpy/__init__.py", line 148, in <module>
|
|
|
|
from . import lib
|
|
|
|
File "/nobackup/paddison/.conda/envs/geofire/lib/python3.9/site-packages/numpy/lib/__init__.py", line 44, in <module>
|
|
|
|
__all__ += type_check.__all__
|
|
|
|
NameError: name 'type_check' is not defined
|
|
|
|
```
|
|
|
|
|
|
|
|
Fixed it by explicitly installing `numpy` as well.
|
|
|
|
|
|
|
|
###### 06/22
|
|
|
|
|
|
|
|
I added some multiprocessing to the code. Good.
|
|
|
|
|
|
|
|
When trying to profile it with Darshan though, I ran into a symbol error for `llapi_layout_get_by_xattr`. Looks like https://github.com/darshan-hpc/darshan/issues/399. I never ran into this issue on NAS before though, oh well. Looks like a fix has been merged, so I'm installing the latest Darshan from the repo to try to get things working.
|
|
|
|
|
|
|
|
Seemed to work on a test Python invocation. Let's run the preprocessing now!
|
|
|
|
|
|
|
|
Damn. I interrupted the program, but I think in doing so I halted it. Let me run it again on just the first handful of samples :/
|
|
|
|
|
|
|
|
Sheesh. It didn't work that time either. Darshan is giving me an 'objdump' log rather than a python log... probably because the program didn't fully terminate properly. I'm trying one more time just going through the data on a single specified date. I'm going to run into the walltime though before this run completes! Noooooooooooo, lol
|
|
|
|
|
|
|
|
Hmmmm... so this time the Python ran through all the data specified just fine. But Darshan produced both an objdump and python log, but the Python log is "incomplete" when viewing with darshan-parser. That's fine, just use the `--show-incomplete`. It *looks* like it worked! Later today I'll load up the results in Python and plot them.
|
|
|
|
|
|
|
|
###### 06/23
|
|
|
|
|
|
|
|
## Validating Darshan outputs
|
|
|
|
|
|
|
|
When I looked at the timings in the Darshan logs using pydarshan, it looked like I/O contributed less than a percent of the runtime. The multiprocessing wasn't actually working correctly, so the compute time should be substantially quicker, but still it seemed odd that I/O took so little time when the files are actually quite large and plentiful.
|
|
|
|
|
|
|
|
To ensure Darshan is showing me the right things, I wrote a series of Python scripts which performed different read and write access patterns, such as reading 1GB chunks of a large file and writing to a new file, or writing random bytes.
|
|
|
|
|
|
|
|
I used cProfile to compare timings seen by Python natively and those seen by Darshan
|
|
|
|
```
|
|
|
|
python -m cProfile -o out.profile script.py
|
|
|
|
```
|
|
|
|
|
|
|
|
```python
|
|
|
|
import pstats
|
|
|
|
p = pstats.Stats('out.profile')
|
|
|
|
p.print_stats()
|
|
|
|
```
|
|
|
|
|
|
|
|
They meshed up just fine. In a first test I was skeptical when Darshan showed my write file test as being over 50% computation, but by using cProfile I figured out that my `os.urandom` generation was indeed taking about half the time!
|
|
|
|
|
|
|
|
###### 06/29
|
|
|
|
|
|
|
|
## Flood detection profiling
|
|
|
|
|
|
|
|
Over the last week I've been working with a [DELTA](https://github.com/nasa/delta) training pipeline which will soon get some data and model specification to train a flood detection algorithm. I was able to make a `geoflood` conda environment on top of `geofire`.
|
|
|
|
|
|
|
|
(Side node, I realized that we have *flight, fire* and *flood* algorithms. That alliteration is kinda cool!)
|
|
|
|
|
|
|
|
### Darshan incomplete logs and segfault
|
|
|
|
|
|
|
|
In running the Landsat-8 example training and attempting to monitor it with Darshan, all seemed well. However, the logs showed up as incomplete, and attempting to analyze them with pydarshan resulted in a segfault and core dump...
|
|
|
|
|
|
|
|
```
|
|
|
|
# *ERROR*: The POSIX module contains incomplete data!
|
|
|
|
# This happens when a module runs out of
|
|
|
|
# memory to store new record data.
|
|
|
|
|
|
|
|
# To avoid this error, consult the darshan-runtime
|
|
|
|
# documentation and consider setting the
|
|
|
|
# DARSHAN_EXCLUDE_DIRS environment variable to prevent
|
|
|
|
# Darshan from instrumenting unecessary files.
|
|
|
|
|
|
|
|
# You can display the (incomplete) data that is
|
|
|
|
# present in this log using the --show-incomplete
|
|
|
|
# option to darshan-parser.
|
|
|
|
```
|
|
|
|
|
|
|
|
I've got to figure this out before moving forwards.
|
|
|
|
|
|
|
|
> ###### 8bda8bc359c393ecad71e92535172a6cfa1d2c66
|
|
|
|
> "Add profiling work" | HEAD -> main | 2021-07-01
|
|
|
|
|