| ... | @@ -59,6 +59,7 @@ |
... | @@ -59,6 +59,7 @@ |
|
|
- [Understanding performance](#understanding-performance)
|
|
- [Understanding performance](#understanding-performance)
|
|
|
- [Darshan on PRP](#darshan-on-prp)
|
|
- [Darshan on PRP](#darshan-on-prp)
|
|
|
- [Darshan to observe an ML application](#darshan-to-observe-an-ml-application)
|
|
- [Darshan to observe an ML application](#darshan-to-observe-an-ml-application)
|
|
|
|
- [I/O Behavior of Flight Anomaly Detection](#io-behavior-of-flight-anomaly-detection)
|
|
|
|
|
|
|
|
<!-- vim-markdown-toc -->
|
|
<!-- vim-markdown-toc -->
|
|
|
|
|
|
| ... | @@ -1523,7 +1524,7 @@ Trying to get a test run of Darshan observing some ML application like image ana |
... | @@ -1523,7 +1524,7 @@ Trying to get a test run of Darshan observing some ML application like image ana |
|
|
Turns out, Darshan *was* working but there are a few things to consider:
|
|
Turns out, Darshan *was* working but there are a few things to consider:
|
|
|
- The environment variable `DARSHAN_ENABLE_NONMPI` needs to be set (it can be empty)
|
|
- The environment variable `DARSHAN_ENABLE_NONMPI` needs to be set (it can be empty)
|
|
|
- I think UTC is used so sometimes you need to look at the next day of log data
|
|
- I think UTC is used so sometimes you need to look at the next day of log data
|
|
|
```
|
|
```bash
|
|
|
env DARSHAN_ENABLE_NONMPI= LD_PRELOAD=/usr/local/lib/libdarshan.so python script.py
|
|
env DARSHAN_ENABLE_NONMPI= LD_PRELOAD=/usr/local/lib/libdarshan.so python script.py
|
|
|
```
|
|
```
|
|
|
|
|
|
| ... | @@ -1550,3 +1551,66 @@ I don't have Docker on the nasmac, so I'm trying to get the Nautilus GitLab cont |
... | @@ -1550,3 +1551,66 @@ I don't have Docker on the nasmac, so I'm trying to get the Nautilus GitLab cont |
|
|
> ###### ddb3cc2587ee8151dd88fa956d9d00a37c81a64f
|
|
> ###### ddb3cc2587ee8151dd88fa956d9d00a37c81a64f
|
|
|
> "Prompt image build" | HEAD -> main | 2021-05-24
|
|
> "Prompt image build" | HEAD -> main | 2021-05-24
|
|
|
|
|
|
|
|
|
###### 05/25
|
|
|
|
|
|
|
|
Alright, I've figured out the images, I have a deployment with PyTorch and Darshan running and I've copied over the flight anomaly code and data. Let's run it once to make sure it does indeed run.
|
|
|
|
```bash
|
|
|
|
python main_CCLP.py -e 1 -v 1
|
|
|
|
```
|
|
|
|
Ha! It does!
|
|
|
|
|
|
|
|
Okie dokes, now time to try monitoring it with Darshan.
|
|
|
|
```bash
|
|
|
|
env DARSHAN_ENABLE_NONMPI= LD_PRELOAD=/usr/local/lib/libdarshan.so python main_CCLP.py -e 1
|
|
|
|
```
|
|
|
|
Sweet, now to examine the Darshan logs.
|
|
|
|
|
|
|
|
### I/O Behavior of Flight Anomaly Detection
|
|
|
|
|
|
|
|
I can create a human readable text dump of the log with `darshan-parser`, but I should also have PyDarshan installed in this image, so let's try to use it! Hmm, trying to import darshan complained. When I install darshan-util I should ./configure it with `--enable-pydarshan --enable-shared`.
|
|
|
|
|
|
|
|
Then I can read in a report as the following in a Python shell, and tell it to read in all records (POSIX, MPI, STDIO, etc):
|
|
|
|
```python
|
|
|
|
import darshan
|
|
|
|
|
|
|
|
report = darshan.DarshanReport('filename',read_all=False)
|
|
|
|
report.read_all_generic_records()
|
|
|
|
```
|
|
|
|
Within the report, there are multiple records. We can see what records we have with `report.info()`, then access them through `report.records['API']` and then run things like `record.info(plot=True)`.
|
|
|
|
|
|
|
|
However, this relies on an implied IPython environment, since it uses `display`. I'll try installing jupyter into this image. Seems to be working like a charm!
|
|
|
|
```bash
|
|
|
|
pip install jupyter # Pod
|
|
|
|
jupyter notebook
|
|
|
|
|
|
|
|
k port-forward podname 8888:8888 # Local
|
|
|
|
```
|
|
|
|
|
|
|
|
Oh my I'm always reminded just how much I *absolutely love* working in Jupyter notebooks :')
|
|
|
|
|
|
|
|
Great! This is awesome. I have the data and can play around with it.
|
|
|
|
|
|
|
|
Okay, now to do a run on multiple epochs. Also it's worth noting this is just the *training* process we're monitoring -- the preprocessing stage is entirely separate.
|
|
|
|
|
|
|
|
I'd be really interested in seeing **how much of the total runtime was waiting for I/O**?
|
|
|
|
|
|
|
|
Looks like I'll want to use their experimental aggregators: https://www.mcs.anl.gov/research/projects/darshan/docs/pydarshan/api/pydarshan/darshan.experimental.aggregators.html. They don't return plots (so actually I guess we don't need jupyter, but I'm still going to use it), so we'll want to write some plotting code to visualize.
|
|
|
|
```python
|
|
|
|
darshan.enable_experimental(True)
|
|
|
|
# IO Size Histogram, given the API ('module')
|
|
|
|
report.mod_agg_iohist('POSIX')
|
|
|
|
# Cumulative number of operations
|
|
|
|
report.agg_ioops()
|
|
|
|
```
|
|
|
|
It seems like I can basically call them all using `.summarize()` then access with `.summary`
|
|
|
|
```python
|
|
|
|
report.summarize()
|
|
|
|
report.summary
|
|
|
|
```
|
|
|
|
|
|
|
|
Plotting a hist/bar of access sizes is easy enough. How about the timeline? Here are the plots I want to replicate: https://www.mcs.anl.gov/research/projects/darshan/docs/ssnyder_ior-hdf5_id3655016_9-23-29011-12333993518351519212_1.darshan.pdf
|
|
|
|
|
|
|
|
|
|
|
|
> ###### 241c061caced2a176cd50628d9c4954a267152a8
|
|
|
|
> "Fix PyDarshan installation" | HEAD -> main | 2021-05-25
|
|
|
|
|