| ... | ... | @@ -25,10 +25,12 @@ |
|
|
|
* [Debugging in interactive mode](#debugging-in-interactive-mode)
|
|
|
|
* [Automated parameter sweeps](#automated-parameter-sweeps)
|
|
|
|
* [Parsing and graphing IOR outputs](#parsing-and-graphing-ior-outputs)
|
|
|
|
* [Some useful parameter sweeps](#some-useful-parameter-sweeps)
|
|
|
|
* [Values for initial parameter sweeps](#values-for-initial-parameter-sweeps)
|
|
|
|
* [Side note: Darshan](#side-note-darshan)
|
|
|
|
* [FIO on NAS](#fio-on-nas)
|
|
|
|
* [Darshan on NAS](#darshan-on-nas)
|
|
|
|
* [What MPI module to use](#what-mpi-module-to-use)
|
|
|
|
* [Conducting initial parameter sweeps](#conducting-initial-parameter-sweeps)
|
|
|
|
|
|
|
|
<!-- vim-markdown-toc -->
|
|
|
|
|
| ... | ... | @@ -828,7 +830,7 @@ Now that I'm using the JSON output from IOR, everything is much more straightfor |
|
|
|
|
|
|
|
Great news is (after some tweaking/bug squashing) the parameter sweep job is working like a charm. Furthermore, the output file was instantly able to be parsed and visualized with the functions written!
|
|
|
|
|
|
|
|
## Some useful parameter sweeps
|
|
|
|
## Values for initial parameter sweeps
|
|
|
|
|
|
|
|
The current steps remaining are:
|
|
|
|
- Come up with some good parameter values to test
|
| ... | ... | @@ -856,11 +858,28 @@ To come up with the initial values, though, I've been drawing inspiration mostly |
|
|
|
- Mostly sequential IO (rather than random)
|
|
|
|
- Mostly writes -- really? I would assume that most scientific projects are moreso write-once read-many...
|
|
|
|
- Transfer size varies a lot -- **"1KB to tens of MB"**
|
|
|
|
- Typical IO patterns: **one processor**, **one file per processor**, **MPI-IO single shared file**
|
|
|
|
-
|
|
|
|
- Typical IO patterns: **one processor**, **one file per processor** (both POSIX), **MPI-IO single shared file**
|
|
|
|
- File per process can lead to *lots* of files being written, especially if there are restarts. This doesn't scale well in terms of data management!
|
|
|
|
- Small transactions and random accesses lead to poor performance... but lots of poorly designed applications do this
|
|
|
|
- Important IOR parameters:
|
|
|
|
- **API** -- POSIX, MPI-IO, HDF5, or NetCDF
|
|
|
|
- **ReadFile/WriteFile** -- whether to measure read/write operations
|
|
|
|
- **SegmentCount** (s) -- number of 'datasets' in the file
|
|
|
|
- Each dataset is composed of **NumTasks** (N) blocks of **BlockSize** (b), read/written by the processor in chunks of **TransferSize** (t)
|
|
|
|
- To avoid caching, filesize per processor (= BlockSize) should be large enough to exhaust the memory buffers on each node. **BlockSize was swept from 16MB to 8GB** to see where caching effects (for read performance) were mitigated. In their words: "where the derivative of the performance was asymptotically zero"
|
|
|
|
- ~~Curious, can't IOR's reorder option mitigate caching? We should test a block size sweep with and without reordering.~~ *This would only apply for tests on more than one node -- we're doing this so that we can trust the rest of the tests which only involve a single node.*
|
|
|
|
- **For this test, only one node was used and TransferSize was fixed at 2MB with one segment.**
|
|
|
|
- **TransferSize was swept from 1KiB to 256MiB** (using a power of 4 in KiB) to get a sense of if the system is optimized for larger transfer size/the system overhead.
|
|
|
|
- Using the ideal parameters seen above, file-per-process versus shared file were both evaluated as **NumTasks was swept from 8 to 256/1024** (depending on how many nodes were available on each system)
|
|
|
|
- On their systems, read and write performance were very similar.
|
|
|
|
- The theoretical peak IO bandwidth of each system was calculated/known before hand... for the Lustre system it was calculated as the number of DDN couplets times the bandwidth of each couplet
|
|
|
|
- [ ] **What is the theoretical peak IO bandwidth on Pleiades?**
|
|
|
|
- It's important to compare systems "on the basis of performance rather than raw performance" due to differences in scale
|
|
|
|
- The paper also explains the physical topology of the systems it tested -- something which I stumbled upon [ANL's CODES project](https://wordpress.cels.anl.gov/codes/) to simulate the impact of different topologies... outside the scope of this project, but perhaps worth ==NOTE==ing
|
|
|
|
- [ ] **We should see what performance a single node is capable of -- this'll let us measure speedup (fixed work per processor, as is default with IOR) and maybe also scaleup (if we adjust parameters to fix aggregate work done)**
|
|
|
|
- Truthfully, a speedup chart would be more effective at comparing different systems than a shared plot of raw performance!
|
|
|
|
- [I/O Performance on Cray XC30](https://www.nersc.gov/assets/pubs_presos/Edison-IO-CUG2014-paper-Zhao-4.pdf) by Zhengji Zhao, Doug Petesch, David Knaak, and Tina Declerck
|
|
|
|
- Sides: https://www.nersc.gov/assets/pubs_presos/Edison-IO-CUG2014-presentation.pdf
|
|
|
|
-
|
|
|
|
|
|
|
|
### Side note: Darshan
|
|
|
|
|
| ... | ... | @@ -893,12 +912,89 @@ The minimal job in [`readwrite.fio`](/fio/readwrite.fio) can be run with |
|
|
|
path/to/fio path/to/readwrite.fio
|
|
|
|
```
|
|
|
|
|
|
|
|
Hmmmm, I came across [](https://dl.acm.org/doi/pdf/10.1145/2335755.2335789) by Shawn Strande, Pietro Cicotti, et al. (and it's out of SDSC - it's a small world after all ;) ) but, I also think I came across the reason why I'm not seeing HPC papers that use FIO: I'm not so sure that FIO can do single-shared-file workloads, https://github.com/axboe/fio/issues/631. So it might be really easy to set up job script and get baseline readings for your filesystems, but not when there are multiple nodes involved.
|
|
|
|
Hmmmm, I came across [Gordon: design, performance, and experiences deploying and supporting a data intensive supercomputer](https://dl.acm.org/doi/pdf/10.1145/2335755.2335789) by Shawn Strande, Pietro Cicotti, et al. (and it's out of SDSC - it's a small world after all ;) ) but, I also think I came across the reason why I'm not seeing HPC papers that use FIO: I'm not so sure that FIO can do single-shared-file workloads, https://github.com/axboe/fio/issues/631. So it might be really easy to set up job script and get baseline readings for your filesystems, but not when there are multiple nodes involved.
|
|
|
|
|
|
|
|
> ###### [b2beb35](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/b2beb35c3771b5251442b3cdd48801f1d2c57e4c)
|
|
|
|
> "FIO works on NAS" | HEAD -> main | 2021-04-09
|
|
|
|
|
|
|
|
###### 04/10
|
|
|
|
|
|
|
|
To view some documentation PDFs and to prepare for viewing plots generated by Darshan, I went ahead and went through the (really easy!) process of setting up a VNC server/connection to a graphical interface. Following https://www.nas.nasa.gov/hecc/support/kb/vnc-a-faster-alternative-to-x11_257.html was straightforward, and boiled down to:
|
|
|
|
```
|
|
|
|
# On pfe
|
|
|
|
vncserver -localhost
|
|
|
|
# > "New desktop is at pfe:XX"
|
|
|
|
~C
|
|
|
|
-L 5900:localhost:59XX
|
|
|
|
# Connect to localhost:5900 with local VNC client
|
|
|
|
vncserver -kill :XX
|
|
|
|
```
|
|
|
|
|
|
|
|
###### 04/11
|
|
|
|
|
|
|
|
## Darshan on NAS
|
|
|
|
|
|
|
|
- [ ] TODO
|
|
|
|
**I should be using `mpi-sgi/mpt` (or `mpi-hpe`?)** rather than mpi-hpcx. This includes `mpicc`.
|
|
|
|
|
|
|
|
> ###### [b2beb35](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/b2beb35c3771b5251442b3cdd48801f1d2c57e4c)
|
|
|
|
> "FIO works on NAS" | HEAD -> main | 2021-04-09
|
|
|
|
Trying to set up Darshan has proven a challenge! But, here's what I've come up with so far, trying to follow https://www.mcs.anl.gov/research/projects/darshan/docs/darshan-runtime.html:
|
|
|
|
1. Download and untar `wget -O- ftp://ftp.mcs.anl.gov/pub/darshan/releases/darshan-3.2.1.tar.gz | tar zxf -`
|
|
|
|
2. Load in mpi-hpe for mpicc `module load mpi-hpe comp-intel`
|
|
|
|
3. Configure and make the darshan-runtime `cd darshan-runtime && ./configure --with-log-path=~/profiler/darshan-logs --with-jobid-env=PBS_JOBID CC=mpicc && make`
|
|
|
|
|
|
|
|
Now is where I get stuck. I can't `make install` since I don't have write permissions to `/usr/local/lib`, but I *can* do something like `make install DESTDIR=~/` to install it to my home directory... I can even add `~/usr/local/bin` to my path. But what about the `lib` and `share` directories? How do I make sure those are accessible?
|
|
|
|
|
|
|
|
The reason I ask, is because when I try to run an mpiexec that is monitored by Darshan, I face an error
|
|
|
|
```
|
|
|
|
paddison@pfe20:~> LD_PRELOAD=~/usr/local/lib/libdarshan.so mpiexec -n 2 ~/benchmarks/ior/ior-3.3.0/src/ior
|
|
|
|
mpiexec: symbol lookup error: /home6/paddison/usr/local/lib/libdarshan.so: undefined symbol: darshan_variance_reduce
|
|
|
|
```
|
|
|
|
|
|
|
|
I just tried `export LD_LIBRARY_PATH=~/usr/local/lib:$LD_LIBRARY_PATH` as well to no avail.
|
|
|
|
|
|
|
|
To be honest, I've spent some time reading about libraries and linking, but I don't truly understand how it all works and what specifically is breaking here. Perhaps I need to set some paths in `./configure`. For instance, `--prefix`.
|
|
|
|
|
|
|
|
Using `--prefix ~/usr/local` lets me run `make install` without messing with Makefile variables (whoops, shoulda just looked at the `./configure --help` to begin with!). And my hope is that it'll also let me actually run the thing!
|
|
|
|
```
|
|
|
|
paddison@pfe20:~/profiler/darshan-3.2.1/darshan-runtime> LD_PRELOAD=~/usr/local/lib/libdarshan.so mpiexec -n 2 ~/benchmarks/ior/ior-3.3.0/src/ior
|
|
|
|
Can't open proc file /proc/arsess
|
|
|
|
: function completed normally
|
|
|
|
Can't open proc file /proc/arsess
|
|
|
|
: function completed normally
|
|
|
|
asallocash failed: array services not available
|
|
|
|
Can't open proc file /proc/arsess
|
|
|
|
: array services not available
|
|
|
|
mpiexec: all_launch.c:737: newash: Assertion `new_ash != old_ash' failed.
|
|
|
|
Aborted (core dumped)
|
|
|
|
```
|
|
|
|
Hey! At least it's different than before :') Oh whoops, that might be because I tried running an mpiexec command on a front-end node rather than a compute node. Let's try it again in an interactive qsub.
|
|
|
|
|
|
|
|
Hmmmm... it hung on me. Gotta figure out how to terminate a PBS job. I tried `qsig -s INT jobid` and `-s` (which should be SIGTERM), then I tried `qdel jobid` but it hasn't worked yet :o After a while (~10 minutes or so), my qdel timed out, then trying it again said "qdel: Server could not connect to MOM ...", then after a bit more time I tried it again and it worked. Maybe some backend server was down temporarily or something...
|
|
|
|
|
|
|
|
As if I didn't learn my lesson, I'm going to try again.
|
|
|
|
|
|
|
|
Aw shucks, here we go again. Something about `LD_PRELOAD=~/usr/local/lib/libdarshan.so mpiexec -n 2 ~/benchmarks/ior/ior-3.3.0/src/ior` is hanging. Same exact thing happened -- `qdel` timed out after 12 minutes, then a subsequent call returned no connection to MoM, then a third call a few seconds later succeeded. Not sure what's going on.
|
|
|
|
|
|
|
|
###### 04/12
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
I'll re-examine Darshan in the future, or perhaps while waiting for some parameter sweeps to conclude. For now, it's time to use the parameter values from the paper and start running some initial tests!
|
|
|
|
|
|
|
|
## What MPI module to use
|
|
|
|
|
|
|
|
Huh, so actually when I was getting things set up to run the parameter sweeps, I realized that I can't run IOR using mpi-hpe/mpt nor mpi-sgi/mpt... only mpi-hpcx + comp-intel it seems... With otherwise I'm met with `error while loading shared libraries: libopen-rte.so.40: cannot open shared object file: No such file or directory`...
|
|
|
|
|
|
|
|
Maybe it's because I ran `make` with hpcx loaded? That would make sense. I've gone ahead and re-downloaded, re-configured, and re-made IOR with mpi-hpe loaded -- it works this time with mpi-hpe as the *only* required module :)
|
|
|
|
|
|
|
|
Let's try Darshan super quick? Damn. It hung again. Alright, I'll give up on Darshan for now and just move on with the parameter sweep finally.
|
|
|
|
|
|
|
|
## Conducting initial parameter sweeps
|
|
|
|
|
|
|
|
**Additional IOR documentation can be found here https://github.com/hpc/ior/blob/main/doc/USER_GUIDE**. It includes some things that aren't on the website. Based on this, I could have written Python code to generate IOR scripts then have the PBS job script run that, rather than execute commands within Python. Oh well, maybe I will change to that in the future.
|
|
|
|
|
|
|
|
I've gone ahead and done the parameter sweeps. The results are plotted and commented on in the `1_Parameter_sweeps.ipynb` notebook. Here are some of the notable things we saw.
|
|
|
|
|
|
|
|
|
|
|
|
> ###### [a10b7d7](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/a10b7d71f4ffb1420f45d916b72968b6174236e4)
|
|
|
|
> "Initial parameter sweeps; Configurable sweeps; Parsing/plotting" | HEAD -> main | 2021-04-14
|
|
|
|
|