Changes

Parker Addison · 89c1989d
--- a/Notes.md
+++ b/Notes.md
@@ -45,6 +45,20 @@
 - [Single node CephFS parameter sweep](#single-node-cephfs-parameter-sweep)
  - [Attach VSCode to kubernetes pod](#attach-vscode-to-kubernetes-pod)
  - [IOR scripts and outputs](#ior-scripts-and-outputs)
+- [IO Hints](#io-hints)
+- [Trying out nbp2 and memory hogging](#trying-out-nbp2-and-memory-hogging)
+- [Better understanding of the filesystem hardware](#better-understanding-of-the-filesystem-hardware)
+- [No-cache read performance](#no-cache-read-performance)
+- [Designing stripe tests](#designing-stripe-tests)
+- [Some useful things to know about IOR](#some-useful-things-to-know-about-ior)
+- [The real reason for transfer size performance drops](#the-real-reason-for-transfer-size-performance-drops)
+  - [Working around the local memory cache](#working-around-the-local-memory-cache)
+- [A quick multi-node test on PRP](#a-quick-multi-node-test-on-prp)
+  - [Pay attention to `-npernode`](#pay-attention-to--npernode)
+  - [Pay attention to cores vs threads](#pay-attention-to-cores-vs-threads)
+  - [Understanding performance](#understanding-performance)
+- [Darshan on PRP](#darshan-on-prp)
+- [Darshan to observe an ML application](#darshan-to-observe-an-ml-application)

 <!-- vim-markdown-toc -->

@@ -1261,6 +1275,8 @@ Gotta figure that one out.
 > ###### e58ea08968662f03b607111cc865d149c4d4ea12
 > "IOR working on shared cephfs filesystem with node selection" | HEAD -> main | 2021-04-26

+###### 04/27
+
 ## PRP SeaweedFS

 Dima mentioned there are some issues with CephFS at the moment and heavy usage is causing the OSDs to run out of memory and crash. In the meantime, he mentioned I can check out SeaweedFS.
@@ -1321,9 +1337,196 @@ Rather than using the Python parameter_sweep script, I just whipped up a very ti

 On NAS at the moment I still need to use the Python orchestration in order to set the Lustre stripe sizes/counts before each run... at least until the IOR Lustre options are fixed.

- [ ] ==NOTE== **I'm curious... does my previous Lustre striping workaround still work when there are multiple repetitions? I don't recall checking...**
+- [x] ~~**I'm curious... does my previous Lustre striping workaround still work when there are multiple repetitions? I don't recall checking...**~~ *Yes, it works :) And it works for multiple tests too, so I actually **don't** need to use the Python script at all as long as I remember to manually set the stripe count before running the test.*

 Also **latency output (json) is measured in seconds**.

 > ###### e5bc2cf2836df6c51b96bbd18bfe828595154d51
 > "Run using IOR scripts; PRP ceph and seaweed" | HEAD -> main | 2021-04-28
+
+###### 04/28
+
+## IO Hints
+
+I am going to be showing the NAS and Ceph findings so far to the NAS APP group in an attempt to figure out what's going on with the drop in performance at 4mb transfer size on NAS, and to ask about the hardware/software stack at NAS, Lustre monitoring, etc.
+
+So, I'm re-running a bunch of the parameter sweeps on NAS (and PRP) to make sure my results are consistent. At the same time, I'd like to experiment with I/O hints. This should be useful: http://www.idris.fr/media/docs/docu/idris/idris_patc_hints_proj.pdf, and https://github.com/hpc/ior/blob/main/doc/USER_GUIDE#L649. I was able to use a hints file that looks like this:
+```
+# File: hints.ior
+IOR_HINT__MPI__romio_cb_write=enable
+IOR_HINT__MPI__romio_cb_read=enable
+```
+Coupled with `hintsFileName` (-U) set to the path to that file, and `showHints` (-H) it worked! Now let's do some parameter sweeps and see if it actually makes a difference.
+
+==NOTE== *Collective* option in IOR causes massive drop in performance -- bandwidth on the order of single mebibytes.
+
+###### 05/03
+
+## Trying out nbp2 and memory hogging
+
+Hogging memory on the node (`-M`) seems to affect the blockSize performance as Mahmoud suggested. Trying to run on /nobackupp2 with 85% memory hogging leads to out of memory error at some point when testing the 1.5Gi block size... not sure why this didn't happen when testing on /nobackupp12... the requested compute nodes were the same.
+
+Transfer size exhibited no drop off when memory hogging was used, and read performance was pretty level at around 200MiB/s, write performance was consistently greater than read.
+
+I'd like to run a read test on an existing file that is for sure *not* in the Lustre OSS cache.
+
+###### 05/04
+
+## Better understanding of the filesystem hardware
+
+These links are useful.
+
+- https://www.nas.nasa.gov/hecc/support/kb/pleiades-lustre-filesystems_225.html
+- https://www.nas.nasa.gov/hecc/support/kb/pleiades-configuration-details_77.html
+- https://www.nas.nasa.gov/hecc/support/kb/sandy-bridge-processors_301.html
+
+However, the hardware of the OSTs is not discussed -- and that's where we'd find out the theoretical performance of our filesystems by looking at the OST drive performance and networking overhead.
+
+## No-cache read performance
+
+I created some files the other day, and I'm now trying to do read-only IOR tests on these existing files. On my small-scale tests it seems to have worked -- I get much lower read bandwidth.
+
+Here are the useful things to know about such a test:
+- **Absolutely keepFile (`-k`) must be enabled** (otherwise the data will be deleted after the test, meaning you'll need to create a new file and wait a while again -- whoops)
+- It is still important to use memory hogging (`-M %`) for multiple tests, otherwise the read file will be in the local cache.
+  - 85% seemed to work well. I wouldn't be surprised if too high and you risk crashing the node due to OOM, however (just like what happened on a nbp2 test earlier)
+- We can read just a portion of the file just fine, just a warning will show
+  ```
+  WARNING: Expected aggregate file size       = 1073741824.
+  WARNING: Stat() of aggregate file size      = 17179869184.
+  WARNING: Using actual aggregate bytes moved = 1073741824.
+  ```
+
+Probably a good idea to just create a handful of very large files (to support out multi-node largest tests) and keep those laying around.
+
+## Designing stripe tests
+
+We're interested in exploring our hypothesis that performance drops at certain transfer sizes are related to the Lustre stripe sizes.
+
+## Some useful things to know about IOR
+
+- We can pass a script into IOR via stdin like so:
+```sh
+cat | ior -f /dev/stdin << EOF
+IOR START
+# rest of script
+RUN
+IOR STOP
+EOF
+```
+- Warnings show up when they occur, so attempting to use summaryFormat=json without a corresponding summaryFile will cause invalid jsons in stdout if anything is logged to stdout
+
+Snippet for plotting the write test until I can figure out a better way
+```
+for rn in df12.ReferenceNumber.unique(): quick_plot(df12[df12.ReferenceNumber==rn],'transferSize','bandwidth')
+```
+
+## The real reason for transfer size performance drops
+
+I'm suspecting that the real reason for transfer size performance drops has something to do with memory. I believe this because we're seeing the effect *while reading from a file that has been cached in local memory*. Observe that the read speeds are astronomical -- but only after the very first iteration for a file. Somehow I need to avoid local caching -- and I was using memory hogging at 85% but that wasn't enough.
+
+### Working around the local memory cache
+
+I'm thinking of the following possible workarounds:
+- Avoid repetitions in IOR itself -- too likely to re-read from cache
+- Make the sweep round-robin style doing each parameter value for all files before moving on to the next, coupled with memory hogging to ensure only one file fits in memory
+- Try to manually drop the file(s) from the memory cache
+  - https://unix.stackexchange.com/questions/36907/drop-a-specific-file-from-the-linux-filesystem-cache
+
+
+I tried the manual cache dropping. First I ran `free -h` to see my memory and cache usage, then read a 4Gi file with IOR and saw memory usage jump up. Sure enough the next IOR read test had OOM better read bandwidth. 
+
+Some testing with memory hogging shows that it definitely lowers performance, but by no means does it prevent the caching effects wholly.
+
+Then I tried running
+```sh
+dd of=FILE_name oflag=nocache conv=notrunc,fdatasync count=0
+```
+
+Looking at `free` confirmed that memory usage went down, **and the next IOR run had similar performance to the very first run!**
+
+This is a helpful read to understand caches: http://arighi.blogspot.com/2007/04/how-to-bypass-buffer-cache-in-linux.html
+
+Some more links related to avoiding/dropping the file cache:
+- https://man7.org/linux/man-pages/man2/posix_fadvise.2.html -- the functionality in the Linux kernel
+- https://github.com/lamby/python-fadvise -- a Python interface to posix_fadvise
+- https://unix.stackexchange.com/questions/17936/setting-proc-sys-vm-drop-caches-to-clear-cache -- if you want to clear the *entire* cache
+
+###### 05/18
+
+## A quick multi-node test on PRP
+
+I'm editing the values of the values.yaml used by kube-openmpi to utilize [Node Anchors and Aliases](https://yaml.org/spec/1.2/spec.html#id2785586) which let me re-use keys so I can write my desired resources and volumes once and have them shared across all workers.
+
+Now that's done, I've lowered the resources to within the allowed PRP settings for interactive pods -- 2cpus and 8gb ram -- and I'll run a multi-node test.
+
+Yes! I created a nice alias `omexec` which takes care of the run-as-root and hostfile considerations, and now I can run it just fine.
+
+### Pay attention to `-npernode`
+
+I requested a limit of 2cpus but that is a quote -- it does not mean that the container cannot access the rest of the cores. So, I can still execute mpiexec with more than 2 processes (assuming we want only 1 proc per core). Setting `-npernode` adjusts how many processes per node we want.
+
+Useful resource:
+- https://www.golinuxcloud.com/kubernetes-resources/#Understanding_resource_units
+- https://www.golinuxcloud.com/kubernetes-resources/#How_pods_with_resource_limits_are_managed
+
+I don't think I've run into the quota throttling yet... and monitoring the Grafana dashboard shows I'm well within limits overall. I think Dima was explaining to someone in the support room that bursty behavior is fine, it just can't consistently exceed the limits.
+- https://grafana.nautilus.optiputer.net/d/85a562078cdf77779eaa1add43ccec1e/kubernetes-compute-resources-namespace-pods?orgId=1&refresh=10s&var-datasource=default&var-cluster=&var-namespace=usra-hpc
+
+### Pay attention to cores vs threads
+
+There are a handful of ways to look at the cpu information, `lscpu` is nice. Notice that this claims we have access to 16 CPUs, but it's actually 8 cores with 2 threads per core. On a single node, if I run IOR with 8 procs, I get really fast performance (and I notice the same caching effect as seen before in the read speeds). When I try to run it with 16 procs the performance is significantly worse all around -- that's because threads can't actually run in parallel, rather they timeshare.
+
+### Understanding performance
+
+So... the performance is pretty bad.
+
+## Darshan on PRP
+
+- https://www.mcs.anl.gov/research/projects/darshan/docs/darshan-runtime.html#_conventional_installation
+
+After downloading and untarring Darshan, I'm trying to ./configure the darshan-runtime. But I got the error that no zlib headers could be found. This can be fixed by specifically installed `zlib1g-dev` -- the non-dev version will not do.
+
+Then we can configure it. We'll need to pass `--with-log-path` and `--with-jobid-env`. The first is easy because I can set it to wherever I want to store logs. The latter I don't know. On NAS I knew that PBS was used so I new the environment. Here, I'm trying to figure it out by running `mpiexec env` and seeing what variables are populated. I'll pass `NONE` for now, but it might be `PMIX_ID` or something like that... we'll see later when I do multi-node Darshan.
+```
+./configure --with-log-path=/shared/darshan-logs --with-jobid-env=NONE
+```
+Finally `make` and `make install` both did the trick! Then follow it up with `mkdir /shared/darshan-logs` and `darshan-mk-log-dirs.pl` as noted in the documentation.
+
+Now let's actually try to use it.
+```
+mpiexec -np 2 -x LD_PRELOAD=/usr/local/lib/libdarshan.so ior
+```
+since openmpi uses -x instead of -env.
+
+Welp, it did't crash like it was on NAS. However, it was unable to create the darshan log
+```
+darshan_library_warning: unable to create log file /shared/darshan-logs/2021/5/18/root_ior_id5293_5-18-81444-18062306667005854292.darshan_partial.
+```
+my guess is permissions? Oh... it was pointing to the wrong path. For some reason I changed the path and re-configured but that error still came up even though `darshan-config --log-path` showed the right path. I simply created a soft link between the actual and expected path and re-ran -- it worked! Let's peek at these logs, shall we?
+
+- https://www.mcs.anl.gov/research/projects/darshan/docs/darshan-util.html
+
+I needed to install Python (odd, that wasn't listed in the requirements), and I'll need to install some other things to get graphical outputs, but for now the ./configure, make, and make install went find and I can get a textual description of the log by running
+```
+darshan-parser <path/to/file>
+```
+
+Sweet! The file is well-documented and understandable.
+
+###### 05/20
+
+## Darshan to observe an ML application
+
+Trying to get a test run of Darshan observing some ML application like image analysis.
+
+Turns out, Darshan *was* working but there are a few things to consider:
+- The environment variable `DARSHAN_ENABLE_NONMPI` needs to be set (it can be empty)
+- I think UTC is used so sometimes you need to look at the next day of log data
+```
+env DARSHAN_ENABLE_NONMPI= LD_PRELOAD=/usr/local/lib/libdarshan.so python script.py
+```
+
+> ###### 5cf6b72690a4b6c9b36cbed37802591106ed410b
+> "A whole bunch of work on PRP; Want image registry" | HEAD -> main | 2021-05-24
+