Changes

Parker Addison · b000f594
--- a/Notes.md
+++ b/Notes.md
@@ -39,6 +39,8 @@
  - [Setup](#setup-2)
  - [Running](#running-2)
    - [IOR MPI job](#ior-mpi-job)
+  - [Shared volume](#shared-volume)
+  - [Node selection](#node-selection)
 <!-- vim-markdown-toc -->
@@ -337,7 +339,7 @@ Here are some repos which chart fio outputs or do other helpful things with FIO.
 This is probably a good search: https://github.com/search?q=fio+benchmark&type=Repositories
-> ###### [957b70c](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/957b70caaca3f180ca323dbf1045965823f349df)
+> ###### 957b70caaca3f180ca323dbf1045965823f349df
 > "Exploration of IOR and FIO benchmarks; Noteful wiki" | HEAD -> main | 2021-03-14
 ###### 03/15
@@ -375,7 +377,7 @@ Things are starting to make more sense and work more consistently with IOR and F
 One thing I'm not fully sure the importance of or how to use is specifying a file in IOR... For instance, if I create a file of random bytes like seen [here](https://gitlab.msu.edu/reyno392/good-practices-in-IO/-/blob/master/generateinput.sh) is there any point to using that as an existing file to read from? Ah... perhaps there is a point. I could create multiple small files or one very large file... this coupled with filePerProc... maybe that's the point.
-> ###### [c57f6dd](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/c57f6dd8feb60eedf5a0c0ae809f30e3ef91f111)
+> ###### c57f6dd8feb60eedf5a0c0ae809f30e3ef91f111
 > "Minimal IOR test script; Repo organization" | HEAD -> main | 2021-03-16
 ###### 03/21
@@ -532,7 +534,7 @@ The job summary and output is shown in the `o` file, and it appears that stderr
 Nice!
-> ###### [6e9d944](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/6e9d9445847ef3dba7e80e79885bbd54cadb122b)
+> ###### 6e9d9445847ef3dba7e80e79885bbd54cadb122b
 > "Hello World PBS job run on Pleiades" | HEAD -> main | 2021-03-21
 ###### 03/22
@@ -696,7 +698,7 @@ ____________________________________________________________________
 Hmmm, so it didn't work fully, but it didn't not work at all at least :')
-> ###### [ea7d869](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/ea7d869f2b97bc0607e9105c1b2dd5f853aa255c)
+> ###### ea7d869f2b97bc0607e9105c1b2dd5f853aa255c
 > "Minimal IOR test almost capable of running on Pleiades. Faced segfault" | HEAD -> main | 2021-03-22
 ---
@@ -766,7 +768,7 @@ mpirun noticed that process rank 0 with PID 0 on node ***** exited on signal 11
 Looks like the abort happens due to mpirun -- but the truth of the matter is *we are getting a segfault from IOR itself*. Now we just need to figure out why!
-> ###### [c1a4492](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/c1a44925fbad444c7a84e054f828a0cb6e6c52a5)
+> ###### c1a44925fbad444c7a84e054f828a0cb6e6c52a5
 > "More minimal testing on Pleiades" | HEAD -> main | 2021-03-23
 ###### 03/27
@@ -778,7 +780,7 @@ Furthermore, I've gone ahead and cleaned up/commented the code *and ran it on th
 It's working great :) It's a super minimal example to just confirm it works. I'll run a slightly larger example parameter sweep as I finish that code.
-> ###### [ae00408](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/ae00408b88260dd35be000cb263c924ee09f16fa)
+> ###### ae00408b88260dd35be000cb263c924ee09f16fa
 > "Working IOR on Pleiades NFS and Lustre" | HEAD -> main | 2021-03-27
 ###### 03/28
@@ -800,7 +802,7 @@ Right now I have some hard-coded values in the Python script itself for the para
 To more easily validate and see the effect of the different parameters, it would help to have code that converts the IOR outputs into tables then graphs them!
-> ###### [fddf049](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/fddf04911b8fb2af9f03f07a1a1f5c6dbd49cdc7)
+> ###### fddf04911b8fb2af9f03f07a1a1f5c6dbd49cdc7
 > "Facilitate running IOR with a parameter sweep" | HEAD -> main | 2021-03-30
 ###### 04/04
@@ -831,7 +833,7 @@ Turns out IOR has a few different output formats -- including JSON and CSV which
 Now that I'm using the JSON output from IOR, everything is much more straightforward when it comes to parsing. I polished up a file to parse and plot outputs. I think it's time now to actually do some larger-scale runs so we can make sure what we're getting as a result makes sense.
-> ###### [9aa37d7](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/9aa37d75a8cb70b06023251e065a64795cb58569)
+> ###### 9aa37d75a8cb70b06023251e065a64795cb58569
 > "Output parsing and plotting functions complete" | HEAD -> main | 2021-04-06
 ---
@@ -846,7 +848,7 @@ The current steps remaining are:
 ###### 04/07
-> ###### [246de9f](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/246de9feb3882fe6a91a66306661a3aeeacf3814)
+> ###### 246de9feb3882fe6a91a66306661a3aeeacf3814
 > "More robust parameter sweep code and pbs job script; Add sweep and visualization to readme" | HEAD -> main | 2021-04-07
 ###### 04/08
@@ -922,7 +924,7 @@ path/to/fio path/to/readwrite.fio
 Hmmmm, I came across [Gordon: design, performance, and experiences deploying and supporting a data intensive supercomputer](https://dl.acm.org/doi/pdf/10.1145/2335755.2335789) by Shawn Strande, Pietro Cicotti, et al. (and it's out of SDSC - it's a small world after all ;) ) but, I also think I came across the reason why I'm not seeing HPC papers that use FIO: I'm not so sure that FIO can do single-shared-file workloads, https://github.com/axboe/fio/issues/631. So it might be really easy to set up job script and get baseline readings for your filesystems, but not when there are multiple nodes involved.
-> ###### [b2beb35](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/b2beb35c3771b5251442b3cdd48801f1d2c57e4c)
+> ###### b2beb35c3771b5251442b3cdd48801f1d2c57e4c
 > "FIO works on NAS" | HEAD -> main | 2021-04-09
 ###### 04/10
@@ -1008,7 +1010,7 @@ It's important to figure out if that behavior is consistent, then if so figure o
 I ran all of the benchmarks on /nobackupp18 but supposedly that filesystem is not fully set up yet. It also has different hardware (SSDs) than /nobackupp12. I will attempt to run the same set of tests on /nobackupp12 and compare the results.
-> ###### [a10b7d7](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/a10b7d71f4ffb1420f45d916b72968b6174236e4)
+> ###### a10b7d71f4ffb1420f45d916b72968b6174236e4
 > "Initial parameter sweeps; Configurable sweeps; Parsing/plotting" | HEAD -> main | 2021-04-14
 ###### 04/16
@@ -1091,7 +1093,7 @@ Looking into running multiple an mpi job (IOR) across multiple nodes on the PRP.
 I think having some tests from the PRP to compare to will be nice. I'm puzzled by a bit of the NAS results... trying to formalize some visualizations and run some more tests to get a better grasp of the I/O performance behavior that's going on...
-> ###### [1aeb895](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/1aeb8957bb17f24494cbff25ebb03d371422054e)
+> ###### 1aeb8957bb17f24494cbff25ebb03d371422054e
 > "sync changes" | HEAD -> main | 2021-04-24
 ###### 04/24
@@ -1133,7 +1135,7 @@ I should now be able to run mpiexec via a `kubectl exec` to the master pod.
 Sweet! The example command works!
 ```bash
-kubectl exec -it $MPI_CLUSTER_NAME-master -- mpiexec --allow-run-as-root --hostfile /kube-openmpi/generated/hostfile --display-map -n 4 -npernode 1 sch -c 'echo $(hostname):hello'
+kubectl exec -it $MPI_CLUSTER_NAME-master -- mpiexec --allow-run-as-root --hostfile /kube-openmpi/generated/hostfile --display-map -n 4 -npernode 1 sh -c 'echo $(hostname):hello'
 ```
 Worth noting for the future in case I need to do some care node selection or mess with some mpiexec options: Some of my nodes (currently master and worker-0) are on nysernet and some aren't. In the JOB MAP section of the output the ones not on nysernet show
@@ -1169,5 +1171,88 @@ Then, when using MPIIO as the API, IOR will not run fully, as we're met with:
 Oh. Probably because I'm not working on a shared volume, duh. So each node can only see its own file. Well, anyway, IOR is technically working!
-> ###### [ad5fe7f](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/ad5fe7f3ca91273e0074fdb26bfd56e9c9d1136e)
+> ###### ad5fe7f3ca91273e0074fdb26bfd56e9c9d1136e
 > "kube-openmpi running with IOR on PRP" | HEAD -> main | 2021-04-24
+###### 04/25
+### Shared volume
+I can use the `rook-cephfs` storage class -- it uses CephFS and supports ReadWriteMany -- once Dima gives me the okay. See: https://pacificresearchplatform.org/userdocs/storage/ceph-posix/
+Basically all I need to do is change my volume yaml to specify:
+```yaml
+spec:
+  storageClassName: rook-cephfs
+  accessModes:
+  - ReadWriteMany
+  resources:
+    requests:
+      storage: 100Gi
+```
+Then, I believe I can adjust `values.yaml` to:
+```yaml
+volumes:
+  - name: shared-cephfs
+    persistentVolumeClaim:
+      claimName: shared-cephfs
+volumeMounts:
+  - mountPath: /shared
+    name: shared-cephfs
+```
+for both `mpiMaster` and `mpiWorkers`... we'll see!
+Wonderful! The shared storage was successfully mounted to all nodes. I tried running IOR with just a single process on the master node in the shared directory and it worked -- now let's go ahead and try a multi-node job.
+Woohoo! It worked!
+Performance was really bad (~0.5 MiB/s wr), probably due to huge separation between nodes -> high latency (0.25s for write performance :o ). To be honest, I didn't even check what region the storage is assigned to. But nevertheless -- it worked :)
+### Node selection
+I want to make sure I'm requesting nodes that have 16 cores -- just like Sandy Bridge.
+To do so, I can do a couple things in `values.yaml`:
+1. Specify `resources.requests/limits`
+2. ~~Specify `nodeSelector` with `nautilus.io/sockets: 2` as the required label. This will prevent being assigned to nodes with more cpus.~~
+  Nevermind. I just checked `k get nodes -l nautilus.io/sockets=2 -o custom-columns=NAME:.metadata.name,CPU:.status.capacity.cpu` and the vast majority of nodes labeled has having 2 sockets have tons of cpus. In general, it looks like most nodes on the cluster have more than 16 cpus. It doesn't make sense to try to get dedicated 16 cpu nodes.
+###### 04/26
+In light of the high latency, I think I'll go ahead and request specific nodes at first. My storage is US West (probably at UCSD), so I'll want to request some other nodes also at UCSD to limit communication latency between pods and between storage.
+I'll need to talk to Dima about which nodes to use, but I should be able to ask for:
+- Pods of type `general` (avoid `testing`, `system`, `osg`, etc)
+- `calit2.optiputer.net` nodes (looks like these should be at UCSD, whereas calit2.uci.edu are at Irvine))
+- `sdsc.optiputer.net` nodes
+- `ucsd.edu` nodes
+- `suncave` nodes
+Ah, I can use these nodes:
+```sh
+k get nodes -l topology.kubernetes.io/zone=ucsd
+```
+(with the possible exception of a .ucsb.edu node which might have been labeled by mistake)
+This means I can use nodeSelector in `values.yaml`. Couple it with my resource requests:
+```yaml
+resources:
+  limits:
+    cpu: 8
+    memory: 8Gi
+  requests:
+    cpu: 8
+    memory: 8Gi
+nodeSelector:
+  topology.kubernetes.io/zone: ucsd
+```
+Uhhh ohhhh.
+```
+Error from server: error when creating "STDIN": admission webhook "pod.nautilus.optiputer.net" denied the request: PODs without controllers are limited to 2 cores and 12 GB of RAM
+```
+Gotta figure that one out.
+> ###### e58ea08968662f03b607111cc865d149c4d4ea12
+> "IOR working on shared cephfs filesystem with node selection" | HEAD -> main | 2021-04-26