Changes

Parker Addison · 54dcb7f1
--- a/Notes.md
+++ b/Notes.md
@@ -3,39 +3,42 @@
 **Table of contents**
 <!-- vim-markdown-toc GFM -->

-* [IOR](#ior)
-  * [Setup](#setup)
-  * [Running](#running)
-* [IO-500](#io-500)
-  * [Setup](#setup-1)
-  * [Running](#running-1)
-* [Custom image for IO500 dependencies](#custom-image-for-io500-dependencies)
-* [FIO benchmark](#fio-benchmark)
-  * [References, graphs, and job files](#references-graphs-and-job-files)
-* [Understanding IOR output](#understanding-ior-output)
-* [Pleiades PBS Hello World](#pleiades-pbs-hello-world)
-  * [PBS Directives](#pbs-directives)
-  * [Hello World](#hello-world)
-* [Running IOR in a PBS job](#running-ior-in-a-pbs-job)
-  * [Download IOR](#download-ior)
-  * [Preparing IOR](#preparing-ior)
-  * [Dependency issues](#dependency-issues)
-* [Nautilus namespace and IOR units](#nautilus-namespace-and-ior-units)
-* [Re-trying IOR on Pleiades](#re-trying-ior-on-pleiades)
-  * [Debugging in interactive mode](#debugging-in-interactive-mode)
-* [Automated parameter sweeps](#automated-parameter-sweeps)
-* [Parsing and graphing IOR outputs](#parsing-and-graphing-ior-outputs)
-* [Values for initial parameter sweeps](#values-for-initial-parameter-sweeps)
-  * [Side note: Darshan](#side-note-darshan)
-* [FIO on NAS](#fio-on-nas)
-* [Darshan on NAS](#darshan-on-nas)
-* [What MPI module to use](#what-mpi-module-to-use)
-* [Conducting initial parameter sweeps](#conducting-initial-parameter-sweeps)
-* [Initial sweeps on /nobackupp12](#initial-sweeps-on-nobackupp12)
-  * [Lustre stripe counts](#lustre-stripe-counts)
-  * [IOR Lustre directives](#ior-lustre-directives)
-  * [IOR Lustre striping workaround](#ior-lustre-striping-workaround)
-* [MPI on PRP](#mpi-on-prp)
+- [IOR](#ior)
+  - [Setup](#setup)
+  - [Running](#running)
+- [IO-500](#io-500)
+  - [Setup](#setup-1)
+  - [Running](#running-1)
+- [Custom image for IO500 dependencies](#custom-image-for-io500-dependencies)
+- [FIO benchmark](#fio-benchmark)
+  - [References, graphs, and job files](#references-graphs-and-job-files)
+- [Understanding IOR output](#understanding-ior-output)
+- [Pleiades PBS Hello World](#pleiades-pbs-hello-world)
+  - [PBS Directives](#pbs-directives)
+  - [Hello World](#hello-world)
+- [Running IOR in a PBS job](#running-ior-in-a-pbs-job)
+  - [Download IOR](#download-ior)
+  - [Preparing IOR](#preparing-ior)
+  - [Dependency issues](#dependency-issues)
+- [Nautilus namespace and IOR units](#nautilus-namespace-and-ior-units)
+- [Re-trying IOR on Pleiades](#re-trying-ior-on-pleiades)
+  - [Debugging in interactive mode](#debugging-in-interactive-mode)
+- [Automated parameter sweeps](#automated-parameter-sweeps)
+- [Parsing and graphing IOR outputs](#parsing-and-graphing-ior-outputs)
+- [Values for initial parameter sweeps](#values-for-initial-parameter-sweeps)
+  - [Side note: Darshan](#side-note-darshan)
+- [FIO on NAS](#fio-on-nas)
+- [Darshan on NAS](#darshan-on-nas)
+- [What MPI module to use](#what-mpi-module-to-use)
+- [Conducting initial parameter sweeps](#conducting-initial-parameter-sweeps)
+- [Initial sweeps on /nobackupp12](#initial-sweeps-on-nobackupp12)
+  - [Lustre stripe counts](#lustre-stripe-counts)
+  - [IOR Lustre directives](#ior-lustre-directives)
+  - [IOR Lustre striping workaround](#ior-lustre-striping-workaround)
+- [MPI on PRP](#mpi-on-prp)
+  - [Setup](#setup-2)
+  - [Running](#running-2)
+    - [IOR MPI job](#ior-mpi-job)

 <!-- vim-markdown-toc -->

@@ -1082,8 +1085,89 @@ Perhaps there is truly some caching effect happening when I run the repetitions?

 ## MPI on PRP

-Looking into running multiple an mpi job (IOR) across multiple node
+Looking into running multiple an mpi job (IOR) across multiple nodes on the PRP.
+
+###### 04/20
+
+I think having some tests from the PRP to compare to will be nice. I'm puzzled by a bit of the NAS results... trying to formalize some visualizations and run some more tests to get a better grasp of the I/O performance behavior that's going on...

 > ###### [1aeb895](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/1aeb8957bb17f24494cbff25ebb03d371422054e)
 > "sync changes" | HEAD -> main | 2021-04-24

+###### 04/24
+
+### Setup
+
+- https://pacificresearchplatform.org/userdocs/running/mpi-jobs/
+
+After installing Helm (a software manager for Kubernetes) it is time to start following https://github.com/everpeace/kube-openmpi#quick-start.
+
+Note that the Helm version has changed and the `--name` option is gone, so the deploy command should now be:
+```bash
+helm template $MPI_CLUSTER_NAME chart --namespace $KUBE_NAMESPACE ...
+```
+
+I took a peek at what this command outputs by redirecting to a file `> OUT` -- it produces a nice kubernetes yaml which defines:
+- A Secret containing the generated ssh key and authorized keys variable
+- A ConfigMap containing a script to generate the hostfile
+- A Service -- this is *"an abstraction which defines a logical set of Pods and a policy by which to access them (sometimes this pattern is called a micro-service)"*. Basically a way to group up Pods into an application with frontend and backend pods, and a way to network between them.
+- A Pod containing the openmpi container with our desired image and a hostfile init container
+- A StatefulSet which manages the pods -- this is like a Deployment in which all pods (including replicas) are uniquely identified and supports persistent storage
+
+Aw, attempting to create that resource led to:
+```
+Error from server (Forbidden): statefulsets.apps "nautilus-worker" is forbidden: User "system:serviceaccount:usra-hpc:default" cannot get resource "statefulsets" in API group "apps" in the namespace "usra-hpc" 
+ cluster_size=
+ rm -f /kube-openmpi/generated/hostfile_new
+stream closed
+```
+I'll need to ask Dima about permission for that resource. Perhaps the API group has just shifted... Or, perhaps it is because I haven't added the rolebindings yet. The rolebinding command is using the GitLab /blob/ instead of the /raw/, but after fixing that I did not face the 'cannot get resource' issue! I still did face an issue though:
+```
+Error from server (NotFound): statefulsets.apps "nautilus-worker" not found
+```
+Ah, I think that was just due to me not fully tearing down my previous attempt. After deleting all the pods and re-running the resource creation -- it's working!
+
+### Running
+
+I should now be able to run mpiexec via a `kubectl exec` to the master pod.
+
+Sweet! The example command works!
+```bash
+kubectl exec -it $MPI_CLUSTER_NAME-master -- mpiexec --allow-run-as-root --hostfile /kube-openmpi/generated/hostfile --display-map -n 4 -npernode 1 sch -c 'echo $(hostname):hello'
+```
+
+Worth noting for the future in case I need to do some care node selection or mess with some mpiexec options: Some of my nodes (currently master and worker-0) are on nysernet and some aren't. In the JOB MAP section of the output the ones not on nysernet show
+```
+Data for node: nautilus-worker-1.nautilus	Num slots: 8	Max slots: 0	Num procs: 1
+  Process OMPI jobid: [35664,1] App: 0 Process rank: 2 Bound: UNBOUND
+```
+whereas the ones on nysernet show Bound to a bunch of sockets
+```
+Data for node: nautilus-worker-0.nautilus	Num slots: 96	Max slots: 0	Num procs: 1
+  Process OMPI jobid: [35664,1] App: 0 Process rank: 1 Bound: socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]], socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]], socket 0[core 32[hwt 0-1]], socket 0[core 33[hwt 0-1]], socket 0[core 34[hwt 0-1]], socket 0[core 35[hwt 0-1]], socket 0[core 36[hwt 0-1]][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
+```
+
+Regardless, I *should* be setting up node affinities so that I get nodes with 16 cores for the closest comparison to Sandy Bridge.
+
+#### IOR MPI job
+
+Before we do that, though, let's get a custom image with IOR on it to do a minimal test run for that. Locally, this was as easy as just downloading IOR, running `./configure` and `make`. It worked fine without needing to mess with any additional dependencies :) Let's try it on the cluster.
+
+Alright, IOR runs, but not without some issues. When running a POSIX API test, the following warnings showed up in the results section of both write and read:
+```
+ior WARNING: inconsistent file size by different tasks.
+WARNING: Expected aggregate file size       = 4194304.
+WARNING: Stat() of aggregate file size      = 1048576.
+WARNING: Using actual aggregate bytes moved = 4194304.
+```
+Then, when using MPIIO as the API, IOR will not run fully, as we're met with:
+```
+[nautilus-worker-2:00058] [3]mca_sharedfp_lockedfile_file_open: Error during file open
+[nautilus-worker-0:00057] [1]mca_sharedfp_lockedfile_file_open: Error during file open
+[nautilus-worker-1:00057] [2]mca_sharedfp_lockedfile_file_open: Error during file open
+```
+
+Oh. Probably because I'm not working on a shared volume, duh. So each node can only see its own file. Well, anyway, IOR is technically working!
+
+> ###### [ad5fe7f](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/ad5fe7f3ca91273e0074fdb26bfd56e9c9d1136e)
+> "kube-openmpi running with IOR on PRP" | HEAD -> main | 2021-04-24