| ... | ... | @@ -3,39 +3,42 @@ |
|
|
|
**Table of contents**
|
|
|
|
<!-- vim-markdown-toc GFM -->
|
|
|
|
|
|
|
|
* [IOR](#ior)
|
|
|
|
* [Setup](#setup)
|
|
|
|
* [Running](#running)
|
|
|
|
* [IO-500](#io-500)
|
|
|
|
* [Setup](#setup-1)
|
|
|
|
* [Running](#running-1)
|
|
|
|
* [Custom image for IO500 dependencies](#custom-image-for-io500-dependencies)
|
|
|
|
* [FIO benchmark](#fio-benchmark)
|
|
|
|
* [References, graphs, and job files](#references-graphs-and-job-files)
|
|
|
|
* [Understanding IOR output](#understanding-ior-output)
|
|
|
|
* [Pleiades PBS Hello World](#pleiades-pbs-hello-world)
|
|
|
|
* [PBS Directives](#pbs-directives)
|
|
|
|
* [Hello World](#hello-world)
|
|
|
|
* [Running IOR in a PBS job](#running-ior-in-a-pbs-job)
|
|
|
|
* [Download IOR](#download-ior)
|
|
|
|
* [Preparing IOR](#preparing-ior)
|
|
|
|
* [Dependency issues](#dependency-issues)
|
|
|
|
* [Nautilus namespace and IOR units](#nautilus-namespace-and-ior-units)
|
|
|
|
* [Re-trying IOR on Pleiades](#re-trying-ior-on-pleiades)
|
|
|
|
* [Debugging in interactive mode](#debugging-in-interactive-mode)
|
|
|
|
* [Automated parameter sweeps](#automated-parameter-sweeps)
|
|
|
|
* [Parsing and graphing IOR outputs](#parsing-and-graphing-ior-outputs)
|
|
|
|
* [Values for initial parameter sweeps](#values-for-initial-parameter-sweeps)
|
|
|
|
* [Side note: Darshan](#side-note-darshan)
|
|
|
|
* [FIO on NAS](#fio-on-nas)
|
|
|
|
* [Darshan on NAS](#darshan-on-nas)
|
|
|
|
* [What MPI module to use](#what-mpi-module-to-use)
|
|
|
|
* [Conducting initial parameter sweeps](#conducting-initial-parameter-sweeps)
|
|
|
|
* [Initial sweeps on /nobackupp12](#initial-sweeps-on-nobackupp12)
|
|
|
|
* [Lustre stripe counts](#lustre-stripe-counts)
|
|
|
|
* [IOR Lustre directives](#ior-lustre-directives)
|
|
|
|
* [IOR Lustre striping workaround](#ior-lustre-striping-workaround)
|
|
|
|
* [MPI on PRP](#mpi-on-prp)
|
|
|
|
- [IOR](#ior)
|
|
|
|
- [Setup](#setup)
|
|
|
|
- [Running](#running)
|
|
|
|
- [IO-500](#io-500)
|
|
|
|
- [Setup](#setup-1)
|
|
|
|
- [Running](#running-1)
|
|
|
|
- [Custom image for IO500 dependencies](#custom-image-for-io500-dependencies)
|
|
|
|
- [FIO benchmark](#fio-benchmark)
|
|
|
|
- [References, graphs, and job files](#references-graphs-and-job-files)
|
|
|
|
- [Understanding IOR output](#understanding-ior-output)
|
|
|
|
- [Pleiades PBS Hello World](#pleiades-pbs-hello-world)
|
|
|
|
- [PBS Directives](#pbs-directives)
|
|
|
|
- [Hello World](#hello-world)
|
|
|
|
- [Running IOR in a PBS job](#running-ior-in-a-pbs-job)
|
|
|
|
- [Download IOR](#download-ior)
|
|
|
|
- [Preparing IOR](#preparing-ior)
|
|
|
|
- [Dependency issues](#dependency-issues)
|
|
|
|
- [Nautilus namespace and IOR units](#nautilus-namespace-and-ior-units)
|
|
|
|
- [Re-trying IOR on Pleiades](#re-trying-ior-on-pleiades)
|
|
|
|
- [Debugging in interactive mode](#debugging-in-interactive-mode)
|
|
|
|
- [Automated parameter sweeps](#automated-parameter-sweeps)
|
|
|
|
- [Parsing and graphing IOR outputs](#parsing-and-graphing-ior-outputs)
|
|
|
|
- [Values for initial parameter sweeps](#values-for-initial-parameter-sweeps)
|
|
|
|
- [Side note: Darshan](#side-note-darshan)
|
|
|
|
- [FIO on NAS](#fio-on-nas)
|
|
|
|
- [Darshan on NAS](#darshan-on-nas)
|
|
|
|
- [What MPI module to use](#what-mpi-module-to-use)
|
|
|
|
- [Conducting initial parameter sweeps](#conducting-initial-parameter-sweeps)
|
|
|
|
- [Initial sweeps on /nobackupp12](#initial-sweeps-on-nobackupp12)
|
|
|
|
- [Lustre stripe counts](#lustre-stripe-counts)
|
|
|
|
- [IOR Lustre directives](#ior-lustre-directives)
|
|
|
|
- [IOR Lustre striping workaround](#ior-lustre-striping-workaround)
|
|
|
|
- [MPI on PRP](#mpi-on-prp)
|
|
|
|
- [Setup](#setup-2)
|
|
|
|
- [Running](#running-2)
|
|
|
|
- [IOR MPI job](#ior-mpi-job)
|
|
|
|
|
|
|
|
<!-- vim-markdown-toc -->
|
|
|
|
|
| ... | ... | @@ -1082,8 +1085,89 @@ Perhaps there is truly some caching effect happening when I run the repetitions? |
|
|
|
|
|
|
|
## MPI on PRP
|
|
|
|
|
|
|
|
Looking into running multiple an mpi job (IOR) across multiple node
|
|
|
|
Looking into running multiple an mpi job (IOR) across multiple nodes on the PRP.
|
|
|
|
|
|
|
|
###### 04/20
|
|
|
|
|
|
|
|
I think having some tests from the PRP to compare to will be nice. I'm puzzled by a bit of the NAS results... trying to formalize some visualizations and run some more tests to get a better grasp of the I/O performance behavior that's going on...
|
|
|
|
|
|
|
|
> ###### [1aeb895](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/1aeb8957bb17f24494cbff25ebb03d371422054e)
|
|
|
|
> "sync changes" | HEAD -> main | 2021-04-24
|
|
|
|
|
|
|
|
###### 04/24
|
|
|
|
|
|
|
|
### Setup
|
|
|
|
|
|
|
|
- https://pacificresearchplatform.org/userdocs/running/mpi-jobs/
|
|
|
|
|
|
|
|
After installing Helm (a software manager for Kubernetes) it is time to start following https://github.com/everpeace/kube-openmpi#quick-start.
|
|
|
|
|
|
|
|
Note that the Helm version has changed and the `--name` option is gone, so the deploy command should now be:
|
|
|
|
```bash
|
|
|
|
helm template $MPI_CLUSTER_NAME chart --namespace $KUBE_NAMESPACE ...
|
|
|
|
```
|
|
|
|
|
|
|
|
I took a peek at what this command outputs by redirecting to a file `> OUT` -- it produces a nice kubernetes yaml which defines:
|
|
|
|
- A Secret containing the generated ssh key and authorized keys variable
|
|
|
|
- A ConfigMap containing a script to generate the hostfile
|
|
|
|
- A Service -- this is *"an abstraction which defines a logical set of Pods and a policy by which to access them (sometimes this pattern is called a micro-service)"*. Basically a way to group up Pods into an application with frontend and backend pods, and a way to network between them.
|
|
|
|
- A Pod containing the openmpi container with our desired image and a hostfile init container
|
|
|
|
- A StatefulSet which manages the pods -- this is like a Deployment in which all pods (including replicas) are uniquely identified and supports persistent storage
|
|
|
|
|
|
|
|
Aw, attempting to create that resource led to:
|
|
|
|
```
|
|
|
|
Error from server (Forbidden): statefulsets.apps "nautilus-worker" is forbidden: User "system:serviceaccount:usra-hpc:default" cannot get resource "statefulsets" in API group "apps" in the namespace "usra-hpc"
|
|
|
|
+ cluster_size=
|
|
|
|
+ rm -f /kube-openmpi/generated/hostfile_new
|
|
|
|
stream closed
|
|
|
|
```
|
|
|
|
I'll need to ask Dima about permission for that resource. Perhaps the API group has just shifted... Or, perhaps it is because I haven't added the rolebindings yet. The rolebinding command is using the GitLab /blob/ instead of the /raw/, but after fixing that I did not face the 'cannot get resource' issue! I still did face an issue though:
|
|
|
|
```
|
|
|
|
Error from server (NotFound): statefulsets.apps "nautilus-worker" not found
|
|
|
|
```
|
|
|
|
Ah, I think that was just due to me not fully tearing down my previous attempt. After deleting all the pods and re-running the resource creation -- it's working!
|
|
|
|
|
|
|
|
### Running
|
|
|
|
|
|
|
|
I should now be able to run mpiexec via a `kubectl exec` to the master pod.
|
|
|
|
|
|
|
|
Sweet! The example command works!
|
|
|
|
```bash
|
|
|
|
kubectl exec -it $MPI_CLUSTER_NAME-master -- mpiexec --allow-run-as-root --hostfile /kube-openmpi/generated/hostfile --display-map -n 4 -npernode 1 sch -c 'echo $(hostname):hello'
|
|
|
|
```
|
|
|
|
|
|
|
|
Worth noting for the future in case I need to do some care node selection or mess with some mpiexec options: Some of my nodes (currently master and worker-0) are on nysernet and some aren't. In the JOB MAP section of the output the ones not on nysernet show
|
|
|
|
```
|
|
|
|
Data for node: nautilus-worker-1.nautilus Num slots: 8 Max slots: 0 Num procs: 1
|
|
|
|
Process OMPI jobid: [35664,1] App: 0 Process rank: 2 Bound: UNBOUND
|
|
|
|
```
|
|
|
|
whereas the ones on nysernet show Bound to a bunch of sockets
|
|
|
|
```
|
|
|
|
Data for node: nautilus-worker-0.nautilus Num slots: 96 Max slots: 0 Num procs: 1
|
|
|
|
Process OMPI jobid: [35664,1] App: 0 Process rank: 1 Bound: socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]], socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]], socket 0[core 32[hwt 0-1]], socket 0[core 33[hwt 0-1]], socket 0[core 34[hwt 0-1]], socket 0[core 35[hwt 0-1]], socket 0[core 36[hwt 0-1]][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
|
|
|
|
```
|
|
|
|
|
|
|
|
Regardless, I *should* be setting up node affinities so that I get nodes with 16 cores for the closest comparison to Sandy Bridge.
|
|
|
|
|
|
|
|
#### IOR MPI job
|
|
|
|
|
|
|
|
Before we do that, though, let's get a custom image with IOR on it to do a minimal test run for that. Locally, this was as easy as just downloading IOR, running `./configure` and `make`. It worked fine without needing to mess with any additional dependencies :) Let's try it on the cluster.
|
|
|
|
|
|
|
|
Alright, IOR runs, but not without some issues. When running a POSIX API test, the following warnings showed up in the results section of both write and read:
|
|
|
|
```
|
|
|
|
ior WARNING: inconsistent file size by different tasks.
|
|
|
|
WARNING: Expected aggregate file size = 4194304.
|
|
|
|
WARNING: Stat() of aggregate file size = 1048576.
|
|
|
|
WARNING: Using actual aggregate bytes moved = 4194304.
|
|
|
|
```
|
|
|
|
Then, when using MPIIO as the API, IOR will not run fully, as we're met with:
|
|
|
|
```
|
|
|
|
[nautilus-worker-2:00058] [3]mca_sharedfp_lockedfile_file_open: Error during file open
|
|
|
|
[nautilus-worker-0:00057] [1]mca_sharedfp_lockedfile_file_open: Error during file open
|
|
|
|
[nautilus-worker-1:00057] [2]mca_sharedfp_lockedfile_file_open: Error during file open
|
|
|
|
```
|
|
|
|
|
|
|
|
Oh. Probably because I'm not working on a shared volume, duh. So each node can only see its own file. Well, anyway, IOR is technically working!
|
|
|
|
|
|
|
|
> ###### [ad5fe7f](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/ad5fe7f3ca91273e0074fdb26bfd56e9c9d1136e)
|
|
|
|
> "kube-openmpi running with IOR on PRP" | HEAD -> main | 2021-04-24 |