| ... | ... | @@ -420,6 +420,8 @@ Output: `10791518.pbspl1.nas.nasa.gov` |
|
|
|
|
|
|
|
Running `qstat -u paddison` lists the jobs I've submitted. This is a pretty quick job on a fast-turnaround queue, so it'll go by quickly. But three quick runs of that command showed the job in three different states. The fourth time running qstat the output was empty -- the job was complete.
|
|
|
|
<details>
|
|
|
|
<summary>qstat</summary>
|
|
|
|
|
|
|
|
```
|
|
|
|
paddison@pfe24:~> qstat -u paddison
|
|
|
|
Req'd Elap
|
| ... | ... | @@ -441,12 +443,13 @@ JobID User Queue Jobname TSK Nds wallt S wallt Eff |
|
|
|
--------------- -------- ----- -------- --- --- ----- - ----- ---
|
|
|
|
10791518.pbspl1 paddison devel HelloJob 1 1 02:00 E 00:00 50%
|
|
|
|
```
|
|
|
|
|
|
|
|
</details>
|
|
|
|
|
|
|
|
Two files are now present in the directory where I ran `qsub`.
|
|
|
|
|
|
|
|
<details>
|
|
|
|
<summary>HelloJob.o10791518/summary>
|
|
|
|
<summary>HelloJob.o10791518</summary>
|
|
|
|
|
|
|
|
```
|
|
|
|
Job 10791518.pbspl1.nas.nasa.gov started on Sun Mar 21 20:12:29 PDT 2021
|
| ... | ... | @@ -488,6 +491,7 @@ Two files are now present in the directory where I ran `qsub`. |
|
|
|
The `e` file was empty. Here is that file from a previous run where an invalid command was used.
|
|
|
|
<details>
|
|
|
|
<summary>HelloJob.e10791410</summary>
|
|
|
|
|
|
|
|
```
|
|
|
|
/var/spool/pbs/mom_priv/jobs/10791410.pbspl1.nas.nasa.gov.SC: line 5: print: command not found
|
|
|
|
```
|
| ... | ... | @@ -498,4 +502,168 @@ The job summary and output is shown in the `o` file, and it appears that stderr |
|
|
|
Nice!
|
|
|
|
|
|
|
|
> ###### [6e9d944](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/6e9d9445847ef3dba7e80e79885bbd54cadb122b)
|
|
|
|
> "Hello World PBS job run on Pleiades" | HEAD -> main | %as |
|
|
|
> "Hello World PBS job run on Pleiades" | HEAD -> main | 2021-03-21
|
|
|
|
|
|
|
|
###### 03/22
|
|
|
|
|
|
|
|
Let's get an IOR benchmark running as a PBS job.
|
|
|
|
|
|
|
|
## Running IOR in a PBS job
|
|
|
|
|
|
|
|
This is going to involve:
|
|
|
|
- Ensure software dependencies exist... and learn how to load modules/packages
|
|
|
|
- Learn how to install software dependencies if need be!
|
|
|
|
- Download the IOR executable to /home(?) and try executing it in a PBS job
|
|
|
|
|
|
|
|
Software modules: https://www.nas.nasa.gov/hecc/support/kb/using-software-modules_115.html -- I'll probably need to `module load mpi...`.
|
|
|
|
|
|
|
|
Software directories: https://www.nas.nasa.gov/hecc/support/kb/software-directories_113.html -- since `/u/scicon/tools` is used by the APP group I have a feeling a handful of dependencies will be there already. These should already be in PATH.
|
|
|
|
|
|
|
|
Also good to know that the pfe nodes can load these modules *and it's fine to use them [for quick testing/debugging](https://www.nas.nasa.gov/hecc/support/kb/pleiades-front-end-(pfe)-usage-guidelines_181.html)*! ~~So I'll be able to test the minimal IOR (and work out all of the dependency, module load, etc steps) before submitting a PBS job :)~~ never mind. **mpi jobs are not permitted on the pfe nodes**. Still, I should be able to run the `./configure` script which checks all dependencies.
|
|
|
|
|
|
|
|
### Download IOR
|
|
|
|
|
|
|
|
Starting by downloading the IOR release from https://github.com/hpc/ior/releases/ to my pfe home directory.
|
|
|
|
```bash
|
|
|
|
wget -O- https://github.com/hpc/ior/releases/download/3.3.0/ior-3.3.0.tar.gz | tar zxf -
|
|
|
|
cd ior-3.3.0
|
|
|
|
```
|
|
|
|
|
|
|
|
### Preparing IOR
|
|
|
|
|
|
|
|
Now we need to make sure that the necessary dependences are loaded by running `./configure`.
|
|
|
|
|
|
|
|
Trying to run it results in
|
|
|
|
```
|
|
|
|
checking for mpicc... no
|
|
|
|
checking for mpixlc_r... no
|
|
|
|
...
|
|
|
|
configure: error: in `/home6/paddison/ior-3.3.0':
|
|
|
|
configure: error: MPI compiler requested, but could not use MPI.
|
|
|
|
```
|
|
|
|
|
|
|
|
Which I think I can fix by running a `module load mpi...`. First let's check what mpi modules are available using `module avail mpi`. Alright, I'll try `module load mpi-sgi`.
|
|
|
|
|
|
|
|
Let's try the configure script again.
|
|
|
|
|
|
|
|
Sweet! It worked fully this time! So we know that we'll need to ==**module load mpi-sgi**==.
|
|
|
|
|
|
|
|
Now I can run `make`. Seems to have worked fine.
|
|
|
|
|
|
|
|
I cannot run `make install` at the moment because I don't have permission to install the binary to `/usr/local/bin` -- but I can change the installation path when running `./configure`. Not necessary though, I can just run the binary from src directly.
|
|
|
|
|
|
|
|
Alright. I honestly think that all other dependencies are met. I suppose it's time to run a PBS job! I've written the folloinwg `minial-ior.sh` file:
|
|
|
|
```bash
|
|
|
|
#!/bin/sh
|
|
|
|
#PBS -q devel
|
|
|
|
#PBS -l select=1:ncpus=8:mpiprocs=8:mem=2gb:model=san
|
|
|
|
#PBS -N MinimalIOR
|
|
|
|
|
|
|
|
module load mpi-sgi
|
|
|
|
|
|
|
|
cd "$PBS_O_WORKDIR/ior-3.3.0"
|
|
|
|
|
|
|
|
# Should write and read a total of 2gb (8 procs * 16 segments of * 16mb)
|
|
|
|
mpirun -np 8 ./src/ior -t 1m -b 16m -s 16
|
|
|
|
```
|
|
|
|
|
|
|
|
Let's try it out! Huh, it complained that I didn't specify the model!? Oh. It was because I had mistyped the comment on the shebang, so it probably didn't read any of the directives.
|
|
|
|
|
|
|
|
```bash
|
|
|
|
qsub minimal-ior.sh
|
|
|
|
```
|
|
|
|
Out: `10799119.pbspl1.nas.nasa.gov`, and running `qstat` shows us the job move from Queued, to Running, to Exiting.
|
|
|
|
<details>
|
|
|
|
<summary>qstat</summary>
|
|
|
|
|
|
|
|
```
|
|
|
|
paddison@pfe26:~> qstat -u paddison
|
|
|
|
Req'd Elap
|
|
|
|
JobID User Queue Jobname TSK Nds wallt S wallt Eff
|
|
|
|
--------------- -------- ----- ---------- --- --- ----- - ----- ---
|
|
|
|
10799119.pbspl1 paddison devel MinimalIOR 8 1 02:00 Q 00:01 --
|
|
|
|
paddison@pfe26:~> qstat -u paddison
|
|
|
|
Req'd Elap
|
|
|
|
JobID User Queue Jobname TSK Nds wallt S wallt Eff
|
|
|
|
--------------- -------- ----- ---------- --- --- ----- - ----- ---
|
|
|
|
10799119.pbspl1 paddison devel MinimalIOR 8 1 02:00 R 00:00 4%
|
|
|
|
paddison@pfe26:~> qstat -u paddison
|
|
|
|
Req'd Elap
|
|
|
|
JobID User Queue Jobname TSK Nds wallt S wallt Eff
|
|
|
|
--------------- -------- ----- ---------- --- --- ----- - ----- ---
|
|
|
|
10799119.pbspl1 paddison devel MinimalIOR 8 1 02:00 E 00:01 4%
|
|
|
|
```
|
|
|
|
|
|
|
|
</details>
|
|
|
|
|
|
|
|
### Dependency issues
|
|
|
|
|
|
|
|
Unfortunately, the `e`file resulted in `/var/spool/pbs/mom_priv/jobs/10799119.pbspl1.nas.nasa.gov.SC: line 11: mpirun: command not found`. Looks like our module load didn't give us the mpirun command. Hmmmmm.
|
|
|
|
|
|
|
|
Sure enough, on the pfe I can see an mpiexec command, but no mpirun command. I seem to be able to access this command by ==**module load mpi-hpcx**==.
|
|
|
|
|
|
|
|
Let's try adding that module and run the job again.
|
|
|
|
|
|
|
|
Also, this is pretty handy: `watch qstat -u paddison`
|
|
|
|
|
|
|
|
Alright this time we got: `mpirun: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory`. A quick look at Stack Exchange shows that this is an Intel math library. There are some `comp-intel` modules available, but a `module help comp-intel` shows only libfftw files... still, I will try it.
|
|
|
|
|
|
|
|
Ah! I can test `mpirun` (without any arguments so I won't actually do anything) on pfe, that way I can check if it complains about dependencies. Sure enough it does complain about missing libimf. Fortunately, after a ==**module load comp-intel**== it no longer complains!
|
|
|
|
|
|
|
|
Let's try this in a PBS job again.
|
|
|
|
|
|
|
|
<details>
|
|
|
|
<summary>Output</summary>
|
|
|
|
|
|
|
|
MinimalIOR.e10799410
|
|
|
|
```
|
|
|
|
--------------------------------------------------------------------------
|
|
|
|
Primary job terminated normally, but 1 process returned
|
|
|
|
a non-zero exit code. Per user-direction, the job has been aborted.
|
|
|
|
--------------------------------------------------------------------------
|
|
|
|
--------------------------------------------------------------------------
|
|
|
|
mpirun noticed that process rank 2 with PID 0 on node r327i7n6 exited on signal 11 (Segmentation fault).
|
|
|
|
--------------------------------------------------------------------------
|
|
|
|
```
|
|
|
|
|
|
|
|
MinimalIOR.o10799410
|
|
|
|
```
|
|
|
|
Job 10799410.pbspl1.nas.nasa.gov started on Mon Mar 22 15:52:46 PDT 2021
|
|
|
|
The job requested the following resources:
|
|
|
|
mem=2gb
|
|
|
|
ncpus=8
|
|
|
|
place=scatter:excl
|
|
|
|
walltime=02:00:00
|
|
|
|
|
|
|
|
PBS set the following environment variables:
|
|
|
|
FORT_BUFFERED = 1
|
|
|
|
TZ = PST8PDT
|
|
|
|
|
|
|
|
On r327i7n6:
|
|
|
|
Current directory is /home6/paddison
|
|
|
|
|
|
|
|
____________________________________________________________________
|
|
|
|
Job Resource Usage Summary for 10799410.pbspl1.nas.nasa.gov
|
|
|
|
|
|
|
|
CPU Time Used : 00:00:04
|
|
|
|
Real Memory Used : 2280kb
|
|
|
|
Walltime Used : 00:00:04
|
|
|
|
Exit Status : 139
|
|
|
|
|
|
|
|
Memory Requested : 2gb
|
|
|
|
Number of CPUs Requested : 8
|
|
|
|
Walltime Requested : 02:00:00
|
|
|
|
|
|
|
|
Execution Queue : devel
|
|
|
|
Charged To : g1119
|
|
|
|
|
|
|
|
Job Stopped : Mon Mar 22 15:52:58 2021
|
|
|
|
____________________________________________________________________
|
|
|
|
```
|
|
|
|
|
|
|
|
</details>
|
|
|
|
|
|
|
|
Hmmm, so it didn't work fully, but it didn't not work at all at least :')
|
|
|
|
|
|
|
|
> ###### [ea7d869](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/ea7d869f2b97bc0607e9105c1b2dd5f853aa255c)
|
|
|
|
> "Minimal IOR test almost capable of running on Pleiades. Faced segfault" | HEAD -> main | 2021-03-22 |