| ... | ... | @@ -667,3 +667,73 @@ Hmmm, so it didn't work fully, but it didn't not work at all at least :') |
|
|
|
|
|
|
|
> ###### [ea7d869](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/ea7d869f2b97bc0607e9105c1b2dd5f853aa255c)
|
|
|
|
> "Minimal IOR test almost capable of running on Pleiades. Faced segfault" | HEAD -> main | 2021-03-22
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
###### 03/23
|
|
|
|
|
|
|
|
## Nautilus namespace and IOR units
|
|
|
|
|
|
|
|
Just went ahead and created a `usra-hpc` namespace on Nautilus, set up a larger volume and new deployment to test out IOR over there. I checked the file sizes and sure enough they're all me**bi**bytes and whatnot. So I was correct before that a command of `mpirun -np 8 ior -t 1m -b 16m -s 16` does infact produce an aggregated file size of 2GiB -- actually, the IOR output says this pretty nicely!
|
|
|
|
|
|
|
|
Also worth noting that once `make install` is run (this is done already in the images I set up, e.g. parkeraddison/io500) then wherever ior is run from serves as the filesystem -- so I merely need to navigate to `/storage` then run ior to test it on that volume.
|
|
|
|
|
|
|
|
Finally (and most relevant right now), I did *not* see any segfault errors when I ran it on Nautilus. Let's try it again on Pleiades.
|
|
|
|
|
|
|
|
## Re-trying IOR on Pleiades
|
|
|
|
|
|
|
|
I'm going to modify the minimal script to simply run `ior` without any arguments -- this writes/reads only one mebibyte of data and is practically instant. It's truly minimal!
|
|
|
|
|
|
|
|
Same message as before.
|
|
|
|
|
|
|
|
Perhaps:
|
|
|
|
- https://github.com/hpc/ior/issues/47
|
|
|
|
|
|
|
|
Or... the error output says "Per user-direction, the job has been aborted" -- this sounds like maybe the PBS job was aborted because it saw a non-zero exit code. Is there some way to specify that I don't want the job aborted?
|
|
|
|
|
|
|
|
### Debugging in interactive mode
|
|
|
|
|
|
|
|
To make figuring this out easier, we can run the PBS job in interactive! This is basically like exec'ing into a compute node shell in a k8s way of thinking about it! Running `qsub -I minimal-ior.sh` will request the resources reading the PBS directives, then attach the terminal. I can run each line of the script manually.
|
|
|
|
|
|
|
|
After loading the mpi-sgi, mpi-hpcx, and comp-intel modules, here's what running ior shows:
|
|
|
|
|
|
|
|
```
|
|
|
|
PBS r301i5n3:~> cd ior-3.3.0/
|
|
|
|
PBS r301i5n3:~/ior-3.3.0> ./src/ior
|
|
|
|
[r301i5n3:20237:0:20237] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xe5)
|
|
|
|
==== backtrace ====
|
|
|
|
0 /nasa/hpcx/2.4.0_mt/ucx/install/lib/libucs.so.0(+0x1d98c) [0x2aaabb7a498c]
|
|
|
|
1 /nasa/hpcx/2.4.0_mt/ucx/install/lib/libucs.so.0(+0x1dbfb) [0x2aaabb7a4bfb]
|
|
|
|
2 /nasa/hpcx/2.4.0_mt/ompi-mt-icc/lib/libmpi.so(MPI_Comm_rank+0) [0x2aaaab668e00]
|
|
|
|
3 ./src/ior() [0x40d58c]
|
|
|
|
4 /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaab935a35]
|
|
|
|
5 ./src/ior() [0x403209]
|
|
|
|
===================
|
|
|
|
Segmentation fault (core dumped)
|
|
|
|
```
|
|
|
|
```
|
|
|
|
PBS r301i5n3:~/ior-3.3.0/src> mpirun -n 1 ./ior
|
|
|
|
[r301i5n3:20556:0:20556] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xe5)
|
|
|
|
==== backtrace ====
|
|
|
|
0 /nasa/hpcx/2.4.0_mt/ucx/install/lib/libucs.so.0(+0x1d98c) [0x2aaabb7a498c]
|
|
|
|
1 /nasa/hpcx/2.4.0_mt/ucx/install/lib/libucs.so.0(+0x1dbfb) [0x2aaabb7a4bfb]
|
|
|
|
2 /nasa/hpcx/2.4.0_mt/ompi-mt-icc/lib/libmpi.so(MPI_Comm_rank+0) [0x2aaaab668e00]
|
|
|
|
3 ./ior() [0x40d58c]
|
|
|
|
4 /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaab935a35]
|
|
|
|
5 ./ior() [0x403209]
|
|
|
|
===================
|
|
|
|
--------------------------------------------------------------------------
|
|
|
|
Primary job terminated normally, but 1 process returned
|
|
|
|
a non-zero exit code. Per user-direction, the job has been aborted.
|
|
|
|
--------------------------------------------------------------------------
|
|
|
|
--------------------------------------------------------------------------
|
|
|
|
mpirun noticed that process rank 0 with PID 0 on node r301i5n3 exited on signal 11 (Segmentation fault).
|
|
|
|
--------------------------------------------------------------------------
|
|
|
|
```
|
|
|
|
|
|
|
|
Looks like the abort happens due to mpirun -- but the truth of the matter is *we are getting a segfault from IOR itself*. Now we just need to figure out why!
|
|
|
|
|
|
|
|
|
|
|
|
> ###### [c1a4492](ssh///https://gitlab-ssh.nautilus.optiputer.net/30622/parkeraddison/filesystem-benchmarks/commit/c1a44925fbad444c7a84e054f828a0cb6e6c52a5)
|
|
|
|
> "More minimal testing on Pleiades" | HEAD -> main | 2021-03-23
|
|
|
|
|