Getting started with MPI on BU ENG Grid
Read the "FAQ: Running MPI jobs" page: http://www.open-mpi.org/faq/?category=running for details about how to run MPI jobs in general.
Selecting the MPI version
We use something called "mpi-selector" to select which version of MPI to use, and if you have never set it up on your account, simply run "mpi-selector-menu" to choose to use openmpi-1.4-gcc-x86_64 (option 2) as your "user" (hit "u") default, and then the next time you log in, "mpirun" will point to the correct version of mpirun, on any node you log into (or on any node the Grid automatically logs you into).
Getting Kerberos tickets
To run an mpi program on bungee.q or any other queue on the ENG-Grid, the easiest thing to do is to get Kerberos tickets so that all nodes will have tickets during the run and OpenMPI can use its default SSH-key-based transport method. Note that you must have an SSH key in your .ssh/authorized_keys, and you should have set it up with a password on it -- please do not use passwordless keys! If you haven't done this before, you can set it up with "ssh-keygen" and then copy the .pub file to your .ssh/authorized_keys file:
-bash-3.2$ ssh-keygen Generating public/private rsa key pair. -bash-3.2$ cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
Now, we can get tickets on all nodes in preparation for qsub'bing an MPI job.
Here, we are logged into the head node "bungee", and we are copying our Kerberos tickets from that node to all of the bungee nodes, bungee01 through 16:
bungee:~$ /mnt/nokrb/sge/etc/gridtickets -R bungee,01,16 Enter your kerberos password:
Now, if you do not already have an SSH agent running in your shell with your key unlocked, you should start one:
bungee:~$ ssh-agent bash bungee:~$ ssh-add Enter passphrase for /home/username/.ssh/id_rsa:
Running the job script
Now, we can run any MPI program we choose on any queue that has the "mpi" parallel environment configured, and Grid Engine will dynamically allocate hosts to the PE. There are many ways to invoke this and many options that you can pass, but here's a simple example:
bungee:/mnt/nokrb/username/MPI$ qsub -q bungee.q mpi-example.sh where mpi-example.sh is this: #!/bin/bash #$ -S /bin/sh #$ -cwd #$ -pe mpi 4 hostname date mpirun -np $NSLOTS hostname
Ethernet vs. InfiniBand
On the ENG-Grid, some of the queues have InfiniBand (currently only bungee.q) but others (such as budge.q) don't, and will default to the vmnet (virtual memory networking) interface instead of falling back to ethernet. To explicitly tell openmpi to NOT use the vmnet interfaces add the "--mca btl_tcp_if_include eth0" switch to your mpirun syntax within your qsub script, as below (more detail at http://www.open-mpi.org/faq/?category=tcp#tcp-selection ) :
mpirun --mca btl_tcp_if_include eth0 ...
If you are using GPUs in your MPI job, note that the "-l gpu=#" complex is allocated slotwise, not jobwise! So if you specify "-pe mpi 8 -l gpu=1" in your job, the system will allocate one GPU per CPU slot -- so a total of 8 CPUs and 8 GPUs for the job. This makes things tricky if you wish to allocate more GPUs than CPU slots in an MPI job. A newer version of Grid Engine, to be installed soon, will allow jobwise GPU allocation.
Monitoring the job
Note that we specified the parallel environment (PE) "mpi", with 4 slots. There are two useful PE's: mpi and openmpi. The mpi parallel environment uses the allocation rule 'fill_up'. This rule effectively allocates all available slots on a single host, for each host, until the number of allocated slots has been reached. If you have a job that is requires a significant amount of inter-node communication it may be advantageous to use this environment. The openmpi environment uses the 'round_robin' allocation rule and is ideal for lightly coupled mpi jobs (where inter-node communication is at a minimum).
Use "qstat" (or qmon) to see the job waiting, then running, and once it's finished, you should have several files in your output directory, including a .o "output" file that looks something like this:
bungee:/mnt/nokrb/username/MPI$ more mpi-example.sh.o1886509 Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. bungee03 Wed Apr 27 04:35:25 EDT 2011 bungee03 bungee01 bungee04 bungee02
Don't worry about the no job control; we don't care. Notice that the part of our code that printed the hostname and then the date is running on the node that Grid Engine designated as the MPI master, and then the other four hostnames were printed by mpirun's invocation of "hostname" on the 4 slaves.
NOTE WELL: With the Kerberos tickets copied to nodes, you hypothetically could actually use Kerberized directories in your grid submissions. However, you should avoid doing this, and instead still use the -cwd switch and your /mnt/nokrb home directory as usual, because if your job runs longer than the amount of time it takes for your tickets to expire, it will lose access to those directories. Note also that if the queue is full up and your job sits in waiting in the queue longer than it takes for your Kerberos tickets to expire, your job will fail once it enters the queue and can't find your SSH keys in your Kerberized home directory. There are other tricks we could use to get around this, such as RSH transport, but they shouldn't be necessary, because nothing should need to wait around that long in bungee.q.