Parallel processing¶

Running a job in parallel is a great way to utilize the power of the cluster. So what is a parallel job/workflow?

Loosely-coupled jobs (sometimes referred to as embarrassingly or naively parallel jobs) are processes in which multiple instances of the same program execute on multiple data files simultaneously, with each instance running independently from others on its own allocated resources (i.e. CPUs and memory). Slurm job arrays offer a simple mechanism for achieving this.
Multi-threaded programs that include explicit support for shared memory processing via multiple threads of execution (e.g. Posix Threads or OpenMP) running across multiple CPU cores.
Distributed memory programs that include explicit support for message passing between processes (e.g. MPI). These processes execute across multiple CPU cores and/or nodes, these are often referred to as tightly-coupled jobs.
GPU (graphics processing unit) programs including explicit support for offloading to the device via languages like CUDA or OpenCL.

It is important to understand the capabilities and limitations of an application in order to fully leverage the parallel processing options available on the cluster. For instance, many popular scientific computing languages like Python, R, and Matlab now offer packages that allow for GPU, multi-core or multithreaded processing, especially for matrix and vector operations. If you need help with the design of your parallel workflow, send us a message in the raapoi-help Slack channel.

Job Array Example¶

Here is an example of running a job array to run 50 simultaneous processes:

sbatch array.sh

The contents of the array.sh batch script looks like this:

  #!/bin/bash
  #SBATCH -a 1-50
  #SBATCH --cpus-per-task=1
  #SBATCH --mem-per-cpu=2G
  #SBATCH --time=00:10:00
  #SBATCH --partition=parallel
  #SBATCH --mail-type=BEGIN,END,FAIL
  #SBATCH --mail-user=me@email.com

  module load fastqc/0.11.7

  fastqc --nano -o $TMPDIR/output_dir seqfile_${SLURM_ARRAY_TASK_ID}

So what do these parameter mean?:

-a sets this up as a parallel array job (this sets up the "loop" that will be run
--cpus-per-task requests the number of CPUs per array task, in this case I just want one CPU per task, we will use 50 in total
--mem-per-cpu request 2GB of RAM per CPU, for this parallel job I will request a total of 100GB RAM (50 CPUs * 2GB RAM)
--time is the max run time for this job, 10 minutes in this case
--partition assigns this job to a partition
module load fastqc/0.11.7: Load software into my environment, in this case fastqc
fastqc --nano -o $TMPDIR/output_dir seqfile${SLURM_ARRAY_TASK_ID}_ Run fastqc on each input data file with the filenames seqfile_1, seqfile_2...seqfile_50

Running the array.sh script will cause the SLURM manager to spawn 50 parallel jobs.

Multi-threaded or Multi-processing Job Example¶

Multi-threaded or multi-processing programs are applications that are able to execute in parallel across multiple CPU cores within a single node using a shared memory execution model. In general, a multi-threaded application uses a single process (aka “task” in Slurm) which then spawns multiple threads of execution. By default, Slurm allocates 1 CPU core per task. In order to make use of multiple CPU cores in a multi-threaded program, one must include the --cpus-per-task option. Below is an example of a multi-threaded program requesting 12 CPU cores per task and a total of 256GB of memory. The program itself is responsible for spawning the appropriate number of threads.

  #!/bin/bash
  #SBATCH --nodes=1
  #SBATCH --ntasks=1
  #SBATCH --cpus-per-task=12 # 12 threads per task
  #SBATCH --time=02:00:00 # two hours
  #SBATCH --mem=256G
  #SBATCH -p bigmem
  #SBATCH --output=threaded.out
  #SBATCH --job-name=threaded
  #SBATCH --mail-type=BEGIN,END,FAIL
  #SBATCH --mail-user=me@email.com
  # Run multi-threaded application
  module load java/1.8.0-91
  java -jar threaded-app.jar

MPI Jobs¶

Most users do not require MPI to run their jobs but many do. Please read on if you want to learn more about using MPI for tightly-coupled jobs. See also the OpenMPI Users Guide

MPI (Message Passing Interface) code require special attention within Slurm. Slurm allocates and launches MPI jobs differently depending on the version of MPI used (e.g. OpenMPI, MPICH2). We recommend using OpenMPI version 2.1.1 or later to compile your C code (using mpicc) and then using the mpirun command in a batch submit script to launch parallel MPI jobs. The example below runs MPI code compiled by OpenMPI 2.1.1:

  #!/bin/bash
  #SBATCH --nodes=3
  #SBATCH --tasks-per-node=8 # 8 MPI processes per node
  #SBATCH --time=3-00:00:00
  #SBATCH --mem=4G # 4 GB RAM per node
  #SBATCH --output=mpi_job.log
  #SBATCH --partition=parallel
  #SBATCH --constraint="IB"
  #SBATCH --mail-type=BEGIN,END,FAIL
  #SBATCH --mail-user=me@email.com

  module load openmpi
  echo $SLURM_JOB_NODELIST
  mpirun -np 24 mpiscript.o

This example requests 3 nodes and 8 tasks (i.e. processes) per node, for a total of 24 tasks. I use this number to tell mpirun how many processes to start, -np 24

NOTE: We highly recomend adding the --constraint="IB" parameter to your MPI job as this will ensure the job is run on nodes with an Infiniband interconnect.

ALSO NOTE: If using python or another language you will also need to add the --oversubscribe parameter to mpirun, eg.

mpirun --oversubscribe -np 24 mpiscript.py

More information about running MPI jobs within Slurm can be found here: http://slurm.schedmd.com/mpi_guide.html.

OpenMPI users guide¶

Which versions of OpenMPI are working on Rāpoi?¶

There are a number of versions of OpenMPI on Rāpoi, although many of these are old installations (prior to an OS update and changes to the module system) and may no longer work. Generally speaking, your best bet is to try a version which appears when you search via module spider OpenMPI (noting that the capital 'O M P I' is important here). A few examples of relatively recent version of OpenMPI which are available (as of April 2024) are OpenMPI/4.1.1, OpenMPI/4.1.4 and OpenMPI/4.1.6.

Each of these OpenMPI modules has one or more of pre-requisite modules that need to be loaded first (generally a specific version of GCC compilers). To find out what you need to load first for a specific version of OpenMPI you just need to check the output of module spider OpenMPI/x.y.z (with the appropriate values for x,y,z). One of the examples below shows how to use OpenMPI/4.1.6. In cases where your code utilises software from another module which also requires a specific GCC module, that will dictate which version of OpenMPI to load (i.e. whichever one depends on the same GCC version). Otherwise, you are free to use any desired OpenMPI module.

Known issues and workarounds¶

There is a known issue with the communication/networking interfaces with several of the installations of OpenMPI. The error/warning messages occur sporadically, making it difficult to pin down and resolve, but it is likely there is a combination of internal and external factors that cause this (OpenMPI is a very complex beast). The warning messages take the form:

Failed to modify UD QP to INIT on mlx5_0: Operation not permitted

A workaround is described below, this page will be updated in the future when a more permanent solution is found.

Exectute your mpi jobs using the additional arguments:

mpirun -mca pml ucx -mca btl '^uct,ofi' -mca mtl '^ofi' -np $SLURM_NTASKS <your executable>

This will ensure OpenMPI avoids trying to use the communication libraries which are problematic. If your executable is launched without using mpirun (i.e. it implements its own wrapper/launcher), you will instead need to set the following environment variables:

export OMPI_MCA_btl='^uct,ofi'
export OMPI_MCA_pml='ucx'
export OMPI_MCA_mtl='^ofi'

Software module ORCA users, sometimes, come across an error message:

PML ucx cannot be selected

To address this, update mpirun variables to:

mpirun --mca btl ^openib --mca pml ob1 --mca osc ucx -np ...<remaining params>`