Moderating a Slurm Cluster¶
This a list of common cluster moderator actions, provided as reference. Users without moderator privileges might find some of this of interest, but you won't be able to perform the actions that affect other users.
These commands will require you to be logged in with your moderator-specific account
Dealing with normal jobs¶
Extending jobs¶
Jobs are normally given a limit of 10 days to run. If a little longer is needed and there is no reason such as upcoming maintenance then jobs can be given a bit longer to complete:
# To set a new total run time for a job of 13 days, 23 hours, 59 minutes and 59 seconds
scontrol update jobid=<jobid> TimeLimit=13-23:59:59
# Extend the job by 3 days, 23 hours, 59 minutes and 59 seconds (the + extends the time)
scontrol update jobid=<jobid> TimeLimit=+3-23:59:59
Dealing with badly behaved jobs¶
Holding jobs¶
Users will occasionally run jobs which consume an unfair amount of resources, if a single user is causes problems, you can hold their jobs. This won't stop their current jobs, but will prevent more from starting
# hold some jobs
scontrol hold jobid1,jobid2,etc
# Allow the jobs back onto the queue
scontrol requeue jobid1,jobid2,etc ## previous step sets priority to zero \
## so they won'ßt actually start now
# Release the jobs to run again
scontrol release jobid1,jobid2,etc
Alterativly you can reduce their priority to a low setting
Cancelling jobs¶
If a users jobs are causing too many problems, you can cancel their jobs. Note this is drastic and can throw away many days of compute, it's best to try get hold of a user first. Get them to cancel their own jobs.
If needed though:
scancel <jobid> # be careful to get the correct job id!
# to cancel all their running jobs on parallel
squeue -p parallel -u <username> -t running --format "scancel %i" | sh
Limiting user's resource allowance on Raapoi¶
Set maxjobs¶
After a few minutes you should be able to see the results on squeue
squeue -u bob -o "%i %r"
# returns something like
JOBID REASON
20582 AssocMaxJobsLimit
20583 Dependency
Show User CPU restruction details¶
Limiting CPU resources¶
Limiting Memory (RAM) resources¶
This is a reference, but note that this may have unintended consequences. Please consult other moderators on Slack before proceeding with this.
Limiting GPU resources¶
sacctmgr modify user bob set GrpTRES=cpu=-1,mem=-1,gres/gpu=4 # -1 means no restriction.
# check result
sacctmgr list assoc User=bob
Using reservations¶
If a research group has a good need and the other moderators agree, you can give them a reservation that only they can use. This is usually done for a specific time period. This is also one of the steps when we put the cluster into maintenance
Create a month-long reservation on amd01n01 and amd01n02
scontrol create reservationname=MyReservation starttime=2021-03-01T11:00:00 \
duration=30-00:00:00 user=user1,user2,user3 nodes=amd01n01,amd01n02
Users will use the reservation with
Building software with EasyBuild¶
Use a terminal multiplexer like screen, tmux or byobu to keep your ssh session alive and get a interactive session on a node.
Build a simple program¶
Here we will build a simple program called velvet - it's a genome assembler.
srun -c 10 --mem=10G -p parallel --time=6:00:00 --pty bash
# now on the node
module purge # just in case
module load EasyBuild
# Search for the package
eb -S velvet
# Returns
* $CFGS1/v/Velvet/Velvet-1.2.10-GCC-8.3.0-mt-kmer_191.eb
* $CFGS1/v/Velvet/Velvet-1.2.10-GCC-11.2.0-mt-kmer_191.eb
* $CFGS1/v/Velvet/Velvet-1.2.10-foss-2018a-mt-kmer_191.eb
* $CFGS1/v/Velvet/Velvet-1.2.10-foss-2018b-mt-kmer_191.eb
* $CFGS1/v/Velvet/Velvet-1.2.10-intel-2017a-mt-kmer_37.eb
# We want to pick one that won't need to build a whole new toolchain if we can avoid it
# Let's have a look at what would get built with a Dry (D) run. The r is for robot to
# find all the dependancies
eb -Dr Velvet-1.2.10-intel-2017a-mt-kmer_37.eb
# Partial return
* [x] $CFGS/m/M4/M4-1.4.17.eb (module: Core | M4/1.4.17)
* [x] $CFGS/b/Bison/Bison-3.0.4.eb (module: Core | Bison/3.0.4)
* [x] $CFGS/f/flex/flex-2.6.0.eb (module: Core | flex/2.6.0)
* [x] $CFGS/z/zlib/zlib-1.2.8.eb (module: Core | zlib/1.2.8)
* [ ] $CFGS/b/binutils/binutils-2.27.eb (module: Core | binutils/2.27)
* [ ] $CFGS/g/GCCcore/GCCcore-6.3.0.eb (module: Core | GCCcore/6.3.0)
* [ ] $CFGS/z/zlib/zlib-1.2.11-GCCcore-6.3.0.eb
* [ ] $CFGS/h/help2man/help2man-1.47.4-GCCcore-6.3.0.eb
* [ ] $CFGS/m/M4/M4-1.4.18-GCCcore-6.3.0.eb
* [ ] $CFGS/b/Bison/Bison-3.0.4-GCCcore-6.3.0.eb
* [ ] $CFGS/f/flex/flex-2.6.3-GCCcore-6.3.0.eb
* [ ] $CFGS/i/icc/icc-2017.1.132-GCC-6.3.0-2.27.eb
* [ ] $CFGS/i/ifort/ifort-2017.1.132-GCC-6.3.0-2.27.eb
* [ ] $CFGS/i/iccifort/iccifort-2017.1.132-GCC-6.3.0-2.27.eb
* [ ] $CFGS/i/impi/impi-2017.1.132-iccifort-2017.1.132-GCC-6.3.0-2.27.eb
* [ ] $CFGS/i/iimpi/iimpi-2017a.eb (module: Core | iimpi/2017a)
* [ ] $CFGS/i/imkl/imkl-2017.1.132-iimpi-2017a.eb
* [ ] $CFGS/i/intel/intel-2017a.eb (module: Core | intel/2017a)
* [ ] $CFGS/v/Velvet/Velvet-1.2.10-intel-2017a-mt-kmer_37.eb
# Packages wiht a [x] are already built, [ ] will need to be built. This is a lot of building,, including a "new" compiler - intel-2017a.eb, let's avoid that and try another
eb -Dr Velvet-1.2.10-foss-2018b-mt-kmer_191.eb
# Partially returns, all [x] except for velvet
...
* [x] $CFGS/f/FFTW/FFTW-3.3.8-gompi-2018b.eb
* [x] $CFGS/s/ScaLAPACK/ScaLAPACK-2.0.2-gompi-2018b-OpenBLAS-0.3.1.eb
* [x] $CFGS/f/foss/foss-2018b.eb (module: Core | foss/2018b)
* [ ] $CFGS/v/Velvet/Velvet-1.2.10-foss-2018b-mt-kmer_191.eb
# Before you proceed to build, make sure appropriate folder permissions will be set
umask 0002
# To build this we would
eb -r --parallel=$SLURM_CPUS_PER_TASK Velvet-1.2.10-foss-2018b-mt-kmer_191.eb
# Afterwards, reset the default permissions for anything else you go on to do
umask 0022
Building a new toolchain¶
Below we ask for 10cpus, 10G memory and 6 hours. Really long rebuilds might need more time and or cpu/memory.
srun -c 10 --mem=10G -p parallel --time=6:00:00 --pty bash
# now on the node
module purge # just in case
module load EasyBuild
# Search for the package
eb -S foss-2022a
# There is a long output as that toochain gets used in many packages, but we can see:
$CFGS1/f/foss/foss-2022a.eb
# Check what will be built
# BE CAUTIOUS OF OPENMPI builds - the .eb file needs to be changed to use pmi2 rather than pmix each time!
eb -Dr foss-2022a.eb
# Trigger the build - this might take a long time, you could add more cpus or time if needed
# (making sure permissions will be set appropriately, as described above)
umask 0002
eb -r --parallel=$SLURM_CPUS_PER_TASK foss-2022a.eb
umask 0022
Rebuilding an existing package¶
This might be needed for some old packages after the move to Rocky 8
Below we ask for 10cpus, 10G memory and 6 hours. Really long rebuilds might need more time and or cpu/memory.
srun -c 10 --mem=10G -p parallel --time=6:00:00 --pty bash
# now on the node
module purge # just in case
module load EasyBuild
# Example of finding the problem
ldd /home/software/apps/samtools/1.10/bin/samtools | grep found
libncursesw.so.5 => not found
libtinfo.so.5 => not found
# See the what got build for samtools
eb -Dr /home/software/EasyBuild/ebfiles_repo/SAMtools/SAMtools-1.10-GCC-8.3.0.eb
# There are a few options
# ncurses-6.0.eb
# ncurses-6.1-GCCcore-8.3.0.eb
# let's try one (making sure permissions will be set appropriately, as described above)
umask 0002
eb -r --parallel=10 --rebuild ncurses-6.0.eb
umask 0022
# test once done
Upgrading easybuild with Easybuild¶
Get an interactice session on a node, then
module load EasyBuild # will load latest version by default
eb --version # see version
eb --install-latest-eb-release # upgrade - will create new module file for new version
Building New Version of Schrodinger Suite¶
Schrödinger Suite releases new versions quarterly, it's good practice to keep up to date with the latest version of the software. To build the new version, first download the tar file from the Schrödinger website (www.schrodinger.com), then move the installation tar file to the directory /home/software/src/Schrodinger
on Rāpoi.
Quick Installation¶
First, extract the tar file
Change to the top-level directory of the download
Then, run the instalaation script
Answer y/n to prompts from the INSTALL script, then all packages should be installed.
NOTE
During installation, you will be asked to confirm the installation directory
, this is /home/software/apps/Schrodinger/2023-3
, '2023-3' should be replaced with the current version being installed. The scratch directory
should be /nfs/scratch/projects/tmp
.
The installation file will check for dependencies in the last stage, missing dependencies will be reported, and will need to be installed for Schrödinger Suite to run properly. Contact Rāpoi admin to install the missing dependencies.
Modify the hosts file¶
Change directory to the installation folder
open the schrodinger.hosts
file with vi
, modify the contents to add hostnames. The hosts and settings can be found in the schrodinger.hosts
file from the installation directory of older versions, such as /home/software/apps/Schrodinger/2023-1
. Add all the remote hosts to the new host file. For example,
Name: parallel
Host: raapoi-login
Queue: SLURM2.1
Qargs: "-p parallel --mem-per-cpu=2G --time=5-00:00 --constraint=AMD"
processors: 1608
tmpdir: /nfs/scratch/projects/tmp
Add new module file¶
Once installation is complete, add a new module file so that the new version can be loaded. Module files for existing Schrodinger versions can be found in /home/software/tools/eb_modulefiles/all/Core/Schrodinger
. The module files are named with .lua
extensions. Make a new module file by copying one of the older module files, for example,
Then edit the new module file (in this case, 2023-3.lua
) to match the new version installed. Fields that will need to be updated include the Whatis
section, and the root
. For example:
You can check if the module has been properly installed by
Installing non-easybuild software for other users¶
Caution:
Before you proceed, ask yourself: how likely is it another user will want to use this software, even if they do will they potentially want different/newer version, etc.
For software which is unlikely to be used by many users, and/or different users may want to install different versions, then it is probably not worth the hassle to install centrally (i.e. users can install local copies as needed). This is something you should only attempt if you are confident in using/modifying build scripts and associated tools.
If you decide it is worth the hassle to proceed, you should first carefully examine the entire installation documentation of the desired software.
For any required dependencies, check if modules for these already exist and pre-load them to minimise the amount of stuff you need to compile (otherwise you will need to install any remaining dependencies first, following these same broad steps).
Depending on what's available, this may help inform which compiler toolchain you will use during the installation (e.g. typically one of the foss/YEAR<a/b>
toolchains).
You'll also need to modify the installation steps so that the sofware is ultimately installed into /home/software/apps/<software name>/<version number>/
taking care that no existing software is overwritten.
However, before installing software in this directory, it is recommended you install a copy into your home/local directory and test that it works properly first.
Lastly, to allow others to load the software you'll need to write a .lua
file to be copied into /home/software/tools/modulefiles/<software name>/<version number>/
.
The .lua
file contains some basic module information, help information, and instructions on what modules need to be loaded and environment variables which need to be set/modified when the software is loaded.
You should have a good idea of what is required based on the installation steps, but you may also want to examine (and possibly copy+modify) the .lua
file for a similar piece of software.