Python users guide

Python users guide¶

Which versions of Python are working on Rāpoi?¶

There are a number of versions of Python on Rāpoi, although many of these are old installations (prior to an OS update and changes to the module system) and may no longer work. Generally speaking, your best bet is to try a version which appears when you search via module spider Python (noting that the capital 'P' in Python is important here). A few examples of relatively recent version of Python which are available (as of April 2024) are Python/3.9.5, Python/3.10.8 and Python/3.11.5.

Each of these Python modules has one or more of pre-requisite modules that need to be loaded first (generally a specific version of GCC compilers). To find out what you need to load first for a specific version of Python you just need to check the output of module spider Python/x.y.z (with the appropriate values for x,y,z). One of the examples below shows how to use Python/3.9.5. In cases where your Python code needs to interact with software from another module which also requires a specific GCC module, that will dictate which version of Python to load (i.e. whichever one depends on the same GCC version). Otherwise, you are free to use any desired Python module.

The Python installations generally have a minimal number of packages/libraries installed. If you require additional packages/libraries it is recommended to create a virtual environment and install any desired packages within that environment. This is illustrated in the examples below using both virtualenv/pip and anaconda/conda.

Simple Python program using virtualenv and pip¶

First we need to create a working directory and move there

mkdir python_test
cd python_test

Next we load the python 3 module and use python 3 to create a python virtualenv. This way we can install pip packages which are not installed on the cluster

module load GCCcore/10.3.0
module load Python/3.9.5
python3 -m venv mytest

Activate the mytest virtualenv and use pip to install the webcolors package

source mytest/bin/activate
pip install webcolors

Create the file test.py with the following contents using nano

import webcolors
from random import randint
from socket import gethostname

colour_list = list(webcolors.CSS3_HEX_TO_NAMES.items())
requested_colour = randint(0,len(colour_list))
colour_name = colour_list[requested_colour][1]

print("Random colour name:", colour_name, " on host: ", gethostname())

Alternatively download it with wget:

wget https://raw.githubusercontent.com/\
    vuw-research-computing/raapoi-tools/\
    master/examples/python_venv/test.py

Using nano create the submissions script called python_submit.sh with the following content - change me@email.com to your email address.

#!/bin/bash
#
#SBATCH --job-name=python_test
#SBATCH -o python_test.out
#SBATCH -e python_test.err
#
#SBATCH --cpus-per-task=2 #Note: you are always allocated an even number of cpus
#SBATCH --mem=1G
#SBATCH --time=10:00
#
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=me@email.com

module load GCCcore/10.3.0
module load Python/3.9.5

source mytest/bin/activate
python test.py

Alternatively download it with wget

wget https://raw.githubusercontent.com/\
    vuw-research-computing/raapoi-tools/\
    master/examples/python_venv/python_submit.sh

To submit your job to the Slurm scheduler

sbatch python_submit.sh

Check for your job on the queue with squeue though it might finish very fast. The output files will appear in your working directory.

Using Anaconda/Miniconda/conda¶

Many users use Anaconda/Miniconda to manage software stacks. One way to do this is to use singularity containers with the conda environment inside - this allows the conda environment to load quickly as the many small conda files are inside a container which the file system sees as one file.

However, this is also an additional bit of complexity so many users just use conda outside of singularity. You can install your own version of Anaconda/Miniconda to your home directory or scratch. We have also got packaged versions of Anaconda/Miniconda installed with our module loading system.

Anaconda has many built in packages so we will use that in our examples, but Miniconda is also available if prefer to start from a minimal initial setup.

module load Anaconda3/2020.11

export PIP_NO_CACHE_DIR=1
export PYTHONNOUSERSITE=1

Note

Setting the variables PIP_NO_CACHE_DIR and PYTHONNOUSERSITE prevents conda from trying to use the system python and pip. This ensures an isolated environment and avoids potential conflicts with system packages.

Let's create a new conda environment for this example, in a sensible location, I used ~/examples/conda/idba

conda create --name idba-example  # press y for the Proceed prompt if it looks correct
source $(conda info --base)/etc/profile.d/conda.sh
conda activate idba-example  #activate our example environment.

Warning

On HPC systems, conda init is not recommended as it modifies your shell configuration files. This can cause problems with the module system and other software. Instead, use the source $(conda info --base)/etc/profile.d/conda.sh command to activate conda in your current shell session.

Conda environments are beyond the scope of this example, but they are a good way to contain all the dependencies and programs for a particular workflow, in this case, idba.

Install idba in our conda environment.

Tip

Note that best practise is to do the install on a compute node

We'll just do it here on the login node for now - the code will run slower on the compute nodes as a result!

conda install -c bioconda idba

Idba is a genome assembler, we will use paired-end illumina reads of E. coli. The data is available on an Amazon S3 bucket (a cloud storage location), and we can download it using wget.

mkdir data  # put our data in a sensible location
cd data
wget --content-disposition goo.gl/JDJTaz #sequence data
wget --content-disposition goo.gl/tt9fsn #sequence data
cd ..  #back to our project directory

The reads we have are paired-end fastq files but idba requires a fasta file. We can use a tool installed with idba to convert them. We'll do this on the Rāpoi login node as it is a fast task that doesn't need many resources.

fq2fa --merge --filter data/MiSeq_Ecoli_MG1655_50x_R1.fastq data/MiSeq_Ecoli_MG1655_50x_R2.fastq data/read.fa

To create our submission script we need to know the path to our conda enviroment. To get this:

conda env list

You'll need to find your idba-example environment, and next to it is the path you'll need for your submission script. In my case:

# conda environments:
#
base                  *  /home/username/anaconda3
idba-example          /home/username/anaconda3/envs/idba-example  # We need this line, it'll be different for you!

Create our sbatch submission script. Note that this sequence doesn't need a lot of memory, so we'll use 3G. To see your usage after the job has run use vuw-job-report <job-id>

idba_submit.sh

#!/bin/bash

#SBATCH --job-name=idba_test
#SBATCH -o _output.out
#SBATCH -e _output.err
#SBATCH --time=00:5:00
#SBATCH --partition=quicktest
#SBATCH --ntasks=12
#SBATCH --mem=3G

module load Anaconda3/2020.11
eval "$(conda shell.bash hook)" # basically inits your conda - prevents errors like: CommandNotFoundError: Your shell has not been properly configured ...
conda activate /home/username/anaconda3/envs/idba-example  # We will need to activate our conda enviroment on the remote node
idba idba_ud -r data/read.fa -o output

To submit our job

sbatch idba_submit.sh

To see our job running or queuing

squeue -u $USER

This job will take a few minutes to run, generally less than 5. When the job is done we can see the output in the output folder. We can also see the std output and std err in the files _output.out and _output.err. The quickest way to examine them is to cat the files when the run is done.

cat _output.out