Training

These are longer worked examples.
If you have domain specific training you'd like to provide for your students or peers, contact someone on the support team, or make a pull request against this repo.

GPU example with neural style in pytorch

We'll do a quick python example using neural style implemented in pytorch. We will be using modules rather than conda/virtualenvs but there is nothing stopping you from loading the modules and creating a virtualenv/conda enviroment to install additional python packages.

The code we use will come from the pytorch example git repo.

Clone the pytorch example repo

In a sensible location, clone the rep.

git clone https://github.com/pytorch/examples.git
cd examples/fast_neural_style  # change to the example we will be running.

Load the modules

We are using the new Easybuild based modules, to ensure we don't have conflicts with the old modules, it will be best to unuse them first and then use the new system. At somepoint we may automatically add the new modules to your bashrc file - but currently you'll have to do this yourself or manually unuse and use the new module system.

module unuse /home/software/tools/modulefiles/  #unuse the old module system
module use /home/software/tools/eb_modulefiles/all/Core #use the new module system
module load fosscuda/2020b
module load PyTorch/1.7.1
module load torchvision/0.8.2-PyTorch-1.7.1
module list #see all the dependencies we have loaded, in particular which version of python we're using now. Currently Python 3.8.6

PyTorch note:

When running jobs which utilise PyTorch, make sure you allocate sufficient memory for the job. If you encounter error messages which are vague, it is possible that you don't have enough memory allocated. Just to import torch it is recommended to have 4GB (or you may see errors such as ImportError: <library>.so: failed to map segment from shared object ).

Optional: Setup a virtualenv

python3 -m venv env  # create a virtualenv folder called env. Note! This will likely only work with the python version listed above!
source env/bin/activate # activate the virtualenv

Now that we've activated the virtual environment, we can install any additional packages we need. In this case we don't need any additional packages.

Download some images to use as content as well as for training.

In your examples/fast_neural_style/ directory.

# Download an image of an octopus to images/content-images. 
## CC BY-SA 3.0 H. Zell
wget https://upload.wikimedia.org/wikipedia/commons/0/0c/Octopus_vulgaris_02.JPG -P images/content-images/ 

# Download an image of The Great Wave off Kanagawa - public domain
wget https://upload.wikimedia.org/wikipedia/commons/a/a5/Tsunami_by_hokusai_19th_century.jpg -O images/style-images/wave.jpg

Depending on the GPU we are using, we may need to resize the image to ensure it fits in memory. On an RTX6000 we would need to resize the image to 70% of its full size to fit in memroy. Thankfully the GPUs on Rāpoi are A100's with 40GB of ram, so we can skip this step.

We will also need to download the pre-trained models for our initial inference runs.

python download_saved_models.py

Style some images - inference

We'll initially just use pretrained models to generate styled images - this is known as model inference and is much less intensive than training the model, we'll do this on both CPU and GPU.

submit_cpu.sh

#!/bin/bash

#SBATCH --job-name=pytorch_test
#SBATCH -o _test.out
#SBATCH -e _test.err
#SBATCH --time=00:15:00
#SBATCH --partition=parallel
#SBATCH --ntasks=12
#SBATCH --mem=6G

module unuse /home/software/tools/modulefiles/  #unuse the old module system
module use /home/software/tools/eb_modulefiles/all/Core #use the new module system
module load fosscuda/2020b
module load PyTorch/1.7.1
module load torchvision/0.8.2-PyTorch-1.7.1

#Optional
source env/bin/activate  #activate the virtualenv

# Run our job --cuda 0 means run on the CPU and we'll save the output image as test1.jpg
#
python neural_style/neural_style.py eval --content-image images/content-images/Octopus_vulgaris_02.JPG  --model saved_models/mosaic.pth --output-image ./test1.jpg --cuda 0

You can check how long the job took to run with vuw-job-history. The last lines are your last run job, in my case:

332281        COMPLETED pytorch_t+              00:02:36 
332281.batch  COMPLETED      batch      0.15G   00:02:36 
332281.exte+  COMPLETED     extern      0.15G   00:02:36

the job took 2:36.

Let's run the inference job again on GPU to see the speedup.

submit_gpu.sh

#!/bin/bash

#SBATCH --job-name=pytorch_test
#SBATCH -o _test.out
#SBATCH -e _test.err
#SBATCH --time=00:15:00
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --ntasks=2
#SBATCH --mem=60G

module unuse /home/software/tools/modulefiles/  #unuse the old module system
module use /home/software/tools/eb_modulefiles/all/Core #use the new module system
module load fosscuda/2020b
module load PyTorch/1.7.1
module load torchvision/0.8.2-PyTorch-1.7.1

#optional
source env/bin/activate  #activate the virtualenv

# Run our job --cuda 1 means run on the GPU and we'll save the output image as test2.jpg
#
python neural_style/neural_style.py eval --content-image images/content-images/Octopus_vulgaris_02.JPG  --model saved_models/mosaic.pth --output-image ./test2.jpg --cuda 1

In this case vuw-job-history the job took:

692973        COMPLETED pytorch_t+              00:00:16 
692973.batch  COMPLETED      batch      0.15G   00:00:16 
692973.exte+  COMPLETED     extern      0.15G   00:00:16 

but the time varies a lot with short GPU runs, some are nearly 2 min long and some runs are 16s with the same data. The memory usage with pytorch is also hard to estimate, running vuw-job-report 332320 shows:

Nodes: 1
Cores per node: 2
CPU Utilized: 00:00:07
CPU Efficiency: 43.75% of 00:00:16 core-walltime
Job Wall-clock time: 00:00:08
Memory Utilized: 1.38 MB
Memory Efficiency: 0.00% of 60.00 GB

The memory usage is very low, but there is a very brief spike in memory at the end of the run as the image is generated that vuw-job-report doesn't quite capture. 60G of memory is needed to ensure this completes - a good rule of thumb is to allocate at least as much system memory as GPU memory. The A100's have 40G of ram.

Train a new style - computationally expensive.

Training a new image style is where we will get the greatest speedup using a GPU.

We will use 13G of training images - COCO 2014 Training images dataset . These images have already been downloaded and are accessable at /nfs/home/training/neural_style_data/train2014/. Note that training a new style will take about 1:15h on an A100 and two and a half hours on an RTX6000

#!/bin/bash

#SBATCH --job-name=pytorch_test
#SBATCH -o _test.out
#SBATCH -e _test.err
#SBATCH --time=03:00:00
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --ntasks=2
#SBATCH --mem=60G

module unuse /home/software/tools/modulefiles/  #unuse the old module system
module use /home/software/tools/eb_modulefiles/all/Core #use the new module system
module load fosscuda/2020b
module load PyTorch/1.7.1
module load torchvision/0.8.2-PyTorch-1.7.1

#Optional
source env/bin/activate  #activate the virtualenv

# Run our job --cuda 1 means run on the GPU                                   
# style-weight and content-weight are just parameters adjusted to give better results
python neural_style/neural_style.py train \
        --dataset /nfs/home/training/neural_style_data/ \
        --style-image images/style-images/wave.jpg \
        --save-model-dir saved_models/style5e10_content_5e4 \
        --style-weight 5e10 \
        --content-weight 5e4 \
        --epochs 2 \
        --cuda 1

This will take a while, but should eventually complete. The A100 has enough memory to train on this image, with other GPUs you may need to scale down the style image to fit in the GPU memory. Note: If you get an out of GPU memory error but it seems the GPU has plenty of memory, it often means you ran out of system memory, try asking for more memory in slurm.

Use our newly trained network

submit_gpu.sh

#!/bin/bash

#SBATCH --job-name=pytorch_test
#SBATCH -o _test.out
#SBATCH -e _test.err
#SBATCH --time=00:15:00
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --ntasks=2
#SBATCH --mem=60G

module unuse /home/software/tools/modulefiles/  #unuse the old module system
module use /home/software/tools/eb_modulefiles/all/Core #use the new module system
module load fosscuda/2020b
module load PyTorch/1.7.1
module load torchvision/0.8.2-PyTorch-1.7.1

#Optional
source env/bin/activate  #activate the virtualenv

# Run our job --cuda 1 means run on the GPU and we'll save the output image as test2.jpg
#
python neural_style/neural_style.py eval \
    --content-image images/content-images/Octopus_vulgaris_02.JPG  \
    --model saved_models/style5e10_content_5e4 \
    --output-image ./test3.jpg --cuda 1

Bonus content use a slurm task-array to find the optimum parameters.

In the above example we use parameters for style-weight and content-weight. There are lots of possibilities for these parameters, we can use a task array and a parameter list to determine good values. Note that actually running this example will consume a lot of resources and it is presented mostly to provide some information about task arrays. Running this example will consume the whole GPU partition for about 12 hours.

First let's create a list of parameters to test, we could include these in the batch submission script, but I think it's clearer to separate them out. If you're version controlling your submission script, it'll make it easier to see what are changes to parameters and what are changes to the script itself.

In the parameter list, the first column is style-weight parameters and the second is content-weight parameters. paramlist.txt

5e10 1e3
5e10 1e4
5e10 5e4
1e11 1e3
1e11 1e4
1e11 5e4
5e11 1e3
5e11 1e4
5e11 5e4
1e12 1e3
1e12 1e4
1e12 5e4

In our submission script we will parse these values with awk. Awk is a bit beyond the scope of this lesson, but it is a handy shell tool for manipulating text. Digital ocean has a nice primer on Awk

submit_gpu_train_array

#!/bin/bash

#SBATCH --job-name=pytorch_test
#SBATCH -o _test.out
#SBATCH -e _test.err
#SBATCH --time=10:00:00
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --ntasks=10
#SBATCH --mem=60G
#SBATCH --array=1-13

module unuse /home/software/tools/modulefiles/  #unuse the old module system
module use /home/software/tools/eb_modulefiles/all/Core #use the new module system
module load fosscuda/2020b
module load PyTorch/1.7.1
module load torchvision/0.8.2-PyTorch-1.7.1

#Optional
source env/bin/activate  #activate the virtualenv

# Run our job --cuda 1 means run on the GPU                                   
#
#awk -v var="$SLURM_ARRAY_TASK_ID" 'NR == var {print $1}' paramlist.txt 
style_weight=$(awk -v var="$SLURM_ARRAY_TASK_ID" 'NR == var {print $1}' paramlist.txt)
content_weight=$(awk -v var="$SLURM_ARRAY_TASK_ID" 'NR == var {print $2}' paramlist.txt)

echo $style_weight
echo $content_weight
python neural_style/neural_style.py train \
    --dataset nfs/home/training/neural_style_data/ \
    --style-image images/style-images/wave.jpg \
    --save-model-dir saved_models/test_params2_epoch2/style${style_weight}_content${content_weight} \
    --style-weight $style_weight \
    --content-weight $content_weight \
    --epochs 2 \
    --cuda 1

Simple OpenMPI with Singularity using the hybrid approach.

The hybrid approach is one way of getting OpenMPI working with containers. It requires the OpenMPI version inside the container to match the OpenMPI outside the container (loaded via module loading).

First check what openMPI version we have on Rāpoi. On Rāpoi switch to our new modules

module unuse /home/software/tools/modulefiles # stop using the older modules
module use /home/software/tools/eb_modulefiles/all/Core #the new module files organised by compiler
module spider OpenMPI # search for openMPI - thre are several options, lets try
module spider OpenMPI/4.0.5  # we will use this one, which requires GCC/10.2.0

On your local machine we will create a very simple C openMPI program. Create this in a sensible place. I used ~/projects/examples/singularity/openMPI

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char **argv) {
        int rc;
        int size;
        int myrank;

        rc = MPI_Init (&argc, &argv);
        if (rc != MPI_SUCCESS) {
                fprintf (stderr, "MPI_Init() failed");
                return EXIT_FAILURE;
        }

        rc = MPI_Comm_size (MPI_COMM_WORLD, &size);
        if (rc != MPI_SUCCESS) {
                fprintf (stderr, "MPI_Comm_size() failed");
                goto exit_with_error;
        }

        rc = MPI_Comm_rank (MPI_COMM_WORLD, &myrank);
        if (rc != MPI_SUCCESS) {
                fprintf (stderr, "MPI_Comm_rank() failed");
                goto exit_with_error;
        }

        fprintf (stdout, "Hello, I am rank %d/%d", myrank, size);

        MPI_Finalize();

        return EXIT_SUCCESS;

 exit_with_error:
        MPI_Finalize();
        return EXIT_FAILURE;
}

In the same location as above create a singularity definition file, note that we choose to compile and install the same OpenMPI version as we will use on Rāpoi.

Bootstrap: docker
From: ubuntu:latest

%files
    mpitest.c /opt

%environment
    export OMPI_DIR=/opt/ompi
    export SINGULARITY_OMPI_DIR=$OMPI_DIR
    export SINGULARITYENV_APPEND_PATH=$OMPI_DIR/bin
    export SINGULAIRTYENV_APPEND_LD_LIBRARY_PATH=$OMPI_DIR/lib

%post
    echo "Installing required packages..."
    apt-get update && apt-get install -y wget git bash gcc gfortran g++ make file

    echo "Installing Open MPI"
    export OMPI_DIR=/opt/ompi
    export OMPI_VERSION=4.0.5  #NOTE matching version to that on Raapoi
    export OMPI_URL="https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-$OMPI_VERSION.tar.bz2"
    mkdir -p /tmp/ompi
    mkdir -p /opt
    # Download
    cd /tmp/ompi && wget -O openmpi-$OMPI_VERSION.tar.bz2 $OMPI_URL && tar -xjf openmpi-$OMPI_VERSION.tar.bz2
    # Compile and install
    cd /tmp/ompi/openmpi-$OMPI_VERSION && ./configure --prefix=$OMPI_DIR && make install
    # Set env variables so we can compile our application
    export PATH=$OMPI_DIR/bin:$PATH
    export LD_LIBRARY_PATH=$OMPI_DIR/lib:$LD_LIBRARY_PATH
    export MANPATH=$OMPI_DIR/share/man:$MANPATH

    echo "Compiling the MPI application..."
    cd /opt && mpicc -o mpitest mpitest.c

Now we build our container locally, giving it a sensible name. We need OpenMPI-4.0.5 to use this, so let's include that in the name.

sudo singularity build test-openmpi-4.0.5.sif test-openmpi-4.0.5.def

Copy that file to Rāpoi somehow - Filezilla, rsync or similar. I'll just use sftp for simplicity.

sftp <username>@raapoi.vuw.ac.nz
put test-openmpi-4.0.5.sif
Now on Rāpoi copy that file to a sensible location, I used ~/projects/examples/singularity/openMPI again.

mv test-openmpi-4.0.5.sif ~/projects/examples/singularity/openMPI/
cd ~/projects/examples/singularity/openMPI/

In that location create a sbatch file

openmpi-test.sh

#!/bin/bash
#SBATCH --job-name=mpi_test
#SBATCH --time=00-00:02:00
#SBATCH --output=out_test.out
#SBATCH --error=out_test.err
#SBATCH --partition=parallel
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1
#SBATCH --tasks-per-node=1
#SBATCH --mem=1GB
#SBATCH --constraint="IB,AMD"
#SBATCH --nodes=2

module use /home/software/tools/eb_modulefiles/all/Core
module unuse /home/software/tools/modulefiles # to prevent conflicts with the old modules
module load GCC/10.2.0
module load OpenMPI/4.0.5
module load Singularity/3.7.3 # Note this is a new singularity build

CONPATH=$HOME/projects/examples/singularity/openMPI
mpirun -np 2 singularity exec $CONPATH/test-openmpi-4.0.5.sif /opt/mpitest

Submit that to slurm and see the output

sbatch openmpi-test.sh
squeue -u $USER  # see the job
cat out_test.out # examine the output after the job is done

Simple tensorflow example (using new module system)

In a sensible location create an example python script - this is basically copied verbatim from the tensorflow docs: ps://www.tensorflow.org/tutorials/quickstart/beginner

example.py

import tensorflow as tf
print("TensorFlow version:", tf.__version__)


# Load and prepare the MNIST dataset. Convert the sample data from integers to floating-point numbers
mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0


# Build a machine learning model

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])


# The model returns a vector of log-odds scores, one for each class
predictions = model(x_train[:1]).numpy()
predictions

# The tf.nn.softmax function converts these log odds to probabilities for each class

tf.nn.softmax(predictions).numpy()

# Define a loss function for training.
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# This untrained model gives probabilities close to random 
loss_fn(y_train[:1], predictions).numpy()

# Configure and compile the model using Keras Model.compile
model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])

# Train and evaluate the model - use Model.fit to adjust parameters and minimize loss
model.fit(x_train, y_train, epochs=5)

# Check model performance
model.evaluate(x_test,  y_test, verbose=2)

# Return a probability - wrap the trained model and attach softmax
probability_model = tf.keras.Sequential([
  model,
  tf.keras.layers.Softmax()
])
probability_model(x_test[:5])

Next create a submission script submit.sh

#!/bin/bash

#SBATCH --job-name=tensoflow_test
#SBATCH -o _test.out
#SBATCH --time=00:10:00
#SBATCH --partition=gpu
#SBATCH --ntasks=6
#SBATCH --mem=50G
#SBATCH --gres=gpu:1

# Use the new module system
module use /home/software/tools/eb_modulefiles/all/Core

#to load tf 2.6.0 you'll first need the compiler set it was built with
module load foss/2021a

#load tf
module load TensorFlow/2.6.0-CUDA-11.3.1

# Run the simple tensorflow example - taken from the docs: https://www.tensorflow.org/tutorials/quickstart/beginner
python example.py

Submit your job to the queue and then observe in the queue

sbatch submit.sh
squeue -u <username>

Possible Errors

Tensorflow jobs on the gpu nodes can be a bit dicey

  1. I'd suggest always choosing more memory than the GPU has (40GB) the gpu nodes have a lot of memory so I'd suggest asking for 50GB of ram minimum.

  2. There is also a relationship between cpu's allocated and memory used - the errors are not always obvious. If you're running into issues try increasing the requested memory or reducing the requested CPUs

  3. Example errors due to requesting many cpus while requesting only 50GB ram Note std::bad_alloc - this suggests a problem allocating memory

    terminate called after throwing an instance of 'std::bad_alloc'
      what():  std::bad_alloc
    /var/lib/slurm/slurmd/job1125851/slurm_script: line 21: 46983 Aborted
    (core dumped) python example.py
    

Note this example is also in our example git repo: https://github.com/vuw-research-computing/raapoi-examples inside the tensorflow-simple directory

Example Gaussian Job Submission on HPC

Here is an example of submitting a Gaussian job on the HPC using Slurm. In this example, we will submit a Gaussian job using the quicktest partition, and request 1 task with 4 CPUs and 7GB of memory for a maximum run time of 1 hour. We will also load the g16 module, which is required to run Gaussian on the HPC.

First, create a new directory and navigate to it:

mkdir gaussian_example
cd gaussian_example

Get the example input file

The test0397.com file is an example input file for Gaussian. It contains instructions for Gaussian to perform a calculation on a molecule.

To run the example job using this input file, you should copy the test0397.com file from the Gaussian installation directory at /home/software/apps/gaussian/g16/tests/com/test0397.com to your working directory (gaussian_example in this case).

To do that from the gaussian_example directory:

cp /home/software/apps/gaussian/g16/tests/com/test0397.com . # copy from location to . The dot means current directory

Have a look at the first few lines of the input file to see what it does.

head test0397.com  # the first 5 lines of test0397.com

#returns
!%nproc=4
#p rb3lyp/3-21g force test scf=novaracc

Gaussian Test Job 397:
Valinomycin force

0,1
O,-1.3754834437,-2.5956821046,3.7664927822
O,-0.3728418073,-0.530460483,3.8840401686
O,2.3301890394,0.5231526187,1.7996834334

The first line !%nproc=4 specifies the number of processors that Gaussian will use to run the calculation, in this case, 4.

We will need to make sure that the number of processes used in this file matches the number of cpus we request from Slurm

Slurm Submission

Next, create a submission script called submit.sh (using nano or similar) and add the following contents:

#!/bin/sh
#SBATCH --job-name=g16-test

# max run time
#SBATCH --time=1:00:00
#SBATCH --partition=quicktest

#SBATCH --output=_quicktest.out
#SBATCH --error=_quicktest.err

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=7G

module load gaussian/g16

g16 test0397.com

In the submission script, we specify the following:

  • --job-name: name of the job to appear in the queue.
  • --time: maximum runtime for the job in hh:mm:ss format.
  • --partition: the partition to run the job on.
  • --output: specifies the name of the standard output file.
  • --error: specifies the name of the standard error file.
  • --ntasks: specifies the number of tasks the job will use.
  • --cpus-per-task: specifies the number of CPUs per task.
  • --mem: specifies the amount of memory to allocate for the job.

Submit the job to the queue using sbatch:

sbatch submit.sh

You can check the status of your job in the queue using squeue:

squeue -u <your_username>

Once the job is finished, you can check for the output files and see the contents of the standard output file using cat:

ls
cat _quicktest.out

The Gaussian output files (test0397.log, test0397.chk, etc.) will also be generated in the working directory. You can view the output file using the less command:

less test0397.log

Press q to Quit the less program

That's it! You have successfully submitted and run a Gaussian job on a HPC cluster using Slurm.