Storage and quotas

Shared Storage

Currently users have 3 main storage areas share across every node. Each node has access to to this storage at all times and data is shared. Be careful with parallel jobs trying to write to the same filename!

  • /nfs/home/USERNAME - This is your Home Directory, each user has a 50 GB quota limit. The data is replicated off site and backed up regularly by Digital Solutions. Home directory tips

  • /nfs/scratch/USERNAME - This is your scratch space, each user has a 5 TB quota limit. This data is not backed up! Scratch directory tips

  • /nfs/scratch/noquota-volatile - This is an additional filesystem (previously referred to as beegfs). There is no quota enforcement here. There is 100TB of total space. This data is not backed up! All data on this storage is periodically deleted

Note: Home directory quotas cannot be increased, however if you need more space in your scratch folder let us know.

To view your current quota and usage use the vuw-quota command, for example:

<username@raapoi-master:~$ vuw-quota 

User Quotas

                       Storage  Usage (GB)  Quota (GB)     % Used 
            /nfs/home/<username>      18.32       50.00     36.63%

         /nfs/scratch/<username>       0.00     5000.00      0.00%

Per Node Storage

Each compute node has local storage you can use at /tmp.
This storage is not shared so a program running on amd01n02 will not be able to see data stored on node amd01n04's /tmp storage. Additionally, you can only access /tmp on any given node via a job running on that node.

On the AMD nodes and GPU nodes the /tmp storage is very fast nvme storage with 1.7TB total space.
On the Intel and highmem nodes this storage is slower and 1.7TB is not always available.

IMPORTANT: If you use the /tmp storage it is your responsibility to copy data to the /tmp and clean it up when your job is done.
For more info see Temp Disk Tips.

Storage Performance

graph TD A(Home and Research Storage) --> B B[Scratch] --> D D[local tmp on AMD nodes]
Figure 1: Storage speed hierarchy. The slowest storage is your user home directory as well as any mounted research storage. The trade off for this is that this data is replicated off site as well as backed up by Digital Solutions. The fastest is the local tmp space on the AMD nodes - it is usually deleted shortly after you logout and only visible to the node it's on, but it is extremely fast with excellent IO performance.

Storage tips

Home Directory Tips

Home directories have a small quota and are on fairly slow storage. The data here is backed up. It is replicated off site live as well as periodically backed up to off site tape. In theory data here is fairly safe, even in a catastophic event it should be recoverable eventually.

If you accidentally delete something here it can be recovered with a service desk request.

While this storage is not performant, is is quite safe and is a good place for you scripts and code to live. Your data sets can also be on your home if they fit and the performance doesn't cause you any problems.

For bigger or faster storage see the Scratch page.

Scratch Tips

The scratch storage is on a large storage node with 2 raid arrays with 50TB of storage each. Your scratch will always be available at /nfs/scratch/<username>.

Your scratch storage could be on scratch or scratch2, to find out run vuw-quota.

Each user has a quota of 5TB on scratch - you can ask the support team to increase it if needed. While each user has a quota of 5TB, we don't actually have enough storage for each user to fill 5TB of storage! This is a shared resource and we will occasionally ask on the slack channel for users to clean up their storage to make space for others. Individuals using a large amount of scratch space may recieve an email.

To check how much space is free on the scratch storage for all users, on Rāpoi:

df -h | grep scratch  #df -h is disk free with human units, | pipes the output to grep, which shows lines which contain the word scratch
On the slack channel in any DM or channel type
/df-scratch 
The output should only be visible to you

This storage is not backed up at all. It is on a raid array so if a hard drive fails your data is safe. However in the event of a more dramatic hardware failure, earthquakes or fire - your data is gone forever. If you accidentally delete something, it's gone forever. If an Admin misconfigures something, your data is gone (we try not to do this!).

It is your responsiblilty to backup your data here - a good place is to use Digital Solutions Research Storage (see Connecting to SoLAR).

Scratch is also not a place for your old data to live forever, please clean up datasets you're no longer using!

Temp Disk Tips

This storage is very fast on the AMD nodes and GPU nodes. It is your job to move data to the tmp space and clean it up when done.

There is very little management of this space and currently it is not visible to slurm for fair use scheduling - in other words someone else might have used up most of the temp space on the node! This is generally not the case though.

A rough example of how you could use this in an sbatch script

#!/bin/bash
#
#SBATCH --job-name=bash_test
#
#SBATCH --partition=quicktest
#
#SBATCH --cpus-per-task=2 #Note: you are always allocated an even number of cpus
#SBATCH --mem=1G
#SBATCH --time=10:00

# Do the needed module loading for your use case
module load etc

#make a temporary directory with your usename so you don't tread on others
mkdir /tmp/<username>

#Copy dataset from scratch to the local tmp on the node (could also use rsync)
cp -r /nfs/scratch/<user>/dataset /tmp/<user>/dataset

Process data against /tmp/<user>/dataset
Lets say the data is output to /tmp/<user>/dataoutput/

# Copy data from output to your scratch - I suggest not overwriting your original dataset!
cp -r /tmp/<user>/dataoutput/* /nfs/scratch/<user>/dataset/dataoutput/

# Delete the data you copy to and created on tmp
rm -rf /tmp/<user>  #DANGER!!  

BeeGFS Tips

The BeeGFS storage is spread across 3 nodes with SSD disks. The aggregate storage is 100TB. We don't enforce quotas here as some projects have large storage needs. However, you do need to be a good HPC citizen and respect the rights of others. Don't needlessly fill up this storage.

We regularly delete all data here every 3-6 months. We only post warnings on the slack channel. If afteer 3 months the storage is not full and still performning well, we delay the wipe another 3 months.

This storage is not backed up at all. It is on a raid array so if a hard drive fails your data is safe. However in the event of a more dramatic hardware failure, earthquakes or fire - your data is gone forever. If you accidentally delete something, it's gone forever. If an Admin misconfigures something, your data is gone (we try not to do this!).

It is your responsiblilty to backup your data here - a good place is to use Digital Solutions Research Storage.

BeeGFS should have better IO performance than the scratch storage - however it does depend what other users are doing. No filesystem likes small files, BeeGFS likes small files even less than scratch. Filesizes over 1MB are best. If you have a large amount of files you can improve performance by splitting your files accross many directories - the load balancing accross the metadata and storage servers is by directory, not file.