Configuring your environment and running jobs on the Colosse super computer @ Calcul Quebec
This tutorial explains how to configure a basic Python environment and how to launch jobs on the Colosse super computer @ Calcul Quebec. The official Colosse wiki can be found at: https://wiki.calculquebec.ca.
Table of contents:
To connect to Colosse, open a terminal and use the following command.
ssh username@colosse.calculquebec.ca
This will connect you to a login node. You should be greeted with a message like this:
===========================================================================
Vous êtes sur un noeud de login de colosse (Calcul Québec).
- Nous n'effectuons pas de sauvegarde de vos fichiers.
- N'utilisez pas le noeud de login pour executer votre code.
This is a Calcul Québec login node for colosse.
- There is no backup of users files.
- Do not use this node to run code.
Rapportez tout problème à / Report any problems to: colosse@calculquebec.ca
Documentation: https://wiki.calculquebec.ca/
Suivre sur Twitter/Follow on Twitter: https://twitter.com/CQ_Colosse
État des serveurs: http://serveurscq.computecanada.ca
===========================================================================
Login nodes are used to prepare/launch/monitor jobs and to move files around. Do not use them to run code.
Colosse has various file systems, which have different properties listed here. Basically, I use SCRATCH to run experiments and RAP to store results and data that must not be lost.
In your home directory, run the following commands:
mkdir $SCRATCH/$USER
mkdir $RAP/$USER
ln -s $SCRATCH/$USER scratch
ln -s $RAP/$USER rap
This will create symbolic links to you scratch and rap folders, which have complicated paths.
Colosse provides a lot of preinstalled software, which is made available through modules.
module avail
module spider [keyword]
For example, running module spider gcc
returns:
-----------------
compilers/gcc:
-----------------
Versions:
compilers/gcc/4.5
compilers/gcc/4.6
compilers/gcc/4.8
compilers/gcc/4.8.5
compilers/gcc/4.9
compilers/gcc/5.4
To use a module, you must first load it using the following command
module load [module name]
For example, module load compilers/gcc/4.8.5
loads version 4.8.5 of the gcc compiler.
Manually loading modules each time you log in is tedious and can be avoided by using a .bashrc
file.
source ~/.bashrc
In addition to the scratch and rap directories, I like to have a dev directory, where I keep all my code repositories. Run the following commands at the root of your home directory.
mkdir dev
mkdir dev/git
First, create a virtual environment by using the following command at the root of your home directory.
virtualenv env
Then, open up your .bashrc file and uncomment the following line in the Software section.
source ~/env/bin/activate
This will load you python environment when you login. Now, run source ~/.bashrc
followed by which python
. The last command should point to an executable in your virtual environment.
First, run the following commands.
pip install --upgrade pip
pip install cython
Run the following command, which will tells numpy where the MKL library is located.
cat > ~/.numpy-site.cfg << EOF
[mkl]
library_dirs = $MKLROOT/lib/intel64
include_dirs = $MKLROOT/include
mkl_libs = mkl_rt
lapack_libs =
EOF
Then, go to the ~/dev/git directory and run the following commands.
git clone https://github.com/numpy/numpy.git
cd numpy
python setup.py install
Run the following commands.
pip install ipython scipy scikit-learn h5py pandas
You can install any other package using pip.
Now, your environment is all set and you are ready to launch experiments!
First, determine what your ressource allocation project is by running colosse-info
. This will print a lot of stuff, including your various computation allocations. In my case, it prints
RAPI nne-790-aa: 0 used cores / 30 allocated cores (recent history)
RAPI nne-790-ae: 39.2049 used cores / 180 allocated cores (recent history)
RAPI agq-973-aa: 0 used cores / 30 allocated cores (recent history)
RAPI kyk-164-aa: 0 used cores / 30 allocated cores (recent history)
but you might only have one. Pick the allocation you want to use and remember its identifier, e.g., nne-790-ae.
Now, open the example_job.msub file provided with this tutorial. The file header gives the scheduler some information about your job. For example, the header could be
#!/bin/bash
#PBS -l nodes=2:ppn=8,walltime=24:00:00
#PBS -o stdout.out
#PBS -e stderr.err
#PBS -V
#PBS -N myjob
#PBS -A nne-790-ae
In this case, the requested computing time is 24 hours. The job requires 2 nodes, with 8 CPUs each. The stderr and stdout are redirected to user specified files. The name of the job is myjob. The ressource allocation to use is nne-790-ae.
Copy the example_job.msub file to a directory called ~/scratch/example_job. Replace the ressource allocation project number by yours. Then, submit the job using the following command.
msub example_job.msub
Our example job will run for 5 minutes, so it should not be queued for a long time.
Once the job is submitted, you can use the i
command to list the jobs that are in the waiting queue, i.e., the IDLE state. The r
command shows all the jobs that are running and the b
command shows all the jobs that are blocked, i.e., that the server refuses to run for the moment.
That’s it! You can now submit jobs on Colosse.
If you don’t want to have to type your password every time you connect to Colosse, do the following:
If you don’t already have an ssh key for your computer, generate one by typing ssh-keygen
in a terminal
Copy the key to the supercomputer using the following command:
ssh-copy-id username@colosse.calculquebec.ca
To copy files to the supercomputer, you can use the scp
utility. For example, you could use this command to copy a directory called mydir
on your computer to the myremotedir
directory on Colosse.
scp -r mydir username@colosse.calculquebec.ca:myremotedir/
You could use a similar command to get the directory from Colosse:
scp -r username@colosse.calculquebec.ca:myremotedir/mydir/ .