Windows-colosse-tutorial-PROSAGA-码农传奇

This tutorial explains how to configure a basic Python environment and how to launch jobs on the Colosse super computer @ Calcul Quebec. The official Colosse wiki can be found at: https://wiki.calculquebec.ca.

Table of contents:

Connecting to the supercomputer
File System
Modules
Configuring your development environment
Submitting jobs
Other useful info

Connecting to the supercomputer

To connect to Colosse, open a terminal and use the following command.

ssh username@colosse.calculquebec.ca

This will connect you to a login node. You should be greeted with a message like this:

===========================================================================
Vous êtes sur un noeud de login de colosse (Calcul Québec).
 - Nous n'effectuons pas de sauvegarde de vos fichiers.
 - N'utilisez pas le noeud de login pour executer votre code.
This is a Calcul Québec login node for colosse.
 - There is no backup of users files.
 - Do not use this node to run code.
Rapportez tout problème à / Report any problems to: colosse@calculquebec.ca
Documentation: https://wiki.calculquebec.ca/
Suivre sur Twitter/Follow on Twitter: https://twitter.com/CQ_Colosse
État des serveurs: http://serveurscq.computecanada.ca
===========================================================================

File system

Colosse has various file systems, which have different properties listed here. Basically, I use SCRATCH to run experiments and RAP to store results and data that must not be lost.

Scratch: This directory is placed on a parallel filesystem, Lustre (for Colosse, Mp2, Ms2 and Cottos) or GPFS (for Briarée and Guillimin). It is generally visible from all nodes. Using it is very fast for large files, but not very efficient for many small files. This is the appropriate place to store large files that you use for a few days or weeks only. Periodically, it may be automatically cleaned (files being deleted).

In your home directory, run the following commands:

mkdir $SCRATCH/$USER
mkdir $RAP/$USER
ln -s $SCRATCH/$USER scratch
ln -s $RAP/$USER rap

This will create symbolic links to you scratch and rap folders, which have complicated paths.

Modules

Colosse provides a lot of preinstalled software, which is made available through modules.

Listing all available modules

module avail

Searching for a specific module

module spider [keyword]

For example, running module spider gcc returns:

-----------------
  compilers/gcc:
-----------------
     Versions:
        compilers/gcc/4.5
        compilers/gcc/4.6
        compilers/gcc/4.8
        compilers/gcc/4.8.5
        compilers/gcc/4.9
        compilers/gcc/5.4

Loading modules

To use a module, you must first load it using the following command

module load [module name]

For example, module load compilers/gcc/4.8.5 loads version 4.8.5 of the gcc compiler.

Manually loading modules each time you log in is tedious and can be avoided by using a .bashrc file.

Copy the bashrc file provided with this tutorial to the root of your home directory.
Load it by running the following command, which will be run automatically on login.
```
source ~/.bashrc
```

Configuring your development environment

Directory structure

In addition to the scratch and rap directories, I like to have a dev directory, where I keep all my code repositories. Run the following commands at the root of your home directory.

mkdir dev
mkdir dev/git

Configuring Python

Creating a virtual environment

First, create a virtual environment by using the following command at the root of your home directory.

virtualenv env

Then, open up your .bashrc file and uncomment the following line in the Software section.

source ~/env/bin/activate

This will load you python environment when you login. Now, run source ~/.bashrc followed by which python. The last command should point to an executable in your virtual environment.

Installing numpy

First, run the following commands.

pip install --upgrade pip
pip install cython

Run the following command, which will tells numpy where the MKL library is located.

cat > ~/.numpy-site.cfg << EOF
[mkl]
library_dirs = $MKLROOT/lib/intel64
include_dirs = $MKLROOT/include
mkl_libs = mkl_rt
lapack_libs =
EOF

Then, go to the ~/dev/git directory and run the following commands.

git clone https://github.com/numpy/numpy.git
cd numpy
python setup.py install

Installing other useful packages

Run the following commands.

pip install ipython scipy scikit-learn h5py pandas

You can install any other package using pip.

Now, your environment is all set and you are ready to launch experiments!

Submitting jobs

Ressource allocation project

First, determine what your ressource allocation project is by running colosse-info. This will print a lot of stuff, including your various computation allocations. In my case, it prints

RAPI nne-790-aa: 0 used cores / 30 allocated cores (recent history)
RAPI nne-790-ae: 39.2049 used cores / 180 allocated cores (recent history)
RAPI agq-973-aa: 0 used cores / 30 allocated cores (recent history)
RAPI kyk-164-aa: 0 used cores / 30 allocated cores (recent history)

but you might only have one. Pick the allocation you want to use and remember its identifier, e.g., nne-790-ae.

Submitting a job to the scheduler

Now, open the example_job.msub file provided with this tutorial. The file header gives the scheduler some information about your job. For example, the header could be

#!/bin/bash
#PBS -l nodes=2:ppn=8,walltime=24:00:00
#PBS -o stdout.out
#PBS -e stderr.err
#PBS -V
#PBS -N myjob
#PBS -A nne-790-ae

In this case, the requested computing time is 24 hours. The job requires 2 nodes, with 8 CPUs each. The stderr and stdout are redirected to user specified files. The name of the job is myjob. The ressource allocation to use is nne-790-ae.

Copy the example_job.msub file to a directory called ~/scratch/example_job. Replace the ressource allocation project number by yours. Then, submit the job using the following command.

msub example_job.msub

Our example job will run for 5 minutes, so it should not be queued for a long time.

Once the job is submitted, you can use the i command to list the jobs that are in the waiting queue, i.e., the IDLE state. The r command shows all the jobs that are running and the b command shows all the jobs that are blocked, i.e., that the server refuses to run for the moment.

That’s it! You can now submit jobs on Colosse.

Other useful info

Logging in without a password

If you don’t want to have to type your password every time you connect to Colosse, do the following:

If you don’t already have an ssh key for your computer, generate one by typing ssh-keygen in a terminal
Copy the key to the supercomputer using the following command:

ssh-copy-id username@colosse.calculquebec.ca

Enter your password when prompted and you’re done.

Copying files over ssh

To copy files to the supercomputer, you can use the scp utility. For example, you could use this command to copy a directory called mydir on your computer to the myremotedir directory on Colosse.

scp -r mydir username@colosse.calculquebec.ca:myremotedir/

You could use a similar command to get the directory from Colosse:

scp -r username@colosse.calculquebec.ca:myremotedir/mydir/ .