I'm struggling to understand how to run multiple processes in the same node using SLURM.
Suppose I want to run a program with 100 different input arguments. This is what I would do on my laptop for example:
for i in `seq 100`; do./program ${i}done
Now I have access to a cluster with 24-core nodes. So, I want to run 24 instances of the program on 5 nodes (24 on 4 nodes + 4 on a 5th node) at the same time.
I thought the submit script should look like this:
#!/bin/bash#SBATCH -N 5#SBATCH -n 100#SBATCH --ntasks-per-node=24for i in `seq 100`; dosrun ./program ${i} &donewait
It turns out that, with this submit script, ./program
is run multiple times for every i
value, even though srun
is called only once for each loop.
What is going on? What is the right way to do this?
Best Answer
By default, srun
will use the full allocation in runs in, so here, the full 100 tasks. To tell is only to use a single core, you need to run
srun --exclusive --ntasks 1 ...
From the srun manpage:
This option can also be used when initiating more than one job stepwithin an existing resource allocation, where you want separateprocessors to be dedicated to each job step. If sufficient processorsare not available to initiate the job step, it will be deferred. Thiscan be thought of as providing a mechanism for resource management tothe job within it's allocation.
Add --nodes 1
will get rid of the warnings.
#!/bin/bash#SBATCH -N 5#SBATCH -n 100#SBATCH --ntasks-per-node=24for i in `seq 100`; dosrun --exclusive --nodes 1 --ntasks 1 ./program ${i} &donewait