Training models in parallel

Please note the following steps are specific when working on the Nikhef Stoomboot computing cluster, running CentOS 7 Linux distribution as of writing this. Other setups may require an alternate script configuration.

Training models on different replicas can be done in parallel, for example on a computing cluster. EELSFitter supports parallel training provided the user specifies some parameters. These are the number of batch jobs n_batches that are sent to the cluster (same value for all batches), the number of replicas n_replica that are trained per batch (same value for all batches), and an index corresponding to a particular batch n_batch_of_replica (different value for all batches). These can be for example be passed by command line to the script. If you are running the code on your own machine on a single core n_batches=1 and n_batch_of_replica=1.

An example setup for submitting code to a cluster is shown below. First a bash script executes commands to submit tasks to a job scheduler.

#!/bin/bash
pbs_file=/path/to/pbs_file.pbs

path_to_image="/path/to/dm4_file.dm4"
path_to_models="/path/to/output of models/"

n_batches=100
for n_batch_of_replica in `seq 1 $n_batches`; do
    <cluster_specific_submission_code> ARG=$path_image,ARG2=$path_models,ARG3=$n_batch_of_replica,ARG4=$n_batches $pbs_file
done

A .pbs file specifies where the Python installation is located such that the system can actually execute the code.

source /path/to/miniconda3/etc/profile.d/conda.sh
conda activate <environmentname>
python /path/to/python_file.py ${ARG} ${ARG2} ${ARG3} ${ARG4}

Finally the Python file contains that which you want to execute.

import sys
import EELSFitter as ef

path_to_image = sys.argv[1]
path_to_models = sys.argv[2]
n_batch_of_replica = int(sys.argv[3])
n_batches = int(sys.argv[4])

im = ef.SpectralImage.load_data(path_to_image)
im.train_zlp_models(n_clusters=n_clusters,
                    seed=seed,
                    based_on=based_on,
                    n_replica=n_replica,
                    n_epochs=n_epochs,
                    n_batch_of_replica=n_batch_of_replica,
                    n_batches=n_batches,
                    shift_de1=shift_dE1,
                    shift_de2=shift_dE2,
                    regularisation_constant=regularisation_constant,
                    path_to_models=path_to_models,
                    signal_type=signal_type)