Example: Simulation on GPU cluster

This describes how to set up the environment to run APPFL using either gRPC or MPI in a GPU cluster for simulation, which is useful for benchmarking the performance of different FL algorithms on various datasets. In this example, we partition the CIFAR10 in an non-independent and identically distributed (non-IID) manner and train a Resnet-18 model using the federated learning.

gRPC Simulation on Polaris Cluster

Note

This section is generated based on the Polaris supercomputer at the Argonne Leadership Computing Facility (ALCF), which uses Portable Batch System (PBS) as it job scheduler.

Loading Modules

Most HPC clusters use modules to manage the environment, and the module configuration may vary depending on the clusters you use. On the Polaris supercomputer, load necessary module via the following commands:

module use /soft/modulefiles
module load conda
module save

Creating Conda Environment and Installing APPFL

Now, we can create a conda environment and install APPFL.

conda create -n appfl python=3.10 # or conda create -p /path/to/env python=3.10
conda activate appfl
git clone https://github.com/APPFL/APPFL.git
cd APPFL
pip install -e ".[examples]"
cd examples

Creating Batch Script

The Polaris supercomputer uses PBS workload manager for job management. Below is an example of a batch script to run the gRPC simulation on the Polaris cluster which launch one server and two clients. Please replace <your_project> with your project name, <your_env> with the name of the conda environment, and <your_appfl_path> with the path to the APPFL repository.

submit.sh
#!/bin/bash
#PBS -A <your_project>
#PBS -q debug
#PBS -l walltime=00:15:00
#PBS -l nodes=1:ppn=64
#PBS -l filesystems=home:eagle:grand
#PBS -m bae

# Set proxy
export HTTP_PROXY="http://proxy.alcf.anl.gov:3128"
export HTTPS_PROXY="http://proxy.alcf.anl.gov:3128"
export http_proxy="http://proxy.alcf.anl.gov:3128"
export https_proxy="http://proxy.alcf.anl.gov:3128"
export ftp_proxy="http://proxy.alcf.anl.gov:3128"
export no_proxy="admin,polaris-adminvm-01,localhost,*.cm.polaris.alcf.anl.gov,polaris-*,*.polaris.alcf.anl.gov,*.alcf.anl.gov"

# Load modules and activate conda environment
module use /soft/modulefiles
module load conda
conda activate <your_env>
cd <your_appfl_path>/APPFL/examples

# Launch the server
python ./grpc/run_server.py --config ./resources/configs/cifar10/server_fedavg.yaml &
sleep 20
echo "Server is ready"

# Launch the clients
python ./grpc/run_client.py --config ./resources/configs/cifar10/client_1.yaml &
python ./grpc/run_client.py --config ./resources/configs/cifar10/client_2.yaml &
wait

Note

On Polaris, it is important to set the proxy environment variables to access the internet from the cluster.

You can submit the script to run via the following command.

qsub submit.sh

Two output files, submit.sh.o{job_id} and submit.sh.e{job_id}, are generated when the script starts to run. You can check the output in real-time by running the following command.

tail -f -n 10 submit.sh.o{job_id}
# or
tail -f -n 10 submit.sh.e{job_id}

MPI Simulation on Delta Cluster

Note

This tutorial is generated based on the Delta supercomputer at the National Center for Supercomputing Applications (NCSA), which uses Slurm as it job scheduler.

Loading Modules

Most HPC clusters use modules to manage the environment, and the module configuration may vary depending on the clusters you use. On the Delta supercomputer, the following modules are loaded.

1) gcc/11.4.0   2) openmpi/4.1.6   3) cuda/11.8.0   4) cue-login-env/1.0   5) slurm-env/0.1   6) default-s11   7) anaconda3_gpu/23.9.0

You need to run module save to save the current module configuration.

module save

Creating Conda Environment and Installing APPFL

Now, we can create a conda environment and install APPFL.

conda create -n appfl python=3.10 # or conda create -p /path/to/env python=3.10
conda activate appfl
git clone --single-branch --branch main https://github.com/APPFL/APPFL.git
cd APPFL
pip install -e ".[mpi,examples]"
cd examples

Creating Batch Script

The Delta supercomputer uses Slurm workload manager for job management.

submit.sh
#!/bin/bash
#SBATCH --mem=150g                              # required number of memory
#SBATCH --nodes=1                               # number of required nodes
#SBATCH --ntasks-per-node=6                     # number of tasks per node [SHOULD BE EQUAL TO THE NUMBER OF CLIENTS+1]
#SBATCH --cpus-per-task=1                       # <- match to OMP_NUM_THREADS
#SBATCH --partition=gpuA40x4                    # <- or one of: gpuA100x4 gpuA40x4 gpuA100x8 gpuMI100x8
#SBATCH --account=<xxxx-delta-gpu>              # <- one of: replace xxxx with your project name
#SBATCH --job-name=APPFL-test                   # job name
#SBATCH --time=00:15:00                         # dd-hh:mm:ss for the job
#SBATCH --gpus-per-node=1
#SBATCH --gpu-bind=none

source ~/.bashrc
conda activate appfl
cd <your_path_to_appfl>/examples
mpiexec -np 6 python ./mpi/run_mpi.py --server_config ./resources/configs/cifar10/server_fedcompass.yaml \
        --client_config ./resources/configs/cifar10/client_1.yaml

The script can be submitted to the cluster using the following command.

sbatch submit.sh

You may see the output.

Submitted batch job {job_id}

The output file slurm-{job_id}.out is generated when the script starts to run, and you can check the output in real-time by running the following command.

tail -f -n 10 slurm-{job_id}.out

Multi-GPU Training

APPFL supports distributed data parallelism (DDP) for multi-GPU training. To enable DDP, users only need to specify the device as a list of cuda devices in the client configuration file, for example (examples/resources/configs/cifar10/client_1_multigpu.yaml):

Note

When you are using multi-GPU training, please make sure the training and validation batch size are divisible by the number of GPUs.

Below provides the batch script to run the multi-GPU training on Delta cluster using MPI.

submit.sh
#!/bin/bash
#SBATCH --mem=150g                              # required number of memory
#SBATCH --nodes=1                               # number of required nodes
#SBATCH --ntasks-per-node=6                     # number of tasks per node [SHOULD BE EQUAL TO THE NUMBER OF CLIENTS+1]
#SBATCH --cpus-per-task=1                       # <- match to OMP_NUM_THREADS
#SBATCH --partition=gpuA40x4                    # <- or one of: gpuA100x4 gpuA40x4 gpuA100x8 gpuMI100x8
#SBATCH --account=<xxxx-delta-gpu>              # <- one of: replace xxxx with your project name
#SBATCH --job-name=APPFL-test                   # job name
#SBATCH --time=00:15:00                         # dd-hh:mm:ss for the job
#SBATCH --gpus-per-node=4
#SBATCH --gpu-bind=none

# Activate conda environment
source ~/.bashrc
conda activate appfl
cd <your_path_to_appfl>/examples

# Launch the experiment
mpiexec -np 6 python ./mpi/run_mpi.py --server_config ./resources/configs/cifar10/server_fedcompass.yaml \
                --client_config ./resources/configs/cifar10/client_1_multigpus.yaml

Below provides the batch script to run the multi-GPU training on Polaris cluster using MPI.

submit.sh
#!/bin/bash
#PBS -A <your_project>
#PBS -q debug
#PBS -l walltime=00:15:00
#PBS -l nodes=1:ppn=64
#PBS -l filesystems=home:eagle:grand
#PBS -m bae

# Load modules and activate conda environment
module use /soft/modulefiles
module load conda
conda activate <your_env>
cd <your_appfl_path>/APPFL/examples

# Launch the experiment
mpiexec -np 6 python ./mpi/run_mpi.py --server_config ./resources/configs/cifar10/server_fedcompass.yaml \
                --client_config ./resources/configs/cifar10/client_1_multigpus.yaml