On GPU Cluster (Argonne's Swing)
================================
This describes how to set up the environment to run APPFL in GPU cluster. This tutorial is generated based on SWING GPU cluster in Argonne National Laboratory. The cluster information is avaiable at `Laboratory Computing Resource Center `_. In this tutorial, we use MNIST example to run APPFL in the cluster.
Preparing Training
--------------------------------
We assume user run the MNIST example in locally machine according to `Our first run MNIST `_. MNIST datasets will be downloaded while running the MNIST example.
We upload the data and example code from local machine to cluster.
.. code-block:: console
cd APPFL/examples
ssh [your_id]@[cluster_destination] mkdir -p workspace
scp -r * [your_id]@[cluster_destination]:workspace
Please check if the workspace folder contains "datasets", "mnist.py", "models" for this tutorial.
Loading Modules
------------------------------------
This tutorial uses `modules `_ in SWING cluster. The module configuration may vary depending on the Clusters.
.. code-block:: console
module load gcc/9.2.0-r4tyw54 cuda/11.4.0-gqbcqie openmpi/4.1.4-cuda-ucx anaconda3
Creating Conda Environment and Installing APPFL
-----------------------------------------------
Anaconda environment is used to control dependencies.
.. code-block:: console
conda create -n APPFL python=3.8
conda activate APPFL
pip install pip --upgrade
pip install "appfl[dev,examples,analytics]"
Modifying Dependencies for CUDA Support
---------------------------------------------
SWING Cluster uses CUDA 11.4 version, so we need to modify torch version to adjust to the CUDA version. CUDA version may vary depending on the Clusters. A different version of CUDA may require changing the `torch `_ versions.
.. code-block:: console
pip uninstall torch tourchvision
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
.. Note::
``pip install chardet`` may need to resolve the dependency issue from the torchvision package.
Creating Batch Script
---------------------------------------------
SWING Cluster uses Slurm workload manager for job management. The job management configuration may vary depending on the Clusters.
.. code-block:: console
:caption: test.sh
#!/bin/bash
#
#SBATCH --job-name=APPFL-test
#SBATCH --account=
#SBATCH --nodes=1
#SBATCH --gres=gpu:2
#SBATCH --time=00:05:00
mpiexec -np 2 --mca opal_cuda_support 1 python ./mnist.py --num_clients=2
The script needs to be submitted to run.
.. code-block:: console
sbatch test.sh
You may see the output.
.. code-block:: console
Submitted batch job {job_id}
The output file is generated when the script run.
.. code-block:: console
cat slurm-{job_number}.out