Example: Run FL Experiment using Ray¶
This tutorial describes how to run federated learning experiments with APPFL using Ray on cloud platforms such as AWS, GCP, and Azure. All the code snippets needed for this tutorial is available at the examples
directory of the APPFL repository at here.
Note
For more detailed information about Ray, please refer to the Ray documentation.
Installation¶
First, we should install the APPFL package on the local machines. Below shows how to install the APPFL package from its source code. For more information, please refer to the APPFL installation guide.
git clone --single-branch --branch main https://github.com/APPFL/APPFL.git
cd APPFL
conda create -n appfl python=3.10 --y
conda activate appfl
pip install -e ".[examples]"
Client Configurations¶
The server needs to collect certain information from the client to run the federated learning experiment. Below is an example of a client configuration file. It is available at examples/resources/config_ray/mnist/clients.yaml
at the APPFL repository at here.
clients:
- client_id: "Client1"
train_configs:
# Device [Optional]: default is "cpu"
device: "cpu"
# Logging [Optional]
logging_output_dirname: "./output"
logging_output_filename: "result"
# Local dataset
data_configs:
dataset_path: "./resources/dataset/mnist_dataset.py"
dataset_name: "get_mnist"
dataset_kwargs:
num_clients: 2
client_id: 0
partition_strategy: "class_noniid"
visualization: True
- client_id: "Client2"
train_configs:
# Device [Optional]: default is "cpu"
device: "cpu"
# Logging [Optional]
logging_output_dirname: "./output"
logging_output_filename: "result"
# Local dataset
data_configs:
dataset_path: "./resources/dataset/mnist_dataset.py"
dataset_name: "get_mnist"
dataset_kwargs:
num_clients: 2
client_id: 1
partition_strategy: "class_noniid"
visualization: False
It should be noted that the client configuration file actually resides on the server machine, and the contents of the file are shared by the clients. Specifically, there are three main parts in the client configuration file:
client_id
: It is the unique ID defined for the client machine.train_configs
: It contains the training configurations for the client, including the device to run the training, logging configurations, etc.data_configs
: It contains the information of a dataloader python file defined and shared by the clients to the server (located atdataset_path
on the server machine). The dataloader file should contain a function (specified bydataset_name
) which can load the client’s local private dataset when it is executing on the client’s machine.
Server Configurations¶
We have provide three sample server configuration files available at examples/resources/config_ray
at the APPFL repository at here. The detailed description of the server configuration file can be found in the APPFL documentation.
It should be noted that client_configs.comm_configs.ray_configs
is optional and only needed be set if the user wants to assign a job to randomly, instead to a particular AWS instance, by setting assign_random
as True
(by default it is False
). You need to configure the same in ray_cluster_config.yaml
as well.
To use AWS S3 for model parameter transmission, add a configuration under comm_configs
as s3_configs
. Set enable_s3
to True, and specify the s3_bucket
field with the name of the S3 bucket that you want to use. Additionally, set s3_creds_file
to the path of a CSV file containing AWS credentials.
<region>,<access_key_id>,<secret_access_key>
Note
The server can also set these information before running the experiment via the aws configure
command.
Ray Cluster Configurations¶
Below is the cluster configuration file for running the experiment on AWS cloud environment.
# An unique identifier for the head node and workers of this cluster.
cluster_name: appfl-ray
# Cloud-provider specific configuration.
provider:
type: aws
region: us-east-1
cache_stopped_nodes: False # if set False terminates the instance when ray down is executed, True: instance stopped not terminated
security_group:
GroupName: ray_client_security_group
IpPermissions:
- FromPort: 8265
ToPort: 8265
IpProtocol: TCP
IpRanges:
# Allow traffic only from your local IP address.
- CidrIp: 0.0.0.0/0
# The maximum number of workers nodes to launch in addition to the head node.
max_workers: 2
available_node_types:
ray.head.default:
resources: { }
# Provider-specific config for this node type, e.g., instance type.
# By default Ray auto-configures unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
node_config:
InstanceType: t3.medium
ImageId: 'ami-0dd6adfad4ad37eec' # Deep Learning Base Neuron AMI (Ubuntu 20.04) 20240216
ray.worker.worker_1:
# The minimum number of worker nodes of this type to launch.
# This number should be >= 0. For FL experiment 1 is sufficient.
min_workers: 1
# The maximum number of worker nodes of this type to launch.
# This parameter takes precedence over min_workers. For FL experiment 1 is sufficient.
max_workers: 1
# Set this to {${client_id} : 1}, client_id from examples/resources/config_ray/mnist/clients.yaml config file
# Set it to empty if client task can be assigned randomly to any worker node
resources: {Client1: 1}
node_config:
InstanceType: t3.medium
ImageId: 'ami-0dd6adfad4ad37eec' # Deep Learning Base Neuron AMI (Ubuntu 20.04) 20240216
InstanceMarketOptions:
MarketType: spot # Configure worker nodes to use Spot Instances
SpotOptions:
MaxPrice: '0.05'
ray.worker.worker_2:
min_workers: 1
max_workers: 1
resources: {Client2: 1}
node_config:
InstanceType: t3.medium
ImageId: 'ami-0dd6adfad4ad37eec' # Deep Learning Base Neuron AMI (Ubuntu 20.04) 20240216
InstanceMarketOptions:
MarketType: spot # Configure worker nodes to use Spot Instances
SpotOptions:
MaxPrice: '0.05'
file_mounts: {
"/home/ubuntu/APPFL": "../../../APPFL",
"/home/ubuntu/resources": "../resources",
"/home/ubuntu/run.py": "run.py"
}
setup_commands:
["conda config --remove channels intel",
"conda create -n APPFL python=3.10 -y ",
'conda activate APPFL && pip install ray["default"] && pip install confluent-kafka --prefer-binary && cd APPFL && pip install -e ".[examples]"',
"(stat $HOME/anaconda3/envs/APPFL/ &> /dev/null && echo 'export PATH=\"$HOME/anaconda3/envs/APPFL/bin:$PATH\"' >> ~/.bashrc) || true"]
You can set the desired aws region under provider.region
All the EC2 instance related configuration for head node or worker nodes goes in node_config
which has InstanceType
, ImageId
(AMI image id), spot vs on demand etc. For more documentation on available fields, see.
For other field description you can follow inline comments in examples/ray/ray_cluster_config.yaml
. Further you can check it out here.
Running Experiment¶
Environment setup¶
Configure AWS credentials - IAM having AmazonEC2FullAccess
and AmazonEC2RoleforSSM
Cluster Creation¶
Go inside ray example
cd examples/ray/
Run below command, which brings up whole cluster that is described in examples/ray/ray_cluster_config.yaml
.
ray up ray_cluster_config.yaml
Note
For lower cluster spin up time create a custom AMI image by running setup command on given image id in ray_cluster_config.yaml. After creating custom AMI you can provide it in ray_cluster_config.yaml under ImageId attribute of each node
Checking cluster status¶
From Local machine¶
You can check cluster status by running
ray exec ray_cluster_config.yaml 'ray status'
From Head Node¶
Go into head node using
ray attach ray_cluster_config.yaml
Check cluster status after attaching to head node using
ray status
Output of ray status would look like below
======== Autoscaler status: 2025-02-25 20:18:02.106153 ========
Node status
---------------------------------------------------------------
Active:
1 ray.worker.worker_2
1 ray.head.default
1 ray.worker.worker_1
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/6.0 CPU
0.0/1.0 Client1
0.0/1.0 Client2
0B/7.64GiB memory
0B/3.16GiB object_store_memory
Demands:
(no resource demands)
Job Submission¶
From Local machine¶
Do port forwarding using
ray dashboard ray_cluster_config.yaml
Now on another terminal you can submit job request using:
ray job submit --address http://localhost:8265 -- python APPFL/examples/ray/run.py
From Head Node¶
Connect to head node
ray attach ray_cluster_config.yaml
Run job using:
python run.py
Stopping Cluster¶
To stop cluster run
ray down ray_cluster_config.yaml