Example: Run FL Experiment using Ray¶

This tutorial describes how to run federated learning experiments with APPFL using Ray on cloud platforms such as AWS, GCP, and Azure. All the code snippets needed for this tutorial is available at the examples directory of the APPFL repository at here.

Note

For more detailed information about Ray, please refer to the Ray documentation.

Installation¶

First, we should install the APPFL package on the local machines. Below shows how to install the APPFL package from its source code. For more information, please refer to the APPFL installation guide.

git clone --single-branch --branch main https://github.com/APPFL/APPFL.git
cd APPFL
conda create -n appfl python=3.10 --y
conda activate appfl
pip install -e ".[examples]"

Client Configurations¶

The server needs to collect certain information from the client to run the federated learning experiment. Below is an example of a client configuration file. It is available at examples/resources/config_ray/mnist/clients.yaml at the APPFL repository at here.

clients:
  - client_id: "Client1"
    train_configs:
      # Device [Optional]: default is "cpu"
      device: "cpu"
      # Logging [Optional]
      logging_output_dirname: "./output"
      logging_output_filename: "result"

    # Local dataset
    data_configs:
      dataset_path: "./resources/dataset/mnist_dataset.py"
      dataset_name: "get_mnist"
      dataset_kwargs:
        num_clients: 2
        client_id: 0
        partition_strategy: "class_noniid"
        visualization: True

  - client_id: "Client2"
    train_configs:
      # Device [Optional]: default is "cpu"
      device: "cpu"
      # Logging [Optional]
      logging_output_dirname: "./output"
      logging_output_filename: "result"

    # Local dataset
    data_configs:
      dataset_path: "./resources/dataset/mnist_dataset.py"
      dataset_name: "get_mnist"
      dataset_kwargs:
        num_clients: 2
        client_id: 1
        partition_strategy: "class_noniid"
        visualization: False

It should be noted that the client configuration file actually resides on the server machine, and the contents of the file are shared by the clients. Specifically, there are three main parts in the client configuration file:

client_id: It is the unique ID defined for the client machine.
train_configs: It contains the training configurations for the client, including the device to run the training, logging configurations, etc.
data_configs: It contains the information of a dataloader python file defined and shared by the clients to the server (located at dataset_path on the server machine). The dataloader file should contain a function (specified by dataset_name) which can load the client’s local private dataset when it is executing on the client’s machine.

Server Configurations¶

We have provide three sample server configuration files available at examples/resources/config_ray at the APPFL repository at here. The detailed description of the server configuration file can be found in the APPFL documentation.

It should be noted that client_configs.comm_configs.ray_configs is optional and only needed be set if the user wants to assign a job to randomly, instead to a particular AWS instance, by setting assign_random as True (by default it is False). You need to configure the same in ray_cluster_config.yaml as well.

To use AWS S3 for model parameter transmission, add a configuration under comm_configs as s3_configs. Set enable_s3 to True, and specify the s3_bucket field with the name of the S3 bucket that you want to use. Additionally, set s3_creds_file to the path of a CSV file containing AWS credentials.

<region>,<access_key_id>,<secret_access_key>

Note

The server can also set these information before running the experiment via the aws configure command.

Ray Cluster Configurations¶

Below is the cluster configuration file for running the experiment on AWS cloud environment.

# An unique identifier for the head node and workers of this cluster.
cluster_name: appfl-ray

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-east-1
    cache_stopped_nodes: False # if set False terminates the instance when ray down is executed, True: instance stopped not terminated
    security_group:
        GroupName: ray_client_security_group
        IpPermissions:
            - FromPort: 8265
              ToPort: 8265
              IpProtocol: TCP
              IpRanges:
                  # Allow traffic only from your local IP address.
                  - CidrIp: 0.0.0.0/0

# The maximum number of workers nodes to launch in addition to the head node.
max_workers: 2

available_node_types:
    ray.head.default:
        resources: { }
        # Provider-specific config for this node type, e.g., instance type.
        # By default Ray auto-configures unspecified fields such as SubnetId and KeyName.
        # For more documentation on available fields, see
        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
        node_config:
            InstanceType: t3.medium
            ImageId: 'ami-0dd6adfad4ad37eec' # Deep Learning Base Neuron AMI (Ubuntu 20.04) 20240216
    ray.worker.worker_1:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0. For FL experiment 1 is sufficient.
        min_workers: 1
        # The maximum number of worker nodes of this type to launch.
        # This parameter takes precedence over min_workers. For FL experiment 1 is sufficient.
        max_workers: 1
        # Set this to {${client_id} : 1}, client_id from examples/resources/config_ray/mnist/clients.yaml config file
        # Set it to empty if client task can be assigned randomly to any worker node
        resources: {Client1: 1}
        node_config:
            InstanceType: t3.medium
            ImageId: 'ami-0dd6adfad4ad37eec' # Deep Learning Base Neuron AMI (Ubuntu 20.04) 20240216
            InstanceMarketOptions:
                MarketType: spot  # Configure worker nodes to use Spot Instances
                SpotOptions:
                    MaxPrice: '0.05'
    ray.worker.worker_2:
        min_workers: 1
        max_workers: 1
        resources: {Client2: 1}
        node_config:
            InstanceType: t3.medium
            ImageId: 'ami-0dd6adfad4ad37eec' # Deep Learning Base Neuron AMI (Ubuntu 20.04) 20240216
            InstanceMarketOptions:
                MarketType: spot  # Configure worker nodes to use Spot Instances
                SpotOptions:
                    MaxPrice: '0.05'

file_mounts: {
    "/home/ubuntu/APPFL": "../../../APPFL",
    "/home/ubuntu/resources": "../resources",
    "/home/ubuntu/run.py": "run.py"
}

setup_commands:
    ["conda config --remove channels intel",
     "conda create -n APPFL python=3.10 -y ",
     'conda activate APPFL && pip install ray["default"] && pip install confluent-kafka --prefer-binary && cd APPFL && pip install -e ".[examples]"',
     "(stat $HOME/anaconda3/envs/APPFL/ &> /dev/null && echo 'export PATH=\"$HOME/anaconda3/envs/APPFL/bin:$PATH\"' >> ~/.bashrc) || true"]

You can set the desired aws region under provider.region

All the EC2 instance related configuration for head node or worker nodes goes in node_config which has InstanceType, ImageId (AMI image id), spot vs on demand etc. For more documentation on available fields, see.

For other field description you can follow inline comments in examples/ray/ray_cluster_config.yaml. Further you can check it out here.

Running Experiment¶

Environment setup¶

Configure AWS credentials - IAM having AmazonEC2FullAccess and AmazonEC2RoleforSSM

Cluster Creation¶

Go inside ray example

cd examples/ray/

Run below command, which brings up whole cluster that is described in examples/ray/ray_cluster_config.yaml.

ray up ray_cluster_config.yaml

Note

For lower cluster spin up time create a custom AMI image by running setup command on given image id in ray_cluster_config.yaml. After creating custom AMI you can provide it in ray_cluster_config.yaml under ImageId attribute of each node

Checking cluster status¶

From Local machine

You can check cluster status by running

ray exec ray_cluster_config.yaml 'ray status'

From Head Node

Go into head node using

ray attach ray_cluster_config.yaml

Check cluster status after attaching to head node using

ray status

Output of ray status would look like below

======== Autoscaler status: 2025-02-25 20:18:02.106153 ========
Node status
---------------------------------------------------------------
Active:
 1 ray.worker.worker_2
 1 ray.head.default
 1 ray.worker.worker_1
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/6.0 CPU
 0.0/1.0 Client1
 0.0/1.0 Client2
 0B/7.64GiB memory
 0B/3.16GiB object_store_memory

Demands:
 (no resource demands)

Job Submission¶

From Local machine

Do port forwarding using

ray dashboard ray_cluster_config.yaml

Now on another terminal you can submit job request using:

ray job submit --address http://localhost:8265  -- python APPFL/examples/ray/run.py

From Head Node

Connect to head node

ray attach ray_cluster_config.yaml

Run job using:

python run.py

Stopping Cluster¶

To stop cluster run

ray down ray_cluster_config.yaml