Example: Run FL Experiment using Ray =============================================== This tutorial describes how to run federated learning experiments with APPFL using Ray on cloud platforms such as AWS, GCP, and Azure. All the code snippets needed for this tutorial is available at the ``examples`` directory of the APPFL repository at `here `_. .. note:: For more detailed information about Ray, please refer to the `Ray documentation `_. Installation ------------ First, we should install the APPFL package on the local machines. Below shows how to install the APPFL package from its source code. For more information, please refer to the `APPFL installation guide `_. .. code-block:: bash git clone --single-branch --branch main https://github.com/APPFL/APPFL.git cd APPFL conda create -n appfl python=3.10 --y conda activate appfl pip install -e ".[examples]" Client Configurations --------------------- The server needs to collect certain information from the client to run the federated learning experiment. Below is an example of a client configuration file. It is available at ``examples/resources/config_ray/mnist/clients.yaml`` at the APPFL repository at `here `_. .. code-block:: yaml clients: - client_id: "Client1" train_configs: # Device [Optional]: default is "cpu" device: "cpu" # Logging [Optional] logging_output_dirname: "./output" logging_output_filename: "result" # Local dataset data_configs: dataset_path: "./resources/dataset/mnist_dataset.py" dataset_name: "get_mnist" dataset_kwargs: num_clients: 2 client_id: 0 partition_strategy: "class_noniid" visualization: True - client_id: "Client2" train_configs: # Device [Optional]: default is "cpu" device: "cpu" # Logging [Optional] logging_output_dirname: "./output" logging_output_filename: "result" # Local dataset data_configs: dataset_path: "./resources/dataset/mnist_dataset.py" dataset_name: "get_mnist" dataset_kwargs: num_clients: 2 client_id: 1 partition_strategy: "class_noniid" visualization: False It should be noted that the client configuration file actually resides on the server machine, and the contents of the file are shared by the clients. Specifically, there are three main parts in the client configuration file: - ``client_id``: It is the unique ID defined for the client machine. - ``train_configs``: It contains the training configurations for the client, including the device to run the training, logging configurations, etc. - ``data_configs``: It contains the information of a dataloader python file defined and shared by the clients to the server (located at ``dataset_path`` on the server machine). The dataloader file should contain a function (specified by ``dataset_name``) which can load the client's local private dataset when it is executing on the client's machine. Server Configurations --------------------- We have provide three sample server configuration files available at ``examples/resources/config_ray`` at the APPFL repository at `here `_. The detailed description of the server configuration file can be found in the `APPFL documentation `_. It should be noted that ``client_configs.comm_configs.ray_configs`` is optional and only needed be set if the user wants to assign a job to randomly, instead to a particular AWS instance, by setting ``assign_random`` as ``True`` (by default it is ``False``). You need to configure the same in ``ray_cluster_config.yaml`` as well. To use AWS S3 for model parameter transmission, add a configuration under ``comm_configs`` as ``s3_configs``. Set ``enable_s3`` to True, and specify the ``s3_bucket`` field with the name of the S3 bucket that you want to use. Additionally, set ``s3_creds_file`` to the path of a CSV file containing AWS credentials. .. code-block:: csv ,, .. note:: The server can also set these information before running the experiment via the ``aws configure`` command. Ray Cluster Configurations -------------------------- Below is the cluster configuration file for running the experiment on AWS cloud environment. .. code-block:: yaml # An unique identifier for the head node and workers of this cluster. cluster_name: appfl-ray # Cloud-provider specific configuration. provider: type: aws region: us-east-1 cache_stopped_nodes: False # if set False terminates the instance when ray down is executed, True: instance stopped not terminated security_group: GroupName: ray_client_security_group IpPermissions: - FromPort: 8265 ToPort: 8265 IpProtocol: TCP IpRanges: # Allow traffic only from your local IP address. - CidrIp: 0.0.0.0/0 # The maximum number of workers nodes to launch in addition to the head node. max_workers: 2 available_node_types: ray.head.default: resources: { } # Provider-specific config for this node type, e.g., instance type. # By default Ray auto-configures unspecified fields such as SubnetId and KeyName. # For more documentation on available fields, see # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances node_config: InstanceType: t3.medium ImageId: 'ami-0dd6adfad4ad37eec' # Deep Learning Base Neuron AMI (Ubuntu 20.04) 20240216 ray.worker.worker_1: # The minimum number of worker nodes of this type to launch. # This number should be >= 0. For FL experiment 1 is sufficient. min_workers: 1 # The maximum number of worker nodes of this type to launch. # This parameter takes precedence over min_workers. For FL experiment 1 is sufficient. max_workers: 1 # Set this to {${client_id} : 1}, client_id from examples/resources/config_ray/mnist/clients.yaml config file # Set it to empty if client task can be assigned randomly to any worker node resources: {Client1: 1} node_config: InstanceType: t3.medium ImageId: 'ami-0dd6adfad4ad37eec' # Deep Learning Base Neuron AMI (Ubuntu 20.04) 20240216 InstanceMarketOptions: MarketType: spot # Configure worker nodes to use Spot Instances SpotOptions: MaxPrice: '0.05' ray.worker.worker_2: min_workers: 1 max_workers: 1 resources: {Client2: 1} node_config: InstanceType: t3.medium ImageId: 'ami-0dd6adfad4ad37eec' # Deep Learning Base Neuron AMI (Ubuntu 20.04) 20240216 InstanceMarketOptions: MarketType: spot # Configure worker nodes to use Spot Instances SpotOptions: MaxPrice: '0.05' file_mounts: { "/home/ubuntu/APPFL": "../../../APPFL", "/home/ubuntu/resources": "../resources", "/home/ubuntu/run.py": "run.py" } setup_commands: ["conda config --remove channels intel", "conda create -n APPFL python=3.10 -y ", 'conda activate APPFL && pip install ray["default"] && pip install confluent-kafka --prefer-binary && cd APPFL && pip install -e ".[examples]"', "(stat $HOME/anaconda3/envs/APPFL/ &> /dev/null && echo 'export PATH=\"$HOME/anaconda3/envs/APPFL/bin:$PATH\"' >> ~/.bashrc) || true"] You can set the desired aws region under ``provider.region`` All the EC2 instance related configuration for head node or worker nodes goes in ``node_config`` which has ``InstanceType``, ``ImageId`` (AMI image id), spot vs on demand etc. For more documentation on available fields, `see `_. For other field description you can follow inline comments in ``examples/ray/ray_cluster_config.yaml``. Further you can check it out `here `_. Running Experiment ------------------ Environment setup ~~~~~~~~~~~~~~~~~ Configure AWS credentials - IAM having ``AmazonEC2FullAccess`` and ``AmazonEC2RoleforSSM`` Cluster Creation ~~~~~~~~~~~~~~~~ Go inside ray example .. code-block:: bash cd examples/ray/ Run below command, which brings up whole cluster that is described in ``examples/ray/ray_cluster_config.yaml``. .. code-block:: bash ray up ray_cluster_config.yaml .. note:: For lower cluster spin up time create a custom AMI image by running setup command on given image id in ray_cluster_config.yaml. After creating custom AMI you can provide it in ray_cluster_config.yaml under ImageId attribute of each node Checking cluster status ~~~~~~~~~~~~~~~~~~~~~~~ **From Local machine** 1. You can check cluster status by running .. code-block:: bash ray exec ray_cluster_config.yaml 'ray status' **From Head Node** 1. Go into head node using .. code-block:: bash ray attach ray_cluster_config.yaml 2. Check cluster status after attaching to head node using .. code-block:: bash ray status Output of ray status would look like below .. code-block:: bash ======== Autoscaler status: 2025-02-25 20:18:02.106153 ======== Node status --------------------------------------------------------------- Active: 1 ray.worker.worker_2 1 ray.head.default 1 ray.worker.worker_1 Pending: (no pending nodes) Recent failures: (no failures) Resources --------------------------------------------------------------- Usage: 0.0/6.0 CPU 0.0/1.0 Client1 0.0/1.0 Client2 0B/7.64GiB memory 0B/3.16GiB object_store_memory Demands: (no resource demands) Job Submission ~~~~~~~~~~~~~~ **From Local machine** 1. Do port forwarding using .. code-block:: bash ray dashboard ray_cluster_config.yaml 2. Now on another terminal you can submit job request using: .. code-block:: bash ray job submit --address http://localhost:8265 -- python APPFL/examples/ray/run.py **From Head Node** 1. Connect to head node .. code-block:: bash ray attach ray_cluster_config.yaml 2. Run job using: .. code-block:: bash python run.py Stopping Cluster ~~~~~~~~~~~~~~~~ 1. To stop cluster run .. code-block:: bash ray down ray_cluster_config.yaml