Example: Run FL Experiment using Globus Compute¶

This tutorial describes how to run federated learning experiments on APPFL using Globus Compute as the communication backend. All the code snippets needed for this tutorial is available at the examples
directory of the APPFL repository at here.
Note
For more detailed information about Globus Compute, please refer to the Globus Compute documentation.
Installation¶
First, both the client and the server should install the APPFL package on their local machines. Below shows how to install the APPFL package from its source code. For more information, please refer to the APPFL documentation.
git clone --single-branch --branch main https://github.com/APPFL/APPFL.git
cd APPFL
conda create -n appfl python=3.10 --y
conda activate appfl
pip install -e ".[examples]"
Creating Globus Compute Endpoint on Client Machines¶
User can create a Globus Compute endpoint on their local machine by the following command:
globus-compute-endpoint configure appfl-endpoint
Note
You can replace appfl-endpoint
any endpoint name you like.
Then you will be asked to configure the endpoint. If you are using your local computer, you can use the default configuration. If you are using an HPC or cloud machine, you need to modify the configuration file at ~/.globus-compute/appfl-endpoint/config.yaml
. Below, we show a sample configuration file for Polaris:
engine:
address:
ifname: bond0
type: address_by_interface
max_workers_per_node: 1
provider:
account: <your_polaris_account> # Replace with your account
cpus_per_node: 32
init_blocks: 0
max_blocks: 1
min_blocks: 0
nodes_per_block: 1
queue: debug # Replace with other queue if needed
scheduler_options: '#PBS -l filesystems=home:eagle:grand'
select_options: ngpus=4
type: PBSProProvider
walltime: 00:30:00
worker_init: module use /soft/modulefiles; module load conda; conda activate <your_conda_env>;
strategy:
max_idletime: 3600
type: SimpleStrategy
type: HighThroughputEngine
Note
It is recommended to set max_idletime
(in seconds) to a large value to avoid the endpoint being shut down by the Globus Compute service when there is no task running.
After the configuration, you can start the endpoint by the following command:
globus-compute-endpoint start appfl-endpoint
Client Configurations¶
The server needs to collect certain information from the client to run the federated learning experiment. Below is an example of a client configuration file. It is available at examples/resources/configs_gc/clients.yaml
at the APPFL repository at here.
clients:
- endpoint_id: "ed4a1881-120e-4f67-88d7-876cd280feef"
client_id: "Client1"
train_configs:
# Device [Optional]: default is "cpu"
device: "cpu"
# Logging and outputs [Optional]
logging_output_dirname: "./output"
logging_output_filename: "result"
# Local dataset
data_configs:
dataset_path: "./resources/dataset/mnist_dataset.py"
dataset_name: "get_mnist"
dataset_kwargs:
num_clients: 2
client_id: 0
partition_strategy: "class_noniid"
visualization: False
- endpoint_id: "762629a0-f3b3-44b5-9acf-2f9b0ab9310f"
client_id: "Client2"
train_configs:
# Device [Optional]: default is "cpu"
device: "cpu"
# Logging and outputs [Optional]
logging_output_dirname: "./output"
logging_output_filename: "result"
# Local dataset
data_configs:
dataset_path: "./resources/dataset/mnist_dataset.py"
dataset_name: "get_mnist"
dataset_kwargs:
num_clients: 2
client_id: 1
partition_strategy: "class_noniid"
visualization: False
It should be noted that the client configuration file actually resides on the server machine, and the contents of the file are shared by the clients. Specifically, there are three main parts in the client configuration file:
endpoint_id
: It is the Globus Compute Endpoint ID of the client machine.train_configs
: It contains the training configurations for the client, including the device to run the training, logging configurations, etc.data_configs
: It contains the information of a dataloader python file defined and shared by the clients to the server (located atdataset_path
on the server machine). The dataloader file should contain a function (specified bydataset_name
) which can load the client’s local private dataset when it is executing on the client’s machine.
Note
When the data loader function is executed on the client’s machine, it’s default working directory is ~/.globus-compute/appfl-endpoint/tasks_working_dir
.
Server Configurations¶
We have provide three sample server configuration files available at examples/resources/config_gc
at the APPFL repository at here. The detailed description of the server configuration file can be found in the APPFL documentation.
It should be noted that client_configs.comm_configs.globus_compute_configs
is optional and should be set only if the user wants to use AWS S3 for data transmission (Globus Compute limits data transmission size to 10 MB, so models larger than 10 MB should be transmitted using AWS S3). Specifically, the s3_bucket
field should be set to the name of the S3 bucket that the user wants to use, and s3_creds_file
is a CSV file containing the AWS credentials. The CSV file should have the following format.
<region>,<access_key_id>,<secret_access_key>
Note
The server can also set these information before running the experiment via the aws configure
command.
Running the Experiment¶
We provide a sample experiment launching script at examples/globus_compute/run.py
, and user can run the experiment by the following command.
python globus_compute/run.py
User can take this script as a reference and starting point to run their own federated learning experiments using Globus Compute as the communication backend.
Extra: Integration with ProxyStore¶

Prepare the ProxyStore Endpoint¶
As Globus Compute limits the data transmission size for the function inputs and outputs to several Megabytes, it is not suitable for transmitting large models. To address this issue, users can integrate Globus Compute with ProxyStore, which facilitates efficient data flow in distributed computing applications.
By default, a ProxyStore endpoint connects to ProxyStore’s cloud-hosted relay server, which uses Globus Auth for identity and access management. To use the provided relay server, users need to do a one-time-per-system authentication using the following command:
proxystore-globus-auth login
User can then create an endpoint using the following command:
$ proxystore-endpoint configure my-endpoint # you can replace my-endpoint with any name you like
INFO: Configured endpoint: my-endpoint <a6c7f036-3e29-4a7a-bf90-5a5f21056e39>
INFO: Config and log file directory: ~/.local/share/proxystore/my-endpoint
INFO: Start the endpoint with:
INFO: $ proxystore-endpoint start my-endpoint
Note
User can change endpoint configuration at ~/.local/share/proxystore/my-endpoint/config.toml
to change maximum object size or use their own relay server.
After creating the endpoint and finishing the configuration (if needed), user can start the endpoint by the following command:
proxystore-endpoint start my-endpoint
Note
For debugging the endpoint, user can refer to the official ProxyStore documentation.
Configure for Federated Learning¶
With ProxyStore endpoints installed on the client/server which would like to use ProxyStore to transfer model parameters, user needs to collect all endpoints ids and put them in the both the server and client configuration files as comm_configs.proxystore_configs
. It should be noted that you only need to specify such configuration for site that you would like to use ProxyStore to transfer model parameters, although you would like to use it for all sites most of the time.
Below shows how to configure the server configuration file. A full sample configuration file is available at examples/resources/configs_gc/server_fedavg_proxystore.yaml
in the APPFL repository at here.
client_configs:
... # general client configurations
server_configs:
...
comm_configs:
proxystore_configs:
enable_proxystore: True
connector_type: "EndpointConnector"
connector_configs:
endpoints: ["endpoint_id_1", "endpoint_id_2", ...] # List of all endpoint ids for server and clients
Below shows how to configure the client configuration file. A full sample configuration file is available at examples/resources/configs_gc/clients_proxystore.yaml
in the APPFL repository at here.
clients:
- endpoint_id: ...
...
comm_configs:
proxystore_configs:
enable_proxystore: True
connector_type: "EndpointConnector"
connector_configs:
endpoints: ["endpoint_id_1", "endpoint_id_2", ...] # List of all endpoint ids for server and clients
- endpoint_id: ...
...
comm_configs:
proxystore_configs:
enable_proxystore: True
connector_type: "EndpointConnector"
connector_configs:
endpoints: ["endpoint_id_1", "endpoint_id_2", ...] # List of all endpoint ids for server and clients
Running the Experiment¶
After configuring the server and client configuration files, user can run the federated learning experiment using the same script as before by providing the new paths to the configuration files.
python globus_compute/run.py \
--server_config ./resources/config_gc/mnist/server_fedavg_proxystore.yaml \
--client_config ./resources/config_gc/mnist/clients_proxystore.yaml
Extra: Integration with ProxyStore on Polaris¶

In this section, we show how to launch a Globus Compute endpoint on ALCF’s Polaris supercomputer and use ProxyStore Endpoint to transfer model parameters between the server and clients.
Prepare the ProxyStore Endpoint¶
One of the most tricky parts of Polaris is that its compute node does not have internet access, with the exception of HTTP, HTTPS, and FTP through a proxy server. Therefore, users have to start their ProxyStore endpoint on a login node with internet access. The started endpoint acts as proxy for data transmission traffic between the compute nodes and the ProxyStore relay server, which listens on http://<login_node_id>:<port>
. When you start the endpoint with the command proxystore-endpoint start <endpoint_name>
, the endpoint log at ~/.local/share/proxystore/<endpoint_name>/log.txt
should look like something below:
[2025-01-30 23:43:08.113] INFO (proxystore.endpoint.serve) :: Installing uvloop as default event loop
[2025-01-30 23:43:08.125] WARNING (proxystore.endpoint.serve) :: Database path not provided. Data will not be persisted
[2025-01-30 23:43:08.125] INFO (proxystore.endpoint.serve) :: Using native app Globus Auth client
[2025-01-30 23:43:08.126] INFO (globus_sdk.client) :: Creating client of type <class 'globus_sdk.services.auth.client.native_client.NativeAppAuthClient'> for service "auth"
[2025-01-30 23:43:08.127] INFO (globus_sdk.services.auth.client.base_login_client) :: Finished initializing AuthLoginClient. client_id='a3379dba-a492-459a-a8df-5e7676a0472f', type(authorizer)=<class 'globus_sdk.authorizers.base.NullAuthorizer'>
[2025-01-30 23:43:08.188] INFO (globus_sdk.authorizers.refresh_token) :: Setting up RefreshTokenAuthorizer with auth_client=[instance:139892558440592]
[2025-01-30 23:43:08.188] INFO (globus_sdk.authorizers.renewing) :: Setting up a RenewingAuthorizer. It will use an auth type of Bearer and can handle 401s.
[2025-01-30 23:43:08.188] INFO (globus_sdk.authorizers.renewing) :: RenewingAuthorizer will start by using access_token with hash "f41c966eeea9ab06d4c69aa4e0219efebe70e2f3e85fd41005ee4e954ec877fd"
[2025-01-30 23:43:08.223] INFO (proxystore.p2p.nat) :: Checking NAT type. This may take a moment...
[2025-01-30 23:43:08.249] INFO (proxystore.p2p.nat) :: NAT Type: Full-cone NAT
[2025-01-30 23:43:08.249] INFO (proxystore.p2p.nat) :: External IP: 140.221.112.14
[2025-01-30 23:43:08.249] INFO (proxystore.p2p.nat) :: External Port: 54320
[2025-01-30 23:43:08.250] INFO (proxystore.p2p.nat) :: NAT traversal for peer-to-peer methods (e.g., hole-punching) is likely to work. (NAT traversal does not work reliably across symmetric NATs or poorly behaved legacy NATs.)
[2025-01-30 23:43:08.540] INFO (proxystore.p2p.relay.client) :: Established client connection to relay server at wss://relay.proxystore.dev with client uuid=b6cfb02b-323f-4eac-8c42-20102bb0bd26 and name=my-endpoint
[2025-01-30 23:43:08.541] INFO (proxystore.endpoint.endpoint) :: Endpoint[my-endpoint(b6cfb02b)]: initialized endpoint operating in PEERING mode
[2025-01-30 23:43:08.545] INFO (proxystore.endpoint.serve) :: Serving endpoint b6cfb02b-323f-4eac-8c42-20102bb0bd26 (my-endpoint) on 10.201.0.56:8767
[2025-01-30 23:43:08.545] INFO (proxystore.endpoint.serve) :: Config: name='my-endpoint' uuid='b6cfb02b-323f-4eac-8c42-20102bb0bd26' port=8767 host='10.201.0.56' relay=EndpointRelayConfig(address='wss://relay.proxystore.dev', auth=EndpointRelayAuthConfig(method='globus', kwargs={}), peer_channels=1, verify_certificate=True) storage=EndpointStorageConfig(database_path=None, max_object_size=100000000)
[2025-01-30 23:43:08.909] INFO (uvicorn.error) :: Started server process [909609]
[2025-01-30 23:43:08.909] INFO (uvicorn.error) :: Waiting for application startup.
[2025-01-30 23:43:08.909] INFO (proxystore.p2p.manager) :: PeerManager[my-endpoint(b6cfb02b)]: listening for messages from relay server
[2025-01-30 23:43:08.909] INFO (proxystore.endpoint.endpoint) :: Endpoint[my-endpoint(b6cfb02b)]: listening for peer requests
[2025-01-30 23:43:08.910] INFO (uvicorn.error) :: Application startup complete.
[2025-01-30 23:43:08.910] INFO (uvicorn.error) :: Uvicorn running on http://10.201.0.56:8767 (Press CTRL+C to quit)
Note
It is important to make sure that the endpoint is started by checking its log. For example, the port your endpoint is listening on might be in use and might cause error like: [Errno 98] error while attempting to bind on address ('10.201.0.56', 8765): address already in use
.
Prepare the Globus Compute Endpoint¶
After starting the ProxyStore endpoint on Polaris login node, user can create a Globus Compute endpoint with the following configuration. It should be noted that compared with the configuration above, we specifically unset the http_proxy/HTTP_PROXY
environment variable so that the compute node can access the ProxyStore endpoint on the login node.
engine:
address:
ifname: bond0
type: address_by_interface
max_workers_per_node: 1
provider:
account: <your_polaris_account> # Replace with your account
cpus_per_node: 32
init_blocks: 0
max_blocks: 1
min_blocks: 0
nodes_per_block: 1
queue: debug # Replace with other queue if needed
scheduler_options: '#PBS -l filesystems=home:eagle:grand'
select_options: ngpus=4
type: PBSProProvider
walltime: 00:30:00
worker_init: module use /soft/modulefiles; module load conda; conda activate <your_conda_env>; export HTTP_PROXY=""; export http_proxy="";
strategy:
max_idletime: 3600
type: SimpleStrategy
type: HighThroughputEngine
After the configuration, user can start the Globus Compute endpoint and configure the FL experiments as described in the previous sections.
Additional Debugging Tips¶
Test Local ProxyStore Endpoint:
To test if your local ProxyStore endpoint (e.g., my-endpoint
) is working, you can use the following command to check if a random object exists in the endpoint store, and it is expected to return a False
.
$ proxystore-endpoint test my-endpoint exists abcdef
# Expected output
INFO: Object exists: False
Test Remote ProxyStore Endpoint:
Consider you have an endpoint running on system A with UUID aaaa0259-5a8c-454b-b17d-61f010d874d4
and name endpoint-a
, and another on System B with UUID bbbbab4d-c73a-44ee-a316-58ec8857e83a
and name endpoint-b
. You want to test the peer connection between two endpoints on system A, then you can request the endpoint on system A to invoke an exists
operatoin on the endpoint on system B via the following command:
$ proxystore-endpoint test --remote bbbbab4d-c73a-44ee-a316-58ec8857e83a endpoint-a exists abcdef
# Expected output
INFO: Object exists: False
Note
For more detailed endpoint debugging tips, we refer users to the official ProxyStore documentation.