Start Experiments

This page describes how the FL group (federation) leader can start an FL experiment via the web application.

  1. Log in to the web application by following the instructions.

  2. You will be directed to a dashboard page after signing in. The dashboard lists your Federations and your Clients. Specifically, federation refers to the FL group that you created, namely, you are the group leader who can start FL experiments and access the experiment results. Client refers to the FL group of which you are a member. The federation leader is also a client of his own federation by default.

  3. Click New Experiment button next to the federation for which you want to start the FL experiment. This will lead you to the New Experiment page.

  4. Client Endpoints at the top of the page shows the status of client Globus Compute endpoints. Click the status icon to see the details of the endpoint status. Only clients with active endpoints can join the FL experiment. You can contact the client via email by clicking the email icon if the client endpoint is not active.

  5. For Federation Algorithm, we support the following federated learning algorithms. Choose one algorithm that you want to use.

  1. For Experiment Name, please provide a name of your choice for this FL experiment.

  2. For Server Training Epochs, enter the number of global aggregations for the FL experiment.

  3. For Client Training Epochs, enter the number of local training epochs for each client (site) before sending the local model back to the server.

  4. When the user selects Use Differential Privacy as True, Privacy Budget, Clip Value and Clip Norm are needed to be specified.

  5. Upload the training model architecture by selecting Custom Model, or choosing a custom model by Uploading from Github. When you choose upload from Github, a modal will pop up, first click Authorize with Github to link your Github account, then you can choose or search for the repository, select the branch and file to upload. For the model, you need to provide a Python script, whose last function returns the model. You can define necessary classes or functions in the script as long as the last function returns the model. Below is an example for model architecture definition.

An example for model architecture definition.
import math
import torch
import torch.nn as nn


class CNN(nn.Module):
    def __init__(self, num_channel, num_classes, num_pixel):
        super().__init__()
        self.conv1 = nn.Conv2d(
            num_channel, 32, kernel_size=5, padding=0, stride=1, bias=True
        )
        self.conv2 = nn.Conv2d(32, 64, kernel_size=5, padding=0, stride=1, bias=True)
        self.maxpool = nn.MaxPool2d(kernel_size=(2, 2))
        self.act = nn.ReLU(inplace=True)

        X = num_pixel
        X = math.floor(1 + (X + 2 * 0 - 1 * (5 - 1) - 1) / 1)
        X = X / 2
        X = math.floor(1 + (X + 2 * 0 - 1 * (5 - 1) - 1) / 1)
        X = X / 2
        X = int(X)

        self.fc1 = nn.Linear(64 * X * X, 512)
        self.fc2 = nn.Linear(512, num_classes)

    def forward(self, x):
        x = self.act(self.conv1(x))
        x = self.maxpool(x)
        x = self.act(self.conv2(x))
        x = self.maxpool(x)
        x = torch.flatten(x, 1)
        x = self.act(self.fc1(x))
        x = self.fc2(x)
        return x


def get_model():
    return CNN(1, 10, 28)
  1. For Client Training Mode, user can either select Epoch-wise or Step-wise. In Epoch-wise mode, the client trains the model for the specified number of epochs and sends the model back to the server. In Step-wise mode, the client trains the model for the specified number of steps and sends the model back to the server.

  2. For Loss File, user needs to provide a Python script whose last class definition defines the loss function as a child class of torch.nn.Module. Below is an example for loss function definition.

An example for loss function definition.
import torch.nn as nn


class CELoss(nn.Module):
    """Cross Entroy Loss"""

    def __init__(self):
        super().__init__()
        self.criterion = nn.CrossEntropyLoss(reduction="mean")

    def forward(self, prediction, target):
        target = target if len(target.shape) == 1 else target.squeeze(1)
        return self.criterion(prediction, target)
  1. For Metric File, user needs to provide a Python script whose last function definition defines the metric function. Below is an example for metric function definition.

An example for metric function definition.
import numpy as np


def accuracy(y_true, y_pred):
    """
    y_true and y_pred are both of type np.ndarray
    y_true (N, d) where N is the size of the validation set, and d is the dimension of the label
    y_pred (N, D) where N is the size of the validation set, and D is the output dimension of the ML model
    """
    if len(y_pred.shape) == 1:
        y_pred = np.round(y_pred)
    else:
        y_pred = y_pred.argmax(axis=1)
    return 100 * np.sum(y_pred == y_true) / y_pred.shape[0]
  1. For Client Optimizer, choose either SGD or Adam, and specify the local learning rate of each client in Client Learning Rate.

  2. For Client Weights, Proportional to Sample Size means applying different weights to different client local models during the global aggregation by calculating the weights proportional to the client sample size, and Equal for All Clients means applying the same weights to all client local models.

  3. After carefully choosing all configurations and hyperparameters for the FL experiment, you can start the experiment by clicking Start. Then the web application will launch an orchestration server for you which trains a federated learning model by collaborating with all active client endpoints.