cisco-open
diff --git a/‎docs/01-introduction.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/01-introduction.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/02-getting-started.md‎
Lines changed: 0 additions & 2 deletions b/‎docs/02-getting-started.md‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎docs/03-fiab.md‎
Lines changed: 5 additions & 5 deletions b/‎docs/03-fiab.md‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎docs/04-examples.md‎
Lines changed: 8 additions & 51 deletions b/‎docs/04-examples.md‎
Lines changed: 8 additions & 51 deletions
diff --git a/‎docs/05-flame-basics.md‎
Lines changed: 99 additions & 12 deletions b/‎docs/05-flame-basics.md‎
Lines changed: 99 additions & 12 deletions
diff --git a/‎docs/08-flame-sdk.md‎
Lines changed: 63 additions & 2 deletions b/‎docs/08-flame-sdk.md‎
Lines changed: 63 additions & 2 deletions
diff --git a/‎docs/images/parallel_exps.png‎
897 KB b/‎docs/images/parallel_exps.png‎
897 KB
diff --git a/‎docs/images/topologies.png‎
887 KB b/‎docs/images/topologies.png‎
887 KB
@@ -81,7 +81,7 @@ The non-ochestration mode is useful in one of the following situations:
 * when the geo-distributed clusters are not under the management of one organization
 * when participants of a FL job want to have a control over when to join or leave the job
 
-In non-ochestration mode, the fleddge system is only responsible for managing (i.e., (de)allocation) non-data comsuming workers (e.g., model aggregating workers).
+In non-ochestration mode, the fleddge system is only responsible for managing (i.e., (de)allocation) non-data consuming workers (e.g., model aggregating workers).
 The system supports a hybrid mode where some are managed workers and others are non-managed workers.
 
 Note that the flame system is in active development and not all the functionalities are supported yet.
@@ -23,15 +23,13 @@ pyenv version
 eval "$(pyenv init -)"
 echo -e '\nif command -v pyenv 1>/dev/null 2>&1; then\n    eval "$(pyenv init -)"\nfi' >> ~/.bash_profile
 ```
-
 The following shows how to install the above packages in Ubuntu.
 ```bash
 sudo apt install golang 
 sudo snap install golangci-lint
 pyenv install 3.9.6
 pyenv global 3.9.6
 pyenv version
-
 eval "$(pyenv init -)"
 echo -e '\nif command -v pyenv 1>/dev/null 2>&1; then\n    eval "$(pyenv init -)"\nfi' >> ~/.bash_profile
 ```
 
@@ -7,7 +7,7 @@ The flame system consists of four components: apiserver, controller, notifier an
 It also includes mongodb as backend state store.
 This development environment is mainly tested under MacOS.
 This guideline is primarily based on MacOS.
-However, this dev environment doesn't work under latest Apple machines with M1 chip set.
+However, this dev environment doesn't work under latest Apple machines with M1 chip set because hyperkit is not yet supported for M1 Mac.
 The fiab is also tested under Archlinux. Hence, it may work on other Linux distributions such as Ubuntu.
 
 The `flame/fiab` folder contains several scripts to configure and set up the fiab environment.
@@ -21,15 +21,15 @@ fiab relies on `minikube`, `kubectl`, `helm`, `docker` and `jq`.
 
 fiab doesn't support docker driver (hence, docker desktop). fiab uses ingress and ingress-dns addons in minikube.
 When docker driver is chosen, these two addons are only supported on Linux (see [here](https://minikube.sigs.k8s.io/docs/drivers/docker/)
-and [here](https://github.com/kubernetes/minikube/issues/7332)). Note that while the issue 7332 is not closed and appears to be fixed,
-ingress and ingress-dns still doesn't work under fiab environment on MacOs.
+and [here](https://github.com/kubernetes/minikube/issues/7332)). Note that while the issue 7332 is now closed and appears to be fixed,
+ingress and ingress-dns still don't work under fiab environment on MacOs.
 In addition, note that the docker subscription service agreement has been updated for `docker desktop`.
 Hence, `docker desktop` may not be free. Please check out the [agreement](https://www.docker.com/products/docker-desktop).
 
 Hence, fiab uses `hyperkit` as its default vm driver. Using `hyperkit` has some drawbacks.
 First, as of May 21, 2022, `hyperkit` driver doesn't support M1 chipset.
 Second, the `hyperkit` driver doesn't work with dnscrypt-proxy or dnsmasq well.
-Thus, if dnscrypt-proxy or dnsmasq is installed in the system, see [here](#Fixing docker build error) for details and a workaround.
+Thus, if dnscrypt-proxy or dnsmasq is installed in the system, see [here](#fixing-docker-build-error) for details and a workaround.
 
 Note that other drivers such as `virtualbox` are not tested.
 
@@ -80,7 +80,7 @@ Next, `ingress` and `ingress` addons need to be installed with the following com
 minikube addons enable ingress
 minikube addons enable ingress-dns
 ```
-When `hyperkit` driver is in use, enabling `ingress` addon may fail due to the same issue shown in [here](#Fixing docker build error),
+When `hyperkit` driver is in use, enabling `ingress` addon may fail due to the same issue shown in [here](#fixing-docker-build-error),
 which explains a workaround. Once the workload is applied, come back here and rerun these commands.
 
 
 
@@ -1,6 +1,6 @@
 # Examples
 
-This section currently presents one example: FL training for MNIST. More examples will follow in the future.
+This section currently presents one example: FL training for MNIST. More examples will follow in the future, and you will find the instruction under the README file of their particular folder.
 
 ## MNIST
 
@@ -82,7 +82,7 @@ flamectl get jobs
 ```
 
 ### Step 7: start a job
-Before staring your job, you can always use `flamectl get` to check each step is set up corretly. For more info, check 
+Before staring your job, you can always use `flamectl get` to check each step is set up correctly. For more info, check 
 ```bash
 flamectl get --help
 ```
@@ -157,52 +157,9 @@ The log for a task is similar to `task-61bd2da4dcaed8024865247e.log` under `/var
 As an alternative, one can check the progress at MLflow UI in the fiab setup.
 Open a browser and go to http://mlflow.flame.test.
 
-## Hierarchical MNIST
-Likewise, the hierarchical FL example follows the same fashion. 
-
-Navigate to `./examples/hier_mnist`
-
-### Step 1:
-```bash 
-flamectl create design hier_mnist -d "hier_mnist example"
-```
-### Step 2:
-```bash
-flamectl create schema schema.json --design hier_mnist
-```
-The schema defines the topology of this FL job. For more info, please refer to [05-flame-basics](05-flame-basics.md).
-### Step 3:
-```bash
-flamectl create code hier_mnist.zip --design hier_mnist
-```
-The zip file should contain code of every code specified in the schema.
-
-### Step 4:
-```bash
-flamectl create dataset dataset_eu_germany.json
-flamectl create dataset dataset_eu_uk.json
-flamectl create dataset dataset_na_canada.json
-flamectl create dataset dataset_na_us.json
-```
-Flame will assign a trainer to each dataset. As each dataset has a `realm` specified, the middle aggreagator will be created based on the corresponding `groupby` tag. In this case, there will be one middle aggregator for Europe (eu) and one for North America (na).
-
-### Step 5: 
-Put all four dataset IDs into `job.json`, and change training hyperparameters as you like.
-```json
-"fromSystem": [
-    "62439c3725fe244585396ad7",
-    "6243a10c25fe244585396af0",
-    "6243a13625fe244585396af2",
-    "6243a14525fe244585396af3"
-]
-```
-
-### Step 6:
-```bash
-flamectl create job job.json
-```
-
-### Step 7:
-```bash
-flamectl start job ${Job ID}
-```
+For other examples, please visit their particular example directories:
+- [Medical Image Multi-class Classification with PyTorch](../examples/medmnist/README.md)
+- [Binary Income Classifcation with Tabular Dataset](../examples/adult/README.md) 
+- [Toy Example of Hierarchical FL](../examples/hier_mnist/README.md)
+- [Toy Example of Parallel Experiments](../examples/parallel_experiment/README.md)
+- [Toy Example of Distributed Training](../examples/distributed_training/README.md)
@@ -18,13 +18,13 @@ The key benefits of the abstraction are:
 Depending on the availability of different communication infrastructures and security policies,
 a workload can be easily changed from one communication technology to another.
 
-**High extensibility**: TAG makes it easy to support a variety of different topologies. Therefore, it can potentially support many different usecases easily.
+**High extensibility**: TAG makes it easy to support a variety of different topologies. Therefore, it can potentially support many different use cases easily.
 
 <p align="center"><img src="images/role_channel.png" alt="role and channel"" /></p>
 
 
 Now let us describe how TAG is enabled. TAG is comprised of two basic and yet simple building blocks: *role* and *channel*.
-A *role* represents a vertex in TAG and should be associated with some hevaviors.
+A *role* represents a vertex in TAG and should be associated with some behaviors.
 To create association between role and its behavior, a (python) code must be attached to a role.
 Once the association is done, a role is fully *defined*.
 
@@ -45,7 +45,7 @@ A channel also has two attributes: *groupBy* and *funcTags*.
 
 **groupBy**: This attribute is used to group roles of the channel based on a tag.
 Therefore, the groupBy attribute allows to build a hierarchical topology (e.g., a single-rooted multi-level tree), for instance, based on geographical location tags (e.g., us, uk, fr, etc).
-Currently a string-based tag is supported. Future extensions may include more dynamic grouping based on dynamic metrics such as latency, data (dis)simiarlity, and so on.
+Currently a string-based tag is supported. Future extensions may include more dynamic grouping based on dynamic metrics such as latency, data (dis)similarity, and so on.
 
 **funcTags** This attribute (discussed later in detail) contains what actions a role would take on the channel.
 As mentioned earlier, a role is associated with executable code.
@@ -54,13 +54,13 @@ We will discuss how to use funcTags correctly in the later part.
 
 ### TAG Example 1: Two-Tier Topology
 In flame, a topology is expressed within a concept called *schema*. 
-A schema is a resuable component as a template.
+A schema is a reusable component as a template.
 The following presents a simple two-tier cross-device topology.
 
 ```json
 {
     "name": "A sample schema",
-    "description": "a sample schema to demostrate a TAG layout",
+    "description": "a sample schema to demonstrate a TAG layout",
     "roles": [
 		{
 			"name": "trainer",
@@ -102,15 +102,15 @@ When datasets are selected (more details [here (not yet updated)]()), each datas
 Therefore, in the flame system, **the number of datasets will drive the number of data-consuming workers** (e.g., trainer in this case).
 Subsequently, the number of non data-consuming workers is derived from the entries in the *groupBy* feature (more on [later]()).
 
-Now let's look at channels. Channels are expressed as a list. A channel consits of four key attributes: *name*, *pair*, *groupBy* and *funcTags*.
+Now let's look at channels. Channels are expressed as a list. A channel consists of four key attributes: *name*, *pair*, *groupBy* and *funcTags*.
 The *name* attribute is used to uniquely identify a channel.
 The *pair* attribute contains two roles that constitute the channel; each role takes one of the channel.
 For the correctness, roles in the pair must exist in the role list.
 
 The *groupBy* attribute allows how to group or cluster workers of two ends (or roles) in the channel. It's optional.
 If this attribute is not defined, workers belonging to the channel are grouped into a default group.
 
-With *pair* and *groupBy*, a channel only specifies what roles consititue a channel and how they are grouped.
+With *pair* and *groupBy*, a channel only specifies what roles constitute a channel and how they are grouped.
 But it doesn't know what actions each role takes on the channel. The *funcTags* attribute allows *dynamic* binding of functions to a channel.
 The software code attached to a role must define a set of functions that it wants to expose to users
 so that the users can specify it in the schema. Therefore, it allows more complex operations on a channel.
@@ -125,7 +125,7 @@ def get_func_tags(cls) -> list[str]:
 ```
 
 Note that keys used in *funcTags* (e.g., "trainer" or "aggregator") do not have direct relation to classes
-such as Aggregtor or Trainer in the `lib/python/flame/mode/horizontal/`. Those keys are only meaningful in the schema.
+such as Aggregator or Trainer in the `lib/python/flame/mode/horizontal/`. Those keys are only meaningful in the schema.
 And *funcTags* is updated at the time when code is associated with a role in the schema.
 
 With the above configuration, the deployed topology looks like as follows.
@@ -139,7 +139,7 @@ The hierarchical topology is very similar to the simple two-tier topology except
 ```json
 {
     "name": "A simple example schema v1.0.1",
-    "description": "a sample schema to demostrate the hierarchical FL setting",
+    "description": "a sample schema to demonstrate the hierarchical FL setting",
     "roles": [
 		{
 	    	"name": "trainer",
@@ -210,13 +210,100 @@ The above example uses "us", "europe" and "asia" as labels and is visualized as
 
 <p align="center"><img src="images/hierarchical_topo.png" alt="Hierarchical topology" width="600px" /></p>
 
-### How to move from 2-tier to Hierarchical Topology
-From 2-tier to hierarchical (e.g., 3-tier), you need to have one more role in between top aggreagator and trainer, so you add middle aggreagator into the topology (i.e., schema), which also require you to define new channels connecting between each two roles. In order for the hierarchical concept to work, the `groupBy` of upstream channel shouldn't be more specific than the downstream channel.
+#### How to move from 2-tier to hierarchical topology
+From 2-tier to hierarchical (e.g., 3-tier), you need to have one more role in between top aggregator and trainer, so you add middle aggregator into the topology (i.e., schema), which also require you to define new channels connecting between each two roles. In order for the hierarchical concept to work, the `groupBy` of upstream channel shouldn't be more specific than the downstream channel.
 Likewise, when you want to expand to 4-tier topology, you will need a new channel definition connecting between two middle aggregators.
 
 However, it is still unclear how workers are grouped together at run time.
 A brief answer is as follows: in the flame system, before workers are created, they are configured with an attribute called *realm*.
 This attribute is a logical hierarchical value which is similar to a directory-like structure in a file system.
 It basically dictates where workers should be created and to which path the workers belong in the logical hierarchy.
 Given this hierarchical information, users can judiciously choose grouping labels.
-Further discussion is available [here (not yet updated)]().
+
+### TAG Example 3: Parallel Experiments
+Flame system allows multiple identical TAGs to run in parallel based on the `groupBy` tag, such as allowing a 2-tier FL task to run in parallel for 3 geographical regions simultaneously (see image below).
+
+<p align="center"><img src="images/parallel_exps.png" alt="Parallel Experiments" width="600px" /></p>
+
+```json
+{
+    "name": "A sample schema",
+    "description": "a sample schema to demonstrate the parallel experiment setting",
+    "roles": [
+		{
+			"name": "trainer",
+			"description": "It consumes the data and trains local model",
+			"isDataConsumer": true
+		},
+		{
+			"name": "aggregator",
+			"description": "It aggregates the updates from trainers",
+		}
+    ],
+    "channels": [
+		{
+			"name": "param-channel",
+			"description": "Model update is sent from trainer to aggregator and vice-versa",
+			"pair": [
+				"trainer",
+				"aggregator"
+			],
+			"groupBy": {
+			"type": "tag",
+			"value": [
+				"default/us",
+				"default/eu",
+				"default/asia"
+			]
+		    },
+			"funcTags": {
+				"trainer": ["fetch", "upload"],
+				"aggregator": ["distribute", "aggregate"]
+			}
+		}
+    ]
+}
+```
+
+This topology is the same as the 2-tier one except there are additional *value* in the *groupBy* tag.
+
+### TAG Example 4: Distributed Learning
+Flame system allows distributed training besides federated learning. In TAG, it's creating a self-loop (see image below) to allow channel communication between trainers so that algorithms such as ring all-reduce can be used to train the model utilizing multiple trainers.
+
+<p align="center"><img src="images/topologies.png" alt="Four Topologies" width="600px" /></p>
+
+```json
+{
+    "name": "A sample schema",
+    "description": "a sample schema to demonstrate the distributed training setting.",
+    "roles": [
+	{
+	    "name": "trainer",
+	    "description": "It consumes the data and trains local model",
+	    "isDataConsumer": true
+	}
+    ],
+    "channels": [
+	{
+	    "name": "param-channel",
+	    "description": "Model update is sent from trainer to other trainers",
+	    "pair": [
+			"trainer",
+			"trainer"
+	    ],
+	    "groupBy": {
+			"type": "tag",
+			"value": [
+		    	"default/us"
+		]
+	    },
+	    "funcTags": {
+			"trainer": ["ring_allreduce"]
+	    }
+	}
+    ]
+}
+```
+
+### TAG Example 5: Hybrid Model (TODO)
+
@@ -1,9 +1,70 @@
 # Flame SDK
 
 ## Selector
-
+Users are able to implement new selectors in `lib/python/flame/selector/` which should return a dictionary with keys corresponding to the active trainer IDs (i.e., agent IDs). After implementation, the new selector needs to be registered into both `lib/python/flame/selectors.py` and `lib/python/flame/config.py`.
 ### Currently Implemented Selectors
 1. Naive (i.e., select all)
+```json
+"selector": {
+    "sort": "default",
+    "kwargs": {}
+}
+```
 2. Random (i.e, select k out of n local trainers)
+```json
+"selector": {
+    "sort": "random",
+    "kwargs": {
+        "k": 1
+    }
+}
+```
+
+## Optimizer (i.e., aggregator of FL)
+Users can implement new server optimizer, when the client optimizer is defined in the actual ML code, in `lib/python/flame/optimizer` which can take in hyperparameters if any and should return the aggregated weights in either PyTorch of Tensorflow format. After implementation, the new optimizer needs to be registered into both `lib/python/flame/optimizer.py` and `lib/python/flame/config.py`.
+
+### Currently Implemented Optimizers
+1. FedAvg (i.e., weighted average in terms of dataset size)
+```json
+# e.g.
+"optimizer": {
+    "sort": "fedavg",
+    "kwargs": {}
+}
+```
+2. FedAdaGrad (i.e., server uses AdaGrad optimizer)
+```json
+"optimizer": {
+    "sort": "fedadagrad",
+    "kwargs": {
+        "beta_1": 0,
+        "eta": 0.1,
+        "tau": 0.01
+    }
+}
+```
+3. FedAdam (i.e., server uses Adam optimizer)
+```json
+"optimizer": {
+    "sort": "fedadam",
+    "kwargs": {
+        "beta_1": 0.9,
+        "beta_2": 0.99,
+        "eta": 0.01,
+        "tau": 0.001
+    }
+}
+```
+4. FedYogi (i.e., servers use Yogi optimizer)
+```json
+"optimizer": {
+    "sort": "fedyogi",
+    "kwargs": {
+        "beta_1": 0.9,
+        "beta_2": 0.99,
+        "eta": 0.01,
+        "tau": 0.001
+    }
+}
+```
 
-Users are able to implement new selectors in `lib/python/flame/selector/` which should return a dictionary with keys corresponding to the active trainer IDs (i.e., agent IDs). After implementation, the new selector needs to be registered into both `lib/python/flame/selectors.py` and `lib/python/flame/config.py`.