Skip to content

Commit 7a8433a

Browse files
authored
docs: three types of TAG and their examples (#140)
I added the doc for the examples of three types of TAG including hierarchical FL, distributed training and parallel experiments under their example folder. Also, I fixed some typos of current doc and added the corresponding TAG doc to the main doc folder.
1 parent fbd9969 commit 7a8433a

File tree

19 files changed

+491
-187
lines changed

19 files changed

+491
-187
lines changed

docs/01-introduction.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ The non-ochestration mode is useful in one of the following situations:
8181
* when the geo-distributed clusters are not under the management of one organization
8282
* when participants of a FL job want to have a control over when to join or leave the job
8383

84-
In non-ochestration mode, the fleddge system is only responsible for managing (i.e., (de)allocation) non-data comsuming workers (e.g., model aggregating workers).
84+
In non-ochestration mode, the fleddge system is only responsible for managing (i.e., (de)allocation) non-data consuming workers (e.g., model aggregating workers).
8585
The system supports a hybrid mode where some are managed workers and others are non-managed workers.
8686

8787
Note that the flame system is in active development and not all the functionalities are supported yet.

docs/02-getting-started.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,15 +23,13 @@ pyenv version
2323
eval "$(pyenv init -)"
2424
echo -e '\nif command -v pyenv 1>/dev/null 2>&1; then\n eval "$(pyenv init -)"\nfi' >> ~/.bash_profile
2525
```
26-
2726
The following shows how to install the above packages in Ubuntu.
2827
```bash
2928
sudo apt install golang
3029
sudo snap install golangci-lint
3130
pyenv install 3.9.6
3231
pyenv global 3.9.6
3332
pyenv version
34-
3533
eval "$(pyenv init -)"
3634
echo -e '\nif command -v pyenv 1>/dev/null 2>&1; then\n eval "$(pyenv init -)"\nfi' >> ~/.bash_profile
3735
```

docs/03-fiab.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ The flame system consists of four components: apiserver, controller, notifier an
77
It also includes mongodb as backend state store.
88
This development environment is mainly tested under MacOS.
99
This guideline is primarily based on MacOS.
10-
However, this dev environment doesn't work under latest Apple machines with M1 chip set.
10+
However, this dev environment doesn't work under latest Apple machines with M1 chip set because hyperkit is not yet supported for M1 Mac.
1111
The fiab is also tested under Archlinux. Hence, it may work on other Linux distributions such as Ubuntu.
1212

1313
The `flame/fiab` folder contains several scripts to configure and set up the fiab environment.
@@ -21,15 +21,15 @@ fiab relies on `minikube`, `kubectl`, `helm`, `docker` and `jq`.
2121

2222
fiab doesn't support docker driver (hence, docker desktop). fiab uses ingress and ingress-dns addons in minikube.
2323
When docker driver is chosen, these two addons are only supported on Linux (see [here](https://minikube.sigs.k8s.io/docs/drivers/docker/)
24-
and [here](https://github.com/kubernetes/minikube/issues/7332)). Note that while the issue 7332 is not closed and appears to be fixed,
25-
ingress and ingress-dns still doesn't work under fiab environment on MacOs.
24+
and [here](https://github.com/kubernetes/minikube/issues/7332)). Note that while the issue 7332 is now closed and appears to be fixed,
25+
ingress and ingress-dns still don't work under fiab environment on MacOs.
2626
In addition, note that the docker subscription service agreement has been updated for `docker desktop`.
2727
Hence, `docker desktop` may not be free. Please check out the [agreement](https://www.docker.com/products/docker-desktop).
2828

2929
Hence, fiab uses `hyperkit` as its default vm driver. Using `hyperkit` has some drawbacks.
3030
First, as of May 21, 2022, `hyperkit` driver doesn't support M1 chipset.
3131
Second, the `hyperkit` driver doesn't work with dnscrypt-proxy or dnsmasq well.
32-
Thus, if dnscrypt-proxy or dnsmasq is installed in the system, see [here](#Fixing docker build error) for details and a workaround.
32+
Thus, if dnscrypt-proxy or dnsmasq is installed in the system, see [here](#fixing-docker-build-error) for details and a workaround.
3333

3434
Note that other drivers such as `virtualbox` are not tested.
3535

@@ -80,7 +80,7 @@ Next, `ingress` and `ingress` addons need to be installed with the following com
8080
minikube addons enable ingress
8181
minikube addons enable ingress-dns
8282
```
83-
When `hyperkit` driver is in use, enabling `ingress` addon may fail due to the same issue shown in [here](#Fixing docker build error),
83+
When `hyperkit` driver is in use, enabling `ingress` addon may fail due to the same issue shown in [here](#fixing-docker-build-error),
8484
which explains a workaround. Once the workload is applied, come back here and rerun these commands.
8585

8686

docs/04-examples.md

Lines changed: 8 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Examples
22

3-
This section currently presents one example: FL training for MNIST. More examples will follow in the future.
3+
This section currently presents one example: FL training for MNIST. More examples will follow in the future, and you will find the instruction under the README file of their particular folder.
44

55
## MNIST
66

@@ -82,7 +82,7 @@ flamectl get jobs
8282
```
8383

8484
### Step 7: start a job
85-
Before staring your job, you can always use `flamectl get` to check each step is set up corretly. For more info, check
85+
Before staring your job, you can always use `flamectl get` to check each step is set up correctly. For more info, check
8686
```bash
8787
flamectl get --help
8888
```
@@ -157,52 +157,9 @@ The log for a task is similar to `task-61bd2da4dcaed8024865247e.log` under `/var
157157
As an alternative, one can check the progress at MLflow UI in the fiab setup.
158158
Open a browser and go to http://mlflow.flame.test.
159159

160-
## Hierarchical MNIST
161-
Likewise, the hierarchical FL example follows the same fashion.
162-
163-
Navigate to `./examples/hier_mnist`
164-
165-
### Step 1:
166-
```bash
167-
flamectl create design hier_mnist -d "hier_mnist example"
168-
```
169-
### Step 2:
170-
```bash
171-
flamectl create schema schema.json --design hier_mnist
172-
```
173-
The schema defines the topology of this FL job. For more info, please refer to [05-flame-basics](05-flame-basics.md).
174-
### Step 3:
175-
```bash
176-
flamectl create code hier_mnist.zip --design hier_mnist
177-
```
178-
The zip file should contain code of every code specified in the schema.
179-
180-
### Step 4:
181-
```bash
182-
flamectl create dataset dataset_eu_germany.json
183-
flamectl create dataset dataset_eu_uk.json
184-
flamectl create dataset dataset_na_canada.json
185-
flamectl create dataset dataset_na_us.json
186-
```
187-
Flame will assign a trainer to each dataset. As each dataset has a `realm` specified, the middle aggreagator will be created based on the corresponding `groupby` tag. In this case, there will be one middle aggregator for Europe (eu) and one for North America (na).
188-
189-
### Step 5:
190-
Put all four dataset IDs into `job.json`, and change training hyperparameters as you like.
191-
```json
192-
"fromSystem": [
193-
"62439c3725fe244585396ad7",
194-
"6243a10c25fe244585396af0",
195-
"6243a13625fe244585396af2",
196-
"6243a14525fe244585396af3"
197-
]
198-
```
199-
200-
### Step 6:
201-
```bash
202-
flamectl create job job.json
203-
```
204-
205-
### Step 7:
206-
```bash
207-
flamectl start job ${Job ID}
208-
```
160+
For other examples, please visit their particular example directories:
161+
- [Medical Image Multi-class Classification with PyTorch](../examples/medmnist/README.md)
162+
- [Binary Income Classifcation with Tabular Dataset](../examples/adult/README.md)
163+
- [Toy Example of Hierarchical FL](../examples/hier_mnist/README.md)
164+
- [Toy Example of Parallel Experiments](../examples/parallel_experiment/README.md)
165+
- [Toy Example of Distributed Training](../examples/distributed_training/README.md)

docs/05-flame-basics.md

Lines changed: 99 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,13 @@ The key benefits of the abstraction are:
1818
Depending on the availability of different communication infrastructures and security policies,
1919
a workload can be easily changed from one communication technology to another.
2020

21-
**High extensibility**: TAG makes it easy to support a variety of different topologies. Therefore, it can potentially support many different usecases easily.
21+
**High extensibility**: TAG makes it easy to support a variety of different topologies. Therefore, it can potentially support many different use cases easily.
2222

2323
<p align="center"><img src="images/role_channel.png" alt="role and channel"" /></p>
2424

2525

2626
Now let us describe how TAG is enabled. TAG is comprised of two basic and yet simple building blocks: *role* and *channel*.
27-
A *role* represents a vertex in TAG and should be associated with some hevaviors.
27+
A *role* represents a vertex in TAG and should be associated with some behaviors.
2828
To create association between role and its behavior, a (python) code must be attached to a role.
2929
Once the association is done, a role is fully *defined*.
3030

@@ -45,7 +45,7 @@ A channel also has two attributes: *groupBy* and *funcTags*.
4545

4646
**groupBy**: This attribute is used to group roles of the channel based on a tag.
4747
Therefore, the groupBy attribute allows to build a hierarchical topology (e.g., a single-rooted multi-level tree), for instance, based on geographical location tags (e.g., us, uk, fr, etc).
48-
Currently a string-based tag is supported. Future extensions may include more dynamic grouping based on dynamic metrics such as latency, data (dis)simiarlity, and so on.
48+
Currently a string-based tag is supported. Future extensions may include more dynamic grouping based on dynamic metrics such as latency, data (dis)similarity, and so on.
4949

5050
**funcTags** This attribute (discussed later in detail) contains what actions a role would take on the channel.
5151
As mentioned earlier, a role is associated with executable code.
@@ -54,13 +54,13 @@ We will discuss how to use funcTags correctly in the later part.
5454

5555
### TAG Example 1: Two-Tier Topology
5656
In flame, a topology is expressed within a concept called *schema*.
57-
A schema is a resuable component as a template.
57+
A schema is a reusable component as a template.
5858
The following presents a simple two-tier cross-device topology.
5959

6060
```json
6161
{
6262
"name": "A sample schema",
63-
"description": "a sample schema to demostrate a TAG layout",
63+
"description": "a sample schema to demonstrate a TAG layout",
6464
"roles": [
6565
{
6666
"name": "trainer",
@@ -102,15 +102,15 @@ When datasets are selected (more details [here (not yet updated)]()), each datas
102102
Therefore, in the flame system, **the number of datasets will drive the number of data-consuming workers** (e.g., trainer in this case).
103103
Subsequently, the number of non data-consuming workers is derived from the entries in the *groupBy* feature (more on [later]()).
104104

105-
Now let's look at channels. Channels are expressed as a list. A channel consits of four key attributes: *name*, *pair*, *groupBy* and *funcTags*.
105+
Now let's look at channels. Channels are expressed as a list. A channel consists of four key attributes: *name*, *pair*, *groupBy* and *funcTags*.
106106
The *name* attribute is used to uniquely identify a channel.
107107
The *pair* attribute contains two roles that constitute the channel; each role takes one of the channel.
108108
For the correctness, roles in the pair must exist in the role list.
109109

110110
The *groupBy* attribute allows how to group or cluster workers of two ends (or roles) in the channel. It's optional.
111111
If this attribute is not defined, workers belonging to the channel are grouped into a default group.
112112

113-
With *pair* and *groupBy*, a channel only specifies what roles consititue a channel and how they are grouped.
113+
With *pair* and *groupBy*, a channel only specifies what roles constitute a channel and how they are grouped.
114114
But it doesn't know what actions each role takes on the channel. The *funcTags* attribute allows *dynamic* binding of functions to a channel.
115115
The software code attached to a role must define a set of functions that it wants to expose to users
116116
so that the users can specify it in the schema. Therefore, it allows more complex operations on a channel.
@@ -125,7 +125,7 @@ def get_func_tags(cls) -> list[str]:
125125
```
126126

127127
Note that keys used in *funcTags* (e.g., "trainer" or "aggregator") do not have direct relation to classes
128-
such as Aggregtor or Trainer in the `lib/python/flame/mode/horizontal/`. Those keys are only meaningful in the schema.
128+
such as Aggregator or Trainer in the `lib/python/flame/mode/horizontal/`. Those keys are only meaningful in the schema.
129129
And *funcTags* is updated at the time when code is associated with a role in the schema.
130130

131131
With the above configuration, the deployed topology looks like as follows.
@@ -139,7 +139,7 @@ The hierarchical topology is very similar to the simple two-tier topology except
139139
```json
140140
{
141141
"name": "A simple example schema v1.0.1",
142-
"description": "a sample schema to demostrate the hierarchical FL setting",
142+
"description": "a sample schema to demonstrate the hierarchical FL setting",
143143
"roles": [
144144
{
145145
"name": "trainer",
@@ -210,13 +210,100 @@ The above example uses "us", "europe" and "asia" as labels and is visualized as
210210

211211
<p align="center"><img src="images/hierarchical_topo.png" alt="Hierarchical topology" width="600px" /></p>
212212

213-
### How to move from 2-tier to Hierarchical Topology
214-
From 2-tier to hierarchical (e.g., 3-tier), you need to have one more role in between top aggreagator and trainer, so you add middle aggreagator into the topology (i.e., schema), which also require you to define new channels connecting between each two roles. In order for the hierarchical concept to work, the `groupBy` of upstream channel shouldn't be more specific than the downstream channel.
213+
#### How to move from 2-tier to hierarchical topology
214+
From 2-tier to hierarchical (e.g., 3-tier), you need to have one more role in between top aggregator and trainer, so you add middle aggregator into the topology (i.e., schema), which also require you to define new channels connecting between each two roles. In order for the hierarchical concept to work, the `groupBy` of upstream channel shouldn't be more specific than the downstream channel.
215215
Likewise, when you want to expand to 4-tier topology, you will need a new channel definition connecting between two middle aggregators.
216216

217217
However, it is still unclear how workers are grouped together at run time.
218218
A brief answer is as follows: in the flame system, before workers are created, they are configured with an attribute called *realm*.
219219
This attribute is a logical hierarchical value which is similar to a directory-like structure in a file system.
220220
It basically dictates where workers should be created and to which path the workers belong in the logical hierarchy.
221221
Given this hierarchical information, users can judiciously choose grouping labels.
222-
Further discussion is available [here (not yet updated)]().
222+
223+
### TAG Example 3: Parallel Experiments
224+
Flame system allows multiple identical TAGs to run in parallel based on the `groupBy` tag, such as allowing a 2-tier FL task to run in parallel for 3 geographical regions simultaneously (see image below).
225+
226+
<p align="center"><img src="images/parallel_exps.png" alt="Parallel Experiments" width="600px" /></p>
227+
228+
```json
229+
{
230+
"name": "A sample schema",
231+
"description": "a sample schema to demonstrate the parallel experiment setting",
232+
"roles": [
233+
{
234+
"name": "trainer",
235+
"description": "It consumes the data and trains local model",
236+
"isDataConsumer": true
237+
},
238+
{
239+
"name": "aggregator",
240+
"description": "It aggregates the updates from trainers",
241+
}
242+
],
243+
"channels": [
244+
{
245+
"name": "param-channel",
246+
"description": "Model update is sent from trainer to aggregator and vice-versa",
247+
"pair": [
248+
"trainer",
249+
"aggregator"
250+
],
251+
"groupBy": {
252+
"type": "tag",
253+
"value": [
254+
"default/us",
255+
"default/eu",
256+
"default/asia"
257+
]
258+
},
259+
"funcTags": {
260+
"trainer": ["fetch", "upload"],
261+
"aggregator": ["distribute", "aggregate"]
262+
}
263+
}
264+
]
265+
}
266+
```
267+
268+
This topology is the same as the 2-tier one except there are additional *value* in the *groupBy* tag.
269+
270+
### TAG Example 4: Distributed Learning
271+
Flame system allows distributed training besides federated learning. In TAG, it's creating a self-loop (see image below) to allow channel communication between trainers so that algorithms such as ring all-reduce can be used to train the model utilizing multiple trainers.
272+
273+
<p align="center"><img src="images/topologies.png" alt="Four Topologies" width="600px" /></p>
274+
275+
```json
276+
{
277+
"name": "A sample schema",
278+
"description": "a sample schema to demonstrate the distributed training setting.",
279+
"roles": [
280+
{
281+
"name": "trainer",
282+
"description": "It consumes the data and trains local model",
283+
"isDataConsumer": true
284+
}
285+
],
286+
"channels": [
287+
{
288+
"name": "param-channel",
289+
"description": "Model update is sent from trainer to other trainers",
290+
"pair": [
291+
"trainer",
292+
"trainer"
293+
],
294+
"groupBy": {
295+
"type": "tag",
296+
"value": [
297+
"default/us"
298+
]
299+
},
300+
"funcTags": {
301+
"trainer": ["ring_allreduce"]
302+
}
303+
}
304+
]
305+
}
306+
```
307+
308+
### TAG Example 5: Hybrid Model (TODO)
309+

docs/08-flame-sdk.md

Lines changed: 63 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,70 @@
11
# Flame SDK
22

33
## Selector
4-
4+
Users are able to implement new selectors in `lib/python/flame/selector/` which should return a dictionary with keys corresponding to the active trainer IDs (i.e., agent IDs). After implementation, the new selector needs to be registered into both `lib/python/flame/selectors.py` and `lib/python/flame/config.py`.
55
### Currently Implemented Selectors
66
1. Naive (i.e., select all)
7+
```json
8+
"selector": {
9+
"sort": "default",
10+
"kwargs": {}
11+
}
12+
```
713
2. Random (i.e, select k out of n local trainers)
14+
```json
15+
"selector": {
16+
"sort": "random",
17+
"kwargs": {
18+
"k": 1
19+
}
20+
}
21+
```
22+
23+
## Optimizer (i.e., aggregator of FL)
24+
Users can implement new server optimizer, when the client optimizer is defined in the actual ML code, in `lib/python/flame/optimizer` which can take in hyperparameters if any and should return the aggregated weights in either PyTorch of Tensorflow format. After implementation, the new optimizer needs to be registered into both `lib/python/flame/optimizer.py` and `lib/python/flame/config.py`.
25+
26+
### Currently Implemented Optimizers
27+
1. FedAvg (i.e., weighted average in terms of dataset size)
28+
```json
29+
# e.g.
30+
"optimizer": {
31+
"sort": "fedavg",
32+
"kwargs": {}
33+
}
34+
```
35+
2. FedAdaGrad (i.e., server uses AdaGrad optimizer)
36+
```json
37+
"optimizer": {
38+
"sort": "fedadagrad",
39+
"kwargs": {
40+
"beta_1": 0,
41+
"eta": 0.1,
42+
"tau": 0.01
43+
}
44+
}
45+
```
46+
3. FedAdam (i.e., server uses Adam optimizer)
47+
```json
48+
"optimizer": {
49+
"sort": "fedadam",
50+
"kwargs": {
51+
"beta_1": 0.9,
52+
"beta_2": 0.99,
53+
"eta": 0.01,
54+
"tau": 0.001
55+
}
56+
}
57+
```
58+
4. FedYogi (i.e., servers use Yogi optimizer)
59+
```json
60+
"optimizer": {
61+
"sort": "fedyogi",
62+
"kwargs": {
63+
"beta_1": 0.9,
64+
"beta_2": 0.99,
65+
"eta": 0.01,
66+
"tau": 0.001
67+
}
68+
}
69+
```
870

9-
Users are able to implement new selectors in `lib/python/flame/selector/` which should return a dictionary with keys corresponding to the active trainer IDs (i.e., agent IDs). After implementation, the new selector needs to be registered into both `lib/python/flame/selectors.py` and `lib/python/flame/config.py`.

docs/images/parallel_exps.png

897 KB
Loading

docs/images/topologies.png

887 KB
Loading

0 commit comments

Comments
 (0)