(About 30 min)
(About 15 min)
If you are provided with an AWS IAM account & pre-built binaries
- If you just want to review figures & raw experimental data, see cluster-config-access-results-only.
- If you also want to reproduce all results from the beginning, see cluster-config-with-ami for setting up a cluster.
If you are not provided with an AWS account or you want to build everything from scratch, see cluster-config.
After logging in to the configured cluster, chdir to the current directory in the hoplite repo.
Here is how you run the experiments:
Baseline (2-3 min): python model_ensembling.py ${scale}
Hoplite (1-2 min): python hoplite_model_ensembling.py ${scale}
${scale}
controls the cluster size. scale=1
corresponds to 8 GPU nodes, scale=2
corresponds to 16 GPU nodes in the figure.
The script prints the mean and std of throughput (queries/s) at the end.
Baseline + fault tolerance test (About 2 min): python model_ensembling_fault_tolerance.py 1
With Hoplite + fault tolerance test (About 2 min): python hoplite_model_ensembling_fault_tolerance.py.py 1
Run python analyze_fault_tolerance.py
to compare the failure detection latency (see section 5.5 in the paper).
The initial run will be extremely slow on AWS due to python generating caching files etc (about 4 min). This is totally normal.