此專案記錄自己閱讀論文蒐集過程,並希望透過閱讀過程記錄研究方法與重點,其整理Papers
分類幫助自己正確朝著研究方向深入探討。論文主題以基於kubernetes 與 Scheduling。
- AS: Auto Scaling
- DL: Deep Learning
- DS: Distributed System
- NE: Network Efficient
- RM: Resource Management
- RU: Resource Utilization
- RC: Resource Contention
- RS: Resource Scheduling
- DMLCS: Distributed Machine Learning Centralized Scheduling
- PA: Performance Analysis
- PT: Parallelized Training
Keywords | Paper Title | Slide | Year | |
---|---|---|---|---|
DL, Scheduling | Gandiva: Introspective Cluster Scheduling for Deep Learning | [pdf] | [slide] | 2018 |
DL, CPU, RS | Scheduling CPU for GPU-based Deep Learning Jobs | [pdf] | [slide] | 2018 |
DL, NE, Scheduling | DLTAP: A Network-efficient Scheduling Method for Distributed Deep Learning Workload in Containerized Cluster Environment | [pdf] | [slide] | 2018 |
DL,Training System | Project Adam: Building an Efficient and Scalable Deep Learning Training System | [pdf] | [Video] | 2014 |
DL, PS, Rack-Scale | Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training | [pdf] | [slide] | 2018 |
ML, PS | Scaling Distributed Machine Learning with the Parameter Server | [pdf] | [slide] | 2014 |
ML, Infra | Applied Machine Learning at Facebook:A Datacenter Infrastructure Perspective | [pdf] | [slide] | 2014 |
RM | Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Cluster | [pdf] | [slide] | 2018 |
DS, PS | Scaling Distributed Machine Learning with the Parameter Server | [pdf] | [slide][Video] | 2014 |
Scheduling, GPU, PA, RC | Topology-Aware GPU Scheduling for Learning Workloads in Cloud Environments | [pdf] | [slide] | 2017 |
DL, RO, Job Scheduling, Autoscaling | DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster | [pdf] | [slide] | 2017 |
Keywords | Paper Title | Slide | Year | |
---|---|---|---|---|
DL, AS, kubernetes | Deep Learning Based Auto-Scaling Load Balancing Mechanism for Distributed Software-Defined Storage Service | [pdf] | [slide] | 2018 |
ML, benchmarking, kubernetes | Kubebench: A Benchmarking Platform for ML Workloads | [pdf] | [slide] | 2018 |
RM, DMLCS,RU, kubernetes, kubeflow | GAI: A Centralized Tree-Based Scheduler for Machine Learning Workload in Large Shared Clusters | [pdf] | [slide] | 2018 |
DL, Scheduling, Algorithm | Online Job Scheduling in Distributed Machine Learning Clusters | [pdf] | [slide] | 2018 |
Autoscaling, kubernetes | Containers Orchestration with Cost-Efficient Autoscaling in Cloud Computing Environments | [pdf] | [slide] | 2018 |
DL, PT, kubernetes | Parallelized Training of Deep NN – Comparison of Current Concepts and Frameworks | [pdf] | [slide] | 2018 |
Keywords | Paper Title | Slide | Year | |
---|---|---|---|---|
DL, DS | Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications | [pdf] | [slide] | 2018 |
DL, DS | GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server | [pdf] | [slide] | 2015 |
DL | Poseidon: A system architecture for efficient GPU-based deep learning on multiple machines | [pdf] | [slide] | 2015 |
Mesos, Marathon, Ceph | Toward High-Availability Container as a Service on Mesos Cluster with Distributed Shared Volumes | [pdf] | [slide] | 2015 |
DL, System | Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools | [pdf] | [slide] | 2019 |
- Comparison of Container Schedulers
- The evolution of cluster scheduler architectures[翻譯]
- TensorFlow: A system for large-scale machine learning
- Traditional scheduling architecture
- Machine learning Distributed Cluster
- Model training
- Farmwork
- Parameters Server / AllReduce
- Combination of both
- Scheulder affinity
- Scheduler Policy
- Hardware GPU topology
- Kube-batch