Paper List for Machine Learning Systems

Paper list for broad topics in machine learning systems

NOTE: Survey papers are annotated with [Survey 🔍] prefix.

1. Data Processing

[arxiv'25] The Streaming Batch Model for Efficient and Fault-Tolerant Heterogeneous Execution
[arxiv'25] In-Network Preprocessing of Recommender Systems on Multi-Tenant SmartNICs
[VLDB'25] cedar: Composable and Optimized Machine Learning Input Data Pipelines
[HotInfra'24] Lotus: Characterize Architecture Level CPU-based Preprocessing in Machine Learning Pipelines
[arxiv'24] TensorSocket: Shared Data Loading for Deep Learning Training
[arxiv'24] Efficient Tabular Data Preprocessing of ML Pipelines
[MLSys'22] Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines
[ISCA'22] Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training
[SIGMOD'22] Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines
[VLDB'21] Analyzing and Mitigating Data Stalls in DNN Training
[VLDB'21] tf.data: A Machine Learning Data Processing Framework

[TACO'23] Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training
[ICPP'22] Lobster: Load Balance-Aware I/O for Distributed DNN Training
[SC'21] Clairvoyant Prefetching for Distributed Machine Learning I/O

[TPDS'23] High-Level Data Abstraction and Elastic Data Caching for Data-Intensive AI Applications on Cloud-Native Platforms
[SOSP'23] UGACHE: A Unified GPU Cache for Embedding-based Deep Learning
[ATC'23] Tectonic-Shift: A Composite Storage Fabric for Large-Scale ML Training
[EuroSys'23] SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters [also in 2.1]
[FAST'23] SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training
[HPCA'23] iCACHE: An Importance-Sampling-Informed Cache for Accelerating I/O-Bound DNN Model Training
[NeurIPS'22] A Deep Learning Dataloader with Shared Data Preparation
[CLUSTER'22] Hvac: Removing I/O Bottleneck for Large-Scale Deep Learning Applications
[ICDE'22] Fluid: Dataset Abstraction and Elastic Acceleration for Cloud-native Deep Learning Training Jobs
[ATC'21] Refurbish Your Training Data: Reusing Partially Augmented Samples for Faster Deep Neural Network Training
[FAST'20] Quiver: An Informed Storage Cache for Deep Learning
[ICPP'20] DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training
[arXiv'19] Faster Neural Network Training with Data Echoing
[HotCloud'19] The Case for Unifying Data Loading in Machine Learning Clusters

[ECCV'22] L3: Accelerator-Friendly Lossless Image Format for High-Resolution, High-Throughput DNN Training
[VLDB'21] Progressive compressed records: Taking a byte out of deep learning data

[CIDR'21] Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines

[VLDB'18] Snorkel: Rapid Training Data Creation with Weak Supervision

2. Training System

[ICSE'24] An Empirical Study on Low GPU Utilization of Deep Learning Jobs
[NSDI'24] Characterization of Large Language Model Development in the Datacenter
[NSDI'22] MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters (PAI)
[ATC'19] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (Philly)

[arxiv'25] TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms
[TACO'24] Taming Flexible Job Packing in Deep Learning Training Clusters
[SoCC'24] Kale: Elastic GPU Scheduling for Online DL Model Training
[arxiv'24] Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling
[SC'24] PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters
[OSDI'24] MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale
[ASPLOS'24] Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters
[Middleware'24] Optimal Resource Efficiency with Fairness in Heterogeneous GPU Clusters
[IPDPS'24] Hadar: Heterogeneity-Aware Optimization-Based Online Scheduling for Deep Learning Cluster
[EuroSys'24] Blox: A Modular Toolkit for Deep Learning Schedulers
[NSDI'24] Swing: Short-cutting Rings for Higher Bandwidth Allreduce
[NSDI'24] Towards Domain-Specific Network Transport for Distributed DNN Training
[NSDI'24] Vulcan: Automatic Query Planning for Live ML Analytics
[NSDI'24] CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters
[Survey 🔍] [ACM CSUR'23] Deep Learning Workload Scheduling in GPU Datacenters: A Survey
[arxiv'23] Energy-Efficient GPU Clusters Scheduling for Deep Learning
[SC'23] EasyScale: Accuracy-consistent Elastic Training for Deep Learning
[ICPP'23] CoTrain: Efficient Scheduling for Large-Model Training upon GPU and CPU in Parallel
[ICPP'23] Embracing Uncertainty for Equity in Resource Allocation in ML Training
[SOSP'23] Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling
[NSDI'23] Shockwave: Proactive, Fair, and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning
[EuroSys'23] SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters [also in 1.2]
[EuroSys'23] Lyra: Elastic Scheduling for Deep Learning Clusters
[EuroSys'23] ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning
[ASPLOS'23] Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs
[arxiv'22] Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads
[Survey 🔍] [arxiv, 2022] Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision
[SoCC'22] ESCHER: Expressive Scheduling with Ephemeral Resources
[NSDI'22] MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters (PAI)
[OSDI'22] Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters (Synergy)
[SIGCOMM'22] Multi-resource interleaving for deep learning training (Muri)
[MLSys'21] Wavelet: Efficient DNN Training with Tick-Tock Scheduling
[SoCC'21] Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs
[SC'21] Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters (Helios)
[OSDI'21] Privacy Budget Scheduling (DPF)
[NSDI'21] Elastic Resource Sharing for Distributed Deep Learning (AFS)
[OSDI'21] Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
[EuroSys'20] Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning (GandivaFair)
[NSDI'20] Themis: Fair and Efficient GPU Cluster Scheduling
[OSDI'20] HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees
[OSDI'20] Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads (Gavel)
[EuroSys'20] AlloX: Compute Allocation in Hybrid Clusters
[MLSys'20] Resource Elasticity in Distributed Deep Learning
[NSDI'19] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
[ATC'19] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (Philly)
[EuroSys'18] Optimus: an efficient dynamic resource scheduler for deep learning clusters
[OSDI'18] Gandiva: Introspective Cluster Scheduling for Deep Learning

[arxiv'25] Memory Analysis on the Training Course of DeepSeek Models
[MICRO'24] SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts
[arxiv'24] Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator
[TACO'24] ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management
[ICML'24] GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
[ASPLOS'24] GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching
[arxiv'23] Rethinking Memory and Communication Cost for Efficient Large Language Model Training
[arxiv'23] Quantized Distributed Training of Large Models with Convergence Guarantees (QSDP)
[arxiv'23] Does compressing activations help model parallel training?
[SoCC'23] Towards GPU Memory Efficiency for Distributed Training at Scale
[VLDB'23] PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
[SOSP'23] Efficient Memory Management for Large Language Model Serving with PagedAttention
[HPCA'23] MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism
[HPCA'23] Tensor Movement Orchestration in Multi-GPU Training Systems
[IJCAI'23] OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning
[ICLR'22] LoRA: Low-Rank Adaptation of Large Language Models
- algorithmic method for memory efficiency
[VLDB'22] Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers
[ATC'21] ZeRO-Offload: Democratizing Billion-Scale Model Training
[ICLR'21] ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
[ICLR'21] Dynamic Tensor Rematerialization
[SC'21] ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning
[HPCA'21] Sentinel: Efficient Tensor Migration and Allocation on Heterogeneous Memory Systems for Deep Learning
[MLSys'20] Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization
[ASPLOS'20] Capuchin: Tensor-based GPU Memory Management for Deep Learning
[ASPLOS'20] SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping
[ESEC/FSE'20] Estimating GPU memory consumption of deep learning models
[SC'20] ZeRO: memory optimizations toward training trillion parameter models
[ISCA'18] Gist: Efficient Data Encoding for Deep Neural Network Training
[PPoPP'18] Superneurons: dynamic GPU memory management for training deep neural networks
[MICRO'16] vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design
[arxiv'16] Training Deep Nets with Sublinear Memory Cost

[OSDI'23] Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters
[NSDI'23] ModelKeeper: Accelerating DNN Training via Automated Training Warmup
[OSDI'20] Retiarii: A Deep Learning Exploratory-Training Framework

For comprehensive list of GNN systems papers, refer to https://github.com/chwan1016/awesome-gnn-systems.

3. Inference System

4. Mixture of Experts (MoE)

This is the list of papers about MoE training and inference (collected from 2.6 and 3).

5. LLM Long Context

6. Federated Learning

7. Privacy-Preserving ML

8. ML APIs & Application-side Optimization

9. ML (LLM) for Systems

10. GPU Kernel Scheduling & Optimization

11. Energy-efficiency for LLM (carbon-aware)

12. Retrieval-Augmented Generation (RAG)

13. Simulation

Others

References

This repository is motivated by:

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
README.md		README.md

byungsoo-oh/ml-systems-papers

Folders and files

Latest commit

History

Repository files navigation

Paper List for Machine Learning Systems

Table of Contents

1. Data Processing

1.1 Data pipeline optimization

1.1.1 General

1.1.2 Prep stalls

1.1.3 Fetch stalls (I/O)

1.1.4 Specific workloads (GNN, DLRM)

1.2 Caching and distributed storage for ML training

1.3 Data formats

1.4 Data pipeline fairness and correctness

1.5 Data labeling automation

2. Training System

2.1 Empirical study on ML Jobs

2.2 DNN job scheduling

2.3 GPU sharing

2.4 GPU memory management and optimization

2.5 Distributed training (parallelism)

2.6 DL job failures & resilient training

2.7 AutoML

2.8 Communication optimization & network infrastructure for ML

2.9 Model compression

2.10 DNN compiler

2.11 GNN training system

3. Inference System

4. Mixture of Experts (MoE)

5. LLM Long Context

6. Federated Learning

7. Privacy-Preserving ML

8. ML APIs & Application-side Optimization

9. ML (LLM) for Systems

10. GPU Kernel Scheduling & Optimization

11. Energy-efficiency for LLM (carbon-aware)

12. Retrieval-Augmented Generation (RAG)

13. Simulation

Others

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages