Welcome to the SparkDLTrigger repository!
This project is about building a machine learning pipeline for
a high-energy physics particle classifier using mainstream tools and techniques from open source
for ML and Data Engineering.
This project is supported by several articles and presentations that provide further insights and details. Check them out:
- Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics Comput Softw Big Sci 4, 8 (2020).
- Blog entries:
This work is about building a particle classifier that enhances the accuracy of event selection in
High Energy Physics.
By utilizing neural networks, we can classify different event topologies of interest, thereby improving the state
of the art in accuracy. This work reproduces the findings of the research article
Topology classification with deep learning to improve real-time event selection at the LHC
and implements it using mainstream tools and techniques from open source for ML and Data Engineering
at scale. We have used tools and frameworks such as Apache Spark and TensorFlow/Keras, Jupyter Notebooks.
We have deployed the workload on different computing resources at CERN and on Cloud resources,
including CERN's Hadoop and Spark service clusters and also using GPU resources on Kubernetes.
- Authors and contacts: [email protected], [email protected], [email protected]
- Acknowledgements:
- Original research article, raw data and neural network models by: T.Q. Nguyen et al., Comput Softw Big Sci (2019) 3: 12
- Marco Zanetti, Thong Nguyen, Maurizio Pierini, Viktor Khristenko, CERN openlab,
- Members of the Hadoop and Spark service at CERN, CMS Bigdata project, Intel team for BigDL and Analytics Zoo consultancy: Jiao (Jennie) Wang and Sajan Govindan.
The project repository is organized into the following sections:
Location: Download datasets
Description: Contains datasets required for the project.
Location: Data ingestion and feature preparation
Description: Covers the process of data ingestion and feature preparation using Apache Spark.
Location: Preparation of the datasets in Parquet and TFRecord formats
Description: Provides instructions for preparing the datasets in Parquet and TFRecord formats.
Location: Hyperparameter tuning
Description: Explains the process of hyperparameter tuning to optimize the model's performance.
-
Location: HLF classifier with Keras
- Description: Demonstrates the training of a High-Level Features (HLF) classifier using a simple model and a small dataset. The notebooks also showcase various methods for feeding Parquet data to TensorFlow, including memory, Pandas, TFRecords, and tf.data.
-
Location: Inclusive classifier
- Description: This classifier uses a Recurrent Neural Network and is data-intensive.
This shows a case when the training when data cannot fit into memory
- Description: This classifier uses a Recurrent Neural Network and is data-intensive.
-
Location: Methods for distributed training
- Description: Discusses methods for distributed training.
-
Location: Training_Spark_ML
- Description: Covers training using tree-based models run in parallel using Spark MLlib Random Forest, XGBoost, and LightGBM.
-
Location: Saved models
- Description: Contains saved models.
Additionally, you can explore the archived work in the article_2020 branch.
Note: Each section includes detailed instructions and examples to guide you through the process.
Data pipelines play a crucial role in the success of machine learning projects. They integrate various components and APIs for seamless data processing throughout the entire data chain. Implementing an efficient data pipeline can significantly accelerate and enhance productivity in the core machine learning tasks. In this project, we have developed a data pipeline consisting of the following four steps:
-
Data Ingestion
- Description: In this step, we read data from the ROOT format and the CERN-EOS storage system into a Spark DataFrame. The resulting data is then saved as a table stored in Apache Parquet files.
- Objective: The data ingestion step ensures that the necessary data is accessible for further processing.
-
Feature Engineering and Event Selection
- Description: This step focuses on processing the Parquet files generated during the data ingestion phase. The files contain detailed event information, which is further filtered and transformed to produce datasets with new features.
- Objective: Feature engineering and event selection enhance the data representation, making it suitable for training machine learning models.
-
Parameter Tuning
- Description: In this step, we perform hyperparameter tuning to identify the best set of hyperparameters for each model architecture. This is achieved through a grid search approach, where different combinations of hyperparameters are tested and evaluated.
- Objective: Parameter tuning ensures that the models are optimized for performance and accuracy.
-
Training
- Description: The best models identified during the parameter tuning phase are trained on the entire dataset. This step leverages the selected hyperparameters and the processed data to train the models effectively.
- Objective: Training the models on the entire dataset enables them to learn and make accurate predictions.
By following this data pipeline, we ensure a well-structured and efficient workflow for deep learning tasks. Each step builds upon the results of the previous one, ultimately leading to the development of high-performance machine learning models.
Machine learning data pipeline
The training of DL models has yielded satisfactory results that align with the findings of the original research paper. The performance of the models can be evaluated through various metrics, including loss convergence, ROC curves, and AUC (Area Under the Curve) analysis.
The provided visualization demonstrates the convergence of the loss function during training and the corresponding ROC curves, which illustrate the trade-off between the true positive rate and the false positive rate for different classification thresholds. The AUC metric provides a quantitative measure of the model's performance, with higher AUC values indicating better classification accuracy.
By achieving results consistent with the original research paper, we validate the effectiveness of our DL models and the reliability of our implementation. These results contribute to advancing the field of high-energy physics and event classification at the LHC (Large Hadron Collider).
For more detailed insights into the experimental setup, methodology, and performance evaluation, please refer to the associated documentation and research article. The results of the DL model(s) training are satisfactory and match the results of the original research paper.
- Article "Machine Learning Pipelines with Modern Big DataTools for High Energy Physics" Comput Softw Big Sci 4, 8 (2020), and arXiv.org
- Blog post "Machine Learning Pipelines for High Energy Physics Using Apache Spark with BigDL and Analytics Zoo"
- Blog post "Distributed Deep Learning for Physics with TensorFlow and Kubernetes"
- Poster at the CERN openlab technical workshop 2019
- Presentation at Spark Summit SF 2019
- Presentation at Spark Summit EU 2019
- Presentation at CERN EP-IT Data science seminar