Intel lab's open sourced data science framework for accelerating digital biology
We are in the epoch of digital biology, that is fueled by the convergence of three revolutions: 1) Measurement of biological systems at high resolution resulting in massive multi-modal, multi-scale, unstructured, distributed data, 2) Novel data science (AI and data management) techniques on this data, and 3) Wide-spread cloud use enabling massive compute and public data repositories, large collaborative projects and consortia. It will require computing and data management at unprecedented scale and speed. However, performance alone would not suffice if it significantly compromised the productivity of biologists and data scientists who are at the forefront of this transformation.
With a goal to build a performant, cost effective and productive platform, we are building Open Omics acceleration framework: a one-click, containerized, customizable, open-sourced framework for accelerating digital biology research. It provides tools and pipelines in the field of genomics, transcriptomics, proteomics, drug molecule search and De novo drug design. The framework is being built with a modular design that keeps in mind the different ways the users would want to interact with it. As shown in the following block diagram, it consists of three layers:
- Pipeline layer: for users who are looking for one click solution to run standard pipelines. The pipelines can be accessed in the 'pipelines' subfolder. It provides instrcutions to build & run the docker images. Currently, we support the following pipelines:
- fq2sortedbam: Given gzipped fastq files of an individual, this workflow performs sequence mapping (BWA-MEM2) and sorting (SAMtools sort) to output the sorted BAM file.
- DeepVariant based germline pipeline for variant calling (fq2vcf): Given paired end gzipped fastq files of an individual, this workflow performs sequence mapping (BWA-MEM2), sorting (SAMtools sort) and variant calling (Open Omics DeepVariant) to call the variants in the genome of the individual.
- AlphaFold2-based protein folding: Given one or more protein sequences, this workflow performs preprocessing (database search and multiple sequence alignment using Open Omics HMMER and HH-suite) and structure prediction (Open Omics AlphaFold2) to output the structure(s) of the protein sequences. It has support for both AlphaFold2 monomer and AlphaFold2 multimer.
- Single cell RNASeq analysis: Given a cell by gene matrix, this scanpy based workflow performs data preprocessing (filter, linear regression and normalization), dimensionality reduction (PCA), clustering (Louvain/Leiden/kmeans) to cluster the cells into different cell types and visualize those clusters (UMAP/t-SNE).
- Toolkit layer: for users who want to use individual tools or to create their own custom pipelines by combining various tools. The toolkit layer can be accessed in the 'applications' subfolder. For each tool, we provide instructions to build and run it. Currently, the tools supported include: genomics (BWA-MEM, minimap2, bcftools, SAMtools, DeepVariant), transcriptomics (STAR aligner), protein folding (AlphaFold2, ESMFold), protein structure and sequence design (RFDiffusion, ProteinMPNN, LM-design, ESM2-inv, ProtGPT2, ESM2 embeddings), molecular docking (AutoDock, AutoDock-Vina), De novo molecule generation (MoFlow).
- Building blocks layer: for tool developers, this layer consists of key building blocks -- biology specific and generic AI algorithms and data structures -- that can replace ones used in existing tools to accelerate them or can be used as ingredients to build new efficient tools. This layer can be accessed in the 'lib' subfolder.
With a goal of providing a one-stop platform, this framework brings our following repositories for digital biology under one umbrella:
- Architecture efficient versions of several popular applications as part of our toolkit layer (under 'applications' folder)
Original Application | Our architecure-efficient version |
---|---|
Short read sequence mapping tool, BWA-MEM | BWA-MEM2 |
long read sequence mapping tool, minimap2 | mm2-fast |
Deep learning based variant calling tool, DeepVariant | Open-Omics-DeepVariant |
Deep learning based tool for protein structure prediction, AlphaFold2 | Open-Omics-AlphaFold |
Tool for biological sequence analysis using profile HMMs, HMMER | IntelLabs HMMER |
Tool for HMM based sensitive protein sequence searching, HH-suite | IntelLabs HH-suite |
- Trans-Omics Acceleration Library: As part of our building blocks layer (under 'lib' folder), this is a library containing architecture-efficient versions of key algorithms and data structures used for Omics analysis.
In addition, we also use several existing AI libraries: oneDNN, oneDAL, oneCCL, Katana Graph, LIBXSMM.
# Download release
wget https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/releases/download/3.0/Source_code_with_submodules.tar.gz
tar -xzf Source_code_with_submodules.tar.gz
# Clone master
git clone --recursive https://github.com/IntelLabs/Open-Omics-Acceleration-Framework
# Go to the pipelines directory
cd pipelines
# For running a specific pipeline, follow the instructions in the respective pipeline's README file.
# Go to the directory with toolkit
cd applications
# Go to the directory with biology building blocks to access Trans-Omics Acceleration Library
cd lib/tal
- Intel Open Omics Acceleration Framework on AWS: fast, cost-efficient, and seamless. AWS HPC blog, April, 2024.
- Intel Xeon is all you need for AI inference: Performance Leadership on Real World Applications. Blog under Intel Communities/Blogs/Tech Innovation/Artificial Intelligence (AI); July, 2023.
- Intel and Mila Join Forces for Responsible AI. Intel newsroom, September, 2022.
- Accelerating Genomics Pipelines Using Intel’s Open Omics Acceleration Framework on AWS. AWS HPC blog, Aug, 2022.
- Intel Labs Accelerates Single-cell RNA-Seq Analysis. Blog under Intel Communities/Blogs/Tech Innovation/Artificial Intelligence (AI); June, 2022.
- Intel and MILA Join Forces to Put AI to Work in Medical Research. HPCwire, April, 2021.
- GenDP: A Framework of Dynamic Programming Acceleration for Genome Sequencing Analysis. Yufeng Gu, Arun Subramaniyan, Tim Dunn, Alireza Khadem, Kuan-Yu Chen, Somnath Paul, Md Vasimuddin, Sanchit Misra, David Blaauw, Satish Narayanasamy, Reetuparna Das. Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA); June, 2023. https://dl.acm.org/doi/abs/10.1145/3579371.3589060.
- Accelerating Barnes-Hut t-SNE Algorithm by Efficient Parallelization on Multi-Core CPUs. Narendra Chaudhary, Alexander Pivovar, Pavel Yakovlev, Andrey Gorshkov and Sanchit Misra. arXiv preprint arXiv:2212.11506; Dec, 2022; doi: https://doi.org/10.48550/arXiv.2212.11506.
- Accelerating Deep Learning based Identification of Chromatin Accessibility from noisy ATAC-seq Data. Narendra Chaudhary, Sanchit Misra, Dhiraj Kalamkar, Alexander Heinecke, Evangelos Georganas, Barukh Ziv, Menachem Adelman and Bharat Kaul. 21st IEEE International Workshop on High Performance Computational Biology (HiCOMB) May 30, 2022. https://ieeexplore.ieee.org/abstract/document/9835674
- Accelerating minimap2 for long-read sequencing applications on modern CPUs. Saurabh Kalikar, Chirag Jain, Md Vasimuddin, Sanchit Misra. Nature Computational Science 2 (2), 78-83, Feb, 2022. https://rdcu.be/cHVAK.
- GenomicsBench: A Benchmark Suite for Genomics. Arun Subramaniyan, Yufeng Gu, Timothy Dunn, Somnath Paul, Md Vasimuddin, Sanchit Misra, David Blaauw, Satish Narayanasamy, Reetuparna Das. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2021.https://ieeexplore.ieee.org/document/9408208.
- LISA: Learned indexes for sequence analysis. Darryl Ho, Saurabh Kalikar, Sanchit Misra, Jialin Ding, Vasimuddin Md, Nesime Tatbul, Heng Li, Tim Kraska. bioRxiv 2020.12.22.423964; doi: https://doi.org/10.1101/2020.12.22.423964.
- Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems. Vasimuddin Md, Sanchit Misra, Heng Li, Srinivas Aluru. IEEE Parallel and Distributed Processing Symposium (IPDPS), 2019. https://ieeexplore.ieee.org/document/8820962.
- Performance extraction and suitability analysis of multi- and many-core architectures for next generation sequencing secondary analysis. Sanchit Misra, Tony Pan, Kanak Mahadik, George Powley, Priya N Vaidya, Md Vasimuddin, Srinivas Aluru. International Conference on Parallel Architectures and Compilation Techniques (PACT), 2018. https://dl.acm.org/doi/abs/10.1145/3243176.3243197.
- Identification of Significant Computational Building Blocks through Comprehensive Deep Dive of NGS Secondary Analysis Methods. Md Vasimuddin, Sanchit Misra, Srinivas Aluru. BioRxiv 2018 301903. https://www.biorxiv.org/content/10.1101/301903v3.abstract.
Please note that Intel does not collect or process any personal data of end users for Open Omics Acceleration Framework. All the inputs and outputs are limited to the end users's local system and Intel does not have access to the data. For more information on Intel's Global Privacy Notice, please click here: https://www.intel.com/content/www/us/en/privacy/intel-privacy-notice.html.