IntelLabs
diff --git a/‎.gitmodules
+3 b/‎.gitmodules
+3
diff --git a/‎README.md
+7-7 b/‎README.md
+7-7
diff --git a/‎applications/AutoDock-Vina/Dockerfile
+31 b/‎applications/AutoDock-Vina/Dockerfile
+31
diff --git a/‎applications/AutoDock-Vina/README.md
+79 b/‎applications/AutoDock-Vina/README.md
+79
diff --git a/‎applications/AutoDock-Vina/data_download_script.sh
+27 b/‎applications/AutoDock-Vina/data_download_script.sh
+27
diff --git a/‎applications/Autodock/Dockerfile
+36 b/‎applications/Autodock/Dockerfile
+36
@@ -22,3 +22,6 @@
 [submodule "applications/bcftools"]
 	path = applications/bcftools
 	url = https://github.com/samtools/bcftools.git
+[submodule "applications/STAR"]
+	path = applications/STAR
+	url = https://github.com/alexdobin/STAR.git
@@ -5,17 +5,17 @@ Intel lab's open sourced data science framework for accelerating digital biology
 # Introduction
 We are in the epoch of digital biology, that is fueled by the convergence of three revolutions: 1) Measurement of biological systems at high resolution resulting in massive multi-modal, multi-scale, unstructured, distributed data, 2) Novel data science (AI and data management) techniques on this data, and 3) Wide-spread cloud use enabling massive compute and public data repositories, large collaborative projects and consortia. It will require computing and data management at unprecedented scale and speed. However, performance alone would not suffice if it significantly compromised the productivity of biologists and data scientists who are at the forefront of this transformation. 
 
-With a goal to build a performant, cost effective and productive platform, we are building **Open Omics acceleration framework**: a one-click, containerized, customizable, open-sourced framework for accelerating digital biology research. The framework is being built with a modular design that keeps in mind the different ways the users would want to interact with it. As shown in the following block diagram, it consists of three layers:
-* **Pipeline layer**: for users who are looking for one click solution to run standard pipelines. Currently, we support the following pipelines:
+With a goal to build a performant, cost effective and productive platform, we are building **Open Omics acceleration framework**: a one-click, containerized, customizable, open-sourced framework for accelerating digital biology research. It provides tools and pipelines in the field of genomics, transcriptomics, proteomics, drug molecule search and De novo drug design. The framework is being built with a modular design that keeps in mind the different ways the users would want to interact with it. As shown in the following block diagram, it consists of three layers:
+* **Pipeline layer**: for users who are looking for one click solution to run standard pipelines. The pipelines can be accessed in the 'pipelines' subfolder. It provides instrcutions to build & run the docker images. Currently, we support the following pipelines:
   * [**fq2sortedbam**](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/fq2sortedbam): Given gzipped fastq files of an individual, this workflow performs sequence mapping ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2)) and sorting ([SAMtools](https://github.com/samtools/samtools) sort) to output the sorted BAM file.
   * [**DeepVariant based germline pipeline for variant calling (fq2vcf)**](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/deepvariant-based-germline-variant-calling-fq2vcf): Given paired end gzipped fastq files of an individual, this workflow performs sequence mapping ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2)), sorting ([SAMtools](https://github.com/samtools/samtools) sort) and variant calling ([Open Omics DeepVariant](https://github.com/IntelLabs/open-omics-deepvariant)) to call the variants in the genome of the individual.
-  * [**AlphaFold2-based protein folding**](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/alphafold2-based-protein-folding): Given one or more protein sequences, this workflow performs preprocessing (database search and multiple sequence alignment using Open Omics [HMMER](https://github.com/IntelLabs/hmmer) and [HH-suite](https://github.com/IntelLabs/hh-suite)) and structure prediction ([Open Omics AlphaFold2](https://github.com/IntelLabs/open-omics-alphafold)) to output the structure(s) of the protein sequences.
+  * [**AlphaFold2-based protein folding**](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/alphafold2-based-protein-folding): Given one or more protein sequences, this workflow performs preprocessing (database search and multiple sequence alignment using Open Omics [HMMER](https://github.com/IntelLabs/hmmer) and [HH-suite](https://github.com/IntelLabs/hh-suite)) and structure prediction ([Open Omics AlphaFold2](https://github.com/IntelLabs/open-omics-alphafold)) to output the structure(s) of the protein sequences. It has support for both AlphaFold2 monomer and AlphaFold2 multimer.
   * [**Single cell RNASeq analysis**](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/single-cell-RNA-seq-analysis): Given a cell by gene matrix, this [scanpy](https://github.com/scverse/scanpy) based workflow performs data preprocessing (filter, linear regression and normalization), dimensionality reduction (PCA), clustering (Louvain/Leiden/kmeans) to cluster the cells into different cell types and visualize those clusters (UMAP/t-SNE).
-* **Toolkit (applications) layer**: for users who want to use individual tools or to create their own custom pipelines by combining various tools.
-* **Building blocks (lib) layer**: for tool developers, this layer consists of key building blocks -- biology specific and generic AI algorithms and data structures -- that can replace ones used in existing tools to accelerate them or can be used as ingredients to build new efficient tools.
+* **Toolkit layer**: for users who want to use individual tools or to create their own custom pipelines by combining various tools. The toolkit layer can be accessed in the 'applications' subfolder. For each tool, we provide instructions to build and run it. Currently, the tools supported include: genomics (BWA-MEM, minimap2, bcftools, SAMtools, DeepVariant), transcriptomics (STAR aligner), protein folding (AlphaFold2, ESMFold), protein structure and sequence design (RFDiffusion, ProteinMPNN, LM-design, ESM2-inv, ProtGPT2, ESM2 embeddings), molecular docking (AutoDock, AutoDock-Vina), De novo molecule generation (MoFlow).
+* **Building blocks layer**: for tool developers, this layer consists of key building blocks -- biology specific and generic AI algorithms and data structures -- that can replace ones used in existing tools to accelerate them or can be used as ingredients to build new efficient tools. This layer can be accessed in the 'lib' subfolder.
 
 <p align="center">
-<img src="https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/blob/main/images/Open-Omics-Acceleration-Framework-v2.0.JPG" height="300"/a></br>
+<img src="https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/blob/main/images/Open-Omics-Acceleration-Framework-v3.0.jpg" height="300"/a></br>
 </p> 
 
 With a goal of providing a one-stop platform, this framework brings our following repositories for digital biology under one umbrella:
@@ -37,7 +37,7 @@ In addition, we also use several existing AI libraries: oneDNN, oneDAL, oneCCL,
 # Getting Started
 ```sh
 # Download release
-wget https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/releases/download/2.1/Source_code_with_submodules.tar.gz 
+wget https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/releases/download/3.0/Source_code_with_submodules.tar.gz 
 tar -xzf Source_code_with_submodules.tar.gz
 
 # Clone master
 
@@ -0,0 +1,31 @@
+FROM condaforge/miniforge3:4.10.2-0
+ENV DEBIAN_FRONTEND=noninteractive
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    build-essential \
+    libboost-all-dev \
+    swig \
+    vim \
+    gcc-8 \
+    g++-8 \
+    numactl \
+    time && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*
+ENV CC=gcc-8
+ENV CXX=g++-8
+WORKDIR /opt
+RUN git clone https://github.com/ccsb-scripps/AutoDock-Vina.git
+WORKDIR /opt/AutoDock-Vina
+RUN git checkout v1.2.2
+WORKDIR /opt/AutoDock-Vina/build/linux/release
+RUN make -j$(nproc)
+ENV SERVICE_NAME="autodock-vina-service"
+RUN groupadd --gid 1001 $SERVICE_NAME && \
+    useradd -m -g $SERVICE_NAME --shell /bin/false --uid 1001 $SERVICE_NAME
+RUN chown -R $SERVICE_NAME:$SERVICE_NAME /opt
+USER $SERVICE_NAME
+ENV PATH="/opt/AutoDock-Vina/build/linux/release:$PATH"
+WORKDIR /input
+HEALTHCHECK NONE
+CMD ["vina","--help"]
+
@@ -0,0 +1,79 @@
+## Open-Omics-Autodock-Vina
+Open-Omics-Autodock-Vina is a fast, efficient molecular docking software used to predict ligand-protein binding poses and affinities. It features a refined scoring function, parallel execution on multicore CPUs and user-friendly configuration.
+
+## Docker Setup Instructions
+
+
+### 1. Build the Docker Image 
+To build the Docker image with the tag `docker_vina`, use the following commands based on your machine's proxy requirements:
+* For machine without a proxy:
+```bash
+docker build -t docker_vina .
+```
+* For machine with a proxy:
+```bash
+docker build --build-arg http_proxy=<http_proxy> --build-arg https_proxy=<https_proxy> --build-arg no_proxy=<no_proxy_ip> -t docker_vina .
+```
+
+
+### 2. Choose and Download Protein Complex Data
+Select any protein complex from the available dataset of **140** protein-ligand complexes(https://zenodo.org/records/4031961) which you can download from (https://zenodo.org/records/4031961/files/data.zip?download=1). This guide uses the **5wlo** protein as an example.
+
+1) Run the below commands to make data download script executable, download the complete dataset and extract the data for `5wlo`:
+
+```bash
+chmod +x data_download_script.sh
+bash data_download_script.sh 5wlo
+```
+**Note: You can replace 5wlo with any other complex name from the complete dataset available in `data_original/data` directory.**
+
+2) Create an output directory to store results specific to `5wlo`:
+```bash
+mkdir -p 5wlo_output                                                                                                               
+```
+
+3) Set the environment variables for the `5wlo` protein as follows:
+```bash                                                                                                                         
+export INPUT_VINA=$PWD/5wlo
+export OUTPUT_VINA=$PWD/5wlo_output
+```
+
+4) Add the necessary permissions to output folder for Docker to write to it:
+```bash
+sudo chmod -R a+w $OUTPUT_VINA
+```
+
+### 3. Run the Docker Container
+Verify that the Docker image was built successfully by listing Docker images:
+```bash
+docker images | grep docker_vina                                                                                                
+```
+If the image is listed, run AutoDock Vina with the following command:
+```bash                                                                                                                         
+docker run -it -v $INPUT_VINA:/input -v $OUTPUT_VINA:/output docker_vina:latest vina --receptor protein.pdbqt --ligand rand-1.pdbqt --out /output/rand-1_out.pdbqt --center_x 16.459 --center_y -19.946 --center_z -5.850 --size_x 18 --size_y 18 --size_z 18 --seed 1234 --exhaustiveness 64
+```
+This command will process your receptor and ligand files and place the results in the specified output directory.
+### 4. Expected Output                                                                                                           
+After running the above command, you should find the output file (`rand-1_out.pdbqt`) in the output directory, such as `5wlo_output` for this example.
+
+---
+The original README content of AutoDock-Vina follows:
+
+## AutoDock Vina: Docking and virtual screening program
+
+**AutoDock Vina** is one of the **fastest** and **most widely used** **open-source** docking engines. It is a turnkey computational docking program that is based on a simple scoring function and rapid gradient-optimization conformational search. It was originally designed and implemented by Dr. Oleg Trott in the Molecular Graphics Lab, and it is now being maintained and develop by the Forli Lab at The Scripps Research Institute.
+
+* AutoDock4.2 and Vina scoring functions
+* Support of simultaneous docking of multiple ligands and batch mode for virtual screening
+* Support of macrocycle molecules
+* Hydrated docking protocol
+* Can write and load external AutoDock maps
+* Python bindings for Python 3
+
+## Documentation
+
+The installation instructions, documentation and tutorials can be found on [readthedocs.org](https://autodock-vina.readthedocs.io/en/latest/).
+
+## Citations
+* [J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli. (2021). AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings. Journal of Chemical Information and Modeling.](https://pubs.acs.org/doi/10.1021/acs.jcim.1c00203)
+* [O. Trott and A. J. Olson. (2010). AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of computational chemistry, 31(2), 455-461.](https://onlinelibrary.wiley.com/doi/10.1002/jcc.21334)
@@ -0,0 +1,27 @@
+url="https://zenodo.org/records/4031961/files/data.zip?download=1"
+download_dir="./data_original"
+target_folder="$1"
+if [ ! -d "$download_dir/data" ]; then
+    echo "Downloading data.zip..."
+    mkdir -p "$download_dir"
+    wget -O "$download_dir/data.zip" "$url"
+
+    echo "Unzipping data.zip..."
+    unzip "$download_dir/data.zip" -d "$download_dir"
+    rm -f "$download_dir/data.zip"
+
+    echo "Data downloaded and extracted to $download_dir/data"
+else
+    echo "Data already exists in $download_dir/data. Skipping download and extraction."
+fi
+if [ -d "$target_folder" ]; then
+    echo "The folder '$target_folder' already exists in the current directory. Skipping copy."
+else
+    if [ -d "$download_dir/data/$target_folder" ]; then
+        cp -r "$download_dir/data/$target_folder" ./
+        echo "$target_folder folder successfully copied to the current directory."
+    else
+        echo "$target_folder folder not found inside '$download_dir/data'."
+    fi
+fi
+echo "'$target_folder' folder is now available in the current directory."
@@ -0,0 +1,36 @@
+FROM condaforge/miniforge3:4.10.2-0
+ENV DEBIAN_FRONTEND=noninteractive
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    vim \
+    git \
+    build-essential \
+    ocl-icd-opencl-dev \
+    clinfo && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*
+RUN conda install -c conda-forge \
+    python=3.10 \
+    requests=2.28.2 \
+    mkl=2023.1 \
+    dpcpp_linux-64=2023.1 \
+    dpcpp-cpp-rt=2023.1 \
+    mkl-devel=2023.1 && \
+    conda clean --all -f -y
+ENV LD_LIBRARY_PATH="/opt/conda/lib:${LD_LIBRARY_PATH}"
+WORKDIR /opt
+ENV SERVICE_NAME="autodock-service"
+RUN groupadd --gid 1001 $SERVICE_NAME && \
+    useradd -m -g $SERVICE_NAME --shell /bin/false --uid 1001 $SERVICE_NAME && \
+    mkdir -p /opt/AutoDock && \
+    chown -R $SERVICE_NAME:$SERVICE_NAME /opt/AutoDock
+USER $SERVICE_NAME
+WORKDIR /opt/AutoDock
+RUN git clone https://github.com/emascarenhas/AutoDock-GPU.git . && \
+    git checkout v1.4
+RUN make DEVICE=CPU NUMWI=64 && \
+    rm -rf .git build_temp
+ENV PATH="/opt/AutoDock/bin:${PATH}"
+HEALTHCHECK NONE
+WORKDIR /input
+CMD ["autodock_cpu_64wi","--help"]
+