Skip to content


This branch is 2 commits behind ramanathanlab/deepdrivemd:main.

Folders and files

Last commit message
Last commit date
Dec 1, 2022
Feb 29, 2024
Mar 14, 2024
Mar 14, 2024
Mar 17, 2023
Mar 18, 2023
Dec 1, 2022
Dec 1, 2022
Dec 1, 2022
Dec 1, 2022
Dec 1, 2022
Mar 9, 2023
Feb 29, 2024
Mar 17, 2023
Oct 25, 2022

Repository files navigation

DeepDriveMD: Coupling streaming AI and HPC ensembles to achieve 100-1000× faster biomolecular simulations

DeepDriveMD implemented using Colmena.

This implementation of DeepDriveMD enables ML/AI-coupled simulations using three primary components. Simulation: Simulations are used to explore possible trajectories of a protein or other biomolecular system; Training: Aggregated trajectories are used to train one or more ML models. Inference: Trained ML models are used to identify conformations for subsequent iterations of simulations. A Thinker process orchestrates these components to advance the workflow toward an optimization objective.


Table of Contents

  1. Installation
  2. Usage
  3. Contributing
  4. License
  5. Citations


Create a conda environment

conda create -n deepdrivemd python=3.9 -y
conda activate deepdrivemd

To install OpenMM for simulations:

conda install -c conda-forge gcc=12.1.0 -y
conda install -c conda-forge openmm -y

To install deepdrivemd:

git clone
cd deepdrivemd
make install


The workflow can be tested on a workstation (a system with a few GPUs) via:

python -m deepdrivemd.workflows.openmm_cvae -c tests/apps-enabled-workstation/test.yaml

This will generate an output directory for the run with logs, results, and task specific output folders.

Each test will write a timestamped experiment output directory to the runs/ directory.

Inside the output directory, you will find:

$ ls runs/experiment-170323-091525/
inference  params.yaml  result  run-info  runtime.log  simulation  train
  • params.yaml: the full configuration file (default parameters included)
  • runtime.log: the workflow log
  • result: a directory containing JSON files simulation.json, train.json, inference.json which log task results including success or failure, potential error messages, runtime statistics. This can be helpful for debugging application-level failures.
  • simulation, train, inference: output directories each containing subdirectories run-<uuid> for each submitted task. This is where the output files of your simulations, preprocessed data, model weights, etc will be written by your applications (it corresponds to the application workdir).
  • run-info: Parsl logs

An example, the simulation run directories may look like:

$ ls runs/experiment-170323-091525/simulation/run-08843adb-65e1-47f0-b0f8-34821aa45923:
1FME-unfolded.pdb  contact_map.npy  input.yaml  output.yaml  rmsd.npy  sim.dcd  sim.log
  • 1FME-unfolded.pdb the PDB file used to start the simulation
  • contact_map.npy, rmsd.npy: the preprocessed data files which will be input into the train and inference tasks
  • input.yaml, output.yaml: These simply log the task function input and return values, they are helpful for debugging but are not strtictly necessary
  • sim.dcd: the simulation trajectory file containing all the coordinate frames
  • sim.log: a simulation log detailing the energy, steps taken, ns/day, etc

By default the runs/ directory is ignored by git.

Production runs can be configured and run analogously. See examples/bba-folding-workstation/ for a detailed example of folding the 1FME protein. The YAML files document the configuration settings and explain the use case.

Software Interface

Implement a DeepDriveMD workflow with custom MD simulation engines, and AI training/inference methods by inherting from the DeepDriveMDWorkflow interface. This workflow implments the examples/bba-folding-workstation/ example:

from deepdrivemd.api import DeepDriveMDWorkflow

class DeepDriveMD_OpenMM_CVAE(DeepDriveMDWorkflow):
    def __init__(
        self, simulations_per_train: int, simulations_per_inference: int, **kwargs: Any
    ) -> None:
        self.simulations_per_train = simulations_per_train
        self.simulations_per_inference = simulations_per_inference

        # Make sure there has been at least one training task 
        # complete before running inference
        self.model_weights_available: bool = False

        # For batching training/inference inputs
        self.train_input = CVAETrainInput(contact_map_paths=[], rmsd_paths=[])
        self.inference_input = CVAEInferenceInput(
            contact_map_paths=[], rmsd_paths=[], model_weight_path=Path()

        # Communicate results between agents
        self.simulation_input_queue: Queue[MDSimulationInput] = Queue()

    def simulate(self) -> None:
        """Submit either a new outlier to simulate, or a starting conformer."""
        with self.simulation_govenor:
            if not self.simulation_input_queue.empty():
                inputs = self.simulation_input_queue.get()
                inputs = MDSimulationInput(sim_dir=next(self.simulation_input_dirs))

        self.submit_task("simulation", inputs)

    def train(self) -> None:
        """Submit a new training task."""
        self.submit_task("train", self.train_input)

    def inference(self) -> None:
        """Submit a new inference task once model weights are available."""
        while not self.model_weights_available:

        self.submit_task("inference", self.inference_input)

    def handle_simulation_output(self, output: MDSimulationOutput) -> None:
        """When a simulation finishes, decide to train a new model or infer outliers."""
        # Collect simulation results
        self.train_input.append(output.contact_map_path, output.rmsd_path)
        self.inference_input.append(output.contact_map_path, output.rmsd_path)

        # Signal train/inference tasks
        num_sims = len(self.train_input)
        if num_sims % self.simulations_per_train == 0:

        if num_sims % self.simulations_per_inference == 0:

    def handle_train_output(self, output: CVAETrainOutput) -> None:
        """When training finishes, update the model weights to use for inference."""
        self.inference_input.model_weight_path = output.model_weight_path
        self.model_weights_available = True

    def handle_inference_output(self, output: CVAEInferenceOutput) -> None:
        """When inference finishes, update the simulation queue with the latest outliers."""
        with self.simulation_govenor:
            self.simulation_input_queue.queue.clear() # Remove old outliers
            for sim_dir, sim_frame in zip(output.sim_dirs, output.sim_frames):
                    MDSimulationInput(sim_dir=sim_dir, sim_frame=sim_frame)


Please report bugs, enhancement requests, or questions through the Issue Tracker.

If you are looking to contribute, please see


DeepDriveMD has a MIT license, as seen in the file.


If you use DeepDriveMD in your research, please cite this paper:

  title={Coupling streaming ai and hpc ensembles to achieve 100--1000$\times$ faster biomolecular simulations},
  author={Brace, Alexander and Yakushin, Igor and Ma, Heng and Trifan, Anda and Munson, Todd and Foster, Ian and Ramanathan, Arvind and Lee, Hyungro and Turilli, Matteo and Jha, Shantenu},
  booktitle={2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)},


DeepDriveMD implemented with Colmena



Code of conduct

Security policy





No releases published


No packages published


  • Python 98.3%
  • Makefile 1.7%