Main repository for the predOC project

This study developed a machine learning model to predict early risk of high-grade serous ovarian cancer (HGSOC) using electronic health records (EHR) from the Andalusian Health Population Database, which covers over 15 million individuals. Employing an Explainable Boosting Machine (EBM) algorithm, the model was trained on 3,088 HGSOC patients and 114,942 matched controls, achieving a sensitivity of 0.65 and a specificity of 0.85 in identifying high-risk individuals. Key predictors included age, abdominal pain, musculoskeletal and heart conditions, endocrine disorders, and specific blood parameters. The model provided early risk assessment, identifying HGSOC risk before diagnosis for a significant number of cases, with a median lead time of 101 days (Q1: 34 days, Q3: 435.5 days) before diagnosis. This research suggests the model's potential as a cost-effective, scalable pre-screening tool for early HGSOC detection using routine EHR data, potentially leading to earlier interventions and improved patient outcomes. The study was funded by AstraZeneca and grants from the Spanish Ministry of Science and Innovation and Consejeria de Salud y Consumo, Junta de Andalucía.

Setup

To facilitate switching between machines, an environment variable definition file .env will be used. This file will be placed in the root of the repository once it is cloned. There is an example file in the assets.

Clone the repository

Clone the repository, preferably working via ssh.

[email protected]:babelomics/predoc.git
cd predoc

Define user-specific environment variables

First, copy the example .env file.

cp ./assets/example.env .env

Assign the path where the conda environment will be created to the ENV_FOLDER variable. Let's assume the environments should be located in, for example, ~/conda_envs.

Additionally, assign the path where the data is located to the DATA_PATH variable, for example, ~/projects/bps_dumps/predoc. Therefore, the .env file would finally look something like this:

ENV_FOLDER=~/conda_envs
DATA_PATH=~/projects/bps_dumps/predoc

# -- Variables used in predoc/omop/
# Path to remote data
REMOTE_RAW_OMOP_DATA="/folder_1/.../folder_n/data/"
# User and machine name (IP or according to .ssh/config) to connect to
REMOTE_USER="username"
REMOTE_HOST="hostname"

It should be noted that the path ~/conda_envs must exist and allow writing.

Installation

We will use GNU make to simplify the task of installation, generation of binaries, documentation, etc. Therefore, it will be sufficient to execute the project's makefile:

make

The makefile uses the provided environment.yml file for the installation of the conda environment. The environment.yml file is located in the root of the repository. However, we provide a lock file to ensure that a CUDA capable machine can reproduce the environment where our model was trained. The lock file is located in the env_linux64_cuda.txt file. This file is used to recreate this specific conda environment.

Pipeline

This repository uses DVC to manage a pipeline that will retrieve and clean the data before training the model.

Pipeline Overview

Stages short description

retrieve_raw_omop_data: Retrieves raw data from a remote location using rsync.
- Requires definition on environment variables:
  - DATA_PATH: Local folder to place the retrieve OMOP tables. See Data.
  - REMOTE_RAW_OMOP_DATA: Remote folder where the OMOP tables are located. See Data.
  - REMOTE_USER: Username in the remote machine.
  - REMOTE_HOST: Remote machine address.
- Output:
  - DATA_PATH/omop/raw/
clean_raw_data: Remove patients from the data if they have certain conditions.
- Input:
  - DATA_PATH/omop/raw
  - Parameter ban_codes in params.yaml.
- Output:
  - DATA_PATH/omop/rare/
adapt_omop: Adapts OMOP tables data before training the model.
- Input:
  - DATA_PATH/omop/rare
  - Parameter target_diagnosis_concept_id in params.yaml.
- Output:
  - DATA_PATH/omop/done/
train_OMOP: Trains ML model.
- Input:
  - DATA_PATH/omop/done
- Output:
  - DATA_PATH/omop/train/
simulate_OMOP: Predicts outcomes.
- Input:
  - DATA_PATH/omop/done
- Output:
  - DATA_PATH/omop/simulate/
validate_OMOP: Generates relevant metrics.
- Input:
  - DATA_PATH/omop/train
  - DATA_PATH/omop/simulate
- Output:
  - DATA_PATH/omop/validate/

Data

The model was designed to work with OMOP-CDM v5.4 tables. The pipeline expects to find the OMOP tables in .parquet format within the folder defined by the environment variable REMOTE_RAW_OMOP_DATA. The contents of this directory will be copied in a local folder defined by the environment variable DATA_PATH. In our implementation, this variable should point to a data/ folder inside the repository, which is ignored by git.

Please note that depending on the dvc version you use folders monitored by DVC cannot exist outside the repository.

For the model to work, the required OMOP tables are:

PERSON
VISIT_OCCURRENCE
CONDITION_OCCURRENCE
MEASUREMENT
PROVIDER
CONCEPT

Running the Pipeline

# Run the entire pipeline
dvc repro

# Run a specific stage
dvc repro train_OMOP

Please note that the current implementation of the pipeline reproduces all the data used in the paper. This involves numerous training cycles that require time to be processed.

Parameters

Parameters are stored in params.yaml. Key parameters:

ban_codes: List of condition_source_concept_id codes to filter patients out.
- Any patient that has one of these codes will be removed from the data.
target_diagnosis_concept_id: condition_source_concept_id that the model will use as target to train.

Our source data have a local vocabulary that we had to map to the standard OMOP vocabulary:

2000000033 is our local OMOP code for ovarian cancer.

Personnel

Contributors

Victor Manuel de la Oliva Roque

Alberto Esteban-Medina

Laura Alejos

Isidoro Gutiérrez-Álvarez

Scientific Management

Carlos Loucera

Management

Joaquin Dopazo

Aknowledgements

This project would not have been possible without the continuous support of the Andalusian Population Health Database (BPS) and their diligent efforts in maintaining the database over the years (specially Dolores Muñoyerro-Muñiz and Román Villegas).

Funding

This study was funded by AstraZeneca project “Retrospective observational study for the development of early predictors of high grade serous ovarian cancer” (ES-2021-3211) and is also supported by grants PID2020-117979RB-I00 from the Spanish Ministry of Science and Innovation and grant IE19_259 FPS from Consejeria de Salud y Consumo, Junta de Andalucía.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.dvc		.dvc
assets		assets
docs		docs
predoc		predoc
tests		tests
.dvcignore		.dvcignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dvc.yaml		dvc.yaml
env_linux64_cuda.txt		env_linux64_cuda.txt
environment.yml		environment.yml
makefile		makefile
params.yaml		params.yaml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Main repository for the predOC project

Setup

Clone the repository

Define user-specific environment variables

Installation

Pipeline

Pipeline Overview

Stages short description

Data

Running the Pipeline

Parameters

Personnel

Contributors

Scientific Management

Management

Aknowledgements

Funding

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

babelomics/predoc

Folders and files

Latest commit

History

Repository files navigation

Main repository for the predOC project

Setup

Clone the repository

Define user-specific environment variables

Installation

Pipeline

Pipeline Overview

Stages short description

Data

Running the Pipeline

Parameters

Personnel

Contributors

Scientific Management

Management

Aknowledgements

Funding

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages