Colorectal Cancer Prediction with Gut Microbiome Data

This project builds a machine learning model to predict colorectal cancer (CRC) based on the gut microbiome composition of patients, using 16S rRNA sequencing data collapsed at the genus level.

Project Workflow

1. Load the Data

metadata.csv: Includes SampleID and whether the patient has CRC.
seqtab_nochim.csv: Rows = samples, Columns = ASVs (amplicon sequence variants), Values = counts.
taxa.csv: Maps ASVs to their taxonomic classification (e.g., Genus, Family).

2. Map ASVs to Genera

Use the taxa.csv table to map each ASV to its genus.
Missing genera are labeled as Unassigned.
Rename ASV columns from IDs (e.g., ASV_1234) to genus names (e.g., Bacteroides).

3. Collapse to Genus Level

Some ASVs belong to the same genus.
Combine all columns with the same genus name by summing their counts per sample.

4. Join with Metadata

Merge CRC status (from metadata.csv) with genus-level abundances.
Save as seqtab_genus.
Final table structure:
- Rows = samples (patients’ stool samples).
- Columns = bacterial genera (e.g., Bacteroides, Prevotella).
- Values = read counts per genus.

5. Normalize the Microbiome Data

Convert raw counts to relative abundances (percentage per sample).
Prepare:
- X = features (genus abundances)
- y = labels (CRC status)

6. Split the Dataset

Perform an 80/20 train-test split.

7. Train a Random Forest Classifier

Fit a Random Forest model to learn relationships between microbial genera and CRC status.
Uses an ensemble of decision trees.

8. Evaluate Model Performance

Report:
- Precision
- Recall
- F1-score
- Accuracy

9. Identify Important Genera

Extract feature importance from the Random Forest.
Highlight genera most predictive of CRC status.

Dependencies

Python 3.8+
pandas, numpy, scikit-learn, matplotlib/seaborn (for analysis & visualization)

Usage

Place the input files (metadata.csv, seqtab_nochim.csv, taxa.csv) in the working directory.
Run the preprocessing script to generate seqtab_genus.
Train the Random Forest model.
Evaluate results and inspect important genera.

Output

Trained model performance metrics.
List of key microbial genera predictive of CRC.

License

This project is open-source and available under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
CRC_Good_Bad_Bacteria_AI_FullList.xlsx		CRC_Good_Bad_Bacteria_AI_FullList.xlsx
LICENSE		LICENSE
README.md		README.md
explication.docx		explication.docx
main.ipynb		main.ipynb
metadata.csv		metadata.csv
seqtab_nochim.csv		seqtab_nochim.csv
seqtab_nochim_export.xlsx		seqtab_nochim_export.xlsx
taxa_species_export.xlsx		taxa_species_export.xlsx
taxonomy_table_species.csv		taxonomy_table_species.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Colorectal Cancer Prediction with Gut Microbiome Data

Project Workflow

1. Load the Data

2. Map ASVs to Genera

3. Collapse to Genus Level

4. Join with Metadata

5. Normalize the Microbiome Data

6. Split the Dataset

7. Train a Random Forest Classifier

8. Evaluate Model Performance

9. Identify Important Genera

Dependencies

Usage

Output

License

About

Uh oh!

Releases

Packages

Languages

License

aramelheni/crc-ai-model

Folders and files

Latest commit

History

Repository files navigation

Colorectal Cancer Prediction with Gut Microbiome Data

Project Workflow

1. Load the Data

2. Map ASVs to Genera

3. Collapse to Genus Level

4. Join with Metadata

5. Normalize the Microbiome Data

6. Split the Dataset

7. Train a Random Forest Classifier

8. Evaluate Model Performance

9. Identify Important Genera

Dependencies

Usage

Output

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages