This project builds a machine learning model to predict colorectal cancer (CRC) based on the gut microbiome composition of patients, using 16S rRNA sequencing data collapsed at the genus level.
metadata.csv: IncludesSampleIDand whether the patient has CRC.seqtab_nochim.csv: Rows = samples, Columns = ASVs (amplicon sequence variants), Values = counts.taxa.csv: Maps ASVs to their taxonomic classification (e.g., Genus, Family).
- Use the
taxa.csvtable to map each ASV to its genus. - Missing genera are labeled as Unassigned.
- Rename ASV columns from IDs (e.g.,
ASV_1234) to genus names (e.g.,Bacteroides).
- Some ASVs belong to the same genus.
- Combine all columns with the same genus name by summing their counts per sample.
-
Merge CRC status (from
metadata.csv) with genus-level abundances. -
Save as
seqtab_genus. -
Final table structure:
- Rows = samples (patients’ stool samples).
- Columns = bacterial genera (e.g., Bacteroides, Prevotella).
- Values = read counts per genus.
-
Convert raw counts to relative abundances (percentage per sample).
-
Prepare:
- X = features (genus abundances)
- y = labels (CRC status)
- Perform an 80/20 train-test split.
- Fit a Random Forest model to learn relationships between microbial genera and CRC status.
- Uses an ensemble of decision trees.
-
Report:
- Precision
- Recall
- F1-score
- Accuracy
- Extract feature importance from the Random Forest.
- Highlight genera most predictive of CRC status.
- Python 3.8+
- pandas, numpy, scikit-learn, matplotlib/seaborn (for analysis & visualization)
- Place the input files (
metadata.csv,seqtab_nochim.csv,taxa.csv) in the working directory. - Run the preprocessing script to generate
seqtab_genus. - Train the Random Forest model.
- Evaluate results and inspect important genera.
- Trained model performance metrics.
- List of key microbial genera predictive of CRC.
This project is open-source and available under the MIT License. See the LICENSE file for details.