2025 Thesis:
Does phase variation play a role in the genetic diversity of mycobacteria that causes tuberculosis in animals?
This repository provides scripts and analyses performed during the study of homopolymeric tract (HT) mutations and their statistical significance in Mycobacterium bovis.
These were run directly from the RStudio terminal.
Required packages:readxl,writexl,dplyr
- Script:
genome_statistics_script.r - Description: Calculates the background mutation rate from genome-wide variant data. It parses mutation counts in the format
x/n(successes/trials), sums across selected columns representing different isolates or lineages, and computes the overall background mutation rate. - Input:
underhill_data.xlsx - Output:
background_rate_summary.csv
- Scripts:
public_ht_only_statistics.runderhill_ht_only_statistics.r
- Description:
- Calculates mutation rates specifically for HT regions.
- Applies an exact binomial test comparing observed HT mutation rates against the genome-wide background rate.
- The
underhillscript includes both lineage-specific analysis (La1,La2,La3) and global analysis.
- Inputs:
mummer_public_ht.xlsxunderhill_ht_only.xlsx
- Outputs:
public_ht_binomial_results.xlsxht_binomial_results.xlsxht_rate_summary_20250914_132034.xlsx
These were run in a Linux environment (Ubuntu on Windows).
Required modules:numpy,pandas,pathlib,argparse,matplotlib,openpyxl
- Script:
ht_identifier.sh - Description: Parses the M. bovis genome for HT regions (polymers of length ≥ 7).
- Input:
bovis_ref_genome.fasta - Output:
homopolymers.txt
- Requirement: MUMmer4
- Script:
run_dnadiff_all.sh - Description: Bash shell with embedded Python to detect mutations after WGS alignment between reference and query genomes.
- Input:
bovis_ref_genome.fasta- Query genome(s)
- Output: Excel sheets with mutation results for each query (see Appendix 1 for results).
- Script:
ht_lineage_analysis.py - Description:
- Compares mutation rates across lineages (
La1,La2,La3) for the 27 significant HT regions in the Underhill dataset. - Significance was determined via background mutation rate calculation (R scripts).
- Creates a heatmap and outputs tidy CSV files.
- Compares mutation rates across lineages (
- Input:
underhill_ht_only.xlsx - Outputs:
underhill_ht_lineage_rates_sig27.csvunderhill_ht_lineage_rates_sig27_pretty.csvunderhill_ht_lineage_heatmap_sig27.png
- Script:
dotplot_underhill_script.py - Input:
underhill_ht_only.xlsx - Outputs:
ht_plot_by_position_tidy.csvht_plot_by_position_combined.pnght_plot_by_position_combined.pdfht_plot_by_position_L1.png/.pdfht_plot_by_position_L2.png/.pdfht_plot_by_position_L3.png/.pdf
- Script:
dotplot_public_script.py - Input:
dotplot_public.xlsx - Outputs:
public_ht_by_position.pngpublic_ht_by_position.pdfpublic_ht_by_position_tidy.csv
- Script:
heatmap_underhill_script.py - Input:
underhill_sig.ht_only.xlsx - Output:
events_matrix_proportions.csv
- Script:
heatmap_public_script.py - Input:
public_sig.ht_only.xlsx - Outputs:
weighted_matrix_proportions.csvevents_heatmap_proportions.pngweighted_heatmap_proportions.pngpublic_events_heatmap_proportions_v2.png