Deciphering the RNA-based regulation mechanism of phage-encoded AbiF system in Clostridioides difficile
This GitHub repository accompanies the preprint where we report the identification of a new AbiF-like system within a prophage of an hypervirulent Clostridioides difficile R20291 strain: "Deciphering the RNA-based regulation mechanism of the phage-encoded AbiF system in Clostridioides difficile".
- Datasets used
- Conservation of the AbiF-like system of the hypervirulent ribotype 027 strain
- MAPS analyses for the RCd22 ncRNA
- Reference of third-party software
- Citation
Descriptions of the software environment stand in the conda_environments
files and are managed with micromamba with the command line: micromamba create -f conda_environments/<environmentName>.yml
and activated before use by micromamba activate <environmentName>_ce
.
Example for the snakemake environment: micromamba create -f conda_environments/smk8.4.6_ce.yml
Example for the the R-Enhanced volcano: micromamba create -f conda_environments/R-EnhancedVolcano_ce.yml
Prepare datasets with following downloads:
- Similarity search of the AbiF-like system: save "Table S6 - Distribution AbiF" of the supplemental material to
distribution_abiF.csv
and suppress header lines. - iTOL templates: colors and simplebar
- genomic contexte of AbiF_like systems: save "Table S7 - AbiF environment" of the supplemental material in a
abiF_environment.csv
file in tsv format (values separated by tabs).
- genome and annotation of the C. difficile R20291 strain used: download both the genome (
fna
) and annotation (gff
) files from the Refseq GCF_000027105.1 ncbi assembly. After unzip, you should havencbi_dataset/data/GCF_000027105.1/GCF_000027105.1_ASM2710v1_genomic.fna
andncbi_dataset/data/GCF_000027105.1/genomic.gff
. Mapped reads are counted using the locus_tag information in the annotation file. For genetic elements of interest that did not contain it, it is added as followsawk -F "\t" '{if(($3!="riboswitch")&&($3!="binding_site")&&($3!="sequence_feature")){print}}' ncbi_dataset/data/GCF_000027105.1/genomic.gff > data/genomic.gff
. And the identified ncRNA of CD630 on the R20291 genome (data/RCd_r20.gff
) is also added to complete the annotation file:cat data/riboswith_bindinsite_sequencefeature_withLocusTag.gff data/RCd_r20.gff >> data/genomic.gff
. - MAPS experiments data: stand in two parts: RNAseq fraction (PRJEB87349) and proteic fraction R20291_RCd22_Soutourina_120723.xlsx
The AbiF-like system identified in the C. difficile hypervirulent ribotype 027 strain were searched in an in-house database of 47545 completely sequences genomes (chromosome level assembly, November 2023) both from Refseq and Genbank databases with PSI-BLAST (2.16.1 version). It was run against COG, CD and PFAM profiles from CDD database (E-value=1e-4), other parameters were default.
The resulting table distribution_abiF.csv
(see Datasets section) contains the number of valid hits by genomes.
The genomic context of the resulting hits including 5 proteins upstream and downstream from the location of the AbiF-like system where extracted and can be found in the corresponding 10 separate files (see Datasets section)
- iTOL templates
dataset_color_strip_template.txt
anddataset_simplebar_template.txt
(see Datasets section) - color selection for the figure of AbiF-like system conservation:
data/color_selection.tsv
- python3
Create the tree (Newick format) from the taxonomy present in the AbiF resulting table distribution_abiF.csv
:
awk -F "\t" '{if($4~/Bacteria/){print $4"\t"$3"\t"$1}}' distribution_abiF.csv | sed 's/:/-/;s/\t/;/g' | sort > full_lineage_bact.tsv
python3 figures_scripts/convert_lineage_to_nwk.py full_lineage_bact.tsv > full_lineage_bact.nwk
Firstly, the NCBI bacterial lineage (4th column) is completed by the strain name (3rd column) and the assembly ID (1st column), the characters ":" not allowed in iTOL are replaced by "-" and the separators ";" in the lineage are replaced by tab characters.
Next, this one-line-per-species format is converted into a tree following the Newick format using the python code convert_lineage_to_nwk.py
taken from the Mark Watson stackoverflow response and adapted for python 3.
Resulting file: full_lineage_bact.tsv
and full_lineage_bact.nwk
For ease of viewing, only 23 taxonomic levels have been manually selected for color display : data/color_selection.tsv
.
This selection was chosen according to different taxonomic levels to highlight C. difficile and Staphylococcus aureus and visualize those with counts of zero (if more than 160 genomes).
The other selected taxonomic levels constitute a balanced grouping according to the number of genomes in which the AbiF_like system was searched.
Black color represents the remaining divisions.
Into the iTOL template dataset_color_strip_template.txt
(see Datasets section), manually change:
- SEPARATOR SPACE for TAB
- DATASET_LABEL label1 to Philum
- COLOR_BRANCHES from 0 to 1
and add the color_selection.tsv
at the end:
cat dataset_color_strip_template.txt data/color_selection.tsv > full_lineage_bact_colors.txt
Resulting file: full_lineage_bact_colors.txt
Into the iTOL template dataset_simplebar_template.txt
(see Datasets section), uncomment the WIDTH,1000
line and change its value from 1000 to 200.
Complete this dataset_simplebar_template.txt
with the AbiF counts (2nd column) present in each assembly (1rst column) of the resulting AbiF table distribution_abiF.csv
:
cp dataset_simplebar_template.txt full_lineage_bact_simplebar_abiF.txt
awk -F "\t" '{if($4~/Bacteria/){print $1","$2}}' distribution_abiF.csv | sort >> full_lineage_bact_simplebar_abiF.txt
Resulting file: full_lineage_bact_simplebar_abiF.txt
On the iTOL v7 web page:
Import the species tree structure ("Upload" tab with full_lineage_bact.nwk
)
As the tree contains too many leaves, iTOL by default compresses some of the data we want to display.
To correct this, performe manually the steps below in the iTOL "Control panel" and "Basic" tab :
- place "Labels" to "Hide"
- change 350° to 359° on "Mode option" and "Arc"
- choose grey color (#5b5b5B) for "Line style"
- select "Un-collapse all" on "Advanced" tab and "Nodes options"
Add colors on the tree by "Upload annotation files" (in "Control panel" then "Datasets" tab) with full_lineage_bact_colors.txt
Add the AbiF distribution : in "Datasets" table, "Upload" tab with full_lineage_bact_simplebar_abiF.txt
Save the resulting figure.
The figure showing the genomic context of AbiF-like systems was produced with the R script figures_scripts/genomiccontext.R
after a few steps of filtering and reorganizing the genomic context data of AbiF-like systems (see Process below).
If required, the R-base version used (4.2.3) and added with the ggplot2 package (3.4.4) can be installed with the conda environment file conda_environments/env_Rbase4.2.3.yml
: conda env create -f conda_environments/env_Rbase4.2.3.yml
Input file: see "genomic contexte of AbiF_like systems" in the Datasets session
Suppress column 11 from the environment_abiF.tsv
file and convert information into several files depending on position:
awk 'BEGIN{OFS="\t";FS="\t"}{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$12,$13}' environment_abiF.tsv > environment_abiF_woc11.tsv
awk -f script/AbiF_environment_2_pos.awk
select COG/Pfam or CDD if > 100 occurencies from the 10 input files:
rm abi_COG_up100.txt ; for p in 5neg 4neg 3neg 2neg 1neg 1pos 2pos 3pos 4pos 5pos ; do
sort -t $'\t' -k9,9 Positions_genes_autours_abi-2_ou_abiF_$p.csv > Pos_abi_$p.tmp
join -t $'\t' -1 9 -2 1 -e "empty" -a 1 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,1.11,1.12,1.13,2.2,2.3 Pos_abi_$p.tmp col9_up100_cdd_pfam_COG.txt > Pos_abi_$po.csv
done
format for graph:
echo "position"$'\t'"abundance"$'\t'"COG"$'\t'"classe" > abi_COG_up100_graph.txt ;
for p in 5neg 4neg 3neg 2neg 1neg abi 1pos 2pos 3pos 4pos 5pos ; do
awk -F "\t" '{if(length($14)>=1){print $14"\t"$15}}' Pos_abi_$p.csv | sort -t $'\t' -k1,1 | uniq -c | awk -v pos=${p} 'BEGIN{nbl=0;somme=0;OFS="\t"}{if(pos!="abi"){if($1>100){funcCl="";for(i=3;i<=NF;i++){funcCl=sprintf("%s %s",funcCl, $i)};print pos,$1,$2,funcCl;somme=somme+$1};nbl=nbl+$1}}END{if(pos=="abi"){print pos, 10000, "V", "Defense mechanisms"}else{print pos,nbl-somme,"others","Other COG, pfam or CDD < 100"}}' | sed 's/5neg/-5/;s/5pos/+5/;s/4neg/-4/;s/4pos/+4/;s/3neg/-3/;s/3pos/+3/;s/2neg/-3/;s/2pos/+2/;s/1neg/-1/;s/1pos/+1/' >> abi_COG_up100_graph.txt
done
create graph:
conda activate Rbase4.2.3
Rscript abi_context_graph.R
result: abi_context_graph.png
MAPS (MS2‐affinity purification coupled with RNA sequencing) analysis for the RCd22 ncRNA
Setup:
- if needed, create then activate the conda environement for snakemake:
conda_environments/smk8.4.6_ce.yaml
- download the
template_script_DESeq2_CL.r
from the SARTools github pages to thescript/
directory. - snakemake pipeline for RNAseq maps analysis:
script/maps.smk
- adapt file paths of the snakemake configuration file:
script/maps.yml
- genome and annotation of the C. difficile R20291 strain (see Datasets section)
- RNAseq fraction of the MAPS experiments data (see Datasets section)
MAPS data (rnaseq part): 4 fastq RCd22 : comparison of a C. difficile R20291 containing a plasmid with an MS2 tag followed (rcd22) or not (crtl) by RCd22 ncRNA (2 replicates for each condition, see Datasets section).
MAPS analysis:
The analysis workflow can be found in the snakemake file (see scripts/maps.smk
) and the parameters are set in the corresponding configuration file (scripts/maps.yml
). The workflow basically contains the following steps: quality control of reads (fastqc, fastqscreen) and correction if necessary (fastp), mapping to the genome sequence (bowtie2), count of the number of reads per gene (featurecount), differential capture analysis (SARTools used with DESeq2 mode), and creation of graph (R-enhancedVolcano).
Software versions are described in their associated conda environments (script/conda_environments/*.yml
).
Results:
- the volcano-plot of differentially captured RNA:
volcano_plot.pdf
- complete, up, and down tables of differentially captured RNA will be provided in the specified
st_dir/st_comparison/
(see the fixed values ofst_dir
andst_comparison
in thescripts/maps.yml
) repository.
Functionnal testing of the snakemake workflow:
To run a functionnal test, a small extract of the related MAPS rnaseq data stands in the data/demo/small_data/
repository.
- if needed, create then activate the conda environement for snakemake:
conda_environments/smk8.4.6_ce.yaml
- choose the
RNAreg_AbiF_CDiff
repository as the current working directory - check (and change if necessary) the indicated paths into the configdfile
script/maps.yml
- run the RNAseq analysis with:
snakemake -c1 -s script/maps.smk --configfile script/maps.yml --use-conda
Setup:
- create a conda environement with the
conda_environments/R-EnhancedVolcano_ce.yml
- download the
template_script_DESeq2_CL.r
from the SARTools github pages to thescript/
directory.
MAPS data (proteic part): The xlxs file was manually saved with a csv format (with tabulation separator). Total spectrum counts columns are broken down as follows: E464 and E464.2 (columns 10 and 11) stand for the 2 RCd22 replicates while E608 and E608.2 denote control replicates (columns 14 and 15).
From the csv file, count and ID columns for each sample were extracted as follow:
mkdir MAPSprot
awk -v col=10 'BEGIN{FS="\t";OFS="\t"}{if($col==""){$col=0};nbsplit=split($2, id , "|");if(nbsplit==3){print id[3],$col}}' R20291_RCd22_Soutourina_120723.csv | awk '{print $1"\t"$NF}' > MAPSprot/rcd22_1.txt
awk -v col=11 'BEGIN{FS="\t";OFS="\t"}{if($col==""){$col=0};nbsplit=split($2, id , "|");if(nbsplit==3){print id[3],$col}}' R20291_RCd22_Soutourina_120723.csv | awk '{print $1"\t"$NF}' > MAPSprot/rcd22_2.txt
awk -v col=14 'BEGIN{FS="\t";OFS="\t"}{if($col==""){$col=0};nbsplit=split($2, id , "|");if(nbsplit==3){print id[3],$col}}' R20291_RCd22_Soutourina_120723.csv | awk '{print $1"\t"$NF}' > MAPSprot/msctr_1.txt
awk -v col=15 'BEGIN{FS="\t";OFS="\t"}{if($col==""){$col=0};nbsplit=split($2, id , "|");if(nbsplit==3){print id[3],$col}}' R20291_RCd22_Soutourina_120723.csv | awk '{print $1"\t"$NF}' > MAPSprot/msctr_2.txt
We apply a differential expression analysis to these counts (in the activated conda environment):
conda activate Rbase4.3.1_ce
Rscript template_script_DESeq2_CL.r --projectName="RCd22-MAPS_MS" --targetFile="data/MassSpec.design4sartools" --rawDir="MAPSprot/" --varInt="group" --condRef="Crtl"
Results:
complete, up, and down tables of differentially total spectrum counts and the associated volcano-plot will be provided in the MAPSprot/tables
and MAPSprot/figures
repositories.
- conda Conda documentation. https://docs.conda.io/en/latest/
- micromamba Micromamba documentation. https://mamba.readthedocs.io/en/latest/
- Psi-blast (v2.16.1) Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. Sep 1;25(17):3389-402. doi
- CDD database Wang J., Chitsaz F., Derbyshire M.K., Gonzales N.R., Gwadz M., Lu S., Marchler G.H., Song J.S., Thanki N., Yamashita R.A., Yang M., Zhang D., Zheng C., Lanczycki C.J., Marchler-Bauer A. (2023) The conserved domain database in 2023. Nucleic Acids Res. Jan 6;51(D1):D384-D388. doi
- snakemake Koster J., Rahmann S. (2018) Snakemake-a scalable bioinformatics workflow engine. Bioinformatics. 34:3600. doi readthedocs
- fastqc (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
- fastqScreen Wingett SW, Andrews S. (2018) FastQ Screen: A tool for multi-genome mapping and quality control. F1000Res. Aug 24;7:1338. doi github
- fastp Chen S, Zhou Y, Chen Y, Gu J. (2018) fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. Sep 1;34(17):i884-i890. doi
- bowtie2 Langmead B., Salzberg S.L. (2012) Fast gapped-read alignment with Bowtie 2. Nat. Methods. 9:357–359. doi
- samtools Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. (2009) The sequence alignment/map format and SAMtools. Bioinformatics. 25:2078–2079. doi
- subread Liao Y., Smyth G.K., Shi W. (2014) featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 30:923–930. doi subread
- SARtools Varet H, Brillet-Guéguen L, Coppée JY, Dillies MA. (2016) SARTools: A DESeq2- and EdgeR-Based R Pipeline for Comprehensive Differential Analysis of RNA-Seq Data. PLoS One. Jun 9;11(6):e0157022. doi
- DESeq2 Love M.I., Huber W., Anders S. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15:550. doi
- R-enhancedVolcano Blighe K., Rana S., Lewis M. (2024) EnhancedVolcano: Publication-ready volcano plots with enhanced colouring and labeling. R package version 1.24.0, doi github
If you find these pages useful for your research, please cite the relevant paper
@article{saunier_abiF_2025,
title = {Deciphering the RNA-based regulation mechanism of phage-encoded AbiF system in Clostridioides difficile},
author = {Saunier, Marion and Humbert, Adeline and Kreis, Victor and Peltier, Johann and Tisba, Arianna and Auxilien, Sylvie and Blum, Marion and Caldelari, Isabelle and Lucier, Jean-François and Ueda, Joe and Gautheret, Daniel and Toffano-Nioche, Claire and Andreani, Jessica and Fortier, Louis-Charles and Soutourina, Olga},
journal = {Bio},
year = {2025},
url = {https://www.biorxiv.org/content/10.1101/2025.04.15.648962}
}