Skip to content

i2bc/RNAreg_AbiF_CDiff

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deciphering the RNA-based regulation mechanism of phage-encoded AbiF system in Clostridioides difficile

This GitHub repository accompanies the preprint where we report the identification of a new AbiF-like system within a prophage of an hypervirulent Clostridioides difficile R20291 strain: "Deciphering the RNA-based regulation mechanism of the phage-encoded AbiF system in Clostridioides difficile".

Contents

Descriptions of the software environment stand in the conda_environments files and are managed with micromamba with the command line: micromamba create -f conda_environments/<environmentName>.yml and activated before use by micromamba activate <environmentName>_ce. Example for the snakemake environment: micromamba create -f conda_environments/smk8.4.6_ce.yml Example for the the R-Enhanced volcano: micromamba create -f conda_environments/R-EnhancedVolcano_ce.yml

Datasets

Prepare datasets with following downloads:

  • genome and annotation of the C. difficile R20291 strain used: download both the genome (fna) and annotation (gff) files from the Refseq GCF_000027105.1 ncbi assembly. After unzip, you should have ncbi_dataset/data/GCF_000027105.1/GCF_000027105.1_ASM2710v1_genomic.fna and ncbi_dataset/data/GCF_000027105.1/genomic.gff. Mapped reads are counted using the locus_tag information in the annotation file. For genetic elements of interest that did not contain it, it is added as follows awk -F "\t" '{if(($3!="riboswitch")&&($3!="binding_site")&&($3!="sequence_feature")){print}}' ncbi_dataset/data/GCF_000027105.1/genomic.gff > data/genomic.gff. And the identified ncRNA of CD630 on the R20291 genome (data/RCd_r20.gff) is also added to complete the annotation file: cat data/riboswith_bindinsite_sequencefeature_withLocusTag.gff data/RCd_r20.gff >> data/genomic.gff.
  • MAPS experiments data: stand in two parts: RNAseq fraction (PRJEB87349) and proteic fraction R20291_RCd22_Soutourina_120723.xlsx

Conservation

Similarity search of the AbiF-like system

The AbiF-like system identified in the C. difficile hypervirulent ribotype 027 strain were searched in an in-house database of 47545 completely sequences genomes (chromosome level assembly, November 2023) both from Refseq and Genbank databases with PSI-BLAST (2.16.1 version). It was run against COG, CD and PFAM profiles from CDD database (E-value=1e-4), other parameters were default. The resulting table distribution_abiF.csv (see Datasets section) contains the number of valid hits by genomes. The genomic context of the resulting hits including 5 proteins upstream and downstream from the location of the AbiF-like system where extracted and can be found in the corresponding 10 separate files (see Datasets section)

AbiF-like system representation in the bacterial species tree (figure)

Setup

  • iTOL templates dataset_color_strip_template.txt and dataset_simplebar_template.txt (see Datasets section)
  • color selection for the figure of AbiF-like system conservation: data/color_selection.tsv
  • python3

species tree

Create the tree (Newick format) from the taxonomy present in the AbiF resulting table distribution_abiF.csv:

awk -F "\t" '{if($4~/Bacteria/){print $4"\t"$3"\t"$1}}' distribution_abiF.csv | sed 's/:/-/;s/\t/;/g' | sort > full_lineage_bact.tsv
python3 figures_scripts/convert_lineage_to_nwk.py full_lineage_bact.tsv > full_lineage_bact.nwk

Firstly, the NCBI bacterial lineage (4th column) is completed by the strain name (3rd column) and the assembly ID (1st column), the characters ":" not allowed in iTOL are replaced by "-" and the separators ";" in the lineage are replaced by tab characters. Next, this one-line-per-species format is converted into a tree following the Newick format using the python code convert_lineage_to_nwk.py taken from the Mark Watson stackoverflow response and adapted for python 3.

Resulting file: full_lineage_bact.tsv and full_lineage_bact.nwk

species colors

For ease of viewing, only 23 taxonomic levels have been manually selected for color display : data/color_selection.tsv. This selection was chosen according to different taxonomic levels to highlight C. difficile and Staphylococcus aureus and visualize those with counts of zero (if more than 160 genomes). The other selected taxonomic levels constitute a balanced grouping according to the number of genomes in which the AbiF_like system was searched. Black color represents the remaining divisions.

Into the iTOL template dataset_color_strip_template.txt (see Datasets section), manually change:

  • SEPARATOR SPACE for TAB
  • DATASET_LABEL label1 to Philum
  • COLOR_BRANCHES from 0 to 1

and add the color_selection.tsv at the end:

cat dataset_color_strip_template.txt data/color_selection.tsv > full_lineage_bact_colors.txt

Resulting file: full_lineage_bact_colors.txt

AbiF distribution

Into the iTOL template dataset_simplebar_template.txt (see Datasets section), uncomment the WIDTH,1000 line and change its value from 1000 to 200. Complete this dataset_simplebar_template.txt with the AbiF counts (2nd column) present in each assembly (1rst column) of the resulting AbiF table distribution_abiF.csv:

cp dataset_simplebar_template.txt full_lineage_bact_simplebar_abiF.txt
awk -F "\t" '{if($4~/Bacteria/){print $1","$2}}' distribution_abiF.csv | sort >> full_lineage_bact_simplebar_abiF.txt

Resulting file: full_lineage_bact_simplebar_abiF.txt

Figure creation on the iTOL web page

On the iTOL v7 web page:

Import the species tree structure ("Upload" tab with full_lineage_bact.nwk) As the tree contains too many leaves, iTOL by default compresses some of the data we want to display. To correct this, performe manually the steps below in the iTOL "Control panel" and "Basic" tab :

  • place "Labels" to "Hide"
  • change 350° to 359° on "Mode option" and "Arc"
  • choose grey color (#5b5b5B) for "Line style"
  • select "Un-collapse all" on "Advanced" tab and "Nodes options"

Add colors on the tree by "Upload annotation files" (in "Control panel" then "Datasets" tab) with full_lineage_bact_colors.txt

Add the AbiF distribution : in "Datasets" table, "Upload" tab with full_lineage_bact_simplebar_abiF.txt

Save the resulting figure.

Environment of AbiF-like (figure)

Setup

The figure showing the genomic context of AbiF-like systems was produced with the R script figures_scripts/genomiccontext.R after a few steps of filtering and reorganizing the genomic context data of AbiF-like systems (see Process below). If required, the R-base version used (4.2.3) and added with the ggplot2 package (3.4.4) can be installed with the conda environment file conda_environments/env_Rbase4.2.3.yml: conda env create -f conda_environments/env_Rbase4.2.3.yml

Process

Input file: see "genomic contexte of AbiF_like systems" in the Datasets session

Suppress column 11 from the environment_abiF.tsv file and convert information into several files depending on position:

awk 'BEGIN{OFS="\t";FS="\t"}{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$12,$13}' environment_abiF.tsv > environment_abiF_woc11.tsv
awk -f script/AbiF_environment_2_pos.awk 

select COG/Pfam or CDD if > 100 occurencies from the 10 input files:

rm abi_COG_up100.txt ; for p in 5neg 4neg 3neg 2neg 1neg 1pos 2pos 3pos 4pos 5pos ; do 
   sort -t $'\t' -k9,9 Positions_genes_autours_abi-2_ou_abiF_$p.csv > Pos_abi_$p.tmp 
   join -t $'\t' -1 9 -2 1 -e "empty" -a 1 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,1.11,1.12,1.13,2.2,2.3 Pos_abi_$p.tmp col9_up100_cdd_pfam_COG.txt > Pos_abi_$po.csv 
done

format for graph:

echo "position"$'\t'"abundance"$'\t'"COG"$'\t'"classe" > abi_COG_up100_graph.txt ; 
for p in 5neg 4neg 3neg 2neg 1neg abi 1pos 2pos 3pos 4pos 5pos ; do 
   awk -F "\t" '{if(length($14)>=1){print $14"\t"$15}}' Pos_abi_$p.csv | sort -t $'\t' -k1,1 | uniq -c | awk -v pos=${p} 'BEGIN{nbl=0;somme=0;OFS="\t"}{if(pos!="abi"){if($1>100){funcCl="";for(i=3;i<=NF;i++){funcCl=sprintf("%s %s",funcCl, $i)};print pos,$1,$2,funcCl;somme=somme+$1};nbl=nbl+$1}}END{if(pos=="abi"){print pos, 10000, "V", "Defense mechanisms"}else{print pos,nbl-somme,"others","Other COG, pfam or CDD < 100"}}' | sed 's/5neg/-5/;s/5pos/+5/;s/4neg/-4/;s/4pos/+4/;s/3neg/-3/;s/3pos/+3/;s/2neg/-3/;s/2pos/+2/;s/1neg/-1/;s/1pos/+1/' >> abi_COG_up100_graph.txt
done

create graph:

conda activate Rbase4.2.3 
Rscript abi_context_graph.R

result: abi_context_graph.png

MAPS

MAPS (MS2‐affinity purification coupled with RNA sequencing) analysis for the RCd22 ncRNA

MAPS rnaseq fraction

Setup:

  • if needed, create then activate the conda environement for snakemake: conda_environments/smk8.4.6_ce.yaml
  • download the template_script_DESeq2_CL.r from the SARTools github pages to the script/ directory.
  • snakemake pipeline for RNAseq maps analysis: script/maps.smk
  • adapt file paths of the snakemake configuration file: script/maps.yml
  • genome and annotation of the C. difficile R20291 strain (see Datasets section)
  • RNAseq fraction of the MAPS experiments data (see Datasets section)

MAPS data (rnaseq part): 4 fastq RCd22 : comparison of a C. difficile R20291 containing a plasmid with an MS2 tag followed (rcd22) or not (crtl) by RCd22 ncRNA (2 replicates for each condition, see Datasets section).

MAPS analysis: The analysis workflow can be found in the snakemake file (see scripts/maps.smk) and the parameters are set in the corresponding configuration file (scripts/maps.yml). The workflow basically contains the following steps: quality control of reads (fastqc, fastqscreen) and correction if necessary (fastp), mapping to the genome sequence (bowtie2), count of the number of reads per gene (featurecount), differential capture analysis (SARTools used with DESeq2 mode), and creation of graph (R-enhancedVolcano). Software versions are described in their associated conda environments (script/conda_environments/*.yml).

Results:

  • the volcano-plot of differentially captured RNA: volcano_plot.pdf
  • complete, up, and down tables of differentially captured RNA will be provided in the specified st_dir/st_comparison/ (see the fixed values of st_dir and st_comparison in the scripts/maps.yml) repository.

Functionnal testing of the snakemake workflow: To run a functionnal test, a small extract of the related MAPS rnaseq data stands in the data/demo/small_data/ repository.

  • if needed, create then activate the conda environement for snakemake: conda_environments/smk8.4.6_ce.yaml
  • choose the RNAreg_AbiF_CDiff repository as the current working directory
  • check (and change if necessary) the indicated paths into the configdfile script/maps.yml
  • run the RNAseq analysis with: snakemake -c1 -s script/maps.smk --configfile script/maps.yml --use-conda

MAPS proteic fraction

Setup:

  • create a conda environement with the conda_environments/R-EnhancedVolcano_ce.yml
  • download the template_script_DESeq2_CL.r from the SARTools github pages to the script/ directory.

MAPS data (proteic part): The xlxs file was manually saved with a csv format (with tabulation separator). Total spectrum counts columns are broken down as follows: E464 and E464.2 (columns 10 and 11) stand for the 2 RCd22 replicates while E608 and E608.2 denote control replicates (columns 14 and 15).

From the csv file, count and ID columns for each sample were extracted as follow:

mkdir MAPSprot
awk -v col=10 'BEGIN{FS="\t";OFS="\t"}{if($col==""){$col=0};nbsplit=split($2, id , "|");if(nbsplit==3){print id[3],$col}}' R20291_RCd22_Soutourina_120723.csv | awk '{print $1"\t"$NF}' > MAPSprot/rcd22_1.txt
awk -v col=11 'BEGIN{FS="\t";OFS="\t"}{if($col==""){$col=0};nbsplit=split($2, id , "|");if(nbsplit==3){print id[3],$col}}' R20291_RCd22_Soutourina_120723.csv | awk '{print $1"\t"$NF}' > MAPSprot/rcd22_2.txt
awk -v col=14 'BEGIN{FS="\t";OFS="\t"}{if($col==""){$col=0};nbsplit=split($2, id , "|");if(nbsplit==3){print id[3],$col}}' R20291_RCd22_Soutourina_120723.csv | awk '{print $1"\t"$NF}' > MAPSprot/msctr_1.txt
awk -v col=15 'BEGIN{FS="\t";OFS="\t"}{if($col==""){$col=0};nbsplit=split($2, id , "|");if(nbsplit==3){print id[3],$col}}' R20291_RCd22_Soutourina_120723.csv | awk '{print $1"\t"$NF}' > MAPSprot/msctr_2.txt

We apply a differential expression analysis to these counts (in the activated conda environment):

conda activate Rbase4.3.1_ce
Rscript template_script_DESeq2_CL.r --projectName="RCd22-MAPS_MS" --targetFile="data/MassSpec.design4sartools" --rawDir="MAPSprot/" --varInt="group" --condRef="Crtl" 

Results: complete, up, and down tables of differentially total spectrum counts and the associated volcano-plot will be provided in the MAPSprot/tables and MAPSprot/figures repositories.

Reference

  • conda Conda documentation. https://docs.conda.io/en/latest/
  • micromamba Micromamba documentation. https://mamba.readthedocs.io/en/latest/
  • Psi-blast (v2.16.1) Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. Sep 1;25(17):3389-402. doi
  • CDD database Wang J., Chitsaz F., Derbyshire M.K., Gonzales N.R., Gwadz M., Lu S., Marchler G.H., Song J.S., Thanki N., Yamashita R.A., Yang M., Zhang D., Zheng C., Lanczycki C.J., Marchler-Bauer A. (2023) The conserved domain database in 2023. Nucleic Acids Res. Jan 6;51(D1):D384-D388. doi
  • snakemake Koster J., Rahmann S. (2018) Snakemake-a scalable bioinformatics workflow engine. Bioinformatics. 34:3600. doi readthedocs
  • fastqc (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
  • fastqScreen Wingett SW, Andrews S. (2018) FastQ Screen: A tool for multi-genome mapping and quality control. F1000Res. Aug 24;7:1338. doi github
  • fastp Chen S, Zhou Y, Chen Y, Gu J. (2018) fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. Sep 1;34(17):i884-i890. doi
  • bowtie2 Langmead B., Salzberg S.L. (2012) Fast gapped-read alignment with Bowtie 2. Nat. Methods. 9:357–359. doi
  • samtools Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. (2009) The sequence alignment/map format and SAMtools. Bioinformatics. 25:2078–2079. doi
  • subread Liao Y., Smyth G.K., Shi W. (2014) featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 30:923–930. doi subread
  • SARtools Varet H, Brillet-Guéguen L, Coppée JY, Dillies MA. (2016) SARTools: A DESeq2- and EdgeR-Based R Pipeline for Comprehensive Differential Analysis of RNA-Seq Data. PLoS One. Jun 9;11(6):e0157022. doi
  • DESeq2 Love M.I., Huber W., Anders S. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15:550. doi
  • R-enhancedVolcano Blighe K., Rana S., Lewis M. (2024) EnhancedVolcano: Publication-ready volcano plots with enhanced colouring and labeling. R package version 1.24.0, doi github

Citation

If you find these pages useful for your research, please cite the relevant paper

@article{saunier_abiF_2025,
  title = {Deciphering the RNA-based regulation mechanism of phage-encoded AbiF system in Clostridioides difficile},
  author = {Saunier, Marion and Humbert, Adeline and Kreis, Victor and Peltier, Johann and Tisba, Arianna and Auxilien, Sylvie and Blum, Marion and Caldelari, Isabelle and Lucier, Jean-François and Ueda, Joe and Gautheret, Daniel and Toffano-Nioche, Claire and Andreani, Jessica and Fortier, Louis-Charles and Soutourina, Olga},
  journal = {Bio},
  year = {2025},
  url = {https://www.biorxiv.org/content/10.1101/2025.04.15.648962}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages