Scripts and results for "ConNIS and labeling instability: new statistical methods for improving the detection of essential genes in TraDIS libraries"
(by Hanke, M., Harten, T. and Foraita, R. 2025)
R scripts and performance results are based on
- 160 synthetic data settings (for method comparison)
- 4 semi-synthetic data settings (for method comparison)
- 3 real world data settings (for method comparison)
- 3 real world and 3 synthetic data examples (for evaluation of tuning parameter selection)
Seeds have been used for all simulations and subsample drawings.
An interactive web app with all results is available at https://connis.bips.eu.
For an implementation of ConNIS and the instability approach as an R package see https://github.com/bips-hb/ConNIS.
All simulations and analyses were run on a 64 core workstation. R >= 4.3.0 and the following packages are required:
parallel(baseR)tidyverse(CRAN)MASS(CRAN)insdens(https://github.com/Kevin-walters/insdens)gmp(CRAN)ggpubr(CRAN)cowplot(CRAN)ggridges(CRAN)ggh4x(CRAN)readr(CRAN)
ConNIS_results/
├──14028s_data/
├──MG1655_data/
├──bw25113_data/
├──data_for_synthetic_data_generation/
├──performance/
├──plots/
├──results/
├──simulatedData/
├──tmpData/
dataSimulation.R
example_<simu>.R
functions.R
generate_stability_tables.R
<method>Analysis.R
performanceAnalysis<Method>.R
plot_<...>.R
realworld_<strain>.R
semi_synthetic.R
simulations.R
stabilities_<real_world_or_synthetic>.R
14028s_data/,MG1655_data/andbw25113_data/contain real world data for the analyses; all data have been cloned by their publicly available original sources (see Hanke et al., 2025, for references)data_for_synthetic_data_generation/contains E. coli data as reference for the synthetic data generation indataSimulation.Rperformance/contains performances of all methods for synthetic, semi-synthetic and real world data analyses based onperformanceAnalysis<Method>.Rplots/contains plots generated withplots_<...>.Rbased on performances inperformance/results/is used for analyses results of<method>Anaysis.R; is used byperformanceAnalysis<Method>.RsimulatedData/is used for generated synthetic data bydataSimulation.R; is used by<method>Anaysis.RtmpDatasaves intermediate results by the MCMC part of the InsDens method
dataSimulation.Rgenerates synthetic data based on parameters insimulation.Rexample_<simu>.Rgenerates results and performances for three synthetic data sets; is needed for the evaluation of the performance of the (in)stability approach bystabilities_<real_world_or_synthetic>.Rfunctions.Rcontains all methods and functions (with exception of InsDens)generate_stability_tables.Rgenerates a table with the performance of the (in)stability approach; uses results ofexample_<simu>.R,realworld_<strain>.Randstabilities_<real_world_or_synthetic>.R<method>Anaysis.R: applies one of the following methods to analyze the synthetic data: Binomial (binomial, implementation ofTSAS 2.0, Burger, 2017), ConNIS (connis, Hanke et al., 2025), Exp. vs. Gamma (expvsgamma, implementation ofBio-TraDIS, Barquist, 2016), Geometric (geometric, Goodall, 2023), InsDens (insdens, Nlebedim, 2021) and Tn5Gaps (tn5gaps, implementation ofTRANSIT, DeJesus, 2015);<method>can be set toperformanceAnalysis<Method>.R: gives the performances forBinomial,ConNIS,ExpVsGamma,Geometric,InsDensandTn5Gapsbased on results of<method>Anaysis.Rplot_<...>.Rgenerates plots for (semi-)synthetic and real world data analysisrealworld_<strain>.Rfor either14028S,bw25113ormg1655analyzes real world data with all 6 methods; performances are saved inperformance/semi_synthetic.Rruns the full semi-synthetic data analysis on data from Goodall, 2018simulations.Rmain script to run the simulation study for synthetic datastabilities_<real_world_or_synthetic>.Rruns the (in)stability approach for all 6 methods
The default values for the simulation study are described in Hanke et al., 2025. To (re-)run the simulation study, call simulations.R. It sets the parameters of the simulation study and the number of workers for the parallel computation using parLapply. It then calls three types of scripts via a loop structure over the different parameters:
dataSimulation.Rfor generating the synthetic data<method>Analysis.Rfor the analysis of the dataperformance<method>.Rfor the performance of the chosen<method>
(the heavy work is done in <method>Analysis.R by parallel computation)
❗ While the number of workers is set in dataSimulation.R, the cluster type is set to PSOCK in all <method>Analysis.R scripts. If you want to use a fork approach for parallelization (e.g. mclapply or makeForkCluster), scripts have to be modified individually.
Run the scripts realworld_<strain>.R. Performances will be saved in performance/.
Run the script semi_synthetic.R. Performances will be saved in performance/.
❗ Uses mclapply and at most detectCores()-1 workers.
First, calculate the instability values for each parameter/weight/threshold of the six methods based on stabilities_<real_world_or_synthetic>.R (for three real world data and three synthetic datasets). Next, run realworld_<strain>.R (if you haven't done before) and example_<simu>.R for results of the three real world data and the three examples of synthetic data. These are used to evaluate the performance of the stability approach. Finally, run generate_stability_tables.R to generate an CSV file under R/ with the performances of the instability approach.
❗ Uses PSOCK for parallelization.
Simply run the script plots_for_paper.R to generate all plots under R/plots.
Barquist, L. et al. The tradis toolkit: sequencing and analysis for dense transposon mutant libraries. Bioinformatics 32, 1109–1111 (2016). URL http://dx.doi.org/ 10.1093/bioinformatics/btw022Burger, B. T., Imam, S., Scarborough, M. J., Noguera, D. R. & Donohue, T. J. Combining genome-scale experimental and computational methods to identify essential genes in rhodobacter sphaeroides. mSystems 2 (2017). URL http://dx. doi.org/10.1128/msystems.00015-17DeJesus, M. A., Ambadipudi, C., Baker, R., Sassetti, C. & Ioerger, T. R. Transit - a software tool for himar1 tnseq analysis. PLOS Computational Biology 11, e1004401 (2015). URL http://dx.doi.org/10.1371/journal.pcbi.1004401Goodall, E. C. A. et al. The essential genome of escherichia coli k-12. mBio 9 (2018). URL http://dx.doi.org/10.1128/mBio.02096-17.Goodall, E. C. A. et al. A multiomic approach to defining the essential genome of the globally important pathogen corynebacterium diphtheriae. PLOS Genetics 19, e1010737 (2023). URL http://dx.doi.org/10.1371/journal.pgen.1010737Nlebedim, V. U., Chaudhuri, R. R. & Walters, K. Probabilistic identification of bacterial essential genes via insertion density using tradis data with tn5 libraries. Bioinformatics 37, 4343–4349 (2021). URL http://dx.doi.org/10.1093/ bioinformatics/btab508