Produces an executable call_features that calls features from different experiments. Initially used for 10x flex sample barcodes, it has been extended to support different methods suitable for perturb-seq and lineage feature calls. Works with process_features which assigns and error corrects barcodes and UMIs and maps features to cell barcodes (GEMs). Open source fast alternative to 10x Cellranger software.
Documentation:
- docs/Technical.md - Statistical deep dive into algorithms and parameterization
- docs/APPLY_ALL_SUMMARY.md - Quick reference for
--apply-allflag - docs/TESTING_COMPLETE.md - Testing results and validation
make clean # remove all executables and objects
make all # serial build, single-thread runtime
# optional: enable OpenMP (multi-thread EM)
make all OPENMP=1 The build produces call_features (streaming demux with optional EM)
and flex_demux_mtx (stand-alone FLEX implementation).
| Flag | Type | Default | Description |
|---|---|---|---|
--mtx-dir DIR |
required | – | Directory with matrix.mtx, features.tsv, barcodes.tsv, allowed_features.tsv (or features.txt in --process-features mode). |
--starsolo-dir DIR |
optional | – | STARsolo Gene directory that contains raw/ and filtered/ barcode lists. If not provided, filtered barcodes are searched in MTX directory. |
--out-prefix PFX |
required | – | Prefix for all output files, or directory path (if ends with / or is existing directory). |
--threads N |
int | 1 | Number of OpenMP threads. When compiled without OpenMP this flag is ignored. |
--help, -h |
switch | – | Print brief help with all flags and exit. |
--cell-list FILE |
optional | none | Replace STARsolo's filtered list with your own whitelist. |
--tau X |
double | 0.8 | Fraction of counts in top feature required for singlet. |
--delta X |
double | 0.4 | Minimum gap f₁ − f₂ between the two top features. |
--gamma X |
double | 0.9 | Dominance (f₁+f₂)/total threshold for doublet. |
--alpha X |
double | 1e-4 | Per-test error rate (Benjamini–Hochberg when FDR is on). |
--floor N |
int | 12 | Hard floor on total feature counts considered for classification. |
--ambient-q X |
double | 0.999 | High quantile used by the empirical ambient model. |
--no-fdr |
switch | off | Use raw p-values instead of FDR-corrected q-values. |
--m-min-fixed N |
int | unset | Bypass all ambient/statistical and cell-derived rules; use M_min = N for the low-support cutoff. |
--apply-all |
switch | off | Apply learned thresholds to all barcodes in the matrix (not just filtered list). See section 2.5 below. |
| Flag | Type | Default | Description |
|---|---|---|---|
--process-features |
switch | off | Treat features.txt as the canonical list and ignore allowed_features.* – useful for antibody capture where every feature is valid. |
--simple-assign |
switch | off | Heuristic one-pass assignment (ratio, min-count) that avoids statistical tests. |
--min-count N |
int | 2 | Minimum raw counts for the top feature in simple mode. |
--min-ratio R |
double | 2.0 | Required ratio f₁/f₂ to call a singlet in simple mode. |
| Flag | Type | Default | Notes |
|---|---|---|---|
--mmin-from-cells |
switch | off | Enable data-driven determination of low-support cutoff. |
--mmin-cells-method {em-cell} |
string | ||
| Initialisation for em-cells. | |||
--mmin3-floor N |
int | --floor |
Floor passed to em-cells internal negative binomials. |
| Flag | Type | Default | Rationale |
|---|---|---|---|
--use-em |
switch | off | Activate per-guide EM. |
--min-em-counts N |
int | 10 | Skip EM for guides whose total count across all cells is < N; a background-only fit is written instead. |
--em-fixed-disp |
switch | off | Keep NB dispersion parameters fixed during EM. |
--tau-pos |
double | 0.95 | Posterior threshold to call a positive cell for a guide (primary threshold). |
--tau-pos2 |
double | tau-pos − 0.05 |
Looser posterior for the 2nd guide. Helps rescue real doublets that have slightly weaker evidence for the second feature. |
--k-min |
int | 4 | Min raw counts for a guide to be considered present (primary). |
--k-min2 |
int | max(2,k-min-1) |
Min counts for the 2nd guide. Allows doublets with modest secondary counts. |
--gamma-min |
double | 0.8 | Dominance threshold (c1+c2)/total using all features. Ensures the two top guides dominate the library for a doublet. |
--gamma-min-cand |
double | 0.85 | Same dominance threshold but denominator is candidate-only counts. Captures true doublets in heavy ambient backgrounds where total dominance may be diluted. |
--doublet-balance |
0/1 | 1 | Require 20–80 % balance between the two guides. Set to 0 to keep highly unbalanced doublets. |
--debug-amb FILE |
path | – | Write CSV of ambiguous cells and rejection reasons – invaluable when tuning the above knobs. |
--k-small |
int | 4 | Minimum number of features required for EM convergence. |
Why two dominance thresholds? In clean data (c1+c2)/total ≥ gamma-min is robust. Under heavy ambient noise the candidate-only ratio is more reliable; the algorithm considers a cell a doublet if either condition passes.
Balance check (--doublet-balance): avoids misclassifying singlets with trace counts of a second guide as doublets. Disabling it loosens the call when barcode balance is not informative (e.g. targeted panels with skewed expression).
By default, call_features trains thresholds on the filtered barcode list (from --cell-list, MTX directory, or STARsolo) and classifies only those barcodes. The --apply-all flag extends this workflow:
- Training phase: Thresholds, ambient model, and EM parameters are learned using only the filtered barcodes (unchanged).
- Application phase: The learned model is then applied to every barcode in the matrix that passes quality filters (
M_min).
Key behaviors:
- Output files (
assignments.tsv,doublets.txt, etc.) contain results for both filtered and unfiltered barcodes - Barcodes already processed in the filtered set are skipped with a warning if duplicates are detected
- Quality filters (
M_min) still apply—barcodes with insufficient counts are excluded - For FLEX mode: Computes p-values vs ambient for each barcode and applies the same decision rules (
f1 ≥ tau,(f1-f2) ≥ delta,q < alpha) used in training. Uses raw p-values (no FDR recomputation across expanded set) - For EM mode: Uses stored EM fit parameters (a_bg, a_pos, r_bg, r_pos, pi_pos) to evaluate each guide, then applies the same candidate-counting and dominance logic as the main path
- For simple-assign: Trivially extends to all barcodes (no training phase)
Memory consideration: --apply-all allocates arrays sized to the full matrix (n_cols), which can be ~1.6M for large datasets. This is acceptable for correctness-first workflows.
When to use:
- You want to "rescue" barcodes that didn't make the filtered list but have clear feature assignments
- Running universe rescue workflows where the filtered list is conservative
- Exploratory analysis to see how many additional cells can be called
Example:
call_features \
--mtx-dir HTO/matrix \
--cell-list filtered_barcodes.tsv \
--out-prefix results/rescue \
--apply-all \
--tau 0.8 --delta 0.4 --gamma 0.9| Mode | Key idea | Strengths | Guardrails / when not to use |
|---|---|---|---|
| FLEX (default) | Binomial test vs ambient + dominance rules | Simple, fast, matches Cell Ranger HTO pipeline. | High ambient or low-MOI guide data will inflate false negatives. |
| EM | Per-guide NB2 mixture with learnt ambient & positive means | Handles overdispersion, great for low-count CRISPR screens. | Slow on >50 k guides; needs ≥ 2–3 counts in positives (k-min). |
| EM-fixed disp | Same but freeze dispersion | Stabilises fits in sparse data. | Slight under-fit when guides are highly expressed. |
| Simple assign | Pure ratio (min-ratio, min-count) |
Blazing fast, good for cell hashing with distinct HTOs. | No p-values/FDR; unsafe for noisy perturb-seq. |
| Cell-derived M_min EM-cell |
NB mixture (A/S/D components) | Best at separating ambient / singlet / doublet peaks. | More CPU; fails on <10 k cells. |
Standard cell hashing with hashtag oligonucleotides. Each cell should express 1-2 HTOs clearly above ambient.
call_features \
--mtx-dir HTO/matrix --starsolo-dir Sample/Solo.out/Gene \
--out-prefix results/hto_run1 \
--tau 0.8 --delta 0.4 --gamma 0.9 --floor 12If there are more HTOs than 4 the EM option is possible and may give more coverage – see scripts/runCallEm.sh for a more extensive, tuned example.
call_features \
--mtx-dir HTO/matrix --starsolo-dir Sample/Solo.out/Gene \
--out-prefix results/hto_run1 \
--tau 0.8 --delta 0.4 --gamma 0.9 --floor 12
--use-em
--em-fixed-dispCRISPR screens where most cells are expected to have a single guide. EM handles sparse counts and ambient contamination better than binomial tests.
call_features \
--mtx-dir Guides/matrix --starsolo-dir Sample/Solo.out/Gene \
--out-prefix results/perturb_EM \
--use-em --k-min 4 --tau-pos 0.95 --tau-pos2 0.9 \
--gamma-min 0.8 --gamma-min-cand 0.85 \
--em-fixed-disp \
--mmin-from-cells --mmin-cells-method model3 --mmin-cap 150Lineage tracing with hundreds of thousands of possible barcodes. Ambient noise tends to be much lower. As a result most cells have unique barcodes or a clearly dominant barcode, so simple ratio-based assignment is sufficient. The ratio can be adjusted. The default ratio seem to be a good compromise between coverage and noise. Alternatively, EM also works very well but is a bit slower with the larger number of barcodes.
call_features \
--mtx-dir LB/matrix --starsolo-dir Sample/Solo.out/Gene \
--out-prefix results/larry \
--simple-assign --min-count 4 --min-ratio 2 \
--m-min-fixed 4 # low counts are expectedCell Ranger uses at least two different demultiplexing engines, depending on the assay. We can approximate the behavior with the following paramters.
Cell Ranger uses binomial + dominance rules (FLEX) with:
- Poisson ambient tail
- Hard floor ≈ 10
- Singlet
τ = 0.9, doublet iff₁+f₂ ≥ 0.8
call_features \
--mtx-dir FLEX --starsolo-dir Sample/Solo.out/Gene \
--out-prefix results/cellRanger_FLEX \
--tau 0.9 --gamma 0.8 --floor 10 --no-fdrCell Ranger uses EM with Gaussian mixture on log-counts, hard read cut-off = 3, dispersion fixed: can use a slightly higher cutoff e.g. 6 to approximate lower CellRanger coverage.
call_features \
--mtx-dir Guides/matrix --starsolo-dir Sample/Solo.out/Gene \
--out-prefix results/cellRanger_CRISPR_like \
--use-em \
--m-min-fixed 3 \
--k-min 3 --k-min2 2 \
--tau-pos 0.9 --tau-pos2 0.85 \
--gamma-min 0.8 --gamma-min-cand 0.85 \
--em-fixed-disp| Aspect | Cell Ranger (Gaussian, log-space) | call_features (NB2, count-space) |
|---|---|---|
| Data input | log10(count + 1) |
integer counts |
| Tail behaviour | Symmetric; under-penalises zeros | Heavy-tailed; captures over-dispersion & excess zeros |
| Interpretation | Means/σ in log space | Direct counts ⇒ easier QC |
| Advantage | Very fast | Better fit for low UMIs; no log artifacts |
In practice we observe higher recall for low-abundance guides and cleaner ambient separation with the NB mixture, while still matching Cell Ranger’s hard floor and posterior thresholds.
-
Cell barcodes
• Load MTX barcodes (columns) into a hash.
• Load STARsolo raw and filtered barcodes.
• Optionally, replace the filtered set with--cell-listfile. -
Feature allow-list
Readallowed_features.tsv; map matrix rows → 1…K indices. In--process-featuresmode, all features fromfeatures.txtare considered allowed. -
Filtered barcode resolution
Search for filtered cell barcodes in priority order:
• Priority 1:--cell-list FILE(user-provided)
• Priority 2:filtered_barcodes.tsvorfiltered_barcodes.txtin MTX directory
• Priority 3:--starsolo-dir/filtered/barcodes.tsv(STARsolo fallback)
If none found, exit with guidance. -
Single scan of
matrix.mtx
• Ambient counts: features × negatives (raw - filtered barcodes).
• Per-cell feature vectors (only for barcodes present in the filtered set). -
Ambient model & thresholds
• Trimmed Poisson pre-filter on negative totals.
• EstimateM_ambient, compare with NB model if overdispersed.
• Statistical power thresholdM_statvia binomial tail on highest ambient proportion.
• FinalM_min = max(M_ambient, M_stat, floor)unless--m-min-fixedor--mmin-from-cellsoverride. -
Per-cell tests
• FLEX mode: For each cell, compute p-value of top feature (p1) and second (p2) vs ambient.
• EM mode: Fit per-guide NB mixture models, compute posterior probabilities.
• Simple mode: Direct ratio and count thresholds.
• FDR-correct (unless--no-fdr) →q1,q2. -
Classification
• Singlet iff1 ≥ τ,f1-f2 ≥ δ,q1 < α.
• Doublet iff1+f2 ≥ γ, balance check passes, and bothq1,q2 < α.
• Else Ambiguous.
• Cells belowM_min→ Low-support.
Outputs are written as five plain-text files:
.assignments.tsv- Main results with cell classifications.doublets.txt- List of doublet barcodes.ambiguous.txt- Ambiguous assignments.unassignable.txt- Low-support cells.missing_cells.txt- Cells in whitelist but not in MTX
The --out-prefix parameter supports two modes:
Prefix mode (traditional): --out-prefix myrun creates:
myrun.assignments.tsv,myrun.doublets.txt, etc.
Directory mode: --out-prefix results/ or --out-prefix results (if results exists) creates:
results/assignments.tsv,results/doublets.txt, etc.- The directory is automatically created if it doesn't exist.
The repository includes test scripts:
# Test individual methods
./scripts/testCallPerturb.sh {flex|em|em-fixed|simple|em-cell}
# Compare all methods side-by-side
./scripts/compareAllMethods.shThe comparison script generates a detailed table showing how different algorithms perform on the same dataset, helping you choose the best approach for your experiment type.
For a full flag reference and method details, see docs/Methods.md.
- Single-pass MTX – avoids large memory; scales to millions of barcodes.
- Robust ambient estimation – mixture of empirical and model-based thresholds copes with overdispersion.
- Multiple algorithms – from simple ratio-based to sophisticated EM mixture models.
- No external dependencies/fast compile – only glibc and
<math.h>.
Licensed under the MIT license (see LICENSE file).
Contributions welcome via pull request.