H5AD Subsetter

A Python tool to subset AnnData (.h5ad) files by selecting specific observation columns and genes. Supports both single file processing and high-performance batch processing with multiprocessing.

Features

Single file processing: Process individual H5AD files with full control
Batch processing: Process thousands of files in parallel with multiprocessing
Pattern matching: Use glob patterns like /data/*.h5ad to process multiple files
File list support: Process files listed in a text file
Preserve categorical data types and categories (important for dataset shards)
Automatically preserves unused categories in categorical columns
Validate output files before optionally deleting originals
Support for input lists via comma-separated strings or files
Comprehensive logging and error handling
Progress tracking for batch operations

Requirements

Python 3.8+
AnnData >= 0.11.0 (required for categorical preservation)

Installation

pip install -r requirements.txt

Usage

Single File Mode

Process a single H5AD file (original functionality):

python subset_anndata.py input.h5ad output.h5ad \
    --obs-columns "cell_type,batch,donor_id" \
    --genes "GENE1,GENE2,GENE3"

Batch Processing Mode

Process multiple files using pattern matching or file lists:

Process All Files Matching a Pattern

# Process all .h5ad files in a directory
python subset_anndata.py \
    --input-pattern "/data/*.h5ad" \
    --output-dir "/output" \
    --obs-columns "cell_type,batch" \
    --genes "GENE1,GENE2"

# Process files with brace expansion (like bash)
python subset_anndata.py \
    --input-pattern "/data/extract_{0..100}.h5ad" \
    --output-dir "/output" \
    --obs-columns obs_columns.txt \
    --genes genes.txt \
    --parallel 12

# Multiple brace expansions
python subset_anndata.py \
    --input-pattern "/data/shard_{001..010}_{a,b,c}.h5ad" \
    --output-dir "/output" \
    --obs-columns "cell_type,batch" \
    --genes "GENE1,GENE2"

Supported Brace Patterns:

Numeric ranges: {0..10} expands to 0, 1, 2, ..., 10
Zero-padded ranges: {001..010} expands to 001, 002, ..., 010
Character ranges: {a..e} expands to a, b, c, d, e
Comma-separated lists: {a,b,c} expands to a, b, c
Multiple expansions: {0..2}_{a,b} expands to 0_a, 0_b, 1_a, 1_b, 2_a, 2_b

Process Files from a List

# Create a file with your input files
echo -e "/data/file1.h5ad\n/data/file2.h5ad\n/data/file3.h5ad" > input_files.txt

python subset_anndata.py \
    --input-list input_files.txt \
    --output-dir "/output" \
    --obs-columns "cell_type,batch" \
    --genes "GENE1,GENE2" \
    --parallel 8

Performance Example for Large Datasets

# Process 4440 files with 12 parallel workers
python subset_anndata.py \
    --input-pattern "/data/shard_*.h5ad" \
    --output-dir "/processed" \
    --obs-columns "cell_type,tissue,donor_id,batch" \
    --genes genes_of_interest.txt \
    --parallel 12 \
    --delete-original \
    --verbose

Estimated time: ~53 minutes (vs 12+ hours sequential)

Additional Options

Using Files for Input Lists

# Create files with your column/gene lists
echo -e "cell_type\nbatch\ndonor_id" > obs_columns.txt
echo -e "GENE1\nGENE2\nGENE3" > genes.txt

# Single file mode
python subset_anndata.py input.h5ad output.h5ad \
    --obs-columns obs_columns.txt \
    --genes genes.txt

# Batch mode
python subset_anndata.py \
    --input-pattern "/data/*.h5ad" \
    --output-dir "/output" \
    --obs-columns obs_columns.txt \
    --genes genes.txt

Delete Originals and Custom Output Suffix

# Single file mode
python subset_anndata.py input.h5ad output.h5ad \
    --obs-columns "cell_type,batch" \
    --genes "GENE1,GENE2" \
    --delete-original

# Batch mode with custom suffix
python subset_anndata.py \
    --input-pattern "/data/*.h5ad" \
    --output-dir "/output" \
    --obs-columns "cell_type,batch" \
    --genes "GENE1,GENE2" \
    --output-suffix "_filtered" \
    --delete-original

Retain Layers and Verbose Logging

python subset_anndata.py input.h5ad output.h5ad \
    --obs-columns "cell_type,batch" \
    --genes "GENE1,GENE2" \
    --retain-layers \
    --verbose

Command Line Options

Single File Mode

INPUT_FILE: Path to input H5AD file (required)
OUTPUT_FILE: Path for output H5AD file (required)

Batch Processing Mode

--input-pattern, -p: Glob pattern for multiple input files
--input-list, -l: File containing list of input files (one per line)
--output-dir: Output directory for processed files (required for batch mode)
--output-suffix, -s: Suffix for output files (default: "_subset")
--parallel, -j: Number of parallel processes (default: auto-detect 75% of cores)

Common Options

--obs-columns, -o: Comma-separated obs column names or path to file (required)
--genes, -g: Comma-separated gene names or path to file (required)
--delete-original, -d: Delete original file(s) after successful processing
--retain-layers, -r: Retain layers slot in output (default: layers are deleted)
--verbose, -v: Enable verbose logging

Safety Features

Input validation: Checks that all specified columns and genes exist
Output validation: Verifies the output file was saved correctly
Categorical preservation: Ensures categorical columns keep all categories, including unused ones
Safe deletion: Only deletes original file if validation passes
Error handling: Comprehensive error messages and logging
Slot cleanup: Removes unnecessary AnnData slots (layers, uns, obsm, varm, obsp, varp, raw) to reduce file size
Parallel processing: Robust multiprocessing with error isolation per file
Progress tracking: Real-time progress bars for batch operations
Memory management: Efficient processing to handle large datasets

Performance

Multiprocessing Benefits

Automatic core detection: Uses 75% of available CPU cores by default
Scalable: Processes thousands of files efficiently
Memory efficient: Each worker processes one file at a time
Error isolation: Failed files don't stop processing of other files

Example Performance (n1-standard-16 GCP machine)

4440 files: ~53 minutes with 12 parallel workers (vs 12+ hours sequential)
Processing time: ~10 seconds per file average
Memory usage: ~1GB per worker process

Example Workflows

Large Dataset Processing

# Process a large dataset shard with parallel workers
python subset_anndata.py \
    --input-pattern "/data/shard_*.h5ad" \
    --output-dir "/processed" \
    --obs-columns "cell_type,tissue,donor_id,batch" \
    --genes genes_of_interest.txt \
    --parallel 12 \
    --delete-original \
    --verbose

Sequential File Processing

# Create file list for specific files
find /data -name "extract_*.h5ad" | head -100 > files_to_process.txt

python subset_anndata.py \
    --input-list files_to_process.txt \
    --output-dir "/output" \
    --obs-columns obs_cols.txt \
    --genes genes.txt \
    --output-suffix "_subset" \
    --parallel 8

This will:

Load and validate each input file
Subset to only the specified obs columns and genes
Save the processed files to the output directory
Validate output file integrity
Optionally delete original files if everything succeeded
Provide detailed logging and progress tracking throughout the process

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
subset_anndata.py		subset_anndata.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

H5AD Subsetter

Features

Requirements

Installation

Usage

Single File Mode

Batch Processing Mode

Process All Files Matching a Pattern

Process Files from a List

Performance Example for Large Datasets

Additional Options

Using Files for Input Lists

Delete Originals and Custom Output Suffix

Retain Layers and Verbose Logging

Command Line Options

Single File Mode

Batch Processing Mode

Common Options

Safety Features

Performance

Multiprocessing Benefits

Example Performance (n1-standard-16 GCP machine)

Example Workflows

Large Dataset Processing

Sequential File Processing

About

Uh oh!

Releases

Packages

Languages

sjfleming/h5ad_subsetter

Folders and files

Latest commit

History

Repository files navigation

H5AD Subsetter

Features

Requirements

Installation

Usage

Single File Mode

Batch Processing Mode

Process All Files Matching a Pattern

Process Files from a List

Performance Example for Large Datasets

Additional Options

Using Files for Input Lists

Delete Originals and Custom Output Suffix

Retain Layers and Verbose Logging

Command Line Options

Single File Mode

Batch Processing Mode

Common Options

Safety Features

Performance

Multiprocessing Benefits

Example Performance (n1-standard-16 GCP machine)

Example Workflows

Large Dataset Processing

Sequential File Processing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages