Skip to content

sjfleming/h5ad_subsetter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

H5AD Subsetter

A Python tool to subset AnnData (.h5ad) files by selecting specific observation columns and genes. Supports both single file processing and high-performance batch processing with multiprocessing.

Features

  • Single file processing: Process individual H5AD files with full control
  • Batch processing: Process thousands of files in parallel with multiprocessing
  • Pattern matching: Use glob patterns like /data/*.h5ad to process multiple files
  • File list support: Process files listed in a text file
  • Preserve categorical data types and categories (important for dataset shards)
  • Automatically preserves unused categories in categorical columns
  • Validate output files before optionally deleting originals
  • Support for input lists via comma-separated strings or files
  • Comprehensive logging and error handling
  • Progress tracking for batch operations

Requirements

  • Python 3.8+
  • AnnData >= 0.11.0 (required for categorical preservation)

Installation

pip install -r requirements.txt

Usage

Single File Mode

Process a single H5AD file (original functionality):

python subset_anndata.py input.h5ad output.h5ad \
    --obs-columns "cell_type,batch,donor_id" \
    --genes "GENE1,GENE2,GENE3"

Batch Processing Mode

Process multiple files using pattern matching or file lists:

Process All Files Matching a Pattern

# Process all .h5ad files in a directory
python subset_anndata.py \
    --input-pattern "/data/*.h5ad" \
    --output-dir "/output" \
    --obs-columns "cell_type,batch" \
    --genes "GENE1,GENE2"

# Process files with brace expansion (like bash)
python subset_anndata.py \
    --input-pattern "/data/extract_{0..100}.h5ad" \
    --output-dir "/output" \
    --obs-columns obs_columns.txt \
    --genes genes.txt \
    --parallel 12

# Multiple brace expansions
python subset_anndata.py \
    --input-pattern "/data/shard_{001..010}_{a,b,c}.h5ad" \
    --output-dir "/output" \
    --obs-columns "cell_type,batch" \
    --genes "GENE1,GENE2"

Supported Brace Patterns:

  • Numeric ranges: {0..10} expands to 0, 1, 2, ..., 10
  • Zero-padded ranges: {001..010} expands to 001, 002, ..., 010
  • Character ranges: {a..e} expands to a, b, c, d, e
  • Comma-separated lists: {a,b,c} expands to a, b, c
  • Multiple expansions: {0..2}_{a,b} expands to 0_a, 0_b, 1_a, 1_b, 2_a, 2_b

Process Files from a List

# Create a file with your input files
echo -e "/data/file1.h5ad\n/data/file2.h5ad\n/data/file3.h5ad" > input_files.txt

python subset_anndata.py \
    --input-list input_files.txt \
    --output-dir "/output" \
    --obs-columns "cell_type,batch" \
    --genes "GENE1,GENE2" \
    --parallel 8

Performance Example for Large Datasets

# Process 4440 files with 12 parallel workers
python subset_anndata.py \
    --input-pattern "/data/shard_*.h5ad" \
    --output-dir "/processed" \
    --obs-columns "cell_type,tissue,donor_id,batch" \
    --genes genes_of_interest.txt \
    --parallel 12 \
    --delete-original \
    --verbose

Estimated time: ~53 minutes (vs 12+ hours sequential)

Additional Options

Using Files for Input Lists

# Create files with your column/gene lists
echo -e "cell_type\nbatch\ndonor_id" > obs_columns.txt
echo -e "GENE1\nGENE2\nGENE3" > genes.txt

# Single file mode
python subset_anndata.py input.h5ad output.h5ad \
    --obs-columns obs_columns.txt \
    --genes genes.txt

# Batch mode
python subset_anndata.py \
    --input-pattern "/data/*.h5ad" \
    --output-dir "/output" \
    --obs-columns obs_columns.txt \
    --genes genes.txt

Delete Originals and Custom Output Suffix

# Single file mode
python subset_anndata.py input.h5ad output.h5ad \
    --obs-columns "cell_type,batch" \
    --genes "GENE1,GENE2" \
    --delete-original

# Batch mode with custom suffix
python subset_anndata.py \
    --input-pattern "/data/*.h5ad" \
    --output-dir "/output" \
    --obs-columns "cell_type,batch" \
    --genes "GENE1,GENE2" \
    --output-suffix "_filtered" \
    --delete-original

Retain Layers and Verbose Logging

python subset_anndata.py input.h5ad output.h5ad \
    --obs-columns "cell_type,batch" \
    --genes "GENE1,GENE2" \
    --retain-layers \
    --verbose

Command Line Options

Single File Mode

  • INPUT_FILE: Path to input H5AD file (required)
  • OUTPUT_FILE: Path for output H5AD file (required)

Batch Processing Mode

  • --input-pattern, -p: Glob pattern for multiple input files
  • --input-list, -l: File containing list of input files (one per line)
  • --output-dir: Output directory for processed files (required for batch mode)
  • --output-suffix, -s: Suffix for output files (default: "_subset")
  • --parallel, -j: Number of parallel processes (default: auto-detect 75% of cores)

Common Options

  • --obs-columns, -o: Comma-separated obs column names or path to file (required)
  • --genes, -g: Comma-separated gene names or path to file (required)
  • --delete-original, -d: Delete original file(s) after successful processing
  • --retain-layers, -r: Retain layers slot in output (default: layers are deleted)
  • --verbose, -v: Enable verbose logging

Safety Features

  1. Input validation: Checks that all specified columns and genes exist
  2. Output validation: Verifies the output file was saved correctly
  3. Categorical preservation: Ensures categorical columns keep all categories, including unused ones
  4. Safe deletion: Only deletes original file if validation passes
  5. Error handling: Comprehensive error messages and logging
  6. Slot cleanup: Removes unnecessary AnnData slots (layers, uns, obsm, varm, obsp, varp, raw) to reduce file size
  7. Parallel processing: Robust multiprocessing with error isolation per file
  8. Progress tracking: Real-time progress bars for batch operations
  9. Memory management: Efficient processing to handle large datasets

Performance

Multiprocessing Benefits

  • Automatic core detection: Uses 75% of available CPU cores by default
  • Scalable: Processes thousands of files efficiently
  • Memory efficient: Each worker processes one file at a time
  • Error isolation: Failed files don't stop processing of other files

Example Performance (n1-standard-16 GCP machine)

  • 4440 files: ~53 minutes with 12 parallel workers (vs 12+ hours sequential)
  • Processing time: ~10 seconds per file average
  • Memory usage: ~1GB per worker process

Example Workflows

Large Dataset Processing

# Process a large dataset shard with parallel workers
python subset_anndata.py \
    --input-pattern "/data/shard_*.h5ad" \
    --output-dir "/processed" \
    --obs-columns "cell_type,tissue,donor_id,batch" \
    --genes genes_of_interest.txt \
    --parallel 12 \
    --delete-original \
    --verbose

Sequential File Processing

# Create file list for specific files
find /data -name "extract_*.h5ad" | head -100 > files_to_process.txt

python subset_anndata.py \
    --input-list files_to_process.txt \
    --output-dir "/output" \
    --obs-columns obs_cols.txt \
    --genes genes.txt \
    --output-suffix "_subset" \
    --parallel 8

This will:

  1. Load and validate each input file
  2. Subset to only the specified obs columns and genes
  3. Save the processed files to the output directory
  4. Validate output file integrity
  5. Optionally delete original files if everything succeeded
  6. Provide detailed logging and progress tracking throughout the process

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages