A Python tool to subset AnnData (.h5ad) files by selecting specific observation columns and genes. Supports both single file processing and high-performance batch processing with multiprocessing.
- Single file processing: Process individual H5AD files with full control
- Batch processing: Process thousands of files in parallel with multiprocessing
- Pattern matching: Use glob patterns like
/data/*.h5ad
to process multiple files - File list support: Process files listed in a text file
- Preserve categorical data types and categories (important for dataset shards)
- Automatically preserves unused categories in categorical columns
- Validate output files before optionally deleting originals
- Support for input lists via comma-separated strings or files
- Comprehensive logging and error handling
- Progress tracking for batch operations
- Python 3.8+
- AnnData >= 0.11.0 (required for categorical preservation)
pip install -r requirements.txt
Process a single H5AD file (original functionality):
python subset_anndata.py input.h5ad output.h5ad \
--obs-columns "cell_type,batch,donor_id" \
--genes "GENE1,GENE2,GENE3"
Process multiple files using pattern matching or file lists:
# Process all .h5ad files in a directory
python subset_anndata.py \
--input-pattern "/data/*.h5ad" \
--output-dir "/output" \
--obs-columns "cell_type,batch" \
--genes "GENE1,GENE2"
# Process files with brace expansion (like bash)
python subset_anndata.py \
--input-pattern "/data/extract_{0..100}.h5ad" \
--output-dir "/output" \
--obs-columns obs_columns.txt \
--genes genes.txt \
--parallel 12
# Multiple brace expansions
python subset_anndata.py \
--input-pattern "/data/shard_{001..010}_{a,b,c}.h5ad" \
--output-dir "/output" \
--obs-columns "cell_type,batch" \
--genes "GENE1,GENE2"
Supported Brace Patterns:
- Numeric ranges:
{0..10}
expands to 0, 1, 2, ..., 10 - Zero-padded ranges:
{001..010}
expands to 001, 002, ..., 010 - Character ranges:
{a..e}
expands to a, b, c, d, e - Comma-separated lists:
{a,b,c}
expands to a, b, c - Multiple expansions:
{0..2}_{a,b}
expands to 0_a, 0_b, 1_a, 1_b, 2_a, 2_b
# Create a file with your input files
echo -e "/data/file1.h5ad\n/data/file2.h5ad\n/data/file3.h5ad" > input_files.txt
python subset_anndata.py \
--input-list input_files.txt \
--output-dir "/output" \
--obs-columns "cell_type,batch" \
--genes "GENE1,GENE2" \
--parallel 8
# Process 4440 files with 12 parallel workers
python subset_anndata.py \
--input-pattern "/data/shard_*.h5ad" \
--output-dir "/processed" \
--obs-columns "cell_type,tissue,donor_id,batch" \
--genes genes_of_interest.txt \
--parallel 12 \
--delete-original \
--verbose
Estimated time: ~53 minutes (vs 12+ hours sequential)
# Create files with your column/gene lists
echo -e "cell_type\nbatch\ndonor_id" > obs_columns.txt
echo -e "GENE1\nGENE2\nGENE3" > genes.txt
# Single file mode
python subset_anndata.py input.h5ad output.h5ad \
--obs-columns obs_columns.txt \
--genes genes.txt
# Batch mode
python subset_anndata.py \
--input-pattern "/data/*.h5ad" \
--output-dir "/output" \
--obs-columns obs_columns.txt \
--genes genes.txt
# Single file mode
python subset_anndata.py input.h5ad output.h5ad \
--obs-columns "cell_type,batch" \
--genes "GENE1,GENE2" \
--delete-original
# Batch mode with custom suffix
python subset_anndata.py \
--input-pattern "/data/*.h5ad" \
--output-dir "/output" \
--obs-columns "cell_type,batch" \
--genes "GENE1,GENE2" \
--output-suffix "_filtered" \
--delete-original
python subset_anndata.py input.h5ad output.h5ad \
--obs-columns "cell_type,batch" \
--genes "GENE1,GENE2" \
--retain-layers \
--verbose
INPUT_FILE
: Path to input H5AD file (required)OUTPUT_FILE
: Path for output H5AD file (required)
--input-pattern, -p
: Glob pattern for multiple input files--input-list, -l
: File containing list of input files (one per line)--output-dir
: Output directory for processed files (required for batch mode)--output-suffix, -s
: Suffix for output files (default: "_subset")--parallel, -j
: Number of parallel processes (default: auto-detect 75% of cores)
--obs-columns, -o
: Comma-separated obs column names or path to file (required)--genes, -g
: Comma-separated gene names or path to file (required)--delete-original, -d
: Delete original file(s) after successful processing--retain-layers, -r
: Retain layers slot in output (default: layers are deleted)--verbose, -v
: Enable verbose logging
- Input validation: Checks that all specified columns and genes exist
- Output validation: Verifies the output file was saved correctly
- Categorical preservation: Ensures categorical columns keep all categories, including unused ones
- Safe deletion: Only deletes original file if validation passes
- Error handling: Comprehensive error messages and logging
- Slot cleanup: Removes unnecessary AnnData slots (layers, uns, obsm, varm, obsp, varp, raw) to reduce file size
- Parallel processing: Robust multiprocessing with error isolation per file
- Progress tracking: Real-time progress bars for batch operations
- Memory management: Efficient processing to handle large datasets
- Automatic core detection: Uses 75% of available CPU cores by default
- Scalable: Processes thousands of files efficiently
- Memory efficient: Each worker processes one file at a time
- Error isolation: Failed files don't stop processing of other files
- 4440 files: ~53 minutes with 12 parallel workers (vs 12+ hours sequential)
- Processing time: ~10 seconds per file average
- Memory usage: ~1GB per worker process
# Process a large dataset shard with parallel workers
python subset_anndata.py \
--input-pattern "/data/shard_*.h5ad" \
--output-dir "/processed" \
--obs-columns "cell_type,tissue,donor_id,batch" \
--genes genes_of_interest.txt \
--parallel 12 \
--delete-original \
--verbose
# Create file list for specific files
find /data -name "extract_*.h5ad" | head -100 > files_to_process.txt
python subset_anndata.py \
--input-list files_to_process.txt \
--output-dir "/output" \
--obs-columns obs_cols.txt \
--genes genes.txt \
--output-suffix "_subset" \
--parallel 8
This will:
- Load and validate each input file
- Subset to only the specified obs columns and genes
- Save the processed files to the output directory
- Validate output file integrity
- Optionally delete original files if everything succeeded
- Provide detailed logging and progress tracking throughout the process