Skip to content

Python DataSource for Apache Spark 4 to read ROOT files (High Energy Physics) as DataFrames, powered by uproot, awkward, and PyArrow.

License

Notifications You must be signed in to change notification settings

cerndb/pyspark-root-datasource

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PySpark Datasource for ROOT

Test PyPI

Apache Spark 4 Python Datasource for reading files in the ROOT data format used in High-Energy Physics (HEP).

Author and version
[email protected] · v0.1 (Sep 2025)

Highlights


Install

# From PyPI
pip install pyspark-root-datasource

# Or, local for development
pip install -e .

Quick start

from pyspark.sql import SparkSession
from pyspark_root_datasource import register

spark = (SparkSession.builder
         .appName("Read ROOT via PySpark + uproot")
         .getOrCreate())

# Register the datasource (short name = "root")
register(spark)

# Get the example ROOT file (2 GB)
# xrdcp root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root .
# if you don't have xrdcp installed, on Linux use wget or curl -O
wget https://sparkdltrigger.web.cern.ch/sparkdltrigger/Run2012BC_DoubleMuParked_Muons.root

# Best practice: provide a schema to prune branches early
schema = "nMuon int, Muon_pt array<float>, Muon_eta array<float>, Muon_phi array<float>, Muon_mass array<float>, Muon_charge array<int>"

df = (spark.read.format("root")
      .schema(schema)
      .option("path", "/data/Run2012BC_DoubleMuParked_Muons.root")
      .option("tree", "Events")
      .option("step_size", "1000000")
      .load())

df.show(5, truncate=False)
print("Count:", df.count())

# Use schema inference
df2 = (spark.read.format("root")
       .option("path", "/data/Run2012BC_DoubleMuParked_Muons.root")
       .option("tree", "Events")
       .option("sample_rows", "1000")   # default 1000
       .load())
df2.printSchema()

Examples and tests


Options

  • "path" (required) – file path, URL, comma-separated list, directory, or glob (e.g. "/data/*.root")
  • "tree" (default: "Events") – TTree name
  • "step_size" (default: "1000000") – entries per Spark partition (per file)
  • "num_partitions" (optional, per file) – overrides step_size
  • "entry_start", "entry_stop" (optional, per file) – index bounds
  • "columns" – comma-separated branch names (if not providing a Spark schema)
  • "list_to32" (default: "true") – Arrow list offset width
  • "extensionarray" (default: "false") – Arrow extension array support
  • "cast_unsigned" (default: "true") – cast uint* → signed (Spark lacks unsigned)
  • "recursive" (default: "false") – expand directories recursively
  • "ext" (default: "*.root") – filter pattern when path is a directory
  • "sample_rows" (default: "1000") – rows for schema inference
  • "arrow_max_chunksize" (default: "0") – if >0, limit rows per Arrow RecordBatch

Reading over XRootD (root://)

# fsspec plugins for xrootd
pip install fsspec fsspec-xrootd

# XRootD client libs + Python bindings
conda install -c conda-forge xrootd

Install the extras, then:

remote_file = "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"
df = (spark.read.format("root")
      .option("path", remote_file)
      .option("tree", "Events")
      .load())
df.show(3, truncate=False)

Reading folders, globs, recursion

# All .root files in a directory (non-recursive)
df = (spark.read.format("root")
      .option("path", "/data/myfolder")
      .load())

# Recursive directory expansion
df = (spark.read.format("root")
      .option("path", "/data/myfolder")
      .option("recursive", "true")
      .load())

# Custom extension used when 'path' is a directory
df = (spark.read.format("root")
      .option("path", "/data/myfolder")
      .option("ext", "*.parquet.root")
      .load())

# Glob
df = (spark.read.format("root")
      .option("path", "/data/*/atlas/*.root")
      .load())

Tips and troubleshooting

  • Prefer explicit schemas to prune early and minimize I/O.
  • Tune partitioning:
    • step_size = entries per Spark partition.
    • num_partitions (per file) overrides step_size.
  • Large jagged arrays benefit from reasonable step_size (e.g., 100k–1M).
  • If necessary, use arrow_max_chunksize to keep batch sizes moderate for downstream stages.
  • cast_unsigned=true normalizes uint* to signed widths (Spark-friendly).
  • Fixed-size lists are preserved as Arrow fixed_size_list (no silent downgrade).
  • XRootD errors: install both fsspec and fsspec-xrootd, and the XRootD client libs. Conda is often the smoothest:
    pip install fsspec fsspec-xrootd
    conda install -c conda-forge xrootd
  • Tree not found: double-check .option("tree", "..."); error messages list available keys.
  • Different schemas across files: ensure compatible branch types or read by subsets, then reconcile in Spark.
  • Driver vs executors env mismatch: set both spark.pyspark.python and spark.pyspark.driver.python to your Python.

Related work & acknowledgments

Notes & limitations

  • Performance — Python data sources cross the Python↔JVM boundary, which adds overhead. Using Direct Arrow Batch support substantially reduces serialization cost and improves throughput. For maximum performance, a native JVM DataSource V2 implementation typically wins (see the guide on reading ROOT with Spark).
  • Scope — This datasource is read-only at this stage.

About

Python DataSource for Apache Spark 4 to read ROOT files (High Energy Physics) as DataFrames, powered by uproot, awkward, and PyArrow.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages