PySpark Apache Beam Runner

Overview

This project introduces a custom Apache Beam runner that leverages PySpark directly. Unlike the default Spark runner shipped with Beam, this runner is designed for environments where a SparkSession is available but a Spark master server is not. This makes it particularly useful for serverless environments where jobs are triggered without a long-running cluster.

Why Another Spark Runner?

Serverless Compatibility: Ideal for environments without a dedicated Spark master, supporting execution in serverless frameworks such as EMR Serverless.
Python-Centric Approach: The entire stack - compilation, optimizations, and execution planning - happens in Python, which can be advantageous for Python-focused teams and environments.
Simplified Setup: Potentially reduces the complexity of job submission by avoiding the need for port listening on a Spark master.
Direct PySpark Integration: Utilizes an assumed PySpark SparkSession directly.

Features

Direct integration with PySpark
Serverless compatibility
Simplified setup process
Python-centric execution stack
Support for key Beam transforms (Create, ReadFromText, ParDo, Flatten, GroupByKey, CombinePerKey)
Efficient handling of side inputs and DoFn lifecycle

Prerequisites

Apache Spark
Apache Beam
Python 3.8 or later

Installation

To use this custom runner, simply install it using pip:

pip install beam-pyspark-runner

Usage

Here's a basic example of how to use the PySpark Beam Runner:

from apache_beam import Pipeline
from apache_beam.options.pipeline_options import PipelineOptions
from beam_pyspark_runner import PySparkRunner

# Define a pipeline with the custom runner implemented here
with Pipeline(runner=PySparkRunner(), options=PipelineOptions()) as p:
    # Your Beam pipeline definition here
    # For example:
    lines = p | 'ReadLines' >> beam.io.ReadFromText('input.txt')
    counts = (lines
              | 'Split' >> beam.Map(lambda x: x.split())
              | 'PairWithOne' >> beam.Map(lambda words: [(w, 1) for w in words])
              | 'GroupAndSum' >> beam.CombinePerKey(sum))
    counts | 'WriteOutput' >> beam.io.WriteToText('output.txt')

# The pipeline will be executed using PySpark

Current Status and Limitations

This project is currently in development. While it supports many common Beam transforms, some advanced features may not be fully implemented. Contributions and feedback are welcome!

Contributing

We welcome contributions to the PySpark Apache Beam Runner! Here are ways you can contribute:

Report bugs or request features by opening an issue
Improve documentation
Submit pull requests with bug fixes or new features

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgements

This project builds upon the great work done by the Apache Beam and Apache Spark communities. We're grateful for their ongoing efforts in advancing distributed data processing technologies.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
beam_pyspark_runner		beam_pyspark_runner
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.txt		CHANGELOG.txt
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark Apache Beam Runner

Overview

Why Another Spark Runner?

Features

Prerequisites

Installation

Usage

Current Status and Limitations

Contributing

License

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

License

moradology/beam-pyspark-runner

Folders and files

Latest commit

History

Repository files navigation

PySpark Apache Beam Runner

Overview

Why Another Spark Runner?

Features

Prerequisites

Installation

Usage

Current Status and Limitations

Contributing

License

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages