Data Processing Overview

The Data Processing Framework is python-based and enables the application of "transforms" to a one or more input data files to produce one or more output data files. Various runtimes are available to execute the transforms using a common shared methodology and mechanism to configure input and output across either local or S3-base storage.

The framework allows simple 1:1 transformation of (parquet) files, but also enables more complex transformations requiring coordination among transforming nodes. This might include operations such as de-duplication, merging, and splitting. The framework uses a plug-in model for the primary functions. The core transformation-specific classes/interfaces are as follows:

AbstractTransform - a simple, easily-implemented interface allowing the definition transforms over arbitrary data types. Support is provided for both files of arbitrary data as a byte array and parquet/arrow tables.
TransformConfiguration - defines the transform short name, its implementation class, and command line configuration parameters.

In support of running a transform over a set of input data in a runtime, the following class/interfaces are provided:

AbstractTransformLauncher - is the central runtime interfacee expected to be implemented by each runtime (python ray, spark, etc.) to apply a transform to a set of data. It is configured with a TransformRuntimeConfiguration and a DataAccessFactory instance (see below).
DataAccessFactory - is used to configure the input and output data files to be processed and creates the DataAccess instance (see below) according to the CLI parameters.
TransformRuntimeConfiguration - captures the TransformConfiguration and runtime-specific configuration.
DataAccess - is the interface defining data i/o methods and selection. Implementations for local and S3 storage are provided.

To learn more consider the following:

Transforms
Transform Exceptions
Transform Runtimes
Transform Examples
Testing Transforms
Utilities
Architecture Deep Dive
Transform project root readme

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overview.md

overview.md

Data Processing Overview

Files

overview.md

Latest commit

History

overview.md

File metadata and controls

Data Processing Overview