What's new in the latest connector?

What's Changed in the Latest Connector?

Hello!

For the past few months, we've been hard at work re-doing the EventHubs connector for Apache Spark. There have been a number of changes, and I wanted to quickly summarize what's changed in the library. This intended for users familiar with the old library. If you're new to the library, then please see the "docs" directory within the repo itself. :)

Major Changes:

No more progress tracker.

The progress tracker was used for two reasons: (1) checkpointing and (2) to communicate the last byte offset that was read from the Spark Executor to the Spark Driver. The communication of the last byte offset from the Executor to the Driver caused the Driver to wait for all tasks within a specific batch to complete. Additionally, this caused N reads and N writes to an HDFS-compatible file system every batch. As you can imagine, the wait and the frequent I/O caused huge performance issues with the previous library and limited the number of concurrent jobs to 1. These limitations no longer exist, there is no progress tracker, and we don't have frequent I/O with the file system.

Switched to Sequence Number based filtering.

In the new connector, we use sequence numbers to consume events from EventHubs. This allows us to know the start and end point of each batch deterministically. In other words, 1000 events from sequence number X is simply X + 1000 whereas 1000 events from byte offset Y cannot be known apriori.

Between these two changes, the entire codebase was rewritten - all concurrency bugs (i.e. multiple threads writing to the same HDFS Path at once), enqueue time filtering not working, no concurrent jobs, long waits between batches, etc have all been solved. Additionally, all unit and integration tests have been rewritten. We also have the new connector regularly stress tested.

In addition to the codebase re-write, the documentation needed to be re-done. All new documentation has been written and there are no undocumented options/parameters for the user!

In addition, we have many new improvements:

Spark 2.3 and Spark 2.2 Support
Databricks (and Azure Databricks) support
Spark core support
Added an EventHubsSink to Structured Streaming
Moved to EventHubs Java Client 1.0.0
Thread pooling between multiple EventHubClients
Connection pooling between multiple EventHubClients and across batches
Java support in Spark Streaming
Allow users to manage their own offsets
Per partition configuration for starting positions, ending positions, and max rates.
Users can start their jobs from START_OF_STREAM and END_OF_STREAM
EventHubs receiver timeout and operation timeout are now configurable
README is revamped
Non-public and international clouds are properly supported

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What's new in the latest connector?

What's Changed in the Latest Connector?

Major Changes:

No more progress tracker.

Switched to Sequence Number based filtering.

Uh oh!

Uh oh!

Clone this wiki locally