Skip to content

What's new in the latest connector?

Sabee Grewal edited this page Feb 23, 2018 · 27 revisions

What's Changed in the Latest Connector?

Hello!

For the past few months, we've been hard at work re-doing the EventHubs connector for Apache Spark. There have been a number of changes, and I wanted to quickly summarize what's changed in the library. This intended for users familiar with the old library. If you're new to the library, then please see the "docs" directory within the repo itself. :)

Major Changes:

No more progress tracker.

The progress tracker was used for two reasons: (1) checkpointing and (2) to communicate the last byte offset that was read from the Spark Executor to the Spark Driver. The communication of the last byte offset from the Executor to the Driver caused the Driver to wait for all tasks within a specific batch to complete. Additionally, this caused N reads and N writes to an HDFS-compatible file system every batch. As you can imagine, the wait and the frequent I/O caused huge performance issues with the previous library and limited the number of concurrent jobs to 1. These limitations no longer exist, there is no progress tracker, and we don't have frequent I/O with the file system.

Switched to Sequence Number based filtering.

In the new connector, we use sequence numbers to consume events from EventHubs. This allows us to know the start and end point of each batch deterministically. In other words, 1000 events from sequence number X is simply X + 1000 whereas 1000 events from byte offset Y cannot be known apriori.

Between these two changes, the entire codebase was rewritten - all concurrency bugs (i.e. multiple threads writing to the same HDFS Path at once), enqueue time filtering not working, no concurrent jobs, long waits between batches, etc have all been solved. Additionally, all unit and integration tests have been rewritten. We also have the new connector regularly stress tested.

In addition to the codebase re-write, the documentation needed to be re-done. All new documentation has been written and there are no undocumented options/parameters for the user!

In addition, we have many new improvements:

  • Spark 2.3 and Spark 2.2 Support
  • Databricks (and Azure Databricks) support
  • Spark core support
  • Added an EventHubsSink to Structured Streaming
  • Moved to EventHubs Java Client 1.0.0
  • Thread pooling between multiple EventHubClients
  • Connection pooling between multiple EventHubClients and across batches
  • Java support in Spark Streaming
  • Allow users to manage their own offsets
  • Per partition configuration for starting positions, ending positions, and max rates.
  • Users can start their jobs from START_OF_STREAM and END_OF_STREAM
  • EventHubs receiver timeout and operation timeout are now configurable
  • README is revamped
  • Non-public and international clouds are properly supported
Clone this wiki locally