Proposal: Add Batch Kafka tasks to prefect.tasks collection #3442

tchoedak · 2020-10-06T00:57:30Z

tchoedak
Oct 6, 2020

Use Case(s)

Data pipelines in many orgs are required to interact with Kafka as both a data source and as a data sink. These data pipelines that in my experience often tend to be batch based - they spin up, read a set of message from queue, perform some action, spin down.

Batch Kafka Consume -> Snowflake Flow

Consume all messages from a Kafka topic
Load messages into a Snowflake table

Snowflake -> Transform -> Batch Kafka Produce Flow

Query Snowflake for a dataset
Apply business logic transformations via pandas/spark/data-processing tool to fit the end data model on an application
Convert data to messages
Produce to a Kafka topic

Task API

All kafka related tasks would live under src/prefect/tasks/kafka/

KafkaBatchConsume

init()

bootstrap_servers: string of all kafka brokers to connect to. Examples: 'localhost:9092', multiple brokers: 'localhost:9092,localhost:9093'
group_id: string Group id assigned to consumers from this consume task
kwargs: all other task kwargs

run()

topic: string topic to consume from
timeout: float timeout for polling operations
auto_offset_reset: string offset strategy. several options available from kafka configs, but earliest/smallest are the most common for data pipeline tasks while latest/largest is common for application consumers.
message_consume_limit: int number of messages consumed to limit to for this task. Optional and only needed in data pipelines designed to handle a fixed input of messages.
-> Returns List of messages

KafkaBatchProduce

init()

bootstrap_servers: string of all kafka brokers to connect to.

run()

topic: string topic to produce messages to
messages: list of messages to produce.
- If a message is a dictionary, check that key/value are keys.
  - produce with key=message['key'] and value=message['value]
- Otherwise produce all messages with value=message (no key specified)
threshold: int count of messages produce that trigger a producer.flush() - flushing regularly is a best practice
callback: Callable callback upon delivery of a message

Backend Client

I would recommend confluent-kafka-python as the backend client that actually does the hard work of producing messages and consuming messages. This lib in my experience, while not as mature as kafka-python, is maintained by the organization backing kafka, confluent, and is more performant.

Design Choices

These are both batch producer/consumer tasks - once they've completed consuming all available messages or produce all available messages, they will run the task to a successful state. I believe a live consumer/producer would prevent the task from ever reaching an end state - causing any subsequent tasks in a flow to never execute (correct me if I'm wrong).
KafkaBatchProduce.run's messages can be two types: List[str], or List[Dict]. I would love any advice or suggestions here, it feels like there's room for misuse or confusion here. I'm currently thinking of supporting the two types because a kafka.produce() call requires a value for the message, but can also optionally accept key. Using a List[Dict] allows us to enforce a standard for key-value based producers where the key is grabbed from message['key'] and value is grabbed from message['value'].
Where is Avro support?
- Avro producers and consumers are more common in more mature organizations. They would not be that hard to add, but I'd suggest there's sign off here and propose Avro implementations to be taken care of in a secondary PR.

Consequences

This will allow Prefect to have a builtin Task that supports the most common type of operation for kafka based data pipelines: batch consume and batch produce workflows.

The proposed design does not support live producers/consumers.
This will require another extras in setup.py for kafka

cicdw · 2020-10-07T19:58:34Z

cicdw
Oct 7, 2020
Maintainer

Hi @tchoedak I think this is a great idea, and very well articulated; I suggest you convert this discussion into an issue for visibility, and in case anyone wants to begin this work!

1 reply

tchoedak Oct 7, 2020
Author

@cicdw Done! #3454

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Add Batch Kafka tasks to prefect.tasks collection #3442

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Proposal: Add Batch Kafka tasks to prefect.tasks collection #3442

Uh oh!

Uh oh!

tchoedak Oct 6, 2020

Use Case(s)

Task API

KafkaBatchConsume

init()

run()

KafkaBatchProduce

init()

run()

Backend Client

Design Choices

Consequences

Replies: 1 comment · 1 reply

Uh oh!

cicdw Oct 7, 2020 Maintainer

Uh oh!

tchoedak Oct 7, 2020 Author

tchoedak
Oct 6, 2020

Replies: 1 comment 1 reply

cicdw
Oct 7, 2020
Maintainer

tchoedak Oct 7, 2020
Author