Replies: 1 comment 1 reply
-
Hi @tchoedak I think this is a great idea, and very well articulated; I suggest you convert this discussion into an issue for visibility, and in case anyone wants to begin this work! |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Use Case(s)
Data pipelines in many orgs are required to interact with Kafka as both a data source and as a data sink. These data pipelines that in my experience often tend to be batch based - they spin up, read a set of message from queue, perform some action, spin down.
Batch Kafka Consume -> Snowflake Flow
Snowflake -> Transform -> Batch Kafka Produce Flow
Task API
All kafka related tasks would live under
src/prefect/tasks/kafka/
KafkaBatchConsume
init()
'localhost:9092'
, multiple brokers:'localhost:9092,localhost:9093'
run()
earliest/smallest
are the most common for data pipeline tasks whilelatest/largest
is common for application consumers.KafkaBatchProduce
init()
run()
Backend Client
I would recommend confluent-kafka-python as the backend client that actually does the hard work of producing messages and consuming messages. This lib in my experience, while not as mature as kafka-python, is maintained by the organization backing kafka,
confluent
, and is more performant.Design Choices
messages
can be two types: List[str], or List[Dict]. I would love any advice or suggestions here, it feels like there's room for misuse or confusion here. I'm currently thinking of supporting the two types because a kafka.produce() call requires a value for the message, but can also optionally acceptkey
. Using a List[Dict] allows us to enforce a standard for key-value based producers where the key is grabbed frommessage['key']
and value is grabbed frommessage['value']
.Consequences
extras
in setup.py forkafka
Beta Was this translation helpful? Give feedback.
All reactions