Performance problems of Dory inputting data to Kafka #26
-
When using Kafka's own producer to import data to kafka,each request is batched and multi partitioned,So it's very efficient. The following is the log of Kafka receiving requests: It took only 4 seconds to import 300000 pieces of data. However, when Dory is used to import the same data into Kafka, there is only a small amount of data and one partition per request: A few minutes have passed but the import has not been completed. After the installation, I used the default configuration of dory. How can I achieve the same performance as kafka-console-producer.sh. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 7 replies
-
When you say "default configuration", are you referring to the one referenced here? https://github.com/dspeterson/dory-work/blob/master/doc/detailed_config.md
Try adjusting the settings so that the amount of batching Dory performs is much higher, comparable to what you are getting from Kafka's own producer. To do that, try increasing the bytes value to something much larger than 256k. In general, batching can be done based on any combination of the following:
Setting any of those settings to "disable" means that no limit is imposed for that setting. If multiple limits are specified (for instance, if you specify a time limit of 10000 ms and a bytes limit of 256k), then Dory will send a batch once either 10 seconds has elapsed or 256k bytes of message data has accumulated, whichever comes first. Your output shows that Kafka's producer is sending about 1,000,000 messages per produce request. The total amount of data in a batch will be roughly that number times the average message size. For instance, if messages average 100 bytes each, the size of a batch will be roughly 100 mb. When Dory starts up, it preallocates a fixed amount of memory for buffering messages before sending them to Kafka. That value should be much larger than the batch size. You will probably need to increase it to prevent Dory from discarding messages due to the buffer capacity being reached. In the config file, see <doryConfig>/<inputConfig> and look for the following:
Try increasing this value. To see whether messages are being discarded, you can query Dory's HTTP interface, which is documented here: https://github.com/dspeterson/dory-work/blob/master/doc/status_monitoring.md Another place where adjustments may be needed is under <doryConfig>/<compression>. You can disable or enable compression, or adjust compression-related settings if compression is enabled. If your messages compress very well and compression is relatively fast, compression may help throughput by reducing the size of each produce request, which reduces network transmission time. But if compression is slow relative to message transmission time and/or you messages compress poorly then it's best to disable compression to avoid the extra time and computation effort required to do the compression. Also consider that Kafka has to decompress the messages so consumers can consume them, and there is a cost associated with that. I recommend experimenting with the settings. It will take some trial and error to fine-tune the performance. I didn't put much effort into creating the settings shown in the example config file, so the performance will likely be far from optimal if you use them unmodified. |
Beta Was this translation helpful? Give feedback.
When you say "default configuration", are you referring to the one referenced here? https://github.com/dspeterson/dory-work/blob/master/doc/detailed_config.md
If so, under <doryConfig>/<batching>/<namedConfigs> you will see the following section:
Try adjusting the settings so that the amount of batching Dory performs is much higher, comparable to what you are getting from Kafka's own producer. To do that, try increasing the bytes value to something much larger than 256k. In general, batching can be don…