SNOW-1369280 Close channels asynchronously #841

sfc-gh-lshcharbaty · 2024-05-08T14:02:08Z

Overview

SNOW-1369280
By closing channels in parallel, speeds up task rebalancing. As in #839, SFException is ignored and doesn't fail a connector task.

Pre-review checklist

src/main/java/com/snowflake/kafka/connector/internal/streaming/SnowflakeSinkServiceV2.java

sfc-gh-tzhang · 2024-05-08T17:39:57Z

src/main/java/com/snowflake/kafka/connector/SnowflakeSinkConnectorConfig.java

+  // Whether to close streaming channels in parallel.
+  public static final String SNOWPIPE_STREAMING_CLOSE_CHANNELS_IN_PARALLEL =
+      "snowflake.streaming.closeChannelsInParallel.enabled";
+  public static final boolean SNOWPIPE_STREAMING_CLOSE_CHANNELS_IN_PARALLEL_DEFAULT = false;


This doesn't feel like a configuration should be exposed to customer. As a customer, I don't know whether I should close the channels in parallel or not. In fact, I don't really care, I just want my system to run without any failure. WDYT?

agreed but I think we want to try this out for customers who has a lot of partitions in their single task. so enabling by default will not benefit a lot of customers

also its good to have param protection.

We should stress test this and slow roll out in couple of releases.

What's the risk of having it default true and then getting rid of the parameter in a future release?

@sfc-gh-tzhang
This is not a publicly announced parameter. It's going to behave as a knob that's going to allow tho customer to easily roll the change back in case something goes wrong. Exactly what @sfc-gh-japatel wrote.

@sfc-gh-xhuang
Imo it's better to be safe than sorry. I'm not sure how the parallelization is going to work on the customer environments as channel closing is assigned to a shared ForkJoinPool that seems to exist as a single instance and shared across all tasks on the same host.

I suggest we approach the "change rollout" parameters the following way:

Protect the change with a parameter and disable it by default. If some customers experience problems that can be solved by the change, they can turn it on.

Enable the change by default. If some customers experience problems that cause by the change, they can turn it off and communicate to us.

Get rid of the older implementation.

Understood but I was just thinking that there is not much difference/value between steps 1 and 2 but yes we can be a bit safer this way.

Would this still be a behavior change after we finish step 3? This configuration is gone and customer's KC won't be able to start, then they will need to get the logs to understand why. Assuming they know how to look into the log, then they realize that the configuration is being removed and they will probably need to reach out to Snowflake to understand why it's being removed?

It seems to me a better way is to do a beta release with the fix so customer can try it (or just provide a private jar)? Other customers won't use this version since it's a beta version.

@sfc-gh-tzhang I don't see any schema validation for configuration properties. We will just stop respecting the snowflake.streaming.closeChannelsInParallel.enabled config key, but no errors should be observed. So, I suppose there will be no behavioral change afterwards.

Or do I miss anything?

I see, then that sounds better, thanks for confirming.

src/main/java/com/snowflake/kafka/connector/internal/streaming/SnowflakeSinkServiceV2.java

test/rest_request_template/schema_evolution_nullable_values_after_smt.json

sfc-gh-japatel

added couple of comments but overall LGTM!

src/main/java/com/snowflake/kafka/connector/internal/streaming/TopicPartitionChannel.java

sfc-gh-dseweryn

LGTM

sfc-gh-tzhang · 2024-05-10T20:27:15Z

src/main/java/com/snowflake/kafka/connector/SnowflakeSinkConnectorConfig.java

+  // Whether to close streaming channels in parallel.
+  public static final String SNOWPIPE_STREAMING_CLOSE_CHANNELS_IN_PARALLEL =
+      "snowflake.streaming.closeChannelsInParallel.enabled";
+  public static final boolean SNOWPIPE_STREAMING_CLOSE_CHANNELS_IN_PARALLEL_DEFAULT = false;


Would this still be a behavior change after we finish step 3? This configuration is gone and customer's KC won't be able to start, then they will need to get the logs to understand why. Assuming they know how to look into the log, then they realize that the configuration is being removed and they will probably need to reach out to Snowflake to understand why it's being removed?

It seems to me a better way is to do a beta release with the fix so customer can try it (or just provide a private jar)? Other customers won't use this version since it's a beta version.

sfc-gh-tzhang · 2024-05-10T20:39:52Z

src/main/java/com/snowflake/kafka/connector/internal/streaming/SnowflakeSinkServiceV2.java


-    StreamingClientProvider.getStreamingClientProviderInstance()
-        .closeClient(this.connectorConfig, this.streamingIngestClient);
+  private void closeAllInParallel() {


Think about more I'm not sure this approach would help a lot. If you look into the close function in the SDK, it will flush everything in the client and then call get status on this particular channel. If we're doing it in parallel, the same number of flush will be called, every flush will still be run in serial and block other flushes. I thought most of time is spent on waiting for server side to commit, which will stay the same since everything will be flush at once. Have you verified locally that it reduces the close time dramatically with a large number of channels?

Hi @sfc-gh-tzhang, as Lex is OOO, I prepared a short test for that. I am not sure whether the IT is sufficient to give meaningful information, but I ran such a test closing 50 channels. It shows the parallelism speeded the execution up.

13513.7 ms of mean sequential execution time vs 1508.0 of parallel

But the scenario does not cover the case where we have rows present in net.snowflake.ingest.streaming.internal.SnowflakeStreamingIngestChannelInternal buffer (the scenario described by you) - I cannot reproduce it even though I played with the buffering parameters.

hmmm, do you mean that closing 50 empty channels in serial takes 13.5s? That doesn't sound right, any idea where it spends most of the time?

The log level was set to DEBUG so it added a few additional seconds to the measured time. I changed it to INFO and attached the profiler. It takes ~11 seconds and spends most of its time on CompletableFuture.get().

sfc-gh-tzhang

Approved to unblock the PR, but I think this change is a risky change since I'm not sure if something weird would happen if we close a few hundreds of channels in parallel, since every close will spin up a background flush thread. Also another concern is calling get status in parallel would potentially overload DS node. I think longer term, the better solution is to add a close function in the SDK which doesn't call flush or get status so it will be very fast. Could we create a JIRA to track this? Thanks!

sfc-gh-tzhang · 2024-05-15T22:52:09Z

src/main/java/com/snowflake/kafka/connector/SnowflakeSinkConnectorConfig.java

+  // Whether to close streaming channels in parallel.
+  public static final String SNOWPIPE_STREAMING_CLOSE_CHANNELS_IN_PARALLEL =
+      "snowflake.streaming.closeChannelsInParallel.enabled";
+  public static final boolean SNOWPIPE_STREAMING_CLOSE_CHANNELS_IN_PARALLEL_DEFAULT = false;


I see, then that sounds better, thanks for confirming.

sfc-gh-tzhang · 2024-05-15T22:53:16Z

src/main/java/com/snowflake/kafka/connector/internal/streaming/SnowflakeSinkServiceV2.java


-    StreamingClientProvider.getStreamingClientProviderInstance()
-        .closeClient(this.connectorConfig, this.streamingIngestClient);
+  private void closeAllInParallel() {


hmmm, do you mean that closing 50 empty channels in serial takes 13.5s? That doesn't sound right, any idea where it spends most of the time?

ebuildy · 2024-05-16T11:16:18Z

As this is for better performance, is it a way to monitor it please?

This could help debugging crash/rebalance issues.

Thanks you, cant wait to test it!

ebuildy · 2024-05-16T20:56:20Z

src/main/java/com/snowflake/kafka/connector/internal/streaming/SnowflakeSinkServiceV2.java

+                })
+            .toArray(CompletableFuture[]::new);
+
+    CompletableFuture.allOf(futures).join();
  }


In my small comprehension, CompletableFuture uses the ForkJoinPool.commonPool() as its default thread pool with (Runtime.getRuntime().availableProcessors() - 1) threads. We can cheat by setting JVM cpu count but I rather suggest to create a thread pools like this:

import java.util.concurrent.*; private void closeAllInParallel() { ExecutorService executor = Executors.newFixedThreadPool(CONFIG_THREAD_POOL_SIZE); CompletableFuture<?>[] futures = partitionsToChannel.entrySet().stream() .map( entry -> { String channelKey = entry.getKey(); TopicPartitionChannel topicPartitionChannel = entry.getValue(); LOGGER.info("Closing partition channel:{}", channelKey); return CompletableFuture.runAsync(topicPartitionChannel::closeChannelAsync, executor); }) .toArray(CompletableFuture[]::new); CompletableFuture.allOf(futures).join(); executor.shutdown(); // Don't forget to shutdown the executor }

Thanks, @ebuildy, for providing an example. Let me merge the PR in its current form, I'll check with the SDK team on a possible built-in solution. If the timeline is not optimistic, I'll propose this change to the rest of the team.

fantastic! Any plan to do the same for open channel operations?

ebuildy · 2024-05-16T20:58:30Z

Approved to unblock the PR, but I think this change is a risky change since I'm not sure if something weird would happen if we close a few hundreds of channels in parallel, since every close will spin up a background flush thread. Also another concern is calling get status in parallel would potentially overload DS node. I think longer term, the better solution is to add a close function in the SDK which doesn't call flush or get status so it will be very fast. Could we create a JIRA to track this? Thanks!

The default thread pool used by CompletableFuture according doc at https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/util/concurrent/CompletableFuture.html

All async methods without an explicit Executor argument are performed using the [ForkJoinPool.commonPool()](https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/util/concurrent/ForkJoinPool.html#commonPool())

Would be better to create a static thread pool

sfc-gh-akowalczyk · 2024-05-21T07:24:28Z

Approved to unblock the PR, but I think this change is a risky change since I'm not sure if something weird would happen if we close a few hundreds of channels in parallel, since every close will spin up a background flush thread. Also another concern is calling get status in parallel would potentially overload DS node. I think longer term, the better solution is to add a close function in the SDK which doesn't call flush or get status so it will be very fast. Could we create a JIRA to track this? Thanks!

Thanks, created SNOW-1437461

sfc-gh-lshcharbaty requested review from sfc-gh-japatel, sfc-gh-tzhang, sfc-gh-tjones, sfc-gh-rcheng and a team as code owners May 8, 2024 14:02

sfc-gh-lshcharbaty force-pushed the lshcharbaty/SNOW-1369280_Close_channels_in_parallel branch from 37db638 to b10f038 Compare May 8, 2024 16:11

sfc-gh-gjachimko reviewed May 8, 2024

View reviewed changes

src/main/java/com/snowflake/kafka/connector/internal/streaming/SnowflakeSinkServiceV2.java Show resolved Hide resolved

sfc-gh-tzhang reviewed May 8, 2024

View reviewed changes

sfc-gh-japatel reviewed May 8, 2024

View reviewed changes

src/main/java/com/snowflake/kafka/connector/internal/streaming/SnowflakeSinkServiceV2.java Outdated Show resolved Hide resolved

sfc-gh-japatel reviewed May 8, 2024

View reviewed changes

test/rest_request_template/schema_evolution_nullable_values_after_smt.json Show resolved Hide resolved

sfc-gh-japatel reviewed May 8, 2024

View reviewed changes

Base automatically changed from lshcharbaty/SNOW-1369271_Ignore_SFException to master May 9, 2024 09:13

sfc-gh-lshcharbaty added 2 commits May 9, 2024 11:15

SNOW-1369280 Close TopicPartitionChannels in parallel

148341d

SNOW-1369280 Add doc for error handling

48922a6

sfc-gh-lshcharbaty force-pushed the lshcharbaty/SNOW-1369280_Close_channels_in_parallel branch from b10f038 to 48922a6 Compare May 9, 2024 09:15

sfc-gh-gjachimko reviewed May 9, 2024

View reviewed changes

src/main/java/com/snowflake/kafka/connector/internal/streaming/TopicPartitionChannel.java Show resolved Hide resolved

sfc-gh-dseweryn reviewed May 9, 2024

View reviewed changes

src/main/java/com/snowflake/kafka/connector/internal/streaming/TopicPartitionChannel.java Show resolved Hide resolved

sfc-gh-lshcharbaty added 2 commits May 9, 2024 15:08

SNOW-1369280 Add CloseTopicPartitionChannelIT

1afb47b

SNOW-1369280 Improve docs

1713e33

sfc-gh-dseweryn approved these changes May 9, 2024

View reviewed changes

sfc-gh-lshcharbaty added 2 commits May 9, 2024 16:58

SNOW-1369280 Fix e2e test

a20f9d6

SNOW-1369280 Format

7048340

sfc-gh-lshcharbaty requested review from sfc-gh-gjachimko, sfc-gh-xhuang, sfc-gh-japatel and sfc-gh-tzhang May 9, 2024 15:31

sfc-gh-tzhang reviewed May 10, 2024

View reviewed changes

sfc-gh-akowalczyk requested a review from sfc-gh-tzhang May 14, 2024 12:31

sfc-gh-japatel mentioned this pull request May 15, 2024

Different offset behaviour between SNOWPIPE & SNOWPIPE_STREAMING #690

Closed

sfc-gh-tzhang approved these changes May 15, 2024

View reviewed changes

ebuildy reviewed May 16, 2024

View reviewed changes

sfc-gh-akowalczyk merged commit 22ccebd into master May 20, 2024
81 checks passed

sfc-gh-akowalczyk deleted the lshcharbaty/SNOW-1369280_Close_channels_in_parallel branch May 20, 2024 12:38

ConfluentSemaphore pushed a commit to confluentinc/snowflake-kafka-connector that referenced this pull request Aug 6, 2024

SNOW-1369280 Close channels asynchronously (snowflakedb#841)

39f0271

ebuildy mentioned this pull request Sep 19, 2024

Unexpected behavior with a high number of channels Snowpipe Streaming #880

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-1369280 Close channels asynchronously #841

SNOW-1369280 Close channels asynchronously #841

sfc-gh-lshcharbaty commented May 8, 2024 •

edited by sfc-gh-wtrefon

Loading

sfc-gh-tzhang May 8, 2024

sfc-gh-japatel May 8, 2024

sfc-gh-xhuang May 8, 2024 •

edited

Loading

sfc-gh-lshcharbaty May 9, 2024

sfc-gh-xhuang May 9, 2024

sfc-gh-tzhang May 10, 2024

sfc-gh-akowalczyk May 14, 2024 •

edited

Loading

sfc-gh-tzhang May 15, 2024

sfc-gh-japatel left a comment

sfc-gh-dseweryn left a comment

sfc-gh-tzhang May 10, 2024

sfc-gh-tzhang May 10, 2024

sfc-gh-akowalczyk May 13, 2024 •

edited

Loading

sfc-gh-tzhang May 15, 2024

sfc-gh-akowalczyk May 16, 2024

sfc-gh-tzhang left a comment •

edited

Loading

sfc-gh-tzhang May 15, 2024

sfc-gh-tzhang May 15, 2024

ebuildy commented May 16, 2024

ebuildy May 16, 2024

sfc-gh-akowalczyk May 20, 2024

ebuildy May 21, 2024

ebuildy commented May 16, 2024 •

edited

Loading

sfc-gh-akowalczyk commented May 21, 2024

SNOW-1369280 Close channels asynchronously #841

SNOW-1369280 Close channels asynchronously #841

Conversation

sfc-gh-lshcharbaty commented May 8, 2024 • edited by sfc-gh-wtrefon Loading

Overview

Pre-review checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-xhuang May 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-akowalczyk May 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-japatel left a comment

Choose a reason for hiding this comment

sfc-gh-dseweryn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-akowalczyk May 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-tzhang left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebuildy commented May 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebuildy commented May 16, 2024 • edited Loading

sfc-gh-akowalczyk commented May 21, 2024

sfc-gh-lshcharbaty commented May 8, 2024 •

edited by sfc-gh-wtrefon

Loading

sfc-gh-xhuang May 8, 2024 •

edited

Loading

sfc-gh-akowalczyk May 14, 2024 •

edited

Loading

sfc-gh-akowalczyk May 13, 2024 •

edited

Loading

sfc-gh-tzhang left a comment •

edited

Loading

ebuildy commented May 16, 2024 •

edited

Loading