IPC degradation (and failures) between databricks driverdaemon and .Net driver program #929

dbeavon · 2021-04-28T01:01:02Z

dbeavon
Apr 28, 2021

I'm starting to open a ticket with azure-databricks on this as well. I have a cluster where I'm submitting lots of concurrent jobs. With ten concurrent jobs, things work fine and my jobs finish consistently in ~30 seconds each. When I gradually start increasing to fifty concurrent jobs then things start misbehaving and the average job can take 3x to 6x the amount of time to do the same work.

The cluster is beefy and I don't see resource problems in ganglia. Below is memory and CPU of the driver during my testing. (respectively)

In contrast to this driver, the workers show minimal CPU activity and RAM usage.

You can see that the memory usage on the driver is reaching a configured limit of 30 GB and then starts to level out.

The problem is that my thruput of jobs can't be increased much past 20 concurrent jobs. As I try to submit more of them concurrently, they take longer to execute. There appears to be a bottleneck. It appears to be happening in my driver program and I narrowed it down to a part of the code where I define (but do not execute) some UDF operations in a spark session.

So I'm a bit suspicious of the IPC between the databricks driverdaemon process (huge 39 GB process) and the spark.net programs (130MB each). Here is a snapshot of the programs as they are running:


  PID USER        VSZ   RSS COMMAND

    1 root     225084  9048 systemd
   70 root      94868 13664 systemd-journal
   81 systemd+  70672  5292 systemd-resolve
   91 root      97064  3416 ntpd
   93 root     167448 16936 networkd-dispat
   97 root      70640  5928 systemd-logind
   99 syslog   193408  4112 rsyslogd
  104 message+  49936  4112 dbus-daemon
  106 root      72308  6612 sshd
  121 root     113892  3144 monit
  128 root     184296 20032 unattended-upgr
  179 ubuntu    76580  7328 systemd
  180 ubuntu   111272  2800 (sd-pam)
  283 root     2208464 33912 goofys-dbr
 1742 root     10085868 846608 java
 1850 root     7006064 536544 java
 1995 root      28364  2788 cron
 2066 root      14232  5820 bash

 2132 root     50191400 38773676 java <<<<<<<<<<<<<<<<<



23294 root     206230292 139260 UFP.DataRail.Sp
23317 root     207090488 136068 UFP.DataRail.Sp
23551 root     207221536 136692 UFP.DataRail.Sp
23564 root     207156020 136088 UFP.DataRail.Sp
23654 root     207155948 136328 UFP.DataRail.Sp
23711 root     206230256 138912 UFP.DataRail.Sp
23732 root     207090512 136492 UFP.DataRail.Sp
23754 root     207221496 136544 UFP.DataRail.Sp
23799 root     207156048 135960 UFP.DataRail.Sp
23865 root     206304076 139420 UFP.DataRail.Sp
23886 root     206304084 138952 UFP.DataRail.Sp
23907 root     207230024 136640 UFP.DataRail.Sp
23935 root     206230536 138832 UFP.DataRail.Sp
23967 root     207229812 136052 UFP.DataRail.Sp
24047 root     206230404 138592 UFP.DataRail.Sp
24075 root     207229656 135772 UFP.DataRail.Sp
24099 root     207237860 136024 UFP.DataRail.Sp
24140 root     207172360 135644 UFP.DataRail.Sp
24179 root     206230532 138812 UFP.DataRail.Sp
24218 root     207336532 136484 UFP.DataRail.Sp
24271 root     207336636 135500 UFP.DataRail.Sp
24294 root     207189064 135220 UFP.DataRail.Sp
24335 root     207336844 135340 UFP.DataRail.Sp
24383 root     207336708 135200 UFP.DataRail.Sp
24405 root     207336544 134872 UFP.DataRail.Sp
24474 root     207336504 135064 UFP.DataRail.Sp
24590 root     207335820 134140 UFP.DataRail.Sp
24616 root     207336068 134132 UFP.DataRail.Sp
24638 root     207336044 134428 UFP.DataRail.Sp
24659 root     207335980 134096 UFP.DataRail.Sp

24411 root     973804 77492 python
2533 root      12352  4104 bash
24910 root      12352  3876 bash
24912 root      36260  3464 ps

As things start to slow down, I also start to see the occasional network exception (and/or threadpool exception) in the log4j. Here are two common examples:


 
21/04/28 00:13:48 ERROR DotnetBackendHandler: Exception caught: 
**java.util.concurrent.RejectedExecutionException**: Task java.util.concurrent.FutureTask@5eeaef38 rejected from java.util.concurrent.ThreadPoolExecutor@4ab4c7b8[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
	at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
	at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
	at java.util.concurrent.Executors$DelegatedExecutorService.submit(Executors.java:678)
	at org.apache.spark.api.dotnet.ThreadPool$.run(ThreadPool.scala:33)
	at org.apache.spark.api.dotnet.DotnetBackendHandler.handleBackendRequest(DotnetBackendHandler.scala:105)
	at org.apache.spark.api.dotnet.DotnetBackendHandler.channelRead0(DotnetBackendHandler.scala:28)
	at org.apache.spark.api.dotnet.DotnetBackendHandler.channelRead0(DotnetBackendHandler.scala:21)
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:321)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:295)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)
 

=========


21/04/28 00:13:58 ERROR DatabricksMain$DBUncaughtExceptionHandler: Uncaught exception in thread serve-DataFrame!
**java.net.SocketTimeoutException**: Accept timed out
	at java.net.PlainSocketImpl.socketAccept(Native Method)
	at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
	at java.net.ServerSocket.implAccept(ServerSocket.java:562)
	at java.net.ServerSocket.accept(ServerSocket.java:530)
	at org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:60)

There are corresponding exceptions that show up on the .Net side but they are fairly unhelpful and meaningless:

System.Exception: JVM method execution failed: Nonstatic method 'intValue' failed for class '94' when called with no arguments

I opened the ticket with azure-databricks because I'm pretty stuck and haven't made any progress on this for about two days.
At first I thought the bottleneck was located in the REST API, because they have a throttle on job creation. But after doing a bit more profiling it appears there is an IPC problem (or maybe I'm being affected by both).

I think one unusual part about my scenario is my short job durations (30 sec) and moderately high number of concurrent requests (over 20 and up to 50).

Another unusual aspect is the way that the "driver daemon" works in databricks. On my local spark I get distinct applications when I submit to the cluster and they appear to be independent of each other. But in databricks they are all lumped together into one massive "driver daemon" that seems to scale up pretty poorly when running lots of dotnet IPC/interop.

Can someone please give some tips on how to troubleshoot this? Is there any profiling or logging I can enable in order to troubleshoot the IPC/network/threadpool issues that I'm seeing?

Is there a way to get a local standalone cluster to behave on my local workstation in a way that is similar to the databricks "driver daemon"? Should I create a single application with lots of SparkSessions running concurrently?

Sorry for the long question.
I hope that azure-databricks might be able to help as well, but I get the impression that they don't encounter that many .Net customers yet.

imback82 · 2021-04-28T01:39:27Z

imback82
Apr 28, 2021

When I gradually start increasing to fifty concurrent jobs then things start misbehaving and the average job can take 3x to 6x the amount of time to do the same work.

It's hard to tell without looking at the actual jobs, etc. I think azure-databricks team can support you better here.

But in databricks they are all lumped together into one massive "driver daemon" that seems to scale up pretty poorly when running lots of dotnet IPC/interop.

Did you try running the same with Python/Scala if they also have a problem with scaling? I want to make sure whether this is .NET related issue or not first.

There appears to be a bottleneck. It appears to be happening in my driver program and I narrowed it down to a part of the code where I define (but do not execute) some UDF operations in a spark session.

Hmm, if you are not executing the UDF, why do you think it's causing the bottleneck? Did you try removing the UDF since it's not being executed?

Sorry, I cannot answer all the questions in this thread because of the lack of info on the internals of databricks (e.g., "driver daemon"), but please let me know if the issue is related to .NET once you get some help from the databricks side.

1 reply

dbeavon Apr 28, 2021
Author

Thanks @imback82 .

Do you use a profiling tool for the Java side? Maybe visualvm or something? Wondering if I can find hotspots quickly. I don't do java/scala development work very often so any pointers would be appreciated.

Given that databricks seems to host all drivers for the entire cluster in a single (~40 GB) process, do you think that could cause blocking when lots of .net processes are trying to talk back to it at the same time?

In particular I'm wondering about this class:
https://github.com/dotnet/spark/blob/main/src/scala/microsoft-spark-3-1/src/main/scala/org/apache/spark/api/dotnet/ThreadPool.scala

It appears to use the "synchronized" keyword a lot and I'm wondering if calls to these methods might cause substantial blocking in all dotnet programs at the same time. If there were distinct JVM driver processes for each .net process then perhaps the blocking wouldn't be so dramatic. Thoughts? I'm still trying to find a way to test this theory on my workstation...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPC degradation (and failures) between databricks driverdaemon and .Net driver program #929

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

IPC degradation (and failures) between databricks driverdaemon and .Net driver program #929

dbeavon Apr 28, 2021

Replies: 1 comment · 1 reply

imback82 Apr 28, 2021

dbeavon Apr 28, 2021 Author

dbeavon
Apr 28, 2021

Replies: 1 comment 1 reply

imback82
Apr 28, 2021

dbeavon Apr 28, 2021
Author