Replies: 1 comment 1 reply
-
It's hard to tell without looking at the actual jobs, etc. I think azure-databricks team can support you better here.
Did you try running the same with Python/Scala if they also have a problem with scaling? I want to make sure whether this is .NET related issue or not first.
Hmm, if you are not executing the UDF, why do you think it's causing the bottleneck? Did you try removing the UDF since it's not being executed? Sorry, I cannot answer all the questions in this thread because of the lack of info on the internals of databricks (e.g., "driver daemon"), but please let me know if the issue is related to .NET once you get some help from the databricks side. |
Beta Was this translation helpful? Give feedback.
-
I'm starting to open a ticket with azure-databricks on this as well. I have a cluster where I'm submitting lots of concurrent jobs. With ten concurrent jobs, things work fine and my jobs finish consistently in ~30 seconds each. When I gradually start increasing to fifty concurrent jobs then things start misbehaving and the average job can take 3x to 6x the amount of time to do the same work.
The cluster is beefy and I don't see resource problems in ganglia. Below is memory and CPU of the driver during my testing. (respectively)
In contrast to this driver, the workers show minimal CPU activity and RAM usage.
You can see that the memory usage on the driver is reaching a configured limit of 30 GB and then starts to level out.
The problem is that my thruput of jobs can't be increased much past 20 concurrent jobs. As I try to submit more of them concurrently, they take longer to execute. There appears to be a bottleneck. It appears to be happening in my driver program and I narrowed it down to a part of the code where I define (but do not execute) some UDF operations in a spark session.
So I'm a bit suspicious of the IPC between the databricks driverdaemon process (huge 39 GB process) and the spark.net programs (130MB each). Here is a snapshot of the programs as they are running:
As things start to slow down, I also start to see the occasional network exception (and/or threadpool exception) in the log4j. Here are two common examples:
There are corresponding exceptions that show up on the .Net side but they are fairly unhelpful and meaningless:
System.Exception: JVM method execution failed: Nonstatic method 'intValue' failed for class '94' when called with no arguments
I opened the ticket with azure-databricks because I'm pretty stuck and haven't made any progress on this for about two days.
At first I thought the bottleneck was located in the REST API, because they have a throttle on job creation. But after doing a bit more profiling it appears there is an IPC problem (or maybe I'm being affected by both).
I think one unusual part about my scenario is my short job durations (30 sec) and moderately high number of concurrent requests (over 20 and up to 50).
Another unusual aspect is the way that the "driver daemon" works in databricks. On my local spark I get distinct applications when I submit to the cluster and they appear to be independent of each other. But in databricks they are all lumped together into one massive "driver daemon" that seems to scale up pretty poorly when running lots of dotnet IPC/interop.
Can someone please give some tips on how to troubleshoot this? Is there any profiling or logging I can enable in order to troubleshoot the IPC/network/threadpool issues that I'm seeing?
Is there a way to get a local standalone cluster to behave on my local workstation in a way that is similar to the databricks "driver daemon"? Should I create a single application with lots of SparkSessions running concurrently?
Sorry for the long question.
I hope that azure-databricks might be able to help as well, but I get the impression that they don't encounter that many .Net customers yet.
Beta Was this translation helpful? Give feedback.
All reactions