Skip to content

[Bug] ShuffleTaskManager.commitShuffle will get stuck forever if an exception occurs during the flush process #1863

@rickyma

Description

@rickyma

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the bug

image

Affects Version(s)

master

Uniffle Server Log Output

jstack:

"Grpc-1788" #2073 daemon prio=5 os_prio=0 cpu=1723.11ms elapsed=88729.16s tid=0x00007f3d3c0f1000 nid=0x968 waiting for monitor entry [0x00007f3cf97fe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.uniffle.server.ShuffleTaskManager.commitShuffle(ShuffleTaskManager.java:338)
        - waiting to lock <0x00007f4fbf708e00> (a java.lang.Object)
        at org.apache.uniffle.server.ShuffleServerGrpcService.finishShuffle(ShuffleServerGrpcService.java:468)
        at org.apache.uniffle.proto.ShuffleServerGrpc$MethodHandlers.invoke(ShuffleServerGrpc.java:1060)
        at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
        at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
        at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:356)
        at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:861)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

"Grpc-1359" #1629 daemon prio=5 os_prio=0 cpu=5536.44ms elapsed=88733.96s tid=0x00007f4380185800 nid=0x7ac waiting on condition [0x00007f41156fe000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.uniffle.server.ShuffleTaskManager.commitShuffle(ShuffleTaskManager.java:360)
        - locked <0x00007f4fbf708e00> (a java.lang.Object)
        at org.apache.uniffle.server.ShuffleServerGrpcService.finishShuffle(ShuffleServerGrpcService.java:468)
        at org.apache.uniffle.proto.ShuffleServerGrpc$MethodHandlers.invoke(ShuffleServerGrpc.java:1060)
        at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
        at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
        at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:356)
        at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:861)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)	

exception log:

[2024-07-03 08:54:32.973] [HadoopFlushEventThreadPool-1] [WARN] SingleStorageManager.write - Exception happened when write data for ShuffleDataFlushEvent: eventId=252896, appId=application_1716779728283_6825960_1719966578466, shuffleId=0, startPartition=315, endPartition=315, retryTimes=0, underStorage=HadoopStorage, isPended=false, ownedByHugePartition=false, try again
org.apache.uniffle.common.exception.RssException: java.io.IOException: All datanodes [DatanodeInfoWithStorage[127.0.0.1:9003,DS-3ad04d12-7d78-405f-ba33-d2bb706f073d,DISK]] are bad. Aborting...
        at org.apache.uniffle.storage.handler.impl.HadoopShuffleWriteHandler.write(HadoopShuffleWriteHandler.java:157)
        at org.apache.uniffle.storage.handler.impl.PooledHadoopShuffleWriteHandler.write(PooledHadoopShuffleWriteHandler.java:122)
        at org.apache.uniffle.server.storage.SingleStorageManager.write(SingleStorageManager.java:59)
        at org.apache.uniffle.server.storage.HybridStorageManager.write(HybridStorageManager.java:130)
        at org.apache.uniffle.server.ShuffleFlushManager.processFlushEvent(ShuffleFlushManager.java:165)
        at org.apache.uniffle.server.DefaultFlushEventHandler.handleEventAndUpdateMetrics(DefaultFlushEventHandler.java:97)
        at org.apache.uniffle.server.DefaultFlushEventHandler.lambda$dispatchEvent$0(DefaultFlushEventHandler.java:219)
        at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: All datanodes [DatanodeInfoWithStorage[127.0.0.1:9003,DS-3ad04d12-7d78-405f-ba33-d2bb706f073d,DISK]] are bad. Aborting...
        at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1567)
        at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1501)
        at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1487)
        at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1262)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:673)

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions