Commit 196b47e
committed
[KYUUBI #6997] Get the latest batch app info after submit process terminated to prevent batch ERROR due to engine submit timeout
### Why are the changes needed?
We meet below issue:
For spark on yarn:
```
spark.yarn.submit.waitAppCompletion=false
kyuubi.engine.yarn.submit.timeout=PT10M
```
Due to network issue, the application submission was very slow.
It was submitted after 15 minutes.
<img width="1430" alt="image" src="https://github.com/user-attachments/assets/a326c3d1-4d39-42da-b6aa-cad5f8e7fc4b" />
<img width="1350" alt="image" src="https://github.com/user-attachments/assets/8e20056a-bd71-4515-a5e3-f881509a34b2" />
Then the batch failed from PENDING state to ERRO state directly, due to application state NOT_FOUND(exceeds the kyuubi.engine.yarn.submit.timeout).
https://github.com/apache/kyuubi/blob/a54ee39ab338e310c6b9a508ad8f14c0bd82fa0f/kyuubi-server/src/main/scala/org/apache/kyuubi/engine/ApplicationOperation.scala#L99-L106
<img width="1727" alt="image" src="https://github.com/user-attachments/assets/20a2987c-675c-4136-a107-001f30b1b217" />
Here is the operation event:
<img width="1727" alt="image" src="https://github.com/user-attachments/assets/e2bab9c3-a959-4e2b-a207-813ae6489b30" />
But from the batch log, the current application status should be `PENDING`.
```
:2025-03-21 17:36:19.350 INFO [KyuubiSessionManager-exec-pool: Thread-176922] org.apache.kyuubi.operation.BatchJobSubmission: Batch report for bbba09c8-3704-4a87-8394-9bcbbd39cc34, Some(ApplicationInfo(application_1741747369441_2258235,6042072c-e8fa-425d-a6a3-3d5bbb4ec1e3-275732_6042072c-e8fa-425d-a6a3-3d5bbb4ec1e3-275732.e3a34b86-7fc7-43ea-b4a5-1b6f27df54b5.0_20250322002147.stm,PENDING,Some(https://apollo-rno-rm-2.vip.hadoop.ebay.com:50030/proxy/application_1741747369441_2258235/),Some()))
```
So, we should retrieve the batch application info after the submission process terminated before checking the application failed, to get the current application information to prevent the corner case:
1. the application submission time exceeds the `kyuubi.engine.yarn.submit.timeout` and the app state is NOT FOUND
2. can not get the application report before the submission process terminated
3. then the batch state to ERROR from PENDING directly.
Conclusion:
The application state transition was:
UNKNOWN(before submit timeout) -> NOT_FOUND(reach submit timeout) -> processExit -> batchOpError -> PENDING(updateApplicationInfoMetadataIfNeeded) -> UNKNOWN(batchError but app not terminated)
After this PR, it should be:
UNKNOWN(before submit timeout) -> NOT_FOUND(reach submit timeout) -> processExit-> PENDING(after process terminated) -> ....
### How was this patch tested?
Existing GA.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #6997 from turboFei/app_not_found_v2.
Closes #6997
370cf49 [Wang, Fei] v2
912ec28 [Wang, Fei] nit
3c376f9 [Wang, Fei] log the op ex
d9cbdb8 [Wang, Fei] fix app not found
Authored-by: Wang, Fei <[email protected]>
Signed-off-by: Wang, Fei <[email protected]>1 parent 2080c21 commit 196b47e
File tree
2 files changed
+44
-33
lines changed- kyuubi-common/src/main/scala/org/apache/kyuubi/operation
- kyuubi-server/src/main/scala/org/apache/kyuubi/operation
2 files changed
+44
-33
lines changedLines changed: 1 addition & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
124 | 124 | | |
125 | 125 | | |
126 | 126 | | |
| 127 | + | |
127 | 128 | | |
128 | 129 | | |
129 | 130 | | |
| |||
Lines changed: 43 additions & 33 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
165 | 165 | | |
166 | 166 | | |
167 | 167 | | |
| 168 | + | |
| 169 | + | |
168 | 170 | | |
169 | 171 | | |
170 | 172 | | |
| |||
250 | 252 | | |
251 | 253 | | |
252 | 254 | | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
253 | 275 | | |
254 | 276 | | |
255 | 277 | | |
256 | | - | |
257 | | - | |
258 | | - | |
259 | | - | |
260 | | - | |
261 | | - | |
262 | | - | |
263 | | - | |
264 | | - | |
265 | | - | |
266 | | - | |
267 | | - | |
268 | | - | |
269 | | - | |
270 | | - | |
271 | | - | |
272 | | - | |
| 278 | + | |
| 279 | + | |
273 | 280 | | |
274 | 281 | | |
275 | 282 | | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
276 | 287 | | |
277 | 288 | | |
278 | 289 | | |
279 | | - | |
280 | | - | |
281 | | - | |
282 | | - | |
283 | | - | |
| 290 | + | |
284 | 291 | | |
285 | | - | |
286 | | - | |
287 | | - | |
288 | | - | |
289 | | - | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
290 | 295 | | |
291 | | - | |
292 | | - | |
293 | | - | |
294 | | - | |
295 | | - | |
296 | | - | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
297 | 307 | | |
298 | 308 | | |
299 | 309 | | |
| |||
0 commit comments