Skip to content

Task '/bin/sleep 60' has failed #76

@gabrielecastellano

Description

@gabrielecastellano

Hello everyone,
I managed to run firmament using the provided docker image.
When I run the container, it gives me the following error (don't know if it is related to my issue):

$ docker run -p 9999:9999 -w /firmament camsas/firmament:dev /firmament/build/src/coordinator --scheduler flow --flow_scheduling_cost_model 6 --listen_uri tcp:0.0.0.0:8081 --http_ui_port 9999 --task_lib_dir=/firmament/build/src
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/H2GR6RBYUIPHBXDMSGKPBAYWNE:/var/lib/docker/overlay2/l/NKEZN6MLXD4DGK5HNNI2K4SN7K:/var/lib/docker/overlay2/l/5H5GK4TBC5MY7NFNYEW2P7MPRP:/var/lib/docker/overlay2/l/2DVGBZKGQHVXVENMWHAW3HNEGB:/var/lib/docker/overlay2/l/DA5VWJ6IOM3MFNW3T6VLSZ4ZDR:/var/lib/docker/overlay2/l/NFSSHKRHC7XPWN7BXCLFDMXHF6:/var/lib/docker/overlay2/l/C4RYQ3MDIDZ376KHATSEPRHOOC:/var/lib/docker/overlay2/l/23CTT2D5BDVQOVVUTHAGP4SPKX:/var/lib/docker/overlay2/l/UTO3PZRTFU4CU'

Despite this, the server seems running correctly, and I am able to access the gui at http://:9999/

However, when I tried to submit a job with
python scripts/job/job_submit.py 172.17.0.2 9999 /bin/sleep 60
I got the following error:
E1116 17:19:04.534961 6 task_health_checker.cc:51] Task 18085502784089753274 has failed!

Here is /tmp/coordinator.INFO:

I1116 17:16:14.514029     1 coordinator_main.cc:36] Firmament coordinator starting ...
I1116 17:16:14.531463     1 coordinator.cc:120] Using Quincy-style min cost flow-based scheduler.
I1116 17:16:14.531641     1 coordinator.cc:133] Coordinator starting on host tcp:0.0.0.0:8081, UUID 42f151f8-deef-46b8-b8a6-88ab53e5e6a7
I1116 17:16:14.531744     1 coordinator.cc:221] Detecting resource topology:
I1116 17:16:14.531754     1 topology_manager.cc:212] *** LEVEL: 0
I1116 17:16:14.531767     1 topology_manager.cc:217] Index: 0: Machine#0(7470MB)
I1116 17:16:14.531774     1 topology_manager.cc:212] *** LEVEL: 1
I1116 17:16:14.531781     1 topology_manager.cc:217] Index: 0: Socket#0
I1116 17:16:14.531786     1 topology_manager.cc:212] *** LEVEL: 2
I1116 17:16:14.531793     1 topology_manager.cc:217] Index: 0: L3(6144KB)
I1116 17:16:14.531800     1 topology_manager.cc:212] *** LEVEL: 3
I1116 17:16:14.531805     1 topology_manager.cc:217] Index: 0: L2(256KB)
I1116 17:16:14.531812     1 topology_manager.cc:217] Index: 1: L2(256KB)
I1116 17:16:14.531819     1 topology_manager.cc:217] Index: 2: L2(256KB)
I1116 17:16:14.531826     1 topology_manager.cc:217] Index: 3: L2(256KB)
I1116 17:16:14.531831     1 topology_manager.cc:212] *** LEVEL: 4
I1116 17:16:14.531838     1 topology_manager.cc:217] Index: 0: L1d(32KB)
I1116 17:16:14.531846     1 topology_manager.cc:217] Index: 1: L1d(32KB)
I1116 17:16:14.531852     1 topology_manager.cc:217] Index: 2: L1d(32KB)
I1116 17:16:14.531859     1 topology_manager.cc:217] Index: 3: L1d(32KB)
I1116 17:16:14.531864     1 topology_manager.cc:212] *** LEVEL: 5
I1116 17:16:14.531870     1 topology_manager.cc:217] Index: 0: Core#0
I1116 17:16:14.531877     1 topology_manager.cc:217] Index: 1: Core#1
I1116 17:16:14.531883     1 topology_manager.cc:217] Index: 2: Core#2
I1116 17:16:14.531889     1 topology_manager.cc:217] Index: 3: Core#3
I1116 17:16:14.531894     1 topology_manager.cc:212] *** LEVEL: 6
I1116 17:16:14.531900     1 topology_manager.cc:217] Index: 0: PU#0
I1116 17:16:14.531908     1 topology_manager.cc:217] Index: 1: PU#1
I1116 17:16:14.531913     1 topology_manager.cc:217] Index: 2: PU#2
I1116 17:16:14.531920     1 topology_manager.cc:217] Index: 3: PU#3
I1116 17:16:14.531926     1 coordinator.cc:176] Found 4 local PUs.
I1116 17:16:14.531932     1 coordinator.cc:177] Resource URI is tcp:0.0.0.0:8081
I1116 17:16:14.534741     1 coordinator_http_ui.cc:1321] Coordinator HTTP interface up!
I1116 17:16:22.949242    16 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /task/
I1116 17:16:23.151162     9 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /stats/
W1116 17:16:23.151223     9 coordinator_http_ui.cc:834] Invalid stats request!
I1116 17:17:25.160835     9 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /task/
I1116 17:17:25.308948    16 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /stats/
W1116 17:17:25.308990    16 coordinator_http_ui.cc:834] Invalid stats request!
I1116 17:18:28.951195    14 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /task/
I1116 17:18:29.114184    16 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /stats/
W1116 17:18:29.114243    16 coordinator_http_ui.cc:834] Invalid stats request!
I1116 17:18:57.184258    16 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /job/submit/
I1116 17:18:57.198359    16 coordinator.cc:865] NEW JOB: 1468db75-43d3-417e-9e26-f9843eba8c8e
I1116 17:18:57.198387    16 flow_scheduler.cc:405] START SCHEDULING (via 1468db75-43d3-417e-9e26-f9843eba8c8e)
W1116 17:18:57.198391    16 flow_scheduler.cc:406] This way of scheduling a job is slow in the flow scheduler! Consider using ScheduleAllJobs() instead.
I1116 17:18:57.198488    16 utils.cc:341] External execution of command: build/third_party/cs2/src/cs2/cs2.exe
I1116 17:18:57.475673    20 local_executor.cc:393] COMMAND LINE for task 18085502784089753274: perf stat -o /tmp/firmament-perf/aa1d8806-8de1-4c73-b634-214341eed606-18085502784089753274.perf -e cpu-clock,task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses,cache-misses,cache-references,stalled-cycles-frontend,stalled-cycles-backend,node-loads,node-load-misses -- /bin/sleep 60
I1116 17:18:57.476095    16 coordinator.cc:911] Attempted to schedule job 1468db75-43d3-417e-9e26-f9843eba8c8e, successfully scheduled 1 tasks.
E1116 17:19:04.534961     6 task_health_checker.cc:51] Task 18085502784089753274 has failed!
I1116 17:19:04.535176     6 event_driven_scheduler.cc:144] Task 18085502784089753274 has not reported heartbeats for 60s and its handler thread has exited. Declaring it FAILED!
I1116 17:19:04.535195     6 local_executor.cc:145] kill(2) for task 18085502784089753274 returned -1

And here is what I get from the GUI:
firmament

By clicking both on the stderr link, I get:

E1116 17:18:57.757828 21 local_executor.cc:443] execvp failed for task command 'perf stat -o /tmp/firmament-perf/aa1d8806-8de1-4c73-b634-214341eed606-18085502784089753274.perf -e cpu-clock,task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses,cache-misses,cache-references,stalled-cycles-frontend,stalled-cycles-backend,node-loads,node-load-misses -- /bin/sleep 60 ': No such file or directory [2]

What am I missing?

Thanks!
Gabriele

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions