-
Notifications
You must be signed in to change notification settings - Fork 77
Description
Hello everyone,
I managed to run firmament using the provided docker image.
When I run the container, it gives me the following error (don't know if it is related to my issue):
$ docker run -p 9999:9999 -w /firmament camsas/firmament:dev /firmament/build/src/coordinator --scheduler flow --flow_scheduling_cost_model 6 --listen_uri tcp:0.0.0.0:8081 --http_ui_port 9999 --task_lib_dir=/firmament/build/src
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/H2GR6RBYUIPHBXDMSGKPBAYWNE:/var/lib/docker/overlay2/l/NKEZN6MLXD4DGK5HNNI2K4SN7K:/var/lib/docker/overlay2/l/5H5GK4TBC5MY7NFNYEW2P7MPRP:/var/lib/docker/overlay2/l/2DVGBZKGQHVXVENMWHAW3HNEGB:/var/lib/docker/overlay2/l/DA5VWJ6IOM3MFNW3T6VLSZ4ZDR:/var/lib/docker/overlay2/l/NFSSHKRHC7XPWN7BXCLFDMXHF6:/var/lib/docker/overlay2/l/C4RYQ3MDIDZ376KHATSEPRHOOC:/var/lib/docker/overlay2/l/23CTT2D5BDVQOVVUTHAGP4SPKX:/var/lib/docker/overlay2/l/UTO3PZRTFU4CU'
Despite this, the server seems running correctly, and I am able to access the gui at http://:9999/
However, when I tried to submit a job with
python scripts/job/job_submit.py 172.17.0.2 9999 /bin/sleep 60
I got the following error:
E1116 17:19:04.534961 6 task_health_checker.cc:51] Task 18085502784089753274 has failed!
Here is /tmp/coordinator.INFO:
I1116 17:16:14.514029 1 coordinator_main.cc:36] Firmament coordinator starting ...
I1116 17:16:14.531463 1 coordinator.cc:120] Using Quincy-style min cost flow-based scheduler.
I1116 17:16:14.531641 1 coordinator.cc:133] Coordinator starting on host tcp:0.0.0.0:8081, UUID 42f151f8-deef-46b8-b8a6-88ab53e5e6a7
I1116 17:16:14.531744 1 coordinator.cc:221] Detecting resource topology:
I1116 17:16:14.531754 1 topology_manager.cc:212] *** LEVEL: 0
I1116 17:16:14.531767 1 topology_manager.cc:217] Index: 0: Machine#0(7470MB)
I1116 17:16:14.531774 1 topology_manager.cc:212] *** LEVEL: 1
I1116 17:16:14.531781 1 topology_manager.cc:217] Index: 0: Socket#0
I1116 17:16:14.531786 1 topology_manager.cc:212] *** LEVEL: 2
I1116 17:16:14.531793 1 topology_manager.cc:217] Index: 0: L3(6144KB)
I1116 17:16:14.531800 1 topology_manager.cc:212] *** LEVEL: 3
I1116 17:16:14.531805 1 topology_manager.cc:217] Index: 0: L2(256KB)
I1116 17:16:14.531812 1 topology_manager.cc:217] Index: 1: L2(256KB)
I1116 17:16:14.531819 1 topology_manager.cc:217] Index: 2: L2(256KB)
I1116 17:16:14.531826 1 topology_manager.cc:217] Index: 3: L2(256KB)
I1116 17:16:14.531831 1 topology_manager.cc:212] *** LEVEL: 4
I1116 17:16:14.531838 1 topology_manager.cc:217] Index: 0: L1d(32KB)
I1116 17:16:14.531846 1 topology_manager.cc:217] Index: 1: L1d(32KB)
I1116 17:16:14.531852 1 topology_manager.cc:217] Index: 2: L1d(32KB)
I1116 17:16:14.531859 1 topology_manager.cc:217] Index: 3: L1d(32KB)
I1116 17:16:14.531864 1 topology_manager.cc:212] *** LEVEL: 5
I1116 17:16:14.531870 1 topology_manager.cc:217] Index: 0: Core#0
I1116 17:16:14.531877 1 topology_manager.cc:217] Index: 1: Core#1
I1116 17:16:14.531883 1 topology_manager.cc:217] Index: 2: Core#2
I1116 17:16:14.531889 1 topology_manager.cc:217] Index: 3: Core#3
I1116 17:16:14.531894 1 topology_manager.cc:212] *** LEVEL: 6
I1116 17:16:14.531900 1 topology_manager.cc:217] Index: 0: PU#0
I1116 17:16:14.531908 1 topology_manager.cc:217] Index: 1: PU#1
I1116 17:16:14.531913 1 topology_manager.cc:217] Index: 2: PU#2
I1116 17:16:14.531920 1 topology_manager.cc:217] Index: 3: PU#3
I1116 17:16:14.531926 1 coordinator.cc:176] Found 4 local PUs.
I1116 17:16:14.531932 1 coordinator.cc:177] Resource URI is tcp:0.0.0.0:8081
I1116 17:16:14.534741 1 coordinator_http_ui.cc:1321] Coordinator HTTP interface up!
I1116 17:16:22.949242 16 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /task/
I1116 17:16:23.151162 9 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /stats/
W1116 17:16:23.151223 9 coordinator_http_ui.cc:834] Invalid stats request!
I1116 17:17:25.160835 9 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /task/
I1116 17:17:25.308948 16 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /stats/
W1116 17:17:25.308990 16 coordinator_http_ui.cc:834] Invalid stats request!
I1116 17:18:28.951195 14 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /task/
I1116 17:18:29.114184 16 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /stats/
W1116 17:18:29.114243 16 coordinator_http_ui.cc:834] Invalid stats request!
I1116 17:18:57.184258 16 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /job/submit/
I1116 17:18:57.198359 16 coordinator.cc:865] NEW JOB: 1468db75-43d3-417e-9e26-f9843eba8c8e
I1116 17:18:57.198387 16 flow_scheduler.cc:405] START SCHEDULING (via 1468db75-43d3-417e-9e26-f9843eba8c8e)
W1116 17:18:57.198391 16 flow_scheduler.cc:406] This way of scheduling a job is slow in the flow scheduler! Consider using ScheduleAllJobs() instead.
I1116 17:18:57.198488 16 utils.cc:341] External execution of command: build/third_party/cs2/src/cs2/cs2.exe
I1116 17:18:57.475673 20 local_executor.cc:393] COMMAND LINE for task 18085502784089753274: perf stat -o /tmp/firmament-perf/aa1d8806-8de1-4c73-b634-214341eed606-18085502784089753274.perf -e cpu-clock,task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses,cache-misses,cache-references,stalled-cycles-frontend,stalled-cycles-backend,node-loads,node-load-misses -- /bin/sleep 60
I1116 17:18:57.476095 16 coordinator.cc:911] Attempted to schedule job 1468db75-43d3-417e-9e26-f9843eba8c8e, successfully scheduled 1 tasks.
E1116 17:19:04.534961 6 task_health_checker.cc:51] Task 18085502784089753274 has failed!
I1116 17:19:04.535176 6 event_driven_scheduler.cc:144] Task 18085502784089753274 has not reported heartbeats for 60s and its handler thread has exited. Declaring it FAILED!
I1116 17:19:04.535195 6 local_executor.cc:145] kill(2) for task 18085502784089753274 returned -1
And here is what I get from the GUI:
By clicking both on the stderr link, I get:
E1116 17:18:57.757828 21 local_executor.cc:443] execvp failed for task command 'perf stat -o /tmp/firmament-perf/aa1d8806-8de1-4c73-b634-214341eed606-18085502784089753274.perf -e cpu-clock,task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses,cache-misses,cache-references,stalled-cycles-frontend,stalled-cycles-backend,node-loads,node-load-misses -- /bin/sleep 60 ': No such file or directory [2]
What am I missing?
Thanks!
Gabriele