Skip to content

Client restarting cores too rapidly can cause WUs to dump #362

@arisu3

Description

@arisu3

The client restarts the CPU core whenever a GPU core stops, downloads, or starts (#329). This rapid-fire series of restarts can happen so rapidly that the client falsely believes the core has terminated with no output, and dumps it.

13:35:14:I1:TailFileToLog:WU335:Completed 2500000 out of 2500000 steps (100%)
13:35:14:I1:TailFileToLog:WU335:Average performance: 86.747 ns/day
13:35:15:I1:TailFileToLog:WU335:Checkpoint completed at step 2500000
13:35:15:I1:TailFileToLog:WU337:Completed 20000 out of 250000 steps (8%)
13:35:17:I1:TailFileToLog:WU335:Saving result file ../logfile_01.txt
13:35:17:I1:TailFileToLog:WU335:Saving result file checkpointIntegrator.xml
13:35:17:I1:TailFileToLog:WU335:Saving result file checkpointState.xml
13:35:20:I1:TailFileToLog:WU335:Saving result file positions.xtc
13:35:21:I1:TailFileToLog:WU335:Saving result file science.log
13:35:21:I1:TailFileToLog:WU335:Folding@home Core Shutdown: FINISHED_UNIT
13:35:22:I1:Unit:WU335:Core returned FINISHED_UNIT (100)
13:35:24:I1:Group:Default:Added new work unit: cpus:0 gpus:gpu:03:00:00
13:35:24:I1:Unit:WU335:Uploading WU results
13:35:24:I1:Unit:WU338:Requesting WU assignment for user Grimoire_of_Lolice team 230362
13:35:24:I1:TailFileToLog:WU337:Caught signal SIGINT(2) on PID 74402
13:35:24:I1:TailFileToLog:WU337:Exiting, please wait. . .
13:35:25:I1:Request:OUT29:> CONNECT ds03.scs.illinois.edu:443 HTTP/1.1
13:35:25:I1:Request:OUT30:> CONNECT assign4.foldingathome.org:443 HTTP/1.1
13:35:25:I1:Unit:WU335:UPLOAD 100% 79B of 79B
13:35:25:I1:Request:OUT30:> POST https://assign4.foldingathome.org/api/assign HTTP/1.1
13:35:26:I1:TailFileToLog:WU337:Folding@home Core Shutdown: INTERRUPTED
13:35:27:I1:Unit:WU337:Core returned INTERRUPTED (102)
13:35:27:I3:CoreProcess:Running FahCore: /tmp/.testing/fah/cores/fahcore-a8-lin-64bit-avx2_256-0.0.12/FahCore_a8 -dir weheaCalrTrvMpULieSCQuP5ctvmaAKHnKYXVTAQGuk -suffix 01 -version 8.4.9 -lifeline 70708 -np 45
13:35:27:I3:Unit:WU337:Started FahCore on PID 74465
13:35:27:I1:Request:OUT30:< HTTP/1.1 200 HTTP_OK
13:35:27:I1:Unit:WU338:Received WU assignment iqmRs3wpfXMi1-JFoXM-2cuWnmjjf-QuqxFGd4kkRYg
13:35:27:I1:Unit:WU338:Downloading WU
13:35:27:I1:Request:OUT31:> CONNECT highland1.seas.upenn.edu:443 HTTP/1.1
13:35:28:I1:Request:OUT31:> POST https://highland1.seas.upenn.edu/api/assign HTTP/1.1
13:35:28:E :Unit:WU337:Core was killed
13:35:28:E :Unit:WU337:Core returned FAILED_1 (0)
13:35:28:E :Unit:WU337:The folding core did not produce any log output.  This indicates that the core is not functional on your system.  Check for missing libraries or GPU drivers.  Make a post about your issue on https://foldingforum.org/ to get more help.
13:35:28:E :Unit:WU337:Run did not produce any results. Dumping WU

Another example from a different system:

04:45:05:I1:TailFileToLog:WU747:Completed 2000000 out of 2000000 steps (100%)
04:45:05:I1:TailFileToLog:WU747:Average performance: 11.6599 ns/day
04:45:09:I1:TailFileToLog:WU747:Checkpoint completed at step 2000000
04:45:13:I1:TailFileToLog:WU747:Saving result file ../logfile_01.txt
04:45:13:I1:TailFileToLog:WU747:Saving result file checkpointIntegrator.xml
04:45:13:I1:TailFileToLog:WU747:Saving result file checkpointState.xml
04:45:15:I1:TailFileToLog:WU747:Saving result file positions.xtc
04:45:16:I1:TailFileToLog:WU747:Saving result file science.log
04:45:16:I1:TailFileToLog:WU747:Folding@home Core Shutdown: FINISHED_UNIT
04:45:18:I1:Unit:WU747:Core returned FINISHED_UNIT (100)
04:45:19:I1:Group:Default:Added new work unit: cpus:0 gpus:gpu:03:00:00
04:45:19:I1:Unit:WU747:Uploading WU results
04:45:20:I1:Unit:WU752:Requesting WU assignment for user Grimoire_of_Lolice team 230362
04:45:20:I1:TailFileToLog:WU751:Caught signal SIGINT(2) on PID 1856226
04:45:20:I1:TailFileToLog:WU751:Exiting, please wait. . .
04:45:20:I1:Request:OUT7:> CONNECT ds01.scs.illinois.edu:443 HTTP/1.1
04:45:20:I1:Request:OUT8:> CONNECT assign2.foldingathome.org:443 HTTP/1.1
04:45:20:I1:Unit:WU747:UPLOAD 100% 79B of 79B
04:45:20:I1:Request:OUT8:> POST https://assign2.foldingathome.org/api/assign HTTP/1.1
04:45:21:I1:TailFileToLog:WU751:Folding@home Core Shutdown: INTERRUPTED
04:45:21:I1:Unit:WU751:Core returned INTERRUPTED (102)
04:45:21:I3:CoreProcess:Running FahCore: /var/lib/fah-client/cores/fahcore-a8-lin-64bit-avx2_256-0.0.12/FahCore_a8 -dir 3RSsAH8YXRbc4JMjYV_iWwoxN_ITr0OekvJ5qGDwaAI -suffix 01 -version 8.4.9 -lifeline 1855875 -np 8
04:45:22:I3:Unit:WU751:Started FahCore on PID 1861548
04:45:22:I1:Request:OUT8:< HTTP/1.1 200 HTTP_OK
04:45:22:I1:Unit:WU752:Received WU assignment wDG_6qrI9_hgivEYpVAGoRx2UnCaeVQhCYrinxm30io
04:45:22:I1:Unit:WU752:Downloading WU
04:45:22:I1:Request:OUT9:> CONNECT ds01.scs.illinois.edu:443 HTTP/1.1
04:45:23:E :Unit:WU751:Core was killed
04:45:23:E :Unit:WU751:Core returned FAILED_1 (0)
04:45:23:E :Unit:WU751:The folding core did not produce any log output.  This indicates that the core is not functional on your system.  Check for missing libraries or GPU drivers.  Make a post about your issue on https://foldingforum.org/ to get more help.
04:45:23:E :Unit:WU751:Run did not produce any results. Dumping WU

In each case, the core terminates properly to the interrupt signal (nothing out of place in md.log), but the client believes it fails. This is another issue which stems from the fragile communication between core and client.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions