-
Notifications
You must be signed in to change notification settings - Fork 61
Description
I believe this is related to #201.
I am running Bpipe 0.9.11 in Apptainer. @hh1985 in #201 was using Docker, so possibly this is related to containerization though really I don't know.
I don't know if it is related, but I am running multiple instances of Bpipe concurrently. I have tried to isolate them by temporarily setting $HOME to a unique temporary directory for each instance (since use of $HOME is hardcoded into Bpipe in at least one place, if I recall correctly).
It seems like sometimes $BPIPE_PID ends up being null or an empty string. I don't know if this is just due to an IO error reading the temporary PID file or if there is another issue behind it. Every path that points to a file named $BPIPE_PID actually points to the parent directory, which would result in the error seen in #201 (along with whatever other issues come up with the PID being null).
Below is the head of a log file of a job which suffered from this issue. Note that the filename of the log is .bpipe/logs/.bpipe.log. It should be .bpipe/logs/$BPIPE_PID.bpipe.log, so the PID is null. This is also supported by the contents of the log:
bpipe.Runner [1] INFO |11:08:23 Starting
bpipe.Runner [1] INFO |11:08:24 OS: Linux (5.15.0-75-generic) Java: 11.0.23 Vendor: Debian
bpipe.Runner [1] INFO |11:08:24 Initializing plugins ...
bpipe.Config [1] INFO |11:08:24 No plugins directory found: /output/.bpipe/plugins
bpipe.Runner [1] INFO |11:08:26 =================== GUID=b35ee14be2f543845b570c1fa5de6d85742cbe76 PID= () ==============
There is no PID in the log, and the whole job ends up failing.
When the failed job is rerun, it then (usually) gets a PID and proceeds as expected. Head of a log after restarting.bpipe/logs/3648408.bpipe.log:
bpipe.Runner [1] INFO |11:10:11 Starting
bpipe.Runner [1] INFO |11:10:11 OS: Linux (5.15.0-75-generic) Java: 11.0.23 Vendor: Debian
bpipe.Runner [1] INFO |11:10:11 Initializing plugins ...
bpipe.Config [1] INFO |11:10:11 No plugins directory found: /output/.bpipe/plugins
bpipe.Runner [1] INFO |11:10:11 =================== GUID=fa5fa6ae3cbc25e0c44b5d8850817a28d18900cf PID=3648408 (3648408) ==============
This time there is a PID.
My solution thus far has been to retry each job several times.