Skip to content

core.ssh_async[openssh] + slurm gives process state 'FINISHED' but it is 'PENDING' in the queue #7062

@rikigigi

Description

@rikigigi

I have random failures in quantumespresso PW CalcJob. I'm using a setup with aiida-core at 239cbf9 (a week ago main) and slurm 24.11.4 (that supports json output). Sometimes everything is ok. Sometimes the following happens:

  • the daemon thinks the CalcJob is completed for some reason and tries to go on
  • it tries to retrieve the files that are not there because the job was pending in the slurm queue, causing the failure of the CalcJob

in the node metadata I can find, in the detailed job info:

{

    retval:0,
    stderr:"",
    stdout:"Account|AdminComment|AllocCPUS|AllocNodes|AllocTRES|AssocID|AveCPU|AveCPUFreq|AveDiskRead|AveDiskWrite|AvePages|AveRSS|AveVMSize|BlockID|Cluster|Comment|Constraints|ConsumedEnergy|ConsumedEnergyRaw|Container|CPUTime|CPUTimeRAW|DBIndex|DerivedExitCode|Elapsed|ElapsedRaw|Eligible|End|ExitCode|Extra|FailedNode|Flags|GID|Group|JobID|JobIDRaw|JobName|Layout|Licenses|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|MaxPages|MaxPagesNode|MaxPagesTask|MaxRSS|MaxRSSNode|MaxRSSTask|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|McsLabel|MinCPU|MinCPUNode|MinCPUTask|NCPUS|NNodes|NodeList|NTasks|Partition|Planned|PlannedCPU|PlannedCPURAW|Priority|QOS|QOSRAW|QOSREQ|Reason|ReqCPUFreq|ReqCPUFreqGov|ReqCPUFreqMax|ReqCPUFreqMin|ReqCPUS|ReqMem|ReqNodes|ReqTRES|Reservation|ReservationId|Restarts|SLUID|Start|State|StdErr|StdIn|StdOut|Submit|SubmitLine|Suspended|SystemComment|SystemCPU|Timelimit|TimelimitRaw|TotalCPU|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutAve|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutMin|TRESUsageOutMinNode|TRESUsageOutMinTask|TRESUsageOutTot|UID|User|UserCPU|WCKey|WCKeyID|WorkDir| materys||0|0||1652|||||||||orfeo|||0|0||00:00:00|0|17827049668961343488|0:0|00:00:00|0|2025-10-10T15:03:15|Unknown|0:0|||StartReceived|1159400193|rbertossa|553227|553227|aiida-259||||||||||||||||||||||0|1|None assigned||GENOA|00:00:16|00:06:24|384|64651|normal|1||None|Unknown|Unknown|Unknown|Unknown|24|46875M|1|billing=24,cpu=24,mem=46875M,node=1|||0|sFESK773FTZ600|Unknown|PENDING|/orfeo/cephfs/scratch/materys/rbertossa/27/37/7004-1b5b-421b-913b-28ecc9ef6731/_scheduler-stderr.txt|/dev/null|/orfeo/cephfs/scratch/materys/rbertossa/27/37/7004-1b5b-421b-913b-28ecc9ef6731/_scheduler-stdout.txt|2025-10-10T15:03:15|sbatch _aiidasubmit.sh|00:00:00||00:00:00|00:20:00|20|00:00:00|||||||||||||||||1159400193|rbertossa|00:00:00||0|/orfeo/cephfs/scratch/materys/rbertossa/27/37/7004-1b5b-421b-913b-28ecc9ef6731| "

}```

note that also in the detailed job info the calculation is marked PENDING

do you think this is a slurm issue of wrong repoted job status, an issue of the slurm plugin, or a issue of the new ssh_async with openssh backend ?

did anybody else experienced similar issues?

thank you

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions