-
Notifications
You must be signed in to change notification settings - Fork 231
Open
Labels
Description
I have random failures in quantumespresso PW CalcJob. I'm using a setup with aiida-core at 239cbf9 (a week ago main) and slurm 24.11.4 (that supports json output). Sometimes everything is ok. Sometimes the following happens:
- the daemon thinks the CalcJob is completed for some reason and tries to go on
- it tries to retrieve the files that are not there because the job was pending in the slurm queue, causing the failure of the CalcJob
in the node metadata I can find, in the detailed job info:
{
retval:0,
stderr:"",
stdout:"Account|AdminComment|AllocCPUS|AllocNodes|AllocTRES|AssocID|AveCPU|AveCPUFreq|AveDiskRead|AveDiskWrite|AvePages|AveRSS|AveVMSize|BlockID|Cluster|Comment|Constraints|ConsumedEnergy|ConsumedEnergyRaw|Container|CPUTime|CPUTimeRAW|DBIndex|DerivedExitCode|Elapsed|ElapsedRaw|Eligible|End|ExitCode|Extra|FailedNode|Flags|GID|Group|JobID|JobIDRaw|JobName|Layout|Licenses|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|MaxPages|MaxPagesNode|MaxPagesTask|MaxRSS|MaxRSSNode|MaxRSSTask|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|McsLabel|MinCPU|MinCPUNode|MinCPUTask|NCPUS|NNodes|NodeList|NTasks|Partition|Planned|PlannedCPU|PlannedCPURAW|Priority|QOS|QOSRAW|QOSREQ|Reason|ReqCPUFreq|ReqCPUFreqGov|ReqCPUFreqMax|ReqCPUFreqMin|ReqCPUS|ReqMem|ReqNodes|ReqTRES|Reservation|ReservationId|Restarts|SLUID|Start|State|StdErr|StdIn|StdOut|Submit|SubmitLine|Suspended|SystemComment|SystemCPU|Timelimit|TimelimitRaw|TotalCPU|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutAve|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutMin|TRESUsageOutMinNode|TRESUsageOutMinTask|TRESUsageOutTot|UID|User|UserCPU|WCKey|WCKeyID|WorkDir| materys||0|0||1652|||||||||orfeo|||0|0||00:00:00|0|17827049668961343488|0:0|00:00:00|0|2025-10-10T15:03:15|Unknown|0:0|||StartReceived|1159400193|rbertossa|553227|553227|aiida-259||||||||||||||||||||||0|1|None assigned||GENOA|00:00:16|00:06:24|384|64651|normal|1||None|Unknown|Unknown|Unknown|Unknown|24|46875M|1|billing=24,cpu=24,mem=46875M,node=1|||0|sFESK773FTZ600|Unknown|PENDING|/orfeo/cephfs/scratch/materys/rbertossa/27/37/7004-1b5b-421b-913b-28ecc9ef6731/_scheduler-stderr.txt|/dev/null|/orfeo/cephfs/scratch/materys/rbertossa/27/37/7004-1b5b-421b-913b-28ecc9ef6731/_scheduler-stdout.txt|2025-10-10T15:03:15|sbatch _aiidasubmit.sh|00:00:00||00:00:00|00:20:00|20|00:00:00|||||||||||||||||1159400193|rbertossa|00:00:00||0|/orfeo/cephfs/scratch/materys/rbertossa/27/37/7004-1b5b-421b-913b-28ecc9ef6731| "
}```
note that also in the detailed job info the calculation is marked PENDING
do you think this is a slurm issue of wrong repoted job status, an issue of the slurm plugin, or a issue of the new ssh_async with openssh backend ?
did anybody else experienced similar issues?
thank you