-
Notifications
You must be signed in to change notification settings - Fork 37
Description
This is the summary of what I observed when half of my hosts went crazy over Christmas. That happenered on 10 out of 22 nodes (!) ine one large mesh and 2 out of 4 on totally separate mesh. All failed nodes stopped providing data to archive at different hours in the period of 3 days betweeen 10 and 12 December. Seems like pscheduler stopped archiving. No obvious reason, nothing in the log, no visible memory issues. Hostmetrics were properly cscrapped, service status ok, just test results not sent with archiver failed to archive in the log at least to some point. Then no other messages in the log and all tasks cancelled. That's strange because rest of the hosts somehow survived
I managed to resolve it with pscheduler internal service restart in most cases however one of the hosts crashed completely after this command :-(
All hosts run perfsonar-testpoint 4.2.2 Debian 12
Failed node:

Node that somehow survive:

Here is some data I collected in one of the hosts:

psudouser@sask0:~$ pscheduler troubleshoot
Performing basic troubleshooting of sask0.
sask0:
Checking that host "sask0" resolves... 83.230.96.5
Measuring MTU... N/A (Local)
Looking for pScheduler... OK.
Fetching API level... 6
Checking clock... OK.
Exercising API... Archivers... Contexts... Tests... Tools... OK.
Fetching service status... OK.
Checking services... Ticker... Scheduler... Runner... Archiver... OK.
Checking limits... OK.
Last run scheduled... 45 minutes ago
Last run completed... in 1 day
Idle test.... 9 seconds... Pending, probably missed... Failed.
Test was scheduled but not run. Check that the [pscheduler-runner] service is running.
psudouser@sask0:~$ systemctl status pscheduler-runner
● pscheduler-runner.service - pScheduler server - runner
Loaded: loaded (/lib/systemd/system/pscheduler-runner.service; enabled; preset: enabled)
Active: active (running) since Thu 2025-09-11 07:05:03 CEST; 4 months 3 days ago
Main PID: 208349 (python3)
Tasks: 28 (limit: 9287)
Memory: 115.9M
CPU: 1w 3d 4h 32min 21.237s
CGroup: /system.slice/pscheduler-runner.service
├─ 208349 python3 /usr/lib/pscheduler/daemons/runner --dsn @/etc/pscheduler/database/database-dsn
├─ 209332 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmp9llhwemw/19bf6139-dbad-4>
├─ 209335 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmp1v26vftx/9e1b0cde-2cfd-4>
├─ 209344 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpbg5up0z8/c6369895-baf8-4>
├─ 209373 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpn6k72o5x/94daa834-4005-4>
├─ 209676 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmphd5_eutq/20ea35ad-6f98-4>
├─ 209821 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpnud6kh1t/adfd6403-880a-4>
├─ 209914 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmph8r6vmvd/ece8679d-4766-4>
├─ 210019 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpdmibzjui/f5cf050b-4bab-4>
├─ 210021 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpgby5cbzk/1125fdcb-cfe5-4>
├─ 210045 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpu3w9zu9u/f82787e5-7d51-4>
├─ 210097 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpkxxvt5c9/348bb479-db93-4>
├─ 210136 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmppivf1rog/670bd97b-d71b-4>
├─ 210145 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpcbgzdhxo/555cebea-89a7-4>
├─ 210209 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmp0xxj1i8k/b2ce9c60-66f9-4>
├─ 210224 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpqn_syejd/9917076d-f01f-4>
├─ 210329 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpak_4ht0m/8322580c-278e-4>
├─ 210482 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpcv0pe8eh/b80248fc-3c3d-4>
├─ 513701 python3 /usr/lib/pscheduler/daemons/runner --dsn @/etc/pscheduler/database/database-dsn
├─2329031 python3 /usr/lib/pscheduler/daemons/runner --dsn @/etc/pscheduler/database/database-dsn
└─3976111 python3 /usr/lib/pscheduler/daemons/runner --dsn @/etc/pscheduler/database/database-dsn
psudouser@sask0:~$
$ psconfig stats pscheduler
Agent Last Run Start Time: 2026-01-14 13:40:09
Agent Last Run End Time: 2026-01-14 13:41:41
Agent Last Run Process ID (PID): 196327
Agent Last Run Log GUID: 1c817651-87ff-42ae-bf41-054930520b35
Total tasks managed by agent: 49
From remote definitions: 49
https://stats.perfsonar.pionier.net.pl/psconfig/psconfig-pionier.json: 49
psudouser@sask0:~$