Skip to content

pscheduler stopped providing test results in multiple hosts in meshes #1612

@szymontrocha

Description

@szymontrocha

This is the summary of what I observed when half of my hosts went crazy over Christmas. That happenered on 10 out of 22 nodes (!) ine one large mesh and 2 out of 4 on totally separate mesh. All failed nodes stopped providing data to archive at different hours in the period of 3 days betweeen 10 and 12 December. Seems like pscheduler stopped archiving. No obvious reason, nothing in the log, no visible memory issues. Hostmetrics were properly cscrapped, service status ok, just test results not sent with archiver failed to archive in the log at least to some point. Then no other messages in the log and all tasks cancelled. That's strange because rest of the hosts somehow survived
I managed to resolve it with pscheduler internal service restart in most cases however one of the hosts crashed completely after this command :-(
All hosts run perfsonar-testpoint 4.2.2 Debian 12

Failed node:
Image
Node that somehow survive:
Image

Here is some data I collected in one of the hosts:
Image

Image Image Image Image
psudouser@sask0:~$ pscheduler troubleshoot
Performing basic troubleshooting of sask0.

sask0:

  Checking that host "sask0" resolves... 83.230.96.5
  Measuring MTU... N/A (Local)
  Looking for pScheduler... OK.
  Fetching API level... 6
  Checking clock... OK.
  Exercising API... Archivers... Contexts... Tests... Tools... OK.
  Fetching service status... OK.
  Checking services... Ticker... Scheduler... Runner... Archiver... OK.
  Checking limits... OK.
  Last run scheduled... 45 minutes ago
  Last run completed... in 1 day
  Idle test.... 9 seconds... Pending, probably missed... Failed.

Test was scheduled but not run. Check that the [pscheduler-runner] service is running.
psudouser@sask0:~$ systemctl status pscheduler-runner
● pscheduler-runner.service - pScheduler server - runner
     Loaded: loaded (/lib/systemd/system/pscheduler-runner.service; enabled; preset: enabled)
     Active: active (running) since Thu 2025-09-11 07:05:03 CEST; 4 months 3 days ago
   Main PID: 208349 (python3)
      Tasks: 28 (limit: 9287)
     Memory: 115.9M
        CPU: 1w 3d 4h 32min 21.237s
     CGroup: /system.slice/pscheduler-runner.service
             ├─ 208349 python3 /usr/lib/pscheduler/daemons/runner --dsn @/etc/pscheduler/database/database-dsn
             ├─ 209332 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmp9llhwemw/19bf6139-dbad-4>
             ├─ 209335 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmp1v26vftx/9e1b0cde-2cfd-4>
             ├─ 209344 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpbg5up0z8/c6369895-baf8-4>
             ├─ 209373 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpn6k72o5x/94daa834-4005-4>
             ├─ 209676 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmphd5_eutq/20ea35ad-6f98-4>
             ├─ 209821 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpnud6kh1t/adfd6403-880a-4>
             ├─ 209914 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmph8r6vmvd/ece8679d-4766-4>
             ├─ 210019 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpdmibzjui/f5cf050b-4bab-4>
             ├─ 210021 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpgby5cbzk/1125fdcb-cfe5-4>
             ├─ 210045 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpu3w9zu9u/f82787e5-7d51-4>
             ├─ 210097 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpkxxvt5c9/348bb479-db93-4>
             ├─ 210136 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmppivf1rog/670bd97b-d71b-4>
             ├─ 210145 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpcbgzdhxo/555cebea-89a7-4>
             ├─ 210209 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmp0xxj1i8k/b2ce9c60-66f9-4>
             ├─ 210224 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpqn_syejd/9917076d-f01f-4>
             ├─ 210329 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpak_4ht0m/8322580c-278e-4>
             ├─ 210482 /usr/bin/powstream -p -d /var/pscheduler-server/runner/tmp/tmpcv0pe8eh/b80248fc-3c3d-4>
             ├─ 513701 python3 /usr/lib/pscheduler/daemons/runner --dsn @/etc/pscheduler/database/database-dsn
             ├─2329031 python3 /usr/lib/pscheduler/daemons/runner --dsn @/etc/pscheduler/database/database-dsn
             └─3976111 python3 /usr/lib/pscheduler/daemons/runner --dsn @/etc/pscheduler/database/database-dsn
psudouser@sask0:~$ 
$ psconfig stats pscheduler
Agent Last Run Start Time: 2026-01-14 13:40:09
Agent Last Run End Time: 2026-01-14 13:41:41
Agent Last Run Process ID (PID): 196327
Agent Last Run Log GUID: 1c817651-87ff-42ae-bf41-054930520b35
Total tasks managed by agent: 49
From remote definitions: 49
    https://stats.perfsonar.pionier.net.pl/psconfig/psconfig-pionier.json: 49

psudouser@sask0:~$ 

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions