Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sending job to gearmand failed since 5.1.2 (still OK in 5.1.0) #171

Open
pvdputte opened this issue Aug 9, 2023 · 13 comments
Open

sending job to gearmand failed since 5.1.2 (still OK in 5.1.0) #171

pvdputte opened this issue Aug 9, 2023 · 13 comments

Comments

@pvdputte
Copy link

pvdputte commented Aug 9, 2023

Hi,

Could there be a regression between 5.1.0 and 5.1.2 in submitting jobs to gearmand?

I'm using a clean Debian 11 install with the packages from OBS (as you recommend). This is from /var/log/mod-gearman/mod_gearman_neb.log with debug=4

[...]
[2023-08-09 01:15:37][26182][TRACE] move_results_to_core()
[2023-08-09 01:15:37][26182][TRACE] handle_svc_check(6, data)
[2023-08-09 01:15:37][26182][TRACE] handle_perfdata(6)
[2023-08-09 01:15:37][26182][TRACE] handle_svc_check(6, data)
[2023-08-09 01:15:37][26182][TRACE] got target queue from service custom variable: servicegroup_somegroup
[2023-08-09 01:15:37][26182][DEBUG] received job for queue servicegroup_somegroup: some.host.tld - MEMUSED, check_options: 0   latency so far: 0.286s
[2023-08-09 01:15:37][26182][TRACE] cmd_line: /usr/local/mon/libexec/check_nrpe -H 1.2.3.4 -t "15" -p "7070"  -c check-memused -a "-" "-" "-" "95" "97"
[2023-08-09 01:15:37][26182][TRACE] add_job_to_queue(servicegroup_somegroup, F5699DEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEF, 1, 1, 1, 0, 60)
[2023-08-09 01:15:37][26182][TRACE] 280 --->type=service
result_queue=check_results
host_name=some.host.tld
service_description=MEMUSED
core_time=1691536537.000288
timeout=120
command_line=/usr/local/mon/libexec/check_nrpe -H 1.2.3.4 -t "15" -p "7070"  -c check-memused -a "-" "-" "-" "95" "97"


<---
[2023-08-09 01:15:37][26182][TRACE] 384 +++>
12Py+XD5K<REDACTED>iW
<+++


[2023-08-09 01:15:42][26182][TRACE] create_client()
[2023-08-09 01:15:42][26182][INFO ] add_job_to_queue() retrying... 0
[2023-08-09 01:15:42][26182][TRACE] add_job_to_queue(servicegroup_somegroup, F5699DEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEF, 1, 0, 1, 0, 60)
[2023-08-09 01:15:42][26182][TRACE] 280 --->type=service
result_queue=check_results
host_name=some.host.tld
service_description=MEMUSED
core_time=1691536537.000288
timeout=120
command_line=/usr/local/mon/libexec/check_nrpe -H 1.2.3.4 -t "15" -p "7070"  -c check-memused -a "-" "-" "-" "95" "97"


<---
[2023-08-09 01:15:42][26182][TRACE] 384 +++>
12Py+XD5K<REDACTED>iW
<+++
[2023-08-09 01:15:47][26182][ERROR] sending job to gearmand failed: gearman_wait(GEARMAN_TIMEOUT) timeout reached, 1 servers were poll(), no servers were available, pipe:false -> libgearman/universal.cc:346: pid(26182)
[2023-08-09 01:15:47][26182][TRACE] create_client()
[2023-08-09 01:15:47][26182][TRACE] add_job_to_queue() finished with errors: 0
[2023-08-09 01:15:47][26182][ERROR] failed to send service check to gearmand
[2023-08-09 01:15:47][26182][TRACE] move_results_to_core()
[2023-08-09 01:15:47][26182][TRACE] handle_progam_status_data_events(11, data)
[2023-08-09 01:15:47][26182][TRACE] log file rotated: 0
[2023-08-09 01:15:47][26182][TRACE] handle_svc_check(6, data)
[2023-08-09 01:15:47][26182][TRACE] handle_perfdata(6)
[...]

By chance I had the 5.1.0 .deb still in the /var/cache/apt of a different testing VM. (note: this deb is from the labs.consol.de repository as I just recently switched, not sure if that matters).

If I dpkg -i the 5.1.0 one without changing my conffiles, it works again:

[...]
[2023-08-09 15:30:03][36755][TRACE] move_results_to_core()
[2023-08-09 15:30:03][36755][TRACE] handle_svc_check(6, data)
[2023-08-09 15:30:03][36755][TRACE] handle_perfdata(6)
[2023-08-09 15:30:03][36755][TRACE] handle_svc_check(6, data)
[2023-08-09 15:30:03][36755][TRACE] got target queue from service custom variable: servicegroup_somegroup
[2023-08-09 15:30:03][36755][DEBUG] received job for queue servicegroup_somegroup: some.host.tld - SSH, check_options: 0   latency so far: 0.513s
[2023-08-09 15:30:03][36755][TRACE] cmd_line: /usr/local/mon/libexec/check_ssh -t "10" -p "22" 1.2.3.4
[2023-08-09 15:30:03][36755][TRACE] add_job_to_queue(servicegroup_somegroup, 0D169733BEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEFDEADBEEFE68, 1, 1, 1, 1, 60)
[2023-08-09 15:30:03][36755][TRACE] 227 --->type=service
result_queue=check_results
host_name=some.host.tld
service_description=SSH
core_time=1691587803.000045
timeout=120
command_line=/usr/local/mon/libexec/check_ssh -t "10" -p "22" 1.2.3.4


<---
[2023-08-09 15:30:03][36755][TRACE] 320 +++>
12Py+XD5Kr8<REDACTED>kPddZBa552c6cpucUT0
<+++
[2023-08-09 15:30:03][36755][INFO ] gearmand submission statistics: jobs:      3   errors:       0   submit_rate:    0.0/s   avg_submit_duration: 0.000098s   max_submit_duration: 0.000145s
[2023-08-09 15:30:03][36755][TRACE] add_job_to_queue() finished successfully
[2023-08-09 15:30:03][36755][TRACE] handle_svc_check() finished successfully
[2023-08-09 15:30:03][36755][TRACE] handle_svc_check() finished successfully -> 206
[2023-08-09 15:30:03][36755][TRACE] got result H:com-dmcore001:47
[2023-08-09 15:30:03][36755][TRACE] 492 +++>
O8eJdV+43h55e<REDACTED>zXhzaSc=
<+++
[2023-08-09 15:30:03][36755][TRACE] 368 --->
type=active
host_name=some.host.tld
service_description=SSH
core_start_time=1691587803.000045
start_time=1691587804.909941
finish_time=1691587804.964980
return_code=0
exited_ok=1
source=Mod-Gearman Worker @ some-worker.tld
output=SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) | time=0.053162s;;;0.000000;10.000000





<---
[2023-08-09 15:30:03][36755][DEBUG] service job completed: some.host.tld SSH: exit 0, latency: 0.518, exec_time: 0.055
[2023-08-09 15:30:04][36755][TRACE] move_results_to_core()
[...]
@pvdputte
Copy link
Author

pvdputte commented Aug 9, 2023

So, for the record, this is a working combination.

# dpkg -l |grep -e naemon -e gearm
ii  gearman-job-server                   1.1.19.1+ds-2+b2               amd64        Job server for the Gearman distributed job queue
ii  gearman-tools                        1.1.19.1+ds-2+b2               amd64        Tools for the Gearman distributed job queue
ii  libgearman8:amd64                    1.1.19.1+ds-2+b2               amd64        Library providing Gearman client and worker functions
ii  libnaemon:amd64                      1.4.1-1                        amd64        Library for Naemon - common data files
ii  mod-gearman-module                   5.1.0                          amd64        Event broker module to distribute service checks.
ii  mod-gearman-tools                    5.1.2-1                        amd64        Tools for mod-gearman
ii  naemon-core                          1.4.1-1                        amd64        host/service/network monitoring and management system
ii  naemon-livestatus                    1.4.1-1                        amd64        contains the Naemon livestatus eventbroker module

Same on Debian 12 'bookworm' with gearman-job-server 1.1.20 by the way (which is where I noticed the issue first).

@pvdputte
Copy link
Author

I did some manual compiles. 5.1 was still fine for me, 5.1.1 was broken. The problem starts right after this commit:

87e2220

@sni
Copy link
Owner

sni commented Aug 13, 2023

Does it fail immediately?

@pvdputte
Copy link
Author

Yes. 0 jobs arrive in gearmand.

@pvdputte
Copy link
Author

pvdputte commented Jan 9, 2024

FYI I did a reinstall on Debian 12 today using

ii  naemon-core    1.4.2-1      amd64        host/service/network monitoring and management system
ii  mod-gearman-module 5.1.2-1      amd64        Event broker module to distribute service checks.

and I got the same result.

I was keeping an old copy of the 5.1 binary to work around it, I'll have to recompile for the new v6 event broker API.

Error: Module '/usr/lib/x86_64-linux-gnu/mod_gearman/mod_gearman_naemon.o' is using an incompatible version (v5) of the event broker API (current version: v6). Module will be unloaded.

@pvdputte
Copy link
Author

pvdputte commented Jan 9, 2024

Or should I look into something else that could be wrongly configured on my system, as I find it hard to believe no-one else has been using mod-gearman-module on Debian the last few months. 🤔

@sni
Copy link
Owner

sni commented Jan 9, 2024

or those packages from OBS? I assumed they automatically rebuild if there is an update in naemon-core.

@pvdputte
Copy link
Author

pvdputte commented Jan 9, 2024

I'm using

$ cat /etc/apt/sources.list.d/naemon-stable.list 
deb [signed-by=/etc/apt/trusted.gpg.d/naemon.asc] http://download.opensuse.org/repositories/home:/naemon/Debian_12/ ./

apt show confirms it.

@sni
Copy link
Owner

sni commented Jan 9, 2024

i will look into this. There was a new naemon release some days ago, i assume that's the issue here. NEB modules have to be rebuild against that version. I will trigger a rebuild for mod-gearman.

@pvdputte
Copy link
Author

pvdputte commented Jan 9, 2024

The build is fine for all I know, there's no NEB error starting naemon after installing. But using those packages still results in the problem described in the issue here: check jobs don't arrive in gearmand.

I only get that NEB v6 error when I try to replace /usr/lib/x86_64-linux-gnu/mod_gearman/mod_gearman_naemon.o with an older 5.1 version that still worked for me on Debian 12 with gearman-job-server 1.1.20+ds-1.

In other words: the issue is still the same with naemon 1.4.2 and the 5.1.2-1 version of mod-gearman-module on Debian 12, fetched from download.opensuse.org/repositories/home:/naemon/Debian_12

At least for me 🙂

@sni
Copy link
Owner

sni commented Jan 10, 2024

I find it hard to believe no-one else has been using mod-gearman-module on Debian

I use it on several debian 12 machines, but i am using it with OMD-Labs which comes with its own gearmand. Had no issues so far.

@pvdputte
Copy link
Author

I'm currently digging a bit deeper as I found that a plain vanilla install of all packages results in a working configuration.

Still a bit at a loss though as in the failing environment I now have identical naemon-core mod-gearman-module packages with identical

/etc/naemon/naemon.cfg
/etc/naemon/module-conf.d/mod-gearman.cfg
/etc/mod-gearman/module.conf

but no fix. I'll update this issue as soon as I find out more.

@pvdputte
Copy link
Author

Turns out it's because I've been running gearmand for ages with --listen=0.0.0.0 to let remote mod-gearman-worker instances connect.

So I had no reason to suspect anything wrong with the gearmand setup (that I had not touched for a long time) and the mod-gearman-module suddenly stopped working properly with version 5.1.2 / this commit: 87e2220

I.e. these errors appeared out of nowhere:

sending job to gearmand failed: gearman_wait(GEARMAN_TIMEOUT) timeout reached, 1 servers were poll(), no servers were available

But by default in the Debian gearman-job-server package, gearmand is started with --listen=localhost, and when I revert to that, version 5.1.2 starts working properly again.

What had me puzzled is that the check_results queue was created and had naemon in it. But the host/notification/service queues etc. were never created because that part of mod-gearman-module 5.1.2 apparently no longer works if gearmand listens on 0.0.0.0 instead of localhost .

5.1.0 still worked with gearmand on 0.0.0.0 though. 🤷

Anyway, the fix for me is to just remove the --listen=... thing, by default it listens on everything then.

Sorry to bother you, maybe this helps someone else later 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants