-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Naemon 1.4.1 livestatus not stable #428
Comments
Works now. Did not notice that the rpm upgrade commented this line in naemon.cfg: broker_module=/usr/lib64/naemon/naemon-livestatus/livestatus.so inet_addr=0.0.0.0:6557 debug=1 max_backlog=128 max_response_size=2000000000 uncommented it and it worked again. |
Glad to see it's working now. Nevertheless, Naemon should not segfault :-) If a incompatible module tries to load, it should print a message and unload the module again. |
Had to revert back to 1.3.0 since livestatus was not stable. Did sometimes not respond. Verified with this command which sometimes did not work. echo -e "GET hosts\nColumns: name\nLimit: 100\n" | netcat -w5 127.0.0.1 6557 Stable with 1.3.0 but not with 1.4.1 There is one error message when starting, not sure if relevant: naemon[2080]: Successfully launched command file worker with pid 2159 |
Alright, so it is not working with the latest release? Which repository did you use? Consol Labs or the SUSE one? And on which os is that? |
This is on Rocky Linux 8.5, using repo: [labs_consol_stable] |
Any hints on troubleshooting that can be done to investigate why Naemon 1.4.1 does not respond to livestatus requests at times? |
Well, i tend to start with strace pretty soon. But enabling debug log might also reveal something already. |
I'm also experiencing issues where livestatus won't respond, but I'm upgrading from 1.0.x to 1.4.1. Currently testing with the production dataset in a staging environment. Not intending to hijack this issue, if you feel this is a different problem I'll open a separate one. Environment: Debian 11 'bullseye'
I've only been testing for a few days, trying to use the native TCP support in livestatus instead of my previous xinetd solution (running Thruk and custom dashboards from other locations). If I leave everything running for a night, in the morning I'm greeted with 'No backend available' in Thruk. Not sure yet how long it takes for the problems to start. I tried to isolate the problem by stopping all external livestatus clients and doing checks with netcat on the naemon host itself. It seems that the connection is stuck in SYN_SENT very often. Here you see a session where I try to run
The matching livestatus debug log only has the successful attempts:
Matching netstat grep for netcat connections:
So no other livestatus connections during this test. System load is very low (0.11) as in staging I run checks with way higher intervals. |
I'm also using TCP, so perhaps related to that. Haven't had time to try/troubleshoot this lately so still on 1.3.0. So, your issue has nothing to do with load? Or how much traffic do you have to it besides your script? |
Zero load (active checks disabled) and almost zero livestatus calls (except for some It took a few hours to get into this state after the restart (couldn't disable active checks because Thruk wouldn't load 😉 so... restarted naemon), but I've got a trace now.
Netstat calls:
Trace: strace.txt |
What's odd though is that I cannot reproduce it in my vagrant environment, which has been running for two days now. |
I think I've found something suspicious. On the staging host with naemon having issues after a while, I noticed there are actually two PIDs listening on tcp/6557:
When I compare with my vagrant environment, So I killed it to see what would happen.
Apparently the parent process is not reaping it, so it turned into a zombie.
Interestingly, the size of the responses also changed. Naemon seems to end up in this state after a reload. So I can now trigger it on this host by doing a simple
On my vagrant naemon host, this does not happen. Both are running the same Debian version and are configured by the same Ansible playbooks though. |
I just realized staging had all package updates while my vagrant box did not. After running a full upgrade, I have the same problem in vagrant.
I'll start from scratch and see if I can pinpoint it further. |
After the kernel update itself + VM restart, the problem was there, but when retrying with the old kernel the problem persisted so that was inconclusive. But I'm now able to reliably reproduce it in vagrant:
Next, copy these commands into a script and execute it.
This should result in something like
|
Same problem with |
Hey @sni - I think we're running into this issue too; not delved into it as much as pvdputte, but running naemon-livestatus 1.4.1 results in a crash of our Naemon stack after a couple of hours. Logs show that Livestatus shuts down and takes everything else with it - I see this error each time (not sure if it's related):
We've reverted back to 1.4.0 for now - let me know if you need more details. |
@pvdputte How long does it take for you to get into the duplicate listener state when reloading? Running your script but have not entered it yet. However, running with a minimal host/service config which is not the same as production. |
Just ran the script in vagrant again, success on the first try.
I'm not running the tcp livestatus in production yet. |
I think I've been hit by the same bug :) Before reload:
After reload:
I'm willing to put in the time to debug this, but I'm not sure where to begin. In the meantime I'm thinking of using socat to share the socket over a tcp connection, or using docker shared volumes (as my whole setup is docker-based). Thanks for the time taken to maintain this project :) Edit, as it might be important: if I kill the second process the whole thing starts working again. |
(moved my input to a separate issue because I feel it's different, never had naemon crash etc.) |
Hi,
I get issues with livestatus causing naemon to core dump. Seems I get 1 of 2 issues. Either Livestatus loads with version 1.3.0 and then it core dump shortly after naemon start. Or Livestatus does not load at all.
I update with:
yum update naemon naemon-core libnaemon naemon-livestatus naemon-core-debugsource naemon-core-dbg naemon-thruk naemon-vimvault-debugsource naemon-livestatus-debugsource naemon-devel naemon-vimvault mod_gearman
Could be when it crashed I did not update mod-gearman. When I include mod-gearman it seems livestatus broker module is not loaded at all.
Is naemon.cfg livestatus conf OK?:
broker_module=/usr/lib64/naemon/naemon-livestatus/livestatus.so inet_addr=0.0.0.0:6557 debug=1 max_backlog=128 max_response_size=2000000000
With naemon core dump
Apr 29 12:55:00 server naemon[53434]: Naemon 1.4.1 starting... (PID=53434)
Apr 29 12:55:00 server naemon[53434]: Local time is Sat Apr 29 12:55:00 UTC 2023
Apr 29 12:55:00 server naemon[53434]: LOG VERSION: 2.0
Apr 29 12:55:00 server naemon[53434]: qh: Socket '/var/lib/naemon/naemon.qh' successfully initialized
Apr 29 12:55:00 server naemon[53434]: nerd: Channel hostchecks registered successfully
Apr 29 12:55:00 server naemon[53434]: nerd: Channel servicechecks registered successfully
Apr 29 12:55:00 server naemon[53434]: nerd: Fully initialized and ready to rock!
Apr 29 12:55:01 server naemon[53434]: mod_gearman: initialized version 3.3.3 (libgearman 0.33)
Apr 29 12:55:01 server naemon[53434]: Event broker module '/usr/lib64/mod_gearman/mod_gearman_naemon.o' initialized successfully.
Apr 29 12:55:01 server naemon[53434]: livestatus: Setting debug level to 1
Apr 29 12:55:01 server naemon[53434]: livestatus: Setting listen backlog to 128
Apr 29 12:55:01 server naemon[53434]: livestatus: Setting maximum response size to 2000000000 bytes (1907.3 MB)
Apr 29 12:55:01 server naemon[53434]: livestatus: Naemon Livestatus 1.3.0, TCP: '0.0.0.0:6557'
Apr 29 12:55:01 server naemon[53434]: livestatus: Setup socket to listen on all interfaces
Apr 29 12:55:01 server naemon[53434]: livestatus: Opened TCP socket 0.0.0.0:6557, backlog 128
Apr 29 12:55:01 server naemon[53434]: livestatus: Your event_broker_options are sufficient for livestatus.
Apr 29 12:55:01 server naemon[53434]: livestatus: Finished initialization. Further log messages go to /var/log/naemon/livestatus.log
Apr 29 12:55:01 server naemon[53434]: Event broker module '/usr/lib64/naemon/naemon-livestatus/livestatus.so' initialized successfully.
Apr 29 12:55:02 server naemon[53434]: livestatus: Cannot delete non-existing downtime/comment 103637
Apr 29 12:55:03 server naemon[53434]: Successfully launched command file worker with pid 53504
Apr 29 12:55:03 server naemon[53434]: TIMEPERIOD TRANSITION: 24x7;-1;1
Apr 29 12:55:03 server naemon[53434]: TIMEPERIOD TRANSITION: 24x7_sans_holidays;-1;1
Apr 29 12:55:03 server naemon[53434]: TIMEPERIOD TRANSITION: none;-1;0
Apr 29 12:55:03 server naemon[53434]: TIMEPERIOD TRANSITION: us-holidays;-1;0
Apr 29 12:55:03 server naemon[53434]: TIMEPERIOD TRANSITION: workhours;-1;0
Apr 29 12:55:03 server kernel: naemon[53519]: segfault at 8 ip 00007ff936ea4e70 sp 00007ff93662e768 error 4 in livestatus.so[7ff936e31000+94000]
Apr 29 12:55:03 server naemon[53504]: Command file worker: Failed to read from bufferqueue (Inappropriate ioctl for device)
Apr 29 12:55:03 server systemd-coredump[53525]: Resource limits disable core dumping for process 53434 (naemon).
Apr 29 12:55:03 server systemd-coredump[53525]: Process 53434 (naemon) of user 989 dumped core.
Apr 29 12:55:03 server systemd[1]: naemon.service: Main process exited, code=killed, status=11/SEGV
Apr 29 12:55:05 server systemd[1]: naemon.service: Failed with result 'signal'.
with livestatus not loaded and not working
Apr 29 13:14:58 server naemon[43859]: Naemon 1.4.1 starting... (PID=43859)
Apr 29 13:14:58 server naemon[43859]: Local time is Sat Apr 29 13:14:58 UTC 2023
Apr 29 13:14:58 server naemon[43859]: LOG VERSION: 2.0
Apr 29 13:14:58 server naemon[43859]: qh: Socket '/var/lib/naemon/naemon.qh' successfully initialized
Apr 29 13:14:58 server naemon[43859]: nerd: Channel hostchecks registered successfully
Apr 29 13:14:58 server naemon[43859]: nerd: Channel servicechecks registered successfully
Apr 29 13:14:58 server naemon[43859]: nerd: Fully initialized and ready to rock!
Apr 29 13:14:58 server naemon[43859]: mod_gearman: initialized version 5.0.2 (libgearman 1.1.19.1)
Apr 29 13:14:58 server naemon[43859]: Event broker module '/usr/lib64/mod_gearman/mod_gearman_naemon.o' initialized successfully.
The text was updated successfully, but these errors were encountered: