-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No failure and buffering of data happening when syslog server (deployed through kubernetes) becomes unreachable #56
Comments
Thanks for reporting!
I don't know how effective it is, but what about |
Hi @daipom
|
Hi @daipom Analysis - We believe this 15min timeout is related to tcp communication issue. It could be a result of the default value of tcp_retries2 (on linux) which is 15. To solve this, we tried setting the TCP socket option TCP_USER_TIMEOUT at the time of socket creation with syslog so that the timeouts happen sooner.
With this change, we observed that in case of syslog going down, after 6 seconds, the connection loss was detected by fluentd and buffering/retry worked as expected. Does this look like the right approach to solve this issue? Can we expect a fix in the plugin on these lines? |
Thanks for finding it out! I also find an interesting behavior of TCPSocket. After stopping a TCP server, a client can send the data once without error. Server: I used Fluentd as a TCP server with the following config <source>
@type tcp
tag test
<parse>
@type none
</parse>
<transport tcp>
linger_timeout 5
</transport>
</source>
<match test.**>
@type stdout
</match> Client: irb require "socket"
socket = TCPSocket.new("IP", 5170)
# The server is alive.
socket.write("fuga\n")
=> 5
# After stopping the server.
socket.write("fuga\n")
=> 5 # Surprisingly, the first one succeeds!!
socket.write("fuga\n")
in `write': Broken pipe (Errno::EPIPE)
from (irb):43:in `<main>'
from /home/daipom/.rbenv/versions/3.2.0/lib/ruby/gems/3.2.0/gems/irb-1.6.2/exe/irb:11:in `<top (required)>'
from /home/daipom/.rbenv/versions/3.2.0/bin/irb:25:in `load'
from /home/daipom/.rbenv/versions/3.2.0/bin/irb:25:in `<main>' If I remove the setting <source>
@type tcp
tag test
<parse>
@type none
</parse>
</source> require "socket"
socket = TCPSocket.new("IP", 5170)
# The server is alive.
socket.write("fuga\n")
=> 5
# After stopping the server.
socket.write("fuga\n")
in `write': Connection reset by peer (Errno::ECONNRESET)
from (irb):31:in `<main>'
from /home/daipom/.rbenv/versions/3.2.0/lib/ruby/gems/3.2.0/gems/irb-1.6.2/exe/irb:11:in `<top (required)>'
from /home/daipom/.rbenv/versions/3.2.0/bin/irb:25:in `load'
from /home/daipom/.rbenv/versions/3.2.0/bin/irb:25:in `<main>' This difference is related to I'm thinking this issue may have something to do with this specification of TCP. The problem that sending data to a closed TCP socket doesn't fail seems to be a fairly well-known problem: https://stackoverflow.com/questions/11436013/writing-to-a-closed-local-tcp-socket-not-failing This is what I found so far. Thanks for the information about
Certainly, this could be a valid solution. |
I now understand the mechanics of this problem. There are 2 points we need to consider.
@aggarwalShivani
|
Hi @daipom
So perhaps that is general k8s behaviour, if there is no endpoint for a service, kube-dns will be unable to resolve it and it would respond with port unreachable.
|
Sorry for my delay, and thanks so much for your reply. |
Hi @daipom, Have you able to conclude on the alternative approach to fix this issue or is it okay to go ahead with the approach of "TCP_USER_TIMEOUT" ? |
Version details:
fluent-plugin-remote_syslog (1.0.0)
remote_syslog_sender (1.2.2)
syslog_protocol (0.9.2)
Background:
Config looks like this
Issue:
Fluentd successfully connects to the configured syslog endpoint and keeps pushing the records as per the flush interval.
But when the k8s service of syslog server goes down (i.e. when the syslog pod gets deleted or goes down to 0/1), fluentd does not detect any failures in connection to syslog.
It also keeps flushing all the chunks from the file buffer and does not retain anything in buffer - inspite of the destination being unreachable.
Pls see that the syslog service is unreachable (as seen by other clients like logger, curl)
Fluentd logs: there are no errors in fluentd. New chunks keep getting created and keep getting cleared from the file buffer location. Trace level logging is enabled.
So why are the chunks getting flushed from the buffer when the destination is unreachable?
Observation: -
When syslog server is not running as a k8s pod, and is running as a standalone service on linux (i.e. managed through systemctl), when the service is stopped (using systemctl stop rsyslog), we immediately see the following error in fluentd logs when it tries to flush the next chunk from its buffer to the syslog endpoint.
As the flush fails due to connectivity, it retains the chunk in the file buffer and keeps retrying flush (as per the configuration).
Later when the rsyslog service is restarted and up again, it connects and successfully pushes all buffered chunks to the destination without any loss of data. For ex.
2023-03-21 10:23:24 +0000 [warn]: #0 retry succeeded. chunk_id="5f7664fef1ec6c112220332f2732de46"
2) For syslog running over kubernetes, we see that fluentd detects if there is a connectivity issue with syslog initially when fluentd starts running and there retry/buffering works fine as expected.
But once fluentd successfully establishes a connection with syslog, and then if syslog destination goes unreachable, fluentd fails to detect connection errors.
Concern:
Why is the loss of connectivity not identified in case of rsyslog server running as a kubernetes service?
Pls let us know if there're any additional configs in syslog plugin that could help us achieve retry/buffering properly in this case too.
The text was updated successfully, but these errors were encountered: