-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IO::EINPROGRESSWaitWritable with Redis 6 #972
Comments
@supercaracal We do not use threaded IO, it is disabled. |
Looks like our code might be incorrect: redis-rb/lib/redis/connection/ruby.rb Lines 149 to 165 in 72bd3a3
The documentation for
|
Hum, actually:
So we should enter the |
It seems that the following logic should be called if we use SSL/TLS connection. redis-rb/lib/redis/connection/ruby.rb Lines 253 to 273 in 7587668
@beanieboi Could you inform us of your specified connecting options? |
The provided backtrace says line 192, so it's a regular TCPSocket. Based on the code itself that backtrace should be impossible as the Exception reported should be caught by the rescue. So really I'm puzzled as to what is happening here. |
our Sidekiq configuration looks like this: We have 2 Redis and they are configured the same. We have 1 Redis just for Sidekiq, we connect like this:
We connect to our 2nd redis like this:
|
@byroot I'm sorry for my overlooks. Thank you for your explanation. I understood. @beanieboi Is the Redis' |
I have to correct myself when I said in the beginning. it definitely happens when we connect to a normal
we have |
I got this same error trace when redis is not installed / stopped. Start the redis fixed it ( |
@beanieboi would you mind confirming if this is still an issue and/or what ended up happening? |
Fwiw, we just saw this running Heroku Redis with ssl, redis-rb 4.6.0 Redis 6.2.3 Ruby 2.7.6 but event is very rare
backtrace
|
I'm unclear what a
So it sounds like (based solely on this information) that the process is async and the status should be checked again later. It sounds like in Ruby this is related to a timeout where we are waiting for a connection, but then hit our timeout and then bubble up the exception. Right now it's unclear the exact circumstances that cause this issue or what specific set of circumstances enable the failure. For example when I debugged connection errors with Puma on Heroku first I had to reproduce the behavior https://blog.heroku.com/puma-4-hammering-out-h13s-a-debugging-story. In that case I used a fixture to move the failure mode from production to local https://github.com/hunterloftis/heroku-node-errcodes/tree/master/h13. In debugging I like to take a step back and think in terms of the scientific method. What are our theories as to why this happens and how could we test those hypothesis? Some things I'm curious about at a high level: What can be done to resolve the issue once it begins? After it occurs is it a persistent issue or does it go away? One theory is that the machine making the request is put into a "bad state" possibly due to killing a thread and that future requests will also fail in the same way until the machine is restarted or until a process is killed. We could test this by catching the failure, logging it and then performing some action: For example retry the same request, it it succeeds then the problem does not persist beyond the original request. We could also take another action such as killing a process so that it's memory is then in a "clean" state.
This implies that the event does not persist. If that's the case then I would be curious about what is the state of the overall system when this happens. Can we correlate the time of this happening with something else? Perhaps a cron job coincides with a large amount of work from a customer endpoint (for example). Collecting more information around when this is happening could be good such as number of connections on the data store, memory, load etc. maybe something stands out as an anomaly. Another approach is: find an input that changes the output (even if it doesn't fix it). If this seems related to timeout and disabling the timeout removes the problem, does this mask the issue or is it related to the issue? Disabling the timeout clearly stops the issue because it would wait forever instead of raising the issue. The questions I have are: do those requests eventually finish or would they wait forever until some other timeout occurs (such as a server is rotated every 24 hours). If changing the timeout value can change the nature of our failure mode, is there a correlation with Redis 6? Does redis 6 do something different in terms of performance when connecting that might mandate a longer default timeout value or is this a coincidence? Sometimes problems that are caused by increased load aren't noticed until other changes are made. For instance if a database is close to some limit, developers might not notice until they take a maintenance action and are inspecting database logs a bit closer etc. Then how could we test that? For example: Is it feasible to run a Redis 5 DB for a bit to see if the error occurs within a reasonable timeframe? Also to consider there may be multiple different failure modes that all appear the same but are different. As juanito pointed out trying to connect to a non-existant redis instance may give this error sometimes. All of that to say: I think we need more theories of what failure mode is causing this error and associated steps to prove or disprove them. |
From /* :nodoc: */
static VALUE
sock_connect_nonblock(VALUE sock, VALUE addr, VALUE ex)
{
VALUE rai;
rb_io_t *fptr;
int n;
SockAddrStringValueWithAddrinfo(addr, rai);
addr = rb_str_new4(addr);
GetOpenFile(sock, fptr);
rb_io_set_nonblock(fptr);
n = connect(fptr->fd, (struct sockaddr*)RSTRING_PTR(addr), RSTRING_SOCKLEN(addr));
if (n < 0) {
int e = errno;
if (e == EINPROGRESS) {
if (ex == Qfalse) {
return sym_wait_writable;
}
rb_readwrite_syserr_fail(RB_IO_WAIT_WRITABLE, e, "connect(2) would block");
}
if (e == EISCONN) {
if (ex == Qfalse) {
return INT2FIX(0);
}
}
rsock_syserr_fail_raddrinfo_or_sockaddr(e, "connect(2)", addr, rai);
}
return INT2FIX(n);
} Then later: case EINPROGRESS:
c = rb_eEINPROGRESSWaitWritable;
break; So this exception is raised when So reference manual says: https://man7.org/linux/man-pages/man2/connect.2.html
Which is kinda expected as we explicitly rescue begin
sock.connect_nonblock(sockaddr)
rescue Errno::EINPROGRESS
raise TimeoutError unless sock.wait_writable(timeout)
begin
sock.connect_nonblock(sockaddr)
rescue Errno::EISCONN
end
end So it should work, but clearly it doesn't. What makes no sense to me is the reported backtraces:
That line 196 is the first call to >> IO::EINPROGRESSWaitWritable < Errno::EINPROGRESS
=> true So over a year later, this bug report is stil at the WTF stage to me, unless somehow |
@schneems I read through your response about what was going on at the time, and upon reflection, I thought I might know what was going on and went to check.
So, timing is correlated with this nutso behavior: Basically, among things, our app tracks trucks picking a thing up at one place, dropping it up at another, driving back to the start, and repeating the cycle. We call the roundtrip time the cycle time and related to it have a metric of tons per cycle. In this (representative error) case:
Seems the cause was some combination of the dyno ( Performance-M ) running out of memory, or Redis running out of memory, or the database barfing on our GB sized INSERT resulting in the worker crashing. And maybe we haven't seen a reoccurrence since I added a max number of cycles of 500? I believe my app crash scenario may explain why the error wasn't rescued in one place, though perhaps not why it was rescued in the other. |
We are also seeing this issue in our staging environment, If it can be of any help, please find the attached backtrace and config of our system. backtrace:
|
we are facing similar issue too
|
Hi,
we recently upgraded to Redis 6 and started using the built-in TLS. Since the upgrade we see frequent
IO::EINPROGRESSWaitWritable: Operation now in progress - connect(2) would block
versions:
ruby: 2.7.2
redis-rb: 4.2.5
redis: 6.0.8
We use Redis in combination with Sidekiq.
What we found out so far:
ensure_connected
(redis-rb/lib/redis/client.rb
Lines 383 to 392 in 72bd3a3
This is happening in bursts every 5min which is also the
timeout
setting on our redis.Setting
timeout
on the redis to 0 (disable) "fixes" this problem.What I think is happening. The connection in the pool was timed out by redis and when we try to use it, it fails to connect again.
This only started to happen after upgrading from Redis 4 to 6. Our old Redis 4 had the same 300s timeout. I believe that I might be related to TLS.
stack trace:
So far I failed to reproduce this issue in a new isolated project.
let me know if there is more information that I can provide.
The text was updated successfully, but these errors were encountered: