-
-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scalability limits? #104
Comments
In fact, unless you're using disable-monitoring-lock, Gatus should never run into concurrency issue. Otherwise, there's a lock that prevents two services from being evaluated at the same time. The reason why this is the default behavior is so that the accuracy of the measured response time is as high as possible. The I occasionally use Gatus for stress testing, and in average, 50 services with an interval of 1s each usually ends up consuming roughly 0.1 vCPU (100m), and given that your interval is 60s rather than 1s, I doubt CPU is to blame, even if you're using It's a bit hard to help you without being able to reproduce it, but it has never happened to me and if I had to guess, I'd say it's either the service that is unreachable, or the service being tested is really timing out. |
I'm not using disable-monitoring-lock, so that can be ruled out. The networking is convoluted, it tests the service as a user would, and gets proxied via cloudflare. You're right about it hitting the 10s timeout, but when manually checking the service - it's fine, and restarting gatus solves it (whilst restarting the service does not). I've been running on 2.4.0 all day now, without issue - so it does seem tricky to reproduce. I opened this after seeing it twice in an hour about 20 minutes or so after adding a new service. Typically, I've not seen it since. Is there any additional verbosity I can turn on to collect useful logging output? I may run a test and scale this up to see if I can more reliably reproduce this behaviour. |
Thing is, if it's timing out, there's no extra verbosity that could be provided, since it's actually timing out.. It's not currently possible to modify the timeout, but it's not really complicated to implement. If the issue happens again, let me know and I'll implement so that you can test it on latest. |
Not seen this since! Closing until / unless I have more evidence to support my initial observations. |
I've added another 4 services (DNS checks) and I have got a number of new instances of this popping up on other HTTP services. If there's an option to increase the timeout, that would be great to test. What would happen in this event if the response time exceeded poll interval? (I.e. how do I know if it's failing still). |
@dchidell Done. Once the build is done, it'll be available on To use it, all you need to do is set the |
I can see this issue shifted a little bit, so is it possible to scale Gatus? Let's say I needed to test 1000, or even 10 000 websites and have accurate response times, is it possible? I can see that Gatus uses file to store data permanently when server stops, and everything is in RAM while the server is running. If we had two types of Gatus (master and worker) where master simply manages workers and store data, meanwhile workers do the work (ping), we could use websockets or Redis to sync data between all of these. With all of these changes, we should be able to start one master, 100 wokers and they should be able to ping "unlimited" about of websites. Anyone think this update would be useful? |
@Meldiron I've got a few large instances of Gatus running with ~100 services, and thanks to the monitoring lock, by default, there will never be two services monitored at the exact same time, therefore, the response time should remain accurate. Furthermore, the memory usage should remain fairly low even if you were monitoring 1000 services, granted the UI might be a bit hard to navigate with that many services 🤔 A distributed approach is being discussed in #64 (and #124 may further contribute to making that easier), I just haven't had the time to work on said issue yet. @dchidell Do you have any news? Has the issue happened again? |
Nope - I've seen absolutely no problems, I ran the new build with the HTTP header set to 120 seconds and saw no problems, so reverted back and still have not seen anything of concequence. |
@dchidell How long ago did you revert back? |
Was running that build for around a week - so ~3 weeks ago I reverted. |
Alright, sweet. I'll get rid of |
Glad to know that the issue no longer happens! |
I'm monitoring around 50-55 services with gatus, most are HTTP and 29 of them are using the
pat
keyword (with wildcards, so about as expensive a query as it can get). All using the default poll interval of 60s.I am starting to see some responses of
context deadline exceeded (Client.Timeout exceeded while awaiting headers)
in the body check. I have manually checked the services in question and they're healthy. Restarting gatus, or just waiting a few minutes seems to resolve this. This does not occur continuously, but have seen it twice in the space of a few hours.I can only assume that this is due to a concurrency issue, as it's more than possible that the combination of service times takes longer than 60s to respond. I do not know enough of the gatus architecture to know if this is a problem or not.
I am running v2.3.0 with #100 changed locally (as I've not updated since I tested it). I will repeat the test with 2.4.0 and report the results here.
The text was updated successfully, but these errors were encountered: