Scalability Issue: outages / timeouts / slow responses in the recrawler service may lead to message queue buildups #43
Description
Describe the bug
The recrawler
service has been switched off since early January, due to a lack of query results which will be opened and tracked as a separate issue for that service.
If no recrawler
pods are available, requests to that service fail with connection errors -- after a considerable timeout -- as visible here in the backend-worker
deployment logs:
[2021-01-27 18:28:19,290: WARNING/ForkPoolWorker-2] Recrawling failed due to "ConnectionError" exception
[2021-01-27 18:28:19,291: WARNING/ForkPoolWorker-3] Recrawling failed due to "ConnectionError" exception
[2021-01-27 18:30:30,362: WARNING/ForkPoolWorker-1] Recrawling failed due to "ConnectionError" exception
[2021-01-27 18:30:30,366: WARNING/ForkPoolWorker-3] Recrawling failed due to "ConnectionError" exception
This causes the throughput of the backend-worker
instances to drop dramatically since most of the task worker time is spent attempting to make a connection.
It may be useful to consider both a short-term and longer-term fix here. Since we are not currently receiving results from the recrawler
service, a patch would involve re-deploying that service to respond with empty results (effectively a no-op). Longer-term we likely want to isolate the queue workers that handle event logs, and perhaps add circuit breakers and/or adjust the connection timeouts they use.
Expected behavior
Throughput for the majority of the RecipeRadar message queues should not be adversely affected by outages in a minor service.