Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovering from "Too many messages have been received without being deleted" #12

Open
danieroux opened this issue Nov 16, 2018 · 1 comment

Comments

@danieroux
Copy link
Contributor

When hammering a filled up queue with many peers to suck of messages as fast as possible, Amazon throws an exception which kills the job.

I would like to know how to recover from this. My stopgap is to use less peers:

 [{:type clojure.lang.ExceptionInfo
   :message "Too many messages have been received without being deleted.\nPlease delete your received messages or let them timeout before receiving more. (Service: AmazonSQS; Status Code: 403; Error Code: OverLimit; Request ID: f53c9b7c-76d4-57a4-9e55-cf1e77e7e885)"
   :data {:original-exception :com.amazonaws.services.sqs.model.OverLimitException}
   :at [com.amazonaws.http.AmazonHttpClient$RequestExecutor handleErrorResponse "AmazonHttpClient.java" 1639]}]

As far as I can figure out:

  • sqs/delete-message-async-batch gets called in checkpointed!. This only happens after a 100k messages have already been read off the queue in the case of many peers.
  • Which means that poll! fails with >100k messages in flight.

Can I get some guidance on how to handle it?

  • Is it as simple as not doing a sqs/receive-messages if (< (count @processing) 100,000)?
  • Or would a separate counter be more useful/efficient?
  • What else should I be aware of before I touch the code?
@lbradstreet
Copy link
Member

Yes, I think we need to add a backoff mechanism to only allow X messages at a time. I thought we had already added one but I reviewed the code and it looks like we didn't.

sqs/receive-messages if (< (count @processing) 100,000) would probably be the best way of doing this, rather than using a counter, as it's the best idea of how many messages we have outstanding at a given time.

You can also add a lifecycle handler to handle the exception so that the job won't be killed, though this obviously won't help with the root cause.

I'd be happy accept a PR to implement this. Make sure to implement the schemas defined in https://github.com/onyx-platform/onyx-amazon-sqs/blob/0.14.x/src/onyx/tasks/sqs.clj

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants