-
Notifications
You must be signed in to change notification settings - Fork 294
Use task names to reduce the risk of duplicate update-feed tasks. #319
base: master
Are you sure you want to change the base?
Conversation
If a duplicate task occurs, an error will be logged: "taskqueue: task has already been added" This should be harmless: it means that duplicate work has been avoided. Also reduce log level for new feed-update tasks.
The docs say:
This suggests that it can take up to 7 days to get a task name back before it's reusable. If that is true, I don't think we can use this method. |
// The URL is hex-escaped but hopefully still human-readable. | ||
newTask.Name = fmt.Sprintf("%v_%v", | ||
feed.NextUpdate.UTC().Format("2006-01-02T15-04-05Z07-00"), | ||
taskNameEscape(id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just base64.URLEncoding.EncodeToString([]byte(id))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted it to be human-readable, for debugging. Good tooling could fix that. Or if that isn't important, I also considered using a one-way hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know there won't be any collisions? i.e., two different feeds that result in the same name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Task names look like 2014-12-16T16-58-34Z07-00_http_3A_2F_2Fxkcd_2Ecom_2Fatom_2Exml
: the time to 1-sec precision plus the full URL in escaped form. I think 1-sec is plenty precise enough to avoid collisions, but we could add fractional seconds if needed.
With that format I don't see how we could generate collisions — so I must be missing something. Can you explain?
Re 7 days, hence the second commit that adds NextUpdate to the task name. |
If we add in the feed update time to the task name, how does this change avoid the duplicate update problems? If the update orders described in #231 happen, would this change prevent double updating? |
Maybe I'm misunderstanding #231 or the update process? Here's the problem I thought I was solving:
With task names using In practice I seem to see something like this in my self-hosted deployment. Here's an example of a duplicate task caught by this code, as logged: I suspect this one was a combination of slow datastore queries and slow URL fetches. First, here's the successful queue:
Two minutes later here's a duplicate task with the same value of
But the update finally happened and there were no further errors:
At 18:47:47.421 the same feed was updated again, this time without any duplicate tasks:
|
I thought of another situation worth mentioning: what if we queue an update-feed task but When I wrote the task-name code I wasn't thinking about that. But looking at it now, I think it's ok because the existing |
Ok, I've had a chance to look over this fully now. I don't particularly care about human-readable task names, especially since the feed URL is already embedded in the task data. I'd prefer to use base64.URLEncoding for the names because I think it's safer and better tested. The main part of the solution here is to append the feed next update time to the task name. This means we can no longer do a keys-only query (a small ops operation, which are free) to a read operation (not-free). I think we can keep the keys-only query and still achieve this. Obviously, this means we can no longer depend on the feed next update time. I believe we should instead use UpdateMin in settings.go like so: |
Looking at the log messages I posted above, would using For me the read operation is worth the cost: it's staying well under 20% of the free quota. It may be a question of which is more expensive: the datastore reads or the extra tasks. But it might also be worth considering something based on TCP/IP timeouts and the settings from |
The task name is based on the feed id (URL). Setting this should reduce the risk of duplicate update-feed tasks (#231) because the queue shouldn't accept tasks with duplicate names. According to the docs this isn't an iron guarantee, but should work under most circumstances.
I've been running this change self-hosted for a day or so. I'm not sure whether or not it's caught any duplicates, but it doesn't seem to cause any new problems.