Skip to content
This repository has been archived by the owner on Mar 16, 2020. It is now read-only.

Use task names to reduce the risk of duplicate update-feed tasks. #319

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mblakele
Copy link

The task name is based on the feed id (URL). Setting this should reduce the risk of duplicate update-feed tasks (#231) because the queue shouldn't accept tasks with duplicate names. According to the docs this isn't an iron guarantee, but should work under most circumstances.

I've been running this change self-hosted for a day or so. I'm not sure whether or not it's caught any duplicates, but it doesn't seem to cause any new problems.

If a duplicate task occurs, an error will be logged:
"taskqueue: task has already been added"
This should be harmless: it means that duplicate work has been avoided.

Also reduce log level for new feed-update tasks.
@maddyblue
Copy link
Owner

The docs say:

If a push task is created successfully, it will eventually be deleted (at most seven days after the task successfully executes). Once deleted, its name can be reused.

This suggests that it can take up to 7 days to get a task name back before it's reusable. If that is true, I don't think we can use this method.

// The URL is hex-escaped but hopefully still human-readable.
newTask.Name = fmt.Sprintf("%v_%v",
feed.NextUpdate.UTC().Format("2006-01-02T15-04-05Z07-00"),
taskNameEscape(id))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just base64.URLEncoding.EncodeToString([]byte(id))?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted it to be human-readable, for debugging. Good tooling could fix that. Or if that isn't important, I also considered using a one-way hash.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know there won't be any collisions? i.e., two different feeds that result in the same name?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task names look like 2014-12-16T16-58-34Z07-00_http_3A_2F_2Fxkcd_2Ecom_2Fatom_2Exml: the time to 1-sec precision plus the full URL in escaped form. I think 1-sec is plenty precise enough to avoid collisions, but we could add fractional seconds if needed.

With that format I don't see how we could generate collisions — so I must be missing something. Can you explain?

@mblakele
Copy link
Author

Re 7 days, hence the second commit that adds NextUpdate to the task name.

@maddyblue
Copy link
Owner

If we add in the feed update time to the task name, how does this change avoid the duplicate update problems? If the update orders described in #231 happen, would this change prevent double updating?

@mblakele
Copy link
Author

Maybe I'm misunderstanding #231 or the update process? Here's the problem I thought I was solving:

  1. Some feed F has NextUpdate == t0
  2. UpdateFeeds selects feeds where NextUpdate <= now at t1, including feed F with NextUpdate t0.
  3. UpdateFeeds queues a task.
  4. The task queue is long or slow, or UpdateFeed takes a long time to fetch the feed URL. Meanwhile NextUpdate is still t0.
  5. UpdateFeeds runs at t2, selecting F with NextUpdate t0 and queuing it for update — again.
  6. The queue contains duplicate tasks for feed F.

With task names using F.NextUpdate and F.StringID(), the value for NextUpdate is the same in steps 2 & 5. In step 5 the task name collision prevents the duplicate task from queuing, avoiding duplicate work.

In practice I seem to see something like this in my self-hosted deployment. Here's an example of a duplicate task caught by this code, as logged: I suspect this one was a combination of slow datastore queries and slow URL fetches. First, here's the successful queue:

18:13:42.745
queuing feed 2014-12-15T01-55-10Z07-00_http_3A_2F_2Fwiki_2Exen_2Eprgmr_2Ecom_2Fxenophilia_2Fatom_2Exml

Two minutes later here's a duplicate task with the same value of NextUpdate for the same feed URL.

18:15:43.281 error for task &{/tasks/update-feed [102 101 101 100 61 104 116 116 112 37 51 65 37 50 70 37 50 70 119 105 107 105 46 120 101 110 46 112 114 103 109 114 46 99 111 109 37 50 70 120 101 110 111 112 104 105 108 105 97 37 50 70 97 116 111 109 46 120 109 108] map[Content-Type:[application/x-www-form-urlencoded]] POST
     2014-12-15T01-55-10Z07-00_http_3A_2F_2Fwiki_2Exen_2Eprgmr_2Ecom_2Fxenophilia_2Fatom_2Exml 0 0001-01-01 00:00:00 +0000 UTC 0 <nil>}: taskqueue: task has already been added

But the update finally happened and there were no further errors:

18:15:43.177 update feed http://wiki.xen.prgmr.com/xenophilia/atom.xml
18:15:44.759 hasUpdate: true, isFeedUpdated: false, storyDate: 2014-12-15 01:21:38.37879 +0000 UTC, stories: 15
18:15:44.765 0 update stories
18:15:44.765 putting 1 entities
18:15:45.133 next update scheduled for 30m16s from now

At 18:47:47.421 the same feed was updated again, this time without any duplicate tasks:

18:47:47.305 queuing feed 2014-12-15T02-46-01Z07-00_http_3A_2F_2Fwiki_2Exen_2Eprgmr_2Ecom_2Fxenophilia_2Fatom_2Exml

@mblakele
Copy link
Author

I thought of another situation worth mentioning: what if we queue an update-feed task but fetchFeed fails? Maybe the server is temporarily down or unreachable, or it's returning error messages instead of good rss or atom. Will we be able to try again with a new task, or is this feed stuck forever?

When I wrote the task-name code I wasn't thinking about that. But looking at it now, I think it's ok because the existing feedError handler in UpdateFeed takes care of it. If anything goes wrong with the fetch or parse, NextUpdate becomes time.Now().Add(time.Hour * time.Duration(v)), where v is calculated based on the number of errors previously observed. So when UpdatedFeeds tries to queue a new task for this problem feed, it'll have an updated NextUpdate value and so the task name will be ok.

@maddyblue
Copy link
Owner

Ok, I've had a chance to look over this fully now. I don't particularly care about human-readable task names, especially since the feed URL is already embedded in the task data. I'd prefer to use base64.URLEncoding for the names because I think it's safer and better tested.

The main part of the solution here is to append the feed next update time to the task name. This means we can no longer do a keys-only query (a small ops operation, which are free) to a read operation (not-free). I think we can keep the keys-only query and still achieve this. Obviously, this means we can no longer depend on the feed next update time. I believe we should instead use UpdateMin in settings.go like so: time.Now().Truncate(UpdateMin). Since we never want to update more than once per UpdateMin, this will work most of the time. It will still not work when that boundary is crossed, but it's much better than we have no with no additional cost. Although because of this boundary problem, I'm not convinced this is a good idea. Will ponder.

@mblakele
Copy link
Author

Looking at the log messages I posted above, would using UpdateMin have helped? Also consider a scenario where many feeds are responding slowly and so the queue is backed up.

For me the read operation is worth the cost: it's staying well under 20% of the free quota. It may be a question of which is more expensive: the datastore reads or the extra tasks. But it might also be worth considering something based on TCP/IP timeouts and the settings from cron.yaml and queue.yaml. Another option might be to keep task metadata in memcache, if that's cheaper than using the datastore.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants