Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only 50 posts are backed up using --incremental #223

Open
jorong1 opened this issue Nov 4, 2020 · 7 comments
Open

Only 50 posts are backed up using --incremental #223

jorong1 opened this issue Nov 4, 2020 · 7 comments
Labels
cannot reproduce Bug cannot be reproduced

Comments

@jorong1
Copy link

jorong1 commented Nov 4, 2020

Version using, f8ae83d

command:
python2.7 tumblr-utils/tumblr_backup.py --incremental --save-video-tumblr --no-ssl-verify --save-audio --json BLOGNAME

Output: 50 posts backed up

I am backing up to a folder with existing post json media archive and index files.
Without --incremental it grabs posts no problem but it will probably want to grab all the posts, which I don't want. I do want archive and index.html generated files but cancelling a mass-run doesn't do that.

My last successful --incremental backup was end of September using 08cbe44 with my own API key

I use --incremental to avoid overwriting existing older posts and media, and avoid going too far back. If I don't use that will it just stop once it finds existing files?
I could use count this time to update, but I'm sure it'll still only incremental grab 50 posts in the future.

I have my own API key and I'm using the latest git for this project. Also tried cebtenzzre@e5537c0 and same 50 post output.
Don't know what to do here thanks!

@cebtenzzre
Copy link
Collaborator

At the time you ran the incremental backup, was there really more than 50 new posts? --incremental stops reading posts from the API as soon as it sees a post that was already backed up.

  • If you run the command again it should back up no posts - if it still claims to back up 50, that is a bug.
  • --incremental may stop earlier than desired if the previous backup was interrupted or had more restrictive post filtering options - that is not a bug, it's intended behavior.
  • If --incremental stops too early, and leaves a gap between the last post it backs up and the first post from the last backup, that is a bug, but seems unlikely unless posts are being skipped somehow or received out of order.

Without --incremental, you'll have to wait for the script to read the entire blog, but it will download all missing posts regardless of what is already backed up. It will overwrite all post HTML unless you use --no-post-clobber on my fork. bbolli's version will never redownload media; on my fork, --no-post-clobber skips downloading media for existing posts and --timestamping prevents redownloading media if the file on disk matches the server.

If you at any point just want to regenerate the archive and index.html, use my fork and pass --count 0.

@jorong1
Copy link
Author

jorong1 commented Nov 25, 2020

I got sidetracked sorry but this is still not working as before.
I confirm there are more than 50 posts available. The last update I did was 2020-09-03, and so far there's been a few thousand new posts on the specific blog.

--incremental works as you describe, it only does 50 posts. Then if I try again, it does nothing.
The backup on 2020-09-03 was done with bbolli's tools but then I tried an incremental I think and it may have caused a bad stop.
But I cleaned up the files as well so that it should have been before whatever update I did.

I just saw the update
cebtenzzre@8417f94

And no-clobber works. I am backing up beyond 50 posts now. I did a backup of the folder before but hopefully it doesn't overwrite the media because it's over 50GB.

So now I guess I'd need to do without incremental, no clobber and timestamping which would basically download the entire blog again (which is like 70000 posts) so that the index and stuff are updated and I guess then I can do incremental beyond 50 posts?

Or so where do I go from here for the future? I don't want to download the whole blog again every time, and there are alwayts more than 100 posts.

Can I only grab what is missing and regenerate index and stuff somehow? It's weird that incremental stopped working like it did before, used to grab 100s of posts. I have my own API key

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Nov 25, 2020

I forgot to mention that --no-post-clobber was only available on my fork's experimental branch. I just moved it to the main branch, so a fresh clone of my fork will have that option now. In this case you need it to make a quick non-incremental backup: --incremental stops when it sees familiar posts, --no-post-clobber just skips them (and should go far enough to discover posts that weren't backed up). In theory, --no-post-clobber has no effect if you are already using --incremental. It also won't change the behavior of future incremental backups, but it will find posts that were missed by previous incremental backups.

Assuming none of your previous incremental backups were interrupted, it sounds like --incremental could actually be malfunctioning. But it relies on such a simple and seemingly guaranteed property of blogs (older posts have lower IDs, the API returns the newest posts first) that I would need solid evidence to believe it. It would help to have the name of the blog you're backing up, but if you'd prefer, I could write a script to make a list of post IDs and dates from a backup, and you could run it after a backup without --incremental to see if the ordering is broken in a way that confuses --incremental.

@slowcar
Copy link

slowcar commented Jan 8, 2021

We have reproduced the issue two times, the -i parameter takes only the 50 recent posts into account. We try to run the script daily now, but a fix would be welcome just in case.

@cebtenzzre cebtenzzre added the cannot reproduce Bug cannot be reproduced label Jan 8, 2021
@cebtenzzre
Copy link
Collaborator

@jorong1 @slowcar If I run these commands, I get the expected result:

$ python2 tumblr_backup.py -s 175 -n 20 just-art
just-art: 20 posts backed up
$ python2 tumblr_backup.py -i just-art
just-art: 175 posts backed up

The first command makes a small backup that is 175 posts out of date, and the next command backs up the missing posts using --incremental. More than 50 posts are backed up. This is probably a blog-specific issue, so it would be helpful if someone could provide an example of a blog where after running these two commands in a clean directory, the second command reports "50 posts backed up".

@jorong1
Copy link
Author

jorong1 commented Jan 24, 2021

Hi. I finally did another blog backup with cebtenzzre@ce10f29 and it seems incremental is working properly now.

(py3venv-cebtenzzre) $ python tumblr-utils-cebtenzzre/tumblr_backup.py --incremental --no-ssl-verify --save-audio --save-video-tumblr --json --no-post-clobber --timestamping myblog
myblog: 2851 posts backed up

My process was to do a full non-incremental backup like @cebtenzzre recommended, followed by an incremental. This "reset" wherever I was stuck.
Now I did a new incremental without issue.

I don't think the issue should be closed because @slowcar is having an issue with it, so it must be something else, but I am good now so I don't mind closing it.
I can't provide a blog example because the issue happened with my personal blog I don't feel comfortable sharing.

The solution here would be to possibly using Cebtenzzre's fork if you're having issues. I don't know if that's appropriate to recommend.

@jorong1
Copy link
Author

jorong1 commented Feb 5, 2021

I was hoping I wouldn't have to post this, but I am now getting the same error again.
I'll try a full redo, which sounds like the proper solution here, but I got the same 50 post error and I know there's a gap:
Newest post from last run: 01/24/2021 11:32:24 AM
Oldest post from current run: 02/05/2021 08:55:41 AM

There's a good couple weeks of content there missing.
I know this can't be reproduced, just commenting it's still happening. Sorry to raise hope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cannot reproduce Bug cannot be reproduced
Projects
None yet
Development

No branches or pull requests

3 participants