Any way to improving archiving success rate? #1618

MelanieTanaka · 2025-06-16T13:35:06Z

MelanieTanaka
Jun 16, 2025

I've noticed a lot of articles I try to save fail to archive. I'll save a bunch of articles while browsing google news on my phone in the morning, The app says "hoarded!" when I do so, then go to check them out later in the day and in the mobile app almost all of them say "null" as the text and nothing else.

In the web browser version it says the following:

Are there any settings or configuration I can be using to avoid this?

To be clear it isn't that it doesn't work at all, just a large amount of links I save do this, in the region of 50-60% if I had to guess.

Thanks,

peter-avila · 2025-06-16T22:37:52Z

peter-avila
Jun 16, 2025

I've noticed a lot of articles I try to save fail to archive. I'll save a bunch of articles while browsing google news on my phone in the morning, The app says "hoarded!" when I do so, then go to check them out later in the day and in the mobile app almost all of them say "null" as the text and nothing else.

In the web browser version it says the following:

Are there any settings or configuration I can be using to avoid this?

To be clear it isn't that it doesn't work at all, just a large amount of links I save do this, in the region of 50-60% if I had to guess.

Thanks,

It would be great if there was an easy way to filter out for failed crawls, including those stopped at captchas, and recrawl through the wayback machine, 12ft.io, etc.

I'm going to play around with the karakeep CLI to see if there's a way to do it through a script. I hope it is and then I may open an issue for a feature request or code a PR if time allows.

ETA: Looks like I wasn't the only person with this suggestion.

#594

#999

#1306

2 replies

cecoates Jun 20, 2025

I think being able to filter out failed crawls would be great!

Is it possible to do something with Smart Lists?

https://docs.karakeep.app/Guides/search-query-language

Like could there be a way to have a list that's something like -is:crawled?

(I genuinely don't know how that stuff works, but that would be pretty neat.)

cecoates Jun 20, 2025

Once full text search is allowed: #907

Maybe it'd be possible to have a Smart List for an exact string such as:

Failed to fetch link content ...

Like this?

I wasn't sure if that text is part of the record or if it's more of an error message.

Edit: After messing around with the regular search more, I'm not sure it's part of the entry. Searching for "Failed to fetch link content" doesn't lead to any of the failed crawls.

MelanieTanaka · 2025-06-16T23:30:48Z

MelanieTanaka
Jun 16, 2025
Author

I was thinking the other day about using archive.is manually as an intermediary like in #1306 but somehow that failed for me as well.

Here's one I attempted earlier today before making this discussion post:

After more testing it looks like I need to refresh the link an arbitrary number of times inside of karakeep before it goes through.

I'll do more manual tests and see what the success rate looks like over the next few days. If there's other such services that you think are better suited I could do some tests on those too.

0 replies

MelanieTanaka · 2025-06-18T13:53:39Z

MelanieTanaka
Jun 18, 2025
Author

I've been testing with archive.is for a bit now by manually archiving there, then adding the archive url to karakeep.
Overall it works fairly well but as mentioned in my previous comment it sometimes fails. I now have a sample size of 28 archive.is links, 9 of them failed the first time. Retrying the same links arbitrary amounts of time later (anywhere from minutes to later in the day) they end up going through for an eventual 100% success rate in getting the content.

I think the ideal scenario would be to implement a system to allow the user to choose their archive service of choice where the karakeep server requests archival of the link, waits a period of time then attempts to retrieve the link at different intervals of time?

There's probably a better way but I'm not a developer so it's beyond me at that point.

0 replies

Eragos · 2025-06-20T11:30:11Z

Eragos
Jun 20, 2025

Hey!

Best would be if you can give some more information, which URL you are trying, some log file pieces, the used Docker Compose file of the stack, and so on. Just put some more meat on the bone ;-)

This will increase the chances of reproduce and catching the problem.

Best Michael

1 reply

MelanieTanaka Jun 20, 2025
Author

Hey!

Best would be if you can give some more information, which URL you are trying, some log file pieces, the used Docker Compose file of the stack, and so on. Just put some more meat on the bone ;-)

This will increase the chances of reproduce and catching the problem.

Best Michael

Hi,
Sorry if I made it sound like I need help with a particular site or url. This is more a discussion for how the success rate of archiving can be improved in general.

There's always going to be sites that try to stop scraping like what karakeep, pocket, linkwarden etc do. Some publicly available web archiving services seem able to get around it so currently testing that in manual workflows to see how well it works. If it works well manually we can maybe write up a feature request to enable users to optionally use such services as a middle-man between karakeep and the real site when karakeep detects a failure to archive.

MelanieTanaka · 2025-06-23T13:24:23Z

MelanieTanaka
Jun 23, 2025
Author

archive.is wasn't a good choice after testing but web.archive.org works really well. Put up a feature request here: #1652

and in the meantime I'm using a shoddily patched together discord bot to make the process as easy as possible for myself.

1 reply

peter-avila Jul 26, 2025

Hey @MelanieTanaka. Thanks again for working on this. I do have a suggestion. I think have multiple options would be better than just two.

What I had in mind are these:

Off
On for failed crawls.
On for a set list of domains.
On for failed crawls & a list of domains.
Always on.

I think these would provide some flexibility and keep the number of requests reasonable and under their rate limit. As the app grows and for instances of large bookmark imports, the volume and rate of requests could be problematic.

Uh oh!

Any way to improving archiving success rate? #1618

Uh oh!

Uh oh!

MelanieTanaka Jun 16, 2025

Replies: 5 comments · 4 replies

Uh oh!

Uh oh!

peter-avila Jun 16, 2025

Uh oh!

cecoates Jun 20, 2025

Uh oh!

Uh oh!

cecoates Jun 20, 2025

Uh oh!

MelanieTanaka Jun 16, 2025 Author

Uh oh!

MelanieTanaka Jun 18, 2025 Author

Uh oh!

Eragos Jun 20, 2025

Uh oh!

MelanieTanaka Jun 20, 2025 Author

Uh oh!

MelanieTanaka Jun 23, 2025 Author

Uh oh!

peter-avila Jul 26, 2025

MelanieTanaka
Jun 16, 2025

Replies: 5 comments 4 replies

peter-avila
Jun 16, 2025

MelanieTanaka
Jun 16, 2025
Author

MelanieTanaka
Jun 18, 2025
Author

Eragos
Jun 20, 2025

MelanieTanaka Jun 20, 2025
Author

MelanieTanaka
Jun 23, 2025
Author