Automated identification/scraping/tagging #1802

WithoutPants · 2021-10-04T03:29:24Z

WithoutPants
Oct 4, 2021
Maintainer

Large-scale population of scene metadata remains a significant pain-point for new and existing users.

Scraping is currently an onerous process, requiring going to each scene page at a minimum, and in many cases requiring the user to know the original URL of the scene.

The auto-tagger is (imo) a ham-fisted sledgehammer approach to tagging, and requires the user to have pre-populated their database with performers, studios and tags.

Stash-box currently provides the best mechanism for populating metadata, with the Tagger view facilitating an easier and more rapid way of populating metadata.

I propose a new Task - the name of which I haven't settled on, but could be Identify or Auto-Scrape (or repurpose Auto-Tag). This task would be configured with a priorised list of data sources. The task would try to identify the scene from each data source in turn, stopping as soon as it has found a match.

Configurable data sources could be stash-box instances or scrapers. The scraper configuration structure will be changed to optionally include filename patterns. Scene filenames will be matched against these patterns and the scraper will be run only if the filename matches a pattern.

The existing auto-tagger code will be used to create a new internal scraper. This scraper would match the scene filename against performers, studios, and tags and populate the results based on matches.

This new task could be run as part of the scan process, on its own (ignoring organised scenes), on a single scene or a selection of scenes. It could optionally be run as an interactive action, much like the current scraping implementation.

The default set of data sources could be configurable, with this configuration able to be overridden when running the task. Where possible the task could also optionally create performers, tags and studios as needed.

Once this feature is mature enough, I believe that the existing auto-tagger functionality should be retired.

kermieisinthehouse · 2021-10-04T03:43:14Z

kermieisinthehouse
Oct 4, 2021
Collaborator

As someone who's set up no scrapers and only uses Auto -tag, I can't sya I'm thrilled about de-emphasizing the auto-tag functionality, but I think an automatic scraping task is a good idea

0 replies

stg-annon · 2021-10-04T03:56:56Z

stg-annon
Oct 4, 2021
Collaborator

I think a good starting point would be auto-scrape/tagging scenes from stashbox with via a reasonable threshold for a % match to fingerprints that way there is one task that could get a stash with generally popular content tagged very quickly

As for bulk scraping scenes this is something I currently do via plugin which works well enough, the big issue most of the time for things not getting tagged is inherently bad structure, naming, and missing metadata, which I'm not sure any amount of parsing the file structure will help with, which is why stashbox is such a game changer in that regard

0 replies

spincity07 · 2021-10-04T04:00:33Z

spincity07
Oct 4, 2021

I like this strategy a lot. Anything we can do to automate off of StashDB would be a help. I think StashDB as a central repository of scenes is really the primary party-trick of Stash.

My current workflow is to import scenes the run auito-tagger. But considering i have so many studios and performers, it helps greatly add those to new scenes. If I could instead run an auto-scraper that matches these or the phash from StashDB, then we've made this one hell of a solid user experience.

I still come back to scene matching to StashDB, when no fingerprint matches, matching on the intersection of Performer, Studio, and Scene Duration should find a very close match.

2 replies

Chodbin Oct 5, 2021

I'm new here, only came across stashapp a few days ago so still trying to get to grips with everything and I'm not a programmer/developer (looked at some of the channels in discord and had no idea what people were talking about for example) so with that in mind, consider my comments from more of an end user point only.

I have been using plex to organise up until now but some frustrations led me to search for alternatives, which initially let me to ThePornDB agent/scraper for plex and it's discord, which is where i heard about stashapp.

My first impressions are that the process of importing/scanning and then correctly identifying/tagging scenes is so much slower and time consuming than i'm used to with plex but i think long term stash will potentially be a better solution for me, especially because you can edit so much more info than you can in plex.

So anything that speeds up this process would definitely be welcomed, and i look forward to seeing this implemented!

However..... whatever method ends up being implemented will only be as good as the source data, which will need to be;

Comprehensive
Up to date
Accurate (especially concerning performer names/aliases across multiple studios)

From the posts above it seems that the intention is to use the stashDB? Which unless i am missing something (I am new, no idea how things get added to it) is;

Not comprehensive, missing a whole host of studios
Not up to date, newest scene is from over 3 months ago
seems reasonably accurate, and from reading discord it looks like people are actively working on the issue of performer aliases.

Maybe it's because i'm using the stashDB and not running a stash-box?

I'm guessing it should be possible to allow users to use regex to parse filenames and then match/create scenes? For example almost all of my files are formatted as: {studio} - {date} - {title} ~ {performer1}, {performer2}, {performer3}.{ext}

Is it possible to import from stashDB to stash?
I thought of this after i had created my first studios in stash, that it would have been much quicker and easier to select a studio (or studios/whole network of studios) in stashDB, press an import button, and hey presto! they appear in stash, with all the correct information and stashIDs.

I'm beginning to ramble, i'll leave it there for now. Thanks to your post I learnt about the Tagger view which has speeded things up for me!

Chodbin Oct 5, 2021

oops, meant to reply to WithoutPants first post

7dJx1qP · 2021-10-05T20:11:08Z

7dJx1qP
Oct 5, 2021

This automatic scraper/tagger idea sounds great and would fill in the gaps in my stash metadata perfectly. Most of my filenames are well structured with studio, performer, title, and date information and the corresponding stash entries currently have that populated. However, I'm missing urls for most scenes so the automatic scraper would help me get the url, cover, details, and tags filled in. I also have many scenes with urls that I know are also in stash-box. The fingerprints for these scenes match in the Tagger view, but it's very tedious to have to go through and fingerprint match each file, so being able to automatically match against stash-box would be a great time saver.

0 replies

philpw99 · 2021-10-09T23:01:48Z

philpw99
Oct 9, 2021

I have a different idea: why not just leave the scraping to the contributors, and the rest just use contributors work?
So you just need 1 contributor to scrape a file, like "abcd.mp4", set its studio, performers, details...etc, then all other 10,000 people will just need to retrieve this info by file hash and file name?
Why do users need to scrape at all, when we just need a small number of contributors to make the data for us?
Now let's have an example to explain my point. Today I downloaded "abcd.mp4", and Stash reports that it cannot find a match. So as a contributor, I scrape the info and set all the data, then click on a button "submit". Instantly all other users now can get the same info for "abcd.mp4" from me. What is so difficult about it?
Maybe you worry about someone put wrong info for this file? Use a voting count. Other people get the info about "abcd.mp4" and they like it, so they vote 5 star for it. Then my info is proven legit. Voting for a contribution is not that hard to implement, right?
So how to become a contributor? Just register an account in your website, and that's it. If you get too many 0 star votes, you got kick out. If you got lots of 5 star votes, you are promoted to be the "Good Contributor". Simple, right?

4 replies

WithoutPants Oct 9, 2021
Maintainer Author

That's basically the end goal, but we're quite a few steps away from that, and this system provides a fallback mechanism if scene metadata isn't found in stash-box.

philpw99 Oct 10, 2021

This is what I don't understand. You have Stash-App to scrape information; you have people who want to scrape scenes and submit the information. You have hash id ready. Now all you need is just let them submit and retrieve the info by others. All the steps are ready IMHO.

There is no need to provide fallback mechanism. If the scene is too new, the users can scrape the info with StashApp themselves. Or
someone scrape the info then submit to stash-box, then it's done. There is no need to help further than that. You don't actually need to develop an AI to guess a scene's info from its file name. You don't need to develop an automatic system to get the scene info from several sources. Just let people do that for you.

WithoutPants Oct 10, 2021
Maintainer Author

Now all you need is just let them submit and retrieve the info by others.

This step requires a significant amount of development effort on both stash-box and stash. Again, it's what we're building towards, but it's not like we can't hit a switch and we can suddenly have all this functionality in both systems. The functionality described in the original post is significantly less effort, and most importantly can be developed quickly and allow users to do this now rather than waiting until we have completed all of the development effort required to have a system like what you describe.

philpw99 Oct 10, 2021

Sorry, I don't have much experience in the javascript/react/graphql programming. My database experience mostly came from my Access/Filemaker/MySQL days. So there is huge gap in understanding of your coding situation. I am sorry to raise the issue like that. You guys already did quite an impressive work and trying to accomplish much more. Here I hope you won't take any offense in my previous words, and keep doing what you are planning to do.

Uh oh!

Automated identification/scraping/tagging #1802

Uh oh!

WithoutPants Oct 4, 2021 Maintainer

Replies: 5 comments · 6 replies

Uh oh!

kermieisinthehouse Oct 4, 2021 Collaborator

Uh oh!

stg-annon Oct 4, 2021 Collaborator

Uh oh!

spincity07 Oct 4, 2021

Uh oh!

Chodbin Oct 5, 2021

Uh oh!

Chodbin Oct 5, 2021

Uh oh!

7dJx1qP Oct 5, 2021

Uh oh!

philpw99 Oct 9, 2021

Uh oh!

WithoutPants Oct 9, 2021 Maintainer Author

Uh oh!

philpw99 Oct 10, 2021

Uh oh!

WithoutPants Oct 10, 2021 Maintainer Author

Uh oh!

philpw99 Oct 10, 2021

WithoutPants
Oct 4, 2021
Maintainer

Replies: 5 comments 6 replies

kermieisinthehouse
Oct 4, 2021
Collaborator

stg-annon
Oct 4, 2021
Collaborator

spincity07
Oct 4, 2021

7dJx1qP
Oct 5, 2021

philpw99
Oct 9, 2021

WithoutPants Oct 9, 2021
Maintainer Author

WithoutPants Oct 10, 2021
Maintainer Author