Automated identification/scraping/tagging #1802
Replies: 5 comments 6 replies
-
As someone who's set up no scrapers and only uses Auto -tag, I can't sya I'm thrilled about de-emphasizing the auto-tag functionality, but I think an automatic scraping task is a good idea |
Beta Was this translation helpful? Give feedback.
-
I think a good starting point would be auto-scrape/tagging scenes from stashbox with via a reasonable threshold for a % match to fingerprints that way there is one task that could get a stash with generally popular content tagged very quickly As for bulk scraping scenes this is something I currently do via plugin which works well enough, the big issue most of the time for things not getting tagged is inherently bad structure, naming, and missing metadata, which I'm not sure any amount of parsing the file structure will help with, which is why stashbox is such a game changer in that regard |
Beta Was this translation helpful? Give feedback.
-
I like this strategy a lot. Anything we can do to automate off of StashDB would be a help. I think StashDB as a central repository of scenes is really the primary party-trick of Stash. My current workflow is to import scenes the run auito-tagger. But considering i have so many studios and performers, it helps greatly add those to new scenes. If I could instead run an auto-scraper that matches these or the phash from StashDB, then we've made this one hell of a solid user experience. I still come back to scene matching to StashDB, when no fingerprint matches, matching on the intersection of Performer, Studio, and Scene Duration should find a very close match. |
Beta Was this translation helpful? Give feedback.
-
This automatic scraper/tagger idea sounds great and would fill in the gaps in my stash metadata perfectly. Most of my filenames are well structured with studio, performer, title, and date information and the corresponding stash entries currently have that populated. However, I'm missing urls for most scenes so the automatic scraper would help me get the url, cover, details, and tags filled in. I also have many scenes with urls that I know are also in stash-box. The fingerprints for these scenes match in the Tagger view, but it's very tedious to have to go through and fingerprint match each file, so being able to automatically match against stash-box would be a great time saver. |
Beta Was this translation helpful? Give feedback.
-
I have a different idea: why not just leave the scraping to the contributors, and the rest just use contributors work? |
Beta Was this translation helpful? Give feedback.
-
Large-scale population of scene metadata remains a significant pain-point for new and existing users.
Scraping is currently an onerous process, requiring going to each scene page at a minimum, and in many cases requiring the user to know the original URL of the scene.
The auto-tagger is (imo) a ham-fisted sledgehammer approach to tagging, and requires the user to have pre-populated their database with performers, studios and tags.
Stash-box currently provides the best mechanism for populating metadata, with the Tagger view facilitating an easier and more rapid way of populating metadata.
I propose a new Task - the name of which I haven't settled on, but could be Identify or Auto-Scrape (or repurpose Auto-Tag). This task would be configured with a priorised list of data sources. The task would try to identify the scene from each data source in turn, stopping as soon as it has found a match.
Configurable data sources could be stash-box instances or scrapers. The scraper configuration structure will be changed to optionally include filename patterns. Scene filenames will be matched against these patterns and the scraper will be run only if the filename matches a pattern.
The existing auto-tagger code will be used to create a new internal scraper. This scraper would match the scene filename against performers, studios, and tags and populate the results based on matches.
This new task could be run as part of the scan process, on its own (ignoring organised scenes), on a single scene or a selection of scenes. It could optionally be run as an interactive action, much like the current scraping implementation.
The default set of data sources could be configurable, with this configuration able to be overridden when running the task. Where possible the task could also optionally create performers, tags and studios as needed.
Once this feature is mature enough, I believe that the existing auto-tagger functionality should be retired.
Beta Was this translation helpful? Give feedback.
All reactions