Skip to content

Conversation

Lc4B
Copy link

@Lc4B Lc4B commented Jun 14, 2025

Hentai Update

Updated:

  • hstream.yml
  • hanime.yml

Added:

  • Oppai.yml
  • HentaiSaturn.yml
  • HentaiSubIta.yml

Details:

  • Improved URL construction via sceneByFragment, with more accurate filename cleaning, especially for older hentai (*Oppai is partially excluded due to the use of symbols in its URLs).
  • Standardized output titles; each scraper had a different format, e.g.:
    Hentai 2 2 | Hentai 2 - 2 | Hentai 2 Ep 2 | Hentai 2 Episode 02 | ...
    Now, the episode number in the title will be displayed with a dash and two-digit format: - 02, ensuring a consistent library.
  • Removed capture of tags related to resolution like "4K" or "HD"; Stash already displays a banner with the correct resolution of the file you own, so those tags were just misleading.
  • In hentai context, each scene is considered an episode of a series, so I decided to use the Groups (Movie) section to create series entries; added group construction directly in the sceneScraper, so a manual scrape of a episode can create the corresponding group.
  • [hstream, hanime] While capturing the series during scene scraping, the cover of episode 1 will be used.
  • [hstream] Added groupScraper support from URL.
  • [hstream] During search, the cover URL was built from the thumbnail URL — now the search URL is modified directly, removing the need for this conversion.
  • [hstream] With the latest update, the preview image was changed to the cover in scenes; I reverted it back to the preview, which seems more appropriate since these are episode-specific preview images and better suited for how Stash uses them (we’ll use covers for series).
  • [hanime] Added scene URL construction from the cover URL.

(HentaiSaturn and HentaiSubIta are somewhat rough in various aspects, mainly used to retrieve Italian plot summaries, but can be useful for some older hentai.)

Lc4B added 5 commits June 15, 2025 00:49
 * Improved URL construction via sceneByFragment
 * Standardized output titles
 * Removed capture of tags related to resolution
 * added group construction in the sceneScraper
 * Added groupScraper support from URL
 * changed the search URL (with covers) to remove the post process change
 * Restored image "preview" for scenes
 * Improved URL construction via sceneByFragment
 * Standardized output titles
 * Removed capture of tags related to resolution
 * Added group construction in the sceneScraper
 * Added scene URL construction from the cover URL
 * Italian plots
 * Italian plots
@Lc4B
Copy link
Author

Lc4B commented Jun 15, 2025

hentaisaturn and hentaisubita have commonXPaths which is simply an aggregator, it is not used in any way, the object fields are grouped there and are taken individually where needed; so even if there are "wrong" fields they do not create errors

::fixed

@feederbox826
Copy link
Collaborator

That is A Lot. what filename format are you using? I'm unable to replicate it with yt-dlp or are oyu just stripping it for Least Common (Filename)

@Lc4B
Copy link
Author

Lc4B commented Jun 24, 2025

That is A Lot. what filename format are you using? I'm unable to replicate it with yt-dlp or are oyu just stripping it for Least Common (Filename)

sorry but I don't understand what you mean, if you mean the cleaning of the sceneByFragment filename, I simply try to remove "common" fields or symbols that would break the url, also considering the url patterns, what is after the last number is also cut, for example:
Name - 02 - Title
[HSUB] Name 02v2 (HD) [3bb935c6]
Name_02_sub_eng
obviously the name between the beginning and the episode number must be correct or it won't find it anyway, but it can avoid renaming a few more files than before.

If instead you mean the output format of the title I only cut or added to the title obtained from the scraping to bring it to a name that resembles the template used by hstream; I used that because as a scraper it is more complete so I assume more used, but I didn't think of obtaining a common pattern but only a uniform one among the scrapers.
Maybe I should change it to something common, usually the series are: "Name S01E02" but I have no way to recognize the season, so as a pattern without season I could opt for "Name ep02".

(I have to change hanime that uses "Name 2" for the new ones and "Name Ep 2" for the old ones and I hadn't considered this case)

Lc4B added 7 commits June 25, 2025 03:30
 * updated title template: Name ep02
 * updated title template: Name ep02
 * updated title template: Name ep02
 * updated title template: Name ep02
 * updated title template: Name ep02
 * updated title with more cases
 * updated sceneByURL
 * added sceneByName/sceneByQueryFragment via python
@Lc4B
Copy link
Author

Lc4B commented Jun 25, 2025

  • changed the output title of all scrapers to a more known pattern; I opted for: Name ep02 (Kodi/Episode Naming/No Season).
  • [hanime] I found out that it used many more names than I knew, I found:
    • Name 1
    • Name Ep 2
    • Name ep 3
    • Name Ep. 4
    • Name Episode 5
    • Name - Episode 6
      so I fixed the output of the scene and group title considering all these cases.
  • [hanime] inserted a more complete link for sceneByURL
  • [hanime] created a Python script to search for scenes! the script needs no configuration and only requires requests to get the search results.
    I don't know python well, I wanted to avoid as many dependencies as possible and not pass all the Yaml scraper functions to this, so the script DOES NOT RUN any scene scraper, it only performs the search and gives you the title (original and not formatted) and the url of the chosen scene, so from these you can run the scrapers.
    Hanime has a large amount of media and with this script you can search for scenes not only by "official name" but also by alias, eg:
    • Bible Black
    • Czarna księga
    • 바이블 블랙
    • バイブルブラック
    • Im Banne des Satans
    • (...)
hanime.search.scene.mp4

(possibly without adding dependencies and without removing the current operation of sceneByURL in yaml) if someone is able to start the scraper directly from the python from the obtained url (instead of making it exit) I thank him, because unfortunately I am not able to do it :/

@feederbox826
Copy link
Collaborator

pretty sure it's because you're filtering by views and it's falling back when there aren't close enough results

you're still failing validation and whatever LLM you're using doesn't understand the scraper return format

@Lc4B
Copy link
Author

Lc4B commented Jun 25, 2025

hentaisaturn and hentasubita fail validation, because I use anchors like this:

      Date: &date
        selector: //div/b[text()="Data di uscita:"]/following-sibling::text()[1]
        postProcess:
          - replace:
              - regex: 'Gennaio'   
                with: 'January'
              - regex: 'Febbraio'  
                with: 'February'
              - regex: 'Marzo'     
                with: 'March'
              - regex: 'Aprile'    
                with: 'April'
              - regex: 'Maggio'    
                with: 'May'
              - regex: 'Giugno'    
                with: 'June'
              - regex: 'Luglio'    
                with: 'July'
              - regex: 'Agosto'    
                with: 'August'
              - regex: 'Settembre' 
                with: 'September'
              - regex: 'Ottobre'   
                with: 'October'
              - regex: 'Novembre'  
                with: 'November'
              - regex: 'Dicembre'  
                with: 'December'
          - parseDate: 2 January 2006

so with the alias *date it calls me the whole block (selector and post process) and I don't have to rewrite anything.
I tested it and on stash it works without problems (if you want to test it create a new group, do the scraper from the link, for example: https://www.hentaisaturn.tv/hentai/Amai-Ijiwaru and you will see the date formatted correctly)

while the validation expects a value matched to the anchor like:
Date: &date //div/b[text()="Data di uscita:"]/following-sibling::text()[1]
but this way it is no longer a "shortcut", and I have to rewrite the post process for everyone.
do I have to change it?

@Maista6969
Copy link
Collaborator

hentaisaturn and hentasubita fail validation, because I use anchors like this:

HentaiSaturn isn't failing validation as far as I can tell, only HentaiSubIta: the validation failure is giving the wrong error message for that, your actual error is that the "parseDate" operation expects a string, but is receiving the integer 2006

      Date: &date
        selector: //span[@class="split"][b[text()="Anno:"]]/text()
        postProcess:
          - replace:
              - regex: '\s*(\d{4})'
                with: '$1'
          - parseDate: 2006

Lc4B added 2 commits June 25, 2025 14:58
 * removed incorrect parseDate
 * changed the search for ascending title
@feederbox826
Copy link
Collaborator

I do not like the addition of all the filename filtering, it seems to be overdoing it and i've opened up a RFC https://discourse.stashapp.cc/t/rfc-scraper-queryurlreplace/2375

also the conversion into a python script seems needlessly complicated and I don't like the direction your LLM of choice has taken since we do natively support JSON

@Lc4B
Copy link
Author

Lc4B commented Jul 4, 2025

I do not like the addition of all the filename filtering, it seems to be overdoing it and i've opened up a RFC https://discourse.stashapp.cc/t/rfc-scraper-queryurlreplace/2375

I can assure you that it seemed exaggerated but it is not, if we were talking about porn scrapers there would be no need because generally the name of the scene file can be exact or not. But the hentai filename can have both different formatting and various additional information that also changes depending on where the file is obtained (such as: subtitle language, fansub/site, hash, versions, codec, original title, ...) and there is always at least one of these, I have never gotten a hentai file with a completely clean name, never.

So I tried to do an automatic identification of my library with the "original" scrapers but they found almost nothing, so I studied what could be the main cases to isolate, and I built this filtering that for me obtained good results and with which I was able to avoid having to work manually on each scene or having to rename everything manually.

Also now I found these examples from which you could take inspiration: Scanning files without renaming them
(my files were all like this)

also the conversion into a python script seems needlessly complicated and I don't like the direction your LLM of choice has taken since we do natively support JSON

I absolutely don't want to convert everything to python, I already said that, I want to keep these scrapers working in yaml, but I wanted to add that function to hanime because it didn't have it and its search is really good; being able to use aliases means being able to search using a secondary, different or in another language title, compared to the main one used by the site and this gives you a great help.
I used py+requests simply because I have used it on other occasions and therefore I knew that this way I could get the search results and I went in that direction, if someone is able to get them directly without python can modify it as they prefer, it would be even better 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants