Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregating adult performers metadata - authority file / schema discussion #10

Open
laurus-lx opened this issue Oct 4, 2021 · 2 comments

Comments

@laurus-lx
Copy link

Currently stashbox supports only single "source of truth" for scenes/performers/studios, where as performer data aggregated from various sources (index sites, tubes, social media, studios) may dither with varying degree of confidence

This is a proposal to create authority file that will:

  1. Have a list of data sources (sites)
  2. Have a regularly updated scrape of scenes/performers metadata
  3. Keep track of metadata as it changes over time
  4. Normalize metadata (birthdays/locations/scene dates and titles/ performer physical attributes)
  5. Generate periodic snapshots:
    a. Assign confidence value to performer matches across sources - link and de-dup performers
    b. Assign confidence value to metadata and de-dup
    c. Generate output scenes/performers/studios dump

image

There is a discussion regarding adding that functionality to stash-box itself https://discord.com/channels/559159668438728723/798641040029777980/894662081830322206

Whether this will be integrated in to stashbox, or kept separate - we need to come up with a schema, so wanted to start this discussion.

@laurus-lx
Copy link
Author

Query for pulling external identifiers from wikipedia / wikidata (credit Tweeticoats - discord):

https://query.wikidata.org/#SELECT%20%3Fpornographic_actor%20%3Fpornographic_actorLabel%20%3Fdate_of_birth%20%3Fmass%20%3Fheight%20%3Feye_color%20%3Feye_colorLabel%20%3Fhair_color%20%3Fhair_colorLabel%20%20%3Fsex_or_gender%20%3Fsex_or_genderLabel%20%3Fplace_of_birth%20%3Fwork_period_start%20%3FTwitter_username%20%3FInstagram_username%20%3FPornhub_ID%20%3FFacebook_ID%20%3FIMDb_ID%20%3FIAFD_female_performer_ID%20%3FIAFD_male_performer_ID%20%3FAdult_Film_Database_actor_ID%20%3Fyouporn_ID%20%3FRedTube_ID%20%3FAVN_performer_ID%20%3FAWMDB_performer_ID%20%3FOnlyFans_ID%20%3FEGAFD_ID%20%3FxHamster_performer_ID%20%3FTMDb_person_ID%20%3FXXXBios_female_performer_ID%20%3FXXXBios_transgender_performer_ID%20%3FModelhub_ID%20%3Fofficial_website%20%3FVIAF_ID%20%3FPenthouse_ID%20%3FSnapchat_username%20%3FTwitch_channel_ID%20%20%3Fimage%20%3FCommons_category%20WHERE%20%7B%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%22.%20%7D%0A%20%20%3Fpornographic_actor%20wdt%3AP106%20wd%3AQ488111%20.%0A%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP569%20%3Fdate_of_birth.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP2067%20%3Fmass.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP2048%20%3Fheight.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP1340%20%3Feye_color.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP1884%20%3Fhair_color.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP21%20%3Fsex_or_gender.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP19%20%3Fplace_of_birth.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP2031%20%3Fwork_period_start.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP2002%20%3FTwitter_username.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP8718%20%3FFacebook_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP2003%20%3FInstagram_username.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP5246%20%3FPornhub_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP3869%20%3FIAFD_female_performer_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP4505%20%3FIAFD_male_performer_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP3351%20%3FAdult_Film_Database_actor_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP4267%20%3Fyouporn_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP5540%20%3FRedTube_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP8718%20%3FAVN_performer_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP8721%20%3FAWMDB_performer_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP8604%20%3FOnlyFans_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP8767%20%3FEGAFD_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP8720%20%3FxHamster_performer_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP345%20%3FIMDb_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP4985%20%3FTMDb_person_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP9233%20%3FXXXBios_female_performer_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP9174%20%3FXXXBios_transgender_performer_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP8280%20%3FModelhub_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP856%20%3Fofficial_website.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP214%20%3FVIAF_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP6290%20%3FPenthouse_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP2984%20%3FSnapchat_username.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP5797%20%3FTwitch_channel_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP18%20%3Fimage.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP373%20%3FCommons_category.%20%7D%0A%0A%7D%0A

@laurus-lx
Copy link
Author

For collaborating on performers authority file - think the easiest way to proceed will be to share scraped performers data using torrents or file-hosting sites. Meta-data can be packed in to json. We'll then have Extract/Transform/Load script pull this files and transform them in to a usable dataset (perform cross-referencing/normalization/validation), so anybody can replicate the process without relying on any central host. List of source would get periodically expanded with new sites and updates from existing sites.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant