Skip to content

Conversation

Druidblack
Copy link

@Druidblack Druidblack commented May 12, 2025

I have made a number of corrections that will help the scraper to get data from the page with its different design.

Examples to test

https://pornolab.net/forum/viewtopic.php?t=1908520
https://pornolab.net/forum/viewtopic.php?t=3177748
https://pornolab.net/forum/viewtopic.php?t=2333222
https://pornolab.net/forum/viewtopic.php?t=1582609
https://pornolab.net/forum/viewtopic.php?t=2137727
https://pornolab.net/forum/viewtopic.php?t=2747150
https://pornolab.net/forum/viewtopic.php?t=1984852
https://pornolab.net/forum/viewtopic.php?t=1384220
https://pornolab.net/forum/viewtopic.php?t=2861716
https://pornolab.net/forum/viewtopic.php?t=1580232

Short description

  1. I changed the logic of getting the name, now we get the name from the name of the header, since the name is not always indicated on the page.
  2. I added options for receiving information for all fields, since there is no single page design rule, the fields with different names could differ, now there is a better chance that the information will be received.
  3. Added image accessibility check. If they are not available, the scraper skips receiving them and moves on to the next (live image)
  4. Changed the logic of getting links to the image. First of all, the scraper searches for the image that is placed on the left side, then on the right (if it is not found on the left).if no such thing is found, then it searches for the first image on the page.
  5. If there are several images with the same location on the page, then the scraper randomly offers you one to choose from (a repeated request by the scraper may return a different image). If there are several images with the same location on the page, then the scraper randomly offers you one to choose from (a repeated request by the scraper may return a different image) This is necessary when using multiple covers on a page or different sides of the cover or the logo and the cover.
  6. In order not to receive unnecessary images, the scraper cuts off the receipt of gif images and does not receive images that are located below the block with video characteristics (as they will not be placed in this part when framing the cover) And there may be screenshots from the video. Therefore, the scraper does not look at the second part of the page.

…ately

There are many edits that help you get data from the site page
Copy link
Collaborator

@feederbox826 feederbox826 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to go over it again with a translator and a fine tooth comb but seemes to contain a lot of redundant, repetitive code

My biggest irks and blockers for me are

  • anonymized variable names
  • .rstrip().rstrip().strip() all over the code and inconsistently
  • runtime errors, where functions are being ran before nullish tests and after regex matching

return scraped

# ветка «Имя актрисы»
raw = self.get_field_text(post_b, ["Имя актрисы"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would have liked preserving meaningful variables names rather than raw over and over again

scraped.append(ScrapedPerformer(name=name))
return scraped
# если после «В ролях» не нашлось ничего кроме «:» и <br> — возвращаем пустой список
return []

def get_image(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge fan of the randomized image tbh

if url.lower().endswith('.gif'):
continue

if not url:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not url has to predate url.lower() or you'll get a runtime error

parts = split_pattern.split(raw)
scraped = []
for part in parts:
m = re.search(r"\((.*?)\)", part)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, removal of variable names for obscure naming schemes

if isinstance(sib, Tag) and 'post-color-text' in sib.get('class', []):
raw = sib.get_text(strip=True)
for tag in raw.rstrip('.').split(','):
t = tag.strip()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double strip? You're doin git so often you might as well just handle it in a function

for part in parts:
u = part.strip()
if not u:
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check before stripping, unless you're expecting an empty string that also matched your regex rule

if director:
return director

return director.rstrip('.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again with the double lstrip,rstrip and boolean check


# Если нет ни img-left, ни img-right — перемешиваем generic и берём их в случайном порядке
if not left_imgs and not right_imgs and generic_imgs:
random.shuffle(generic_imgs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't match with the logic in your PR, this is only shuffling generics, but then adds in candidates left and right images to an empty candates post?

@feederbox826
Copy link
Collaborator

If anyone wants to use the PR, they have uploaded it on their own repo https://github.com/Druidblack/Stash-Scrapers/tree/main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants