Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JavLibrary scrapers not working #1817

Open
theeda opened this issue May 6, 2024 · 11 comments
Open

JavLibrary scrapers not working #1817

theeda opened this issue May 6, 2024 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@theeda
Copy link

theeda commented May 6, 2024

** Scraper name **
JavLibrary_python

** Scraper method **
Getting the following error when trying to scrape with JavLibrary_python:

error while fragment scraping with scraper JavLibrary_python: could not unmarshal json from script output: EOF

@theeda theeda added the bug Something isn't working label May 6, 2024
@Maista6969
Copy link
Collaborator

Thank you for reporting this: I suspect this is failing for you because the Python scrapers need a little more setup than other scrapers as described in the README for this repo (we have not yet found a way to display this in the Stash UI)

But even after you've got that set up I'm afraid the JavLibrary scraper will still fail due to their recent application of a more aggressive Cloudflare strategy. We have reports from users who have used a VPN to get a Japanese IP address and that seems to work for scraping JavLibrary, but for right now there's no easy way to scrape their site.

@Net005
Copy link
Contributor

Net005 commented Jul 7, 2024

When you set the Flaresolverr url in the Javlibrary Python file you can get it going again even with non-JP VPN.
Bit more work and you need a Windows machine with Flaresolverr however getting consistent good results with Stash.

@theeda
Copy link
Author

theeda commented Aug 13, 2024

@Net005 did you have to do anything special apart from the standard Flaresolverr setup to get it to work? When I try to use Flaresolverr, I get the following errors:

The Cloudflare 'Verify you are human' button not found on the page.
Waiting for title (attempt X): Just a moment...
Timeout waiting for selector

It keeps retrying but it can't find the checkbox

@Pheromir
Copy link

Pheromir commented Oct 3, 2024

@Net005 did you have to do anything special apart from the standard Flaresolverr setup to get it to work?

I was able to get the scraper working for me with flaresolverr in a separate docker container:
-> It seems to be an issue with flaresolverr, that it can't find the checkbox. There is fork which doesn't have this issue, as I found out here: FlareSolverr/FlareSolverr#1380 (comment) (21hsmw/flaresolverr:nodriver)

-> I had to change line 445 in the .py from response_html.status_code = json_input['solution']['status'] to response_html.status_code = responseJson.status_code, as flaresolverr always returned "None"in the json while the code expected a status code (like 200). Not sure if this is an issue with the flaresolverr fork, as it's my first time using flaresolverr.

response_html.status_code = json_input['solution']['status']

Maybe this will help someone.

(Edited: Fixed wrong information)

@janemba
Copy link

janemba commented Oct 3, 2024

line 445 in which file ? docker-compose.yml as like 20 lines.

@Pheromir
Copy link

Pheromir commented Oct 3, 2024

Sorry, had a brainfart I guess.
I meant the .py of the scraper and forgot the link:

response_html.status_code = json_input['solution']['status']

@TecnoCreeper
Copy link

My reproduction steps to my problem:
I am running in a linux container on a windows 10 host.

docker-compose.yml

# APPNICENAME=Stash
# APPDESCRIPTION=An organizer for your porn, written in Go
services:
  stash:
    image: stashapp/stash:latest
    container_name: stash
    hostname: stash
    restart: unless-stopped
    ## the container's port must be the same with the STASH_PORT in the environment section
    ports:
      - "3003:9999"
    ## If you intend to use stash's DLNA functionality uncomment the below network mode and comment out the above ports section
    # network_mode: host
    logging:
      driver: "json-file"
      options:
        max-file: "10"
        max-size: "2m"
    environment:
      - TZ="Europe/Rome"
      - STASH_STASH=/data/
      - STASH_GENERATED=/generated/
      - STASH_METADATA=/metadata/
      - STASH_CACHE=/cache/
      ## Adjust below to change default port (9999)
      - STASH_PORT=9999
    volumes:
      ## Adjust below paths (the left part) to your liking.
      ## E.g. you can change ./config:/root/.stash to ./stash:/root/.stash
      
      ## Keep configs, scrapers, and plugins here.
      - ./config:/root/.stash
      ## Point this at your collection.
      - D:/private/media:/data:ro
      ## This is where your stash's metadata lives
      - ./metadata:/metadata
      ## Any other cache content.
      - ./cache:/cache
      ## Where to store binary blob data (scene covers, images)
      - ./blobs:/blobs
      ## Where to store generated content (screenshots,previews,transcodes,sprites)
      - ./generated:/generated

  # ===== FlareSolverr =====
  flaresolverr:
    image: 21hsmw/flaresolverr:fixlooping
    container_name: flaresolverr-stash
    environment:
      - LOG_LEVEL=${LOG_LEVEL:-info}
      - LOG_HTML=${LOG_HTML:-false}
      - CAPTCHA_SOLVER=${CAPTCHA_SOLVER:-none}
      - TZ=Europe/Rome
      - LANG=fr-FR
    restart: unless-stopped

networks:
  default:
    name: caddy_net
    external: true

I installed the Javlibrary_python scraper from the settings, then edited JavLibrary_python.py

FLARESOLVERR_ENABLED = True
FLARESOLVERR_URL = "http://flaresolverr-stash:8191/v1"

In the Tagger page I have a file named BBAN-414.mp4, when I query BBAN-414 and press search I get The search doesn't return any result.

I tried debugging the script on my own but I can't seem to get it working, it looked like the script was not following redirects maybe? When I entered the value of JAV_SEARCH_HTML.url in my browser it redirected me to the correct entry.
For example:
JAV_SEARCH_HTML.url = https://www.javlibrary.com/en/vl_searchbyid.php?keyword=BBAN-414 redirects to https://www.javlibrary.com/en/?v=javmeebk4e in my browser.

@TecnoCreeper
Copy link

I solved my problem. I had to use 21hsmw/flaresolverr:nodriver as the image instead of fixlooping. No other changes (other than enabling flaresolverr and configuring the url).

@Gykes
Copy link

Gykes commented Oct 25, 2024

As a extra to this, I tried to get it running as well.

With Flaresolverr in docker I couldn't get it to run at all. I installed Flaresolverr on my windows PC and routed through that. I now at least get logs and Flaresolverr seems to be detecting the cloudflare but just times out.

2024-10-25 15:11:22 INFO     Challenge detected. Title found: Just a moment...
2024-10-25 15:12:24 ERROR    Error: Error solving the challenge. Timeout after 60.0 seconds.
2024-10-25 15:12:24 INFO     Response in 62.984 s
2024-10-25 15:12:24 INFO     192.168.1.54 POST http://192.168.1.10:8191/v1 500 Internal Server Error
2024-10-25 15:12:29 INFO     Incoming request => POST /v1 body: {'cmd': 'request.get', 'url': 'https://www.javlibrary.com/en/vl_searchbyid.php?keyword=SHYN-213', 'maxTimeout': 60000}
2024-10-25 15:12:30 INFO     Challenge detected. Title found: Just a moment...
2024-10-25 15:13:32 ERROR    Error: Error solving the challenge. Timeout after 60.0 seconds.
2024-10-25 15:13:32 INFO     Response in 63.026 s
2024-10-25 15:13:32 INFO     192.168.1.54 POST http://192.168.1.10:8191/v1 500 Internal Server Error

@Gykes
Copy link

Gykes commented Oct 26, 2024

Using this pr:

alexfozor/flaresolverr:pr-1300-experimental

I was able to get around the CF issue.

2024-10-26 00:30:07 INFO     Serving on http://0.0.0.0:8191
2024-10-26 00:34:09 INFO     Incoming request => POST /v1 body: {'cmd': 'request.get', 'url': 'https://www.javlibrary.com/en/vl_searchbyid.php?keyword=mdvr00324 1', 'maxTimeout': 60000}
2024-10-26 00:34:17 INFO     Challenge detected.
2024-10-26 00:34:24 INFO     Challenge solved!
2024-10-26 00:34:25 INFO     Response in 15.336 s
2024-10-26 00:34:25 INFO     172.17.0.1 POST http://192.168.1.54:8191/v1 200 OK
2024-10-26 00:34:43 INFO     Incoming request => POST /v1 body: {'cmd': 'request.get', 'url': 'https://www.javlibrary.com/en/vl_searchbyid.php?keyword=SHYN-213', 'maxTimeout': 60000}
2024-10-26 00:34:46 INFO     Challenge detected.
2024-10-26 00:34:49 INFO     Challenge solved!
2024-10-26 00:34:50 INFO     Response in 6.463 s
2024-10-26 00:34:50 INFO     172.17.0.1 POST http://192.168.1.54:8191/v1 200 OK 

It still returns no data for some reason but at least it isn't a CF block anymore. Maybe @Maista6969 can look into why that's happening

@mgfjx
Copy link

mgfjx commented Oct 29, 2024

I got the same issue. I run stash app on my nas docker, my nas can visit missav.com, but the MissAV(jp) not working:
QQ_1730218163645

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants