-
-
Notifications
You must be signed in to change notification settings - Fork 141
fix(mass): update scraper to use new HTML endpoint after URL change #1720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
…ted-url' into 1714-masssuperct-changed-or-deleted-url
grossir
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the court filter is not working
for more information, see https://pre-commit.ci
…ted-url' into 1714-masssuperct-changed-or-deleted-url # Conflicts: # tests/examples/opinions/united_states/massappct_example.compare.json # tests/examples/opinions/united_states/masssuperct_example.compare.json
…ted-url' into 1714-masssuperct-changed-or-deleted-url
grossir
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Cleanup the example file, it has data it shouldn't have
-
Have you tried running the scraper? I am getting 403, both locally and in the server. Seems like this will need further research, probably some intermediate requests to get cookies
tests/examples/opinions/united_states/mass_example.compare.json
Outdated
Show resolved
Hide resolved
|
yes, now its blocking me as well (with or without VPN) The website is using cloudflare bot protection |
|
@Luis-manzur Can you try using cookies / replicating headers or something similar and see if that fixes the issue? |
|
I found a request library that may help us to bypass cloudflare bot protection, this library can be use to impersonate any web browser curl-cffi I tested it with solution: from curl_cffi import requests as curl_requests
def _request_url_get(self, url):
"""Override to use curl_cffi to bypass Cloudflare protection
Execute GET request using curl_cffi with browser impersonation
to bypass Cloudflare's bot detection.
"""
self.request["url"] = url
# Use curl_cffi to impersonate a real Chrome browser
self.request["response"] = curl_requests.get(
url,
impersonate="chrome",
timeout=60,
**self.request["parameters"],
)
if self.save_response:
self.save_response(self) |
|
If this depends on The solution looks promising, but I am not sure if we should "impersonate" something we are not, due to our scraping policy. I know we do change some headers, sometimes, but this is a more complicated step. We should ask in the sprint channel |


fix issue #1714