-
Notifications
You must be signed in to change notification settings - Fork 34
Move fullDomFetcher to Playwright #1144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -94,10 +94,8 @@ | |||
"morgan": "^1.10.0", | |||
"node-fetch": "^3.1.0", | |||
"octokit": "2.0.2", | |||
"patchright": "1.50.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just flagging that this is an explicit pin at the moment due to Kaliiiiiiiiii-Vinyzu/patchright#58. Issue is closed but the fix is not yet part of the latest release. This version should be adjusted after review, prior to merging.
I noticed the tests are failing due to linting and commit/changelog issues. I'll fix these, but happy to have a first high-level review first to ensure this is useful and worth merging and fix everything at once afterwards :) |
Thanks @LVerneyEC for this contribution! Fully agree with a first high-level overview before ironing out details :) |
Not so much. I have another PR to come for the htmlOnlyFetcher, for which this increases widely coverage. Here, the main benefit is to move away from Also, more high-level updates such as supporting corporate proxies and offering the ability to run headful for debugging purposes. |
Hi @LVerneyEC, I've conducted a series of benchmark tests to evaluate the potential benefits of switching from Puppeteer to Playwright. Below are the detailed results:
Observations:
Based on these benchmark results, I do not recommend switching to Playwright at this time. Even if it has faster execution times, its higher failure rates and blocking issues is a blocking point for me. Regarding the other points mentioned:
Have I missed any key points in my analysis, and do you still see reasons to switch to Playwright despite these results? |
Would you have more details the benchmark and the results? I am a bit surprised about the 403 and selector errors, since it does not really match my experience so far. |
Hi @LVerneyEC, For the benchmark, I used the PGA collection as it includes many VLOPs and for which we have many blocking issues. I ran the engine five times consecutively using version 5.0.3 with Puppeteer, and another five times using the version from this PR with Playwright. All runs were executed on a server hosted on OVH Horizon. You can find the output for each run here. Since your experience seems different, could you share a bit more about your own results and setup? It would be great to compare and understand the differences. |
Hi, I ran a quick comparison on our collection with both current All terms tracked by Puppeteer are tracked by Playwright. The following terms succeed with Playwright but not with Puppeteer:
Best |
Hi @LVerneyEC, I've conducted additional testing across 5 collections, including yours, comparing Puppeteer, Playwright/Patchright, and rebrowser. Here are my results:
In addition to the fix of Puppeteer configuration, I've implemented two improvements to enhance tracking reliability:
Can you test these improvements with your collection and give me a feedback? |
Hi, Thanks for the updates to headless browser and retry mechanism! Would you have some tables or logs from your latest tests? Also, did you backport the latest changes into this PR? Example https://github.com/OpenTermsArchive/engine/blob/main/src/archivist/fetcher/fullDomFetcher.js#L24 which is missing from this PR. Finally, did you get the Shein and Temu documents to be scraped with the latest version using Puppeteer? Thanks! |
Here are some tracking logs I saved. I didn’t keep all of them because there were enough for me to draw solid conclusions, and saving and referencing them all would have taken a lot of time, as these tests have already taken up quite a lot of time. Also, I only ran the services that were bot-blocked, not the functioning ones, to speed up the process.
At the moment, based on our tests, Puppeteer still appears to be a better option compared to Playwright. Since there’s no strong reason to switch, we don’t plan to merge this PR. So I haven’t backported the changes.
Yes I did. Can you confirm that on your side with the latest release? I won’t close this PR until you confirm that the latest improvements have resolved your issues. If you still encounter cases where Playwright performs better than Puppeteer, please share some tracking logs. |
Hi, Many thanks for the extra details. I did some extensive testing and comparison on my end, comparing with the latest baseline in First, the current Puppeteer implementation cannot use an authenticating proxy. I think it correctly grabs Then, I did a full comparison on our dataset of declarations (https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/vlop-vlose-declarations). Apart from a few transient and blinking terms (which are quite similar for both browsers), the striking difference is:
In light of this, I would propose to strengthen and expand your latest retry strategy as such:
Beyond the fact that there might be transient issues, the reason for this is that depending on your infrastructure/proxy solution you might end up doing the second scraping try with a different IP and therefore augmenting your success rate. Given the results on my testing set (no clear winner, 1 vs 1 failures), I'm wondering whether it would make sense to keep both browsers and either have it configurable in the Best |
Hi @LVerneyEC, Thanks for the testing and feedback. Regarding the issue you mentioned with While adding a fallback to Playwright when Puppeteer fails could be a practical solution, it would significantly increase code complexity and maintenance overhead. For this reason, I prefer to avoid it unless it's absolutely necessary and there's no reliable workaround within the existing Puppeteer setup. Regarding support for authenticating proxies, would you be open to proposing an implementation in a separate pull request? |
And I forgot, but indeed, it also seems like a good idea to expand the retry strategy to include failures that occurred with client script enabled since we’ve both seen transient blocking errors in that scenario. |
Hi @LVerneyEC, Just following up as we haven’t received a reply, and the issues you were facing with Puppeteer have been addressed. Closing this for now, but feel free to reopen if you run into other limitations that could be addressed with Playwright. |
Hi @Ndpnt, Sorry for coming back late on this. I cannot use the current upstream engine directly at the moment due to the sandboxing mechanism in Puppeteer (not possible to use in a Docker image running as root, which is imposed by my infrastructure constraints, due to the Puppeteer sandboxing). I would need basically this to be backported from this PR. If this is OK for you, let me know whether you'd rather push it or should I open a small PR for these environment variables toggle for sandbox and/or headless. I ran a test run on our infra (manually overloading the sandbox, see before), and I got two unexpected errors on Temu documents: https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/vlop-vlose-declarations/-/issues/13#note_461065 and https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/vlop-vlose-declarations/-/issues/34#note_461064. Error message was
And this is unexpected to me at the moment (looks like a bug in OTA logic with puppeteer, not really a selector/antibot issue). These are failing with Playwright, but with a proper selector error due to hitting an antibots page with the same infra in use. Apart from this, it seems to be running on-par with the Playwright-based code from this PR. |
Yes, that works for me. You can go ahead and open a small PR for that. Thanks!
I’m currently investigating this further to understand the root cause, and I’ll get back to you once I’ve identified anything concrete. |
Hi,
Here is a proposal for a rewriting of the full DOM fetcher, moving it to Playwright instead of Puppeteer.
This edited browser also has support for HTTP/HTTPS proxy (e.g. corporate proxy) and behavior can be adjusted by two environment variables:
PLAYWRIGHT_NO_SANDBOX
to disable all the sandboxing in Chrome (required for running in Docker, depending on the Docker setup).PLAYWRIGHT_NO_HEADLESS
to run it in headful mode (sometimes useful for debugging purposes)This is using patchright wrapper around Playwright, which adds several patches for obvious Playwright detection mechanisms. Similar to the previously used
puppeteer-extra-plugin-stealth
.Best,