Skip to content

Move fullDomFetcher to Playwright #1144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

LVerneyEC
Copy link

@LVerneyEC LVerneyEC commented Apr 3, 2025

Hi,

Here is a proposal for a rewriting of the full DOM fetcher, moving it to Playwright instead of Puppeteer.

This edited browser also has support for HTTP/HTTPS proxy (e.g. corporate proxy) and behavior can be adjusted by two environment variables:

  • PLAYWRIGHT_NO_SANDBOX to disable all the sandboxing in Chrome (required for running in Docker, depending on the Docker setup).
  • PLAYWRIGHT_NO_HEADLESS to run it in headful mode (sometimes useful for debugging purposes)

This is using patchright wrapper around Playwright, which adds several patches for obvious Playwright detection mechanisms. Similar to the previously used puppeteer-extra-plugin-stealth.

Best,

@@ -94,10 +94,8 @@
"morgan": "^1.10.0",
"node-fetch": "^3.1.0",
"octokit": "2.0.2",
"patchright": "1.50.1",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just flagging that this is an explicit pin at the moment due to Kaliiiiiiiiii-Vinyzu/patchright#58. Issue is closed but the fix is not yet part of the latest release. This version should be adjusted after review, prior to merging.

@LVerneyEC
Copy link
Author

I noticed the tests are failing due to linting and commit/changelog issues. I'll fix these, but happy to have a first high-level review first to ensure this is useful and worth merging and fix everything at once afterwards :)

@MattiSG
Copy link
Member

MattiSG commented Apr 4, 2025

Thanks @LVerneyEC for this contribution! Fully agree with a first high-level overview before ironing out details :)
The intervention seems minimal. Do you have examples of cases that were blocked with the previous implementation and are unblocked with that switch? 🙂

@LVerneyEC
Copy link
Author

Do you have examples of cases that were blocked with the previous implementation and are unblocked with that switch? 🙂

Not so much. I have another PR to come for the htmlOnlyFetcher, for which this increases widely coverage.

Here, the main benefit is to move away from puppeteer-extra-stealth which is unmaintained for a couple years: https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth.

Also, more high-level updates such as supporting corporate proxies and offering the ability to run headful for debugging purposes.

@Ndpnt Ndpnt closed this Apr 10, 2025
@Ndpnt Ndpnt reopened this Apr 10, 2025
@Ndpnt
Copy link
Member

Ndpnt commented Apr 10, 2025

Hi @LVerneyEC,

I've conducted a series of benchmark tests to evaluate the potential benefits of switching from Puppeteer to Playwright. Below are the detailed results:

Browser Automation Tool Run # Total Failures 403 Errors Navigation Timeouts Selector Timeouts 404 Errors Duration
Puppeteer 1 57 48 9 0 0 6m 8s
2 47 36 10 0 1 6m 22s
3 30 18 11 0 1 5m 50s
4 31 19 11 0 1 5m 37s
5 31 29 1 0 0 5m 54s
Playwright 1 76 59 0 17 0 3m 9s
2 75 59 0 16 0 2m 46s
3 72 59 0 13 0 2m 47s
4 76 59 0 17 0 2m 54s
5 69 59 0 10 0 2m 59s

Observations:

  • Playwright is faster than Puppeteer
  • Playwright shows more consistent 403 error counts
  • Puppeteer shows more variation in error types and counts
  • Puppeteer is less frequently blocked than Playwright

Based on these benchmark results, I do not recommend switching to Playwright at this time. Even if it has faster execution times, its higher failure rates and blocking issues is a blocking point for me.

Regarding the other points mentioned:

  • It seems that Puppeteer natively supports HTTP proxies through environment variables
  • Puppeteer supports non-headless mode through the headless: false parameter in the launch function
  • While recommended to keep enabled, Puppeteer allows sandbox disabling using the --no-sandbox option

Have I missed any key points in my analysis, and do you still see reasons to switch to Playwright despite these results?

@LVerneyEC
Copy link
Author

Would you have more details the benchmark and the results? I am a bit surprised about the 403 and selector errors, since it does not really match my experience so far.

@Ndpnt
Copy link
Member

Ndpnt commented Apr 29, 2025

Hi @LVerneyEC,
Sorry for the late reply, I was on holiday and when I got back we had a seminar.

For the benchmark, I used the PGA collection as it includes many VLOPs and for which we have many blocking issues.

I ran the engine five times consecutively using version 5.0.3 with Puppeteer, and another five times using the version from this PR with Playwright. All runs were executed on a server hosted on OVH Horizon.

You can find the output for each run here.

Since your experience seems different, could you share a bit more about your own results and setup? It would be great to compare and understand the differences.

@LVerneyEC
Copy link
Author

Hi,

I ran a quick comparison on our collection with both current main of the engine and proposed patch with Playwright/Patchright. Run time is comparable for both.

All terms tracked by Puppeteer are tracked by Playwright. The following terms succeed with Playwright but not with Puppeteer:

  • Shein — Privacy Policy
  • Shein — Terms of Purchase
  • Shein — Terms of Service
  • Temu — Prohibited Product List
  • Temu — Quality Guidelines
  • Youtube — Community Guidelines

Best

@Ndpnt
Copy link
Member

Ndpnt commented May 20, 2025

Hi @LVerneyEC,

I've conducted additional testing across 5 collections, including yours, comparing Puppeteer, Playwright/Patchright, and rebrowser.

Here are my results:

  • While Playwright/Patchright successfully tracked some terms that Puppeteer failed on (like the Shein and Temu documents you mentioned), Puppeteer performed better overall across most terms.
  • I identified that Puppeteer's failures were primarily due to configuration issues that caused navigation timeout. After fixing this, Puppeteer consistently outperformed both Playwright/Patchright and rebrowser.

In addition to the fix of Puppeteer configuration, I've implemented two improvements to enhance tracking reliability:

Can you test these improvements with your collection and give me a feedback?

@LVerneyEC
Copy link
Author

Hi,

Thanks for the updates to headless browser and retry mechanism! Would you have some tables or logs from your latest tests?

Also, did you backport the latest changes into this PR? Example https://github.com/OpenTermsArchive/engine/blob/main/src/archivist/fetcher/fullDomFetcher.js#L24 which is missing from this PR.

Finally, did you get the Shein and Temu documents to be scraped with the latest version using Puppeteer?

Thanks!

@Ndpnt
Copy link
Member

Ndpnt commented May 21, 2025

Thanks for the updates to headless browser and retry mechanism! Would you have some tables or logs from your latest tests?

Here are some tracking logs I saved. I didn’t keep all of them because there were enough for me to draw solid conclusions, and saving and referencing them all would have taken a lot of time, as these tests have already taken up quite a lot of time. Also, I only ran the services that were bot-blocked, not the functioning ones, to speed up the process.
Please note that these tests were conducted before the latest improvements were made.

playwright - puppeeter.zip

Also, did you backport the latest changes into this PR? Example https://github.com/OpenTermsArchive/engine/blob/main/src/archivist/fetcher/fullDomFetcher.js#L24 which is missing from this PR.

At the moment, based on our tests, Puppeteer still appears to be a better option compared to Playwright. Since there’s no strong reason to switch, we don’t plan to merge this PR. So I haven’t backported the changes.

Finally, did you get the Shein and Temu documents to be scraped with the latest version using Puppeteer?

Yes I did. Can you confirm that on your side with the latest release?

I won’t close this PR until you confirm that the latest improvements have resolved your issues. If you still encounter cases where Playwright performs better than Puppeteer, please share some tracking logs.

@LVerneyEC
Copy link
Author

Hi,

Many thanks for the extra details. I did some extensive testing and comparison on my end, comparing with the latest baseline in main as of today.

First, the current Puppeteer implementation cannot use an authenticating proxy. I think it correctly grabs http_proxy environment variable, but it cannot handle the credentials (env variable of the form http_proxy=http://username:password@host:port). This would require some tuning (see https://github.com/Decodo/Puppeteer/blob/master/puppeteer.js#L8-L15), similar to the proxy parsing in this PR. If sticking to Puppeteer, this should be backported.

Then, I did a full comparison on our dataset of declarations (https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/vlop-vlose-declarations). Apart from a few transient and blinking terms (which are quite similar for both browsers), the striking difference is:

  • Amazon Store - Global Store terms and conditions was only working with your Puppeteer implementation (HTTP code 403 otherwise, systematically working in one case / systematically failing in the other, over a few tries).
  • TikTok - Commercial Terms was only working with our Playwright implementation (no match of the selector - getting a captcha page, systematically working in one case / systematically failing in the other, over a few tries).
  • Shein - Seller Agreement is never working, no matter the browser.
  • A few other platforms are blinking, but this is not statistically relevant and this is similar under both situations.

In light of this, I would propose to strengthen and expand your latest retry strategy as such:

  • If scraping without client scripts and getting a failure, then retry with fullDomFetcher (this is your latest addition)
  • If scraping with client scripts and getting a failure, then retry once (or a configurable number of times) with fullDomFetcher (just a basic retry for transient issues)

Beyond the fact that there might be transient issues, the reason for this is that depending on your infrastructure/proxy solution you might end up doing the second scraping try with a different IP and therefore augmenting your success rate.

Given the results on my testing set (no clear winner, 1 vs 1 failures), I'm wondering whether it would make sense to keep both browsers and either have it configurable in the executeClientScripts (e.g. true == Puppeteer, false == htmlOnlyFetcher, "playwright" == playwright codebase) or use it as an escalation when retrying a failed term?

Best

@Ndpnt
Copy link
Member

Ndpnt commented May 29, 2025

Hi @LVerneyEC,

Thanks for the testing and feedback.

Regarding the issue you mentioned with TikTok - Commercial Terms not working under Puppeteer: I dug into it and found that the problem can be fixed by changing the waitUntil option to consider page loaded to either networkidle0 or networkidle2. However, reverting to that approach will led to navigation timeouts on other terms.
To avoid that, I implemented a solution to explicitly waiting for expected elements to be present and non-empty on the page.
With this update, the TikTok - Commercial Terms page now loads successfully with Puppeteer both on my local machine and on our OVH experimentation server.

While adding a fallback to Playwright when Puppeteer fails could be a practical solution, it would significantly increase code complexity and maintenance overhead. For this reason, I prefer to avoid it unless it's absolutely necessary and there's no reliable workaround within the existing Puppeteer setup.

Regarding support for authenticating proxies, would you be open to proposing an implementation in a separate pull request?

@Ndpnt
Copy link
Member

Ndpnt commented Jun 2, 2025

And I forgot, but indeed, it also seems like a good idea to expand the retry strategy to include failures that occurred with client script enabled since we’ve both seen transient blocking errors in that scenario.

@Ndpnt
Copy link
Member

Ndpnt commented Jun 19, 2025

Hi @LVerneyEC,

Just following up as we haven’t received a reply, and the issues you were facing with Puppeteer have been addressed.
Thanks again for contributing to improve the tracking success rate. All improvements are available since engine version 5.6.0.
If you're interested in adding support for authenticated proxies, we'd be happy to review and merge a PR for that.

Closing this for now, but feel free to reopen if you run into other limitations that could be addressed with Playwright.

@Ndpnt Ndpnt closed this Jun 19, 2025
@LVerneyEC
Copy link
Author

Hi @Ndpnt,

Sorry for coming back late on this. I cannot use the current upstream engine directly at the moment due to the sandboxing mechanism in Puppeteer (not possible to use in a Docker image running as root, which is imposed by my infrastructure constraints, due to the Puppeteer sandboxing). I would need basically this to be backported from this PR.

If this is OK for you, let me know whether you'd rather push it or should I open a small PR for these environment variables toggle for sandbox and/or headless.

I ran a test run on our infra (manually overloading the sandbox, see before), and I got two unexpected errors on Temu documents: https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/vlop-vlose-declarations/-/issues/13#note_461065 and https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/vlop-vlose-declarations/-/issues/34#note_461064.

Error message was

Fetch failed: Fetch failed: Execution context was destroyed, most likely because of a navigation.

And this is unexpected to me at the moment (looks like a bug in OTA logic with puppeteer, not really a selector/antibot issue). These are failing with Playwright, but with a proper selector error due to hitting an antibots page with the same infra in use.

Apart from this, it seems to be running on-par with the Playwright-based code from this PR.

@Ndpnt
Copy link
Member

Ndpnt commented Jun 26, 2025

Hi @Ndpnt,

Sorry for coming back late on this. I cannot use the current upstream engine directly at the moment due to the sandboxing mechanism in Puppeteer (not possible to use in a Docker image running as root, which is imposed by my infrastructure constraints, due to the Puppeteer sandboxing). I would need basically this to be backported from this PR.

If this is OK for you, let me know whether you'd rather push it or should I open a small PR for these environment variables toggle for sandbox and/or headless.

Yes, that works for me. You can go ahead and open a small PR for that. Thanks!

I ran a test run on our infra (manually overloading the sandbox, see before), and I got two unexpected errors on Temu documents: https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/vlop-vlose-declarations/-/issues/13#note_461065 and https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/vlop-vlose-declarations/-/issues/34#note_461064.

Error message was

Fetch failed: Fetch failed: Execution context was destroyed, most likely because of a navigation.

And this is unexpected to me at the moment (looks like a bug in OTA logic with puppeteer, not really a selector/antibot issue). These are failing with Playwright, but with a proper selector error due to hitting an antibots page with the same infra in use.

I’m currently investigating this further to understand the root cause, and I’ll get back to you once I’ve identified anything concrete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants