Move fullDomFetcher to Playwright #1144

LVerneyEC · 2025-04-03T12:18:56Z

Hi,

Here is a proposal for a rewriting of the full DOM fetcher, moving it to Playwright instead of Puppeteer.

This edited browser also has support for HTTP/HTTPS proxy (e.g. corporate proxy) and behavior can be adjusted by two environment variables:

PLAYWRIGHT_NO_SANDBOX to disable all the sandboxing in Chrome (required for running in Docker, depending on the Docker setup).
PLAYWRIGHT_NO_HEADLESS to run it in headful mode (sometimes useful for debugging purposes)

This is using patchright wrapper around Playwright, which adds several patches for obvious Playwright detection mechanisms. Similar to the previously used puppeteer-extra-plugin-stealth.

Best,

LVerneyEC · 2025-04-03T12:19:57Z

package.json

@@ -94,10 +94,8 @@
    "morgan": "^1.10.0",
    "node-fetch": "^3.1.0",
    "octokit": "2.0.2",
+    "patchright": "1.50.1",


Just flagging that this is an explicit pin at the moment due to Kaliiiiiiiiii-Vinyzu/patchright#58. Issue is closed but the fix is not yet part of the latest release. This version should be adjusted after review, prior to merging.

LVerneyEC · 2025-04-04T14:55:14Z

I noticed the tests are failing due to linting and commit/changelog issues. I'll fix these, but happy to have a first high-level review first to ensure this is useful and worth merging and fix everything at once afterwards :)

MattiSG · 2025-04-04T15:00:22Z

Thanks @LVerneyEC for this contribution! Fully agree with a first high-level overview before ironing out details :)
The intervention seems minimal. Do you have examples of cases that were blocked with the previous implementation and are unblocked with that switch? 🙂

LVerneyEC · 2025-04-04T15:16:14Z

Do you have examples of cases that were blocked with the previous implementation and are unblocked with that switch? 🙂

Not so much. I have another PR to come for the htmlOnlyFetcher, for which this increases widely coverage.

Here, the main benefit is to move away from puppeteer-extra-stealth which is unmaintained for a couple years: https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth.

Also, more high-level updates such as supporting corporate proxies and offering the ability to run headful for debugging purposes.

Ndpnt · 2025-04-10T15:21:52Z

Hi @LVerneyEC,

I've conducted a series of benchmark tests to evaluate the potential benefits of switching from Puppeteer to Playwright. Below are the detailed results:

Browser Automation Tool	Run #	Total Failures	403 Errors	Navigation Timeouts	Selector Timeouts	404 Errors	Duration
Puppeteer	1	57	48	9	0	0	6m 8s
	2	47	36	10	0	1	6m 22s
	3	30	18	11	0	1	5m 50s
	4	31	19	11	0	1	5m 37s
	5	31	29	1	0	0	5m 54s
Playwright	1	76	59	0	17	0	3m 9s
	2	75	59	0	16	0	2m 46s
	3	72	59	0	13	0	2m 47s
	4	76	59	0	17	0	2m 54s
	5	69	59	0	10	0	2m 59s

Observations:

Playwright is faster than Puppeteer
Playwright shows more consistent 403 error counts
Puppeteer shows more variation in error types and counts
Puppeteer is less frequently blocked than Playwright

Based on these benchmark results, I do not recommend switching to Playwright at this time. Even if it has faster execution times, its higher failure rates and blocking issues is a blocking point for me.

Regarding the other points mentioned:

It seems that Puppeteer natively supports HTTP proxies through environment variables
Puppeteer supports non-headless mode through the headless: false parameter in the launch function
While recommended to keep enabled, Puppeteer allows sandbox disabling using the --no-sandbox option

Have I missed any key points in my analysis, and do you still see reasons to switch to Playwright despite these results?

LVerneyEC · 2025-04-11T06:45:38Z

Would you have more details the benchmark and the results? I am a bit surprised about the 403 and selector errors, since it does not really match my experience so far.

Ndpnt · 2025-04-29T08:41:29Z

Hi @LVerneyEC,
Sorry for the late reply, I was on holiday and when I got back we had a seminar.

For the benchmark, I used the PGA collection as it includes many VLOPs and for which we have many blocking issues.

I ran the engine five times consecutively using version 5.0.3 with Puppeteer, and another five times using the version from this PR with Playwright. All runs were executed on a server hosted on OVH Horizon.

You can find the output for each run here.

Since your experience seems different, could you share a bit more about your own results and setup? It would be great to compare and understand the differences.

LVerneyEC · 2025-05-05T12:07:32Z

Hi,

I ran a quick comparison on our collection with both current main of the engine and proposed patch with Playwright/Patchright. Run time is comparable for both.

All terms tracked by Puppeteer are tracked by Playwright. The following terms succeed with Playwright but not with Puppeteer:

Shein — Privacy Policy
Shein — Terms of Purchase
Shein — Terms of Service
Temu — Prohibited Product List
Temu — Quality Guidelines
Youtube — Community Guidelines

Best

Ndpnt · 2025-05-20T15:20:31Z

Hi @LVerneyEC,

I've conducted additional testing across 5 collections, including yours, comparing Puppeteer, Playwright/Patchright, and rebrowser.

Here are my results:

While Playwright/Patchright successfully tracked some terms that Puppeteer failed on (like the Shein and Temu documents you mentioned), Puppeteer performed better overall across most terms.
I identified that Puppeteer's failures were primarily due to configuration issues that caused navigation timeout. After fixing this, Puppeteer consistently outperformed both Playwright/Patchright and rebrowser.

In addition to the fix of Puppeteer configuration, I've implemented two improvements to enhance tracking reliability:

Automatic fallback to headless browser when bot detection is encountered (eliminating the need for executeClientScripts workaround to bypass bot protection)
Automatic retry mechanism for transient errors (network/server issues)

Can you test these improvements with your collection and give me a feedback?

LVerneyEC · 2025-05-21T09:09:43Z

Hi,

Thanks for the updates to headless browser and retry mechanism! Would you have some tables or logs from your latest tests?

Also, did you backport the latest changes into this PR? Example https://github.com/OpenTermsArchive/engine/blob/main/src/archivist/fetcher/fullDomFetcher.js#L24 which is missing from this PR.

Finally, did you get the Shein and Temu documents to be scraped with the latest version using Puppeteer?

Thanks!

Ndpnt · 2025-05-21T09:58:48Z

Thanks for the updates to headless browser and retry mechanism! Would you have some tables or logs from your latest tests?

Here are some tracking logs I saved. I didn’t keep all of them because there were enough for me to draw solid conclusions, and saving and referencing them all would have taken a lot of time, as these tests have already taken up quite a lot of time. Also, I only ran the services that were bot-blocked, not the functioning ones, to speed up the process.
Please note that these tests were conducted before the latest improvements were made.

playwright - puppeeter.zip

Also, did you backport the latest changes into this PR? Example https://github.com/OpenTermsArchive/engine/blob/main/src/archivist/fetcher/fullDomFetcher.js#L24 which is missing from this PR.

At the moment, based on our tests, Puppeteer still appears to be a better option compared to Playwright. Since there’s no strong reason to switch, we don’t plan to merge this PR. So I haven’t backported the changes.

Finally, did you get the Shein and Temu documents to be scraped with the latest version using Puppeteer?

Yes I did. Can you confirm that on your side with the latest release?

I won’t close this PR until you confirm that the latest improvements have resolved your issues. If you still encounter cases where Playwright performs better than Puppeteer, please share some tracking logs.

LVerneyEC · 2025-05-22T15:04:46Z

Hi,

Many thanks for the extra details. I did some extensive testing and comparison on my end, comparing with the latest baseline in main as of today.

First, the current Puppeteer implementation cannot use an authenticating proxy. I think it correctly grabs http_proxy environment variable, but it cannot handle the credentials (env variable of the form http_proxy=http://username:password@host:port). This would require some tuning (see https://github.com/Decodo/Puppeteer/blob/master/puppeteer.js#L8-L15), similar to the proxy parsing in this PR. If sticking to Puppeteer, this should be backported.

Then, I did a full comparison on our dataset of declarations (https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/vlop-vlose-declarations). Apart from a few transient and blinking terms (which are quite similar for both browsers), the striking difference is:

Amazon Store - Global Store terms and conditions was only working with your Puppeteer implementation (HTTP code 403 otherwise, systematically working in one case / systematically failing in the other, over a few tries).
TikTok - Commercial Terms was only working with our Playwright implementation (no match of the selector - getting a captcha page, systematically working in one case / systematically failing in the other, over a few tries).
Shein - Seller Agreement is never working, no matter the browser.
A few other platforms are blinking, but this is not statistically relevant and this is similar under both situations.

In light of this, I would propose to strengthen and expand your latest retry strategy as such:

If scraping without client scripts and getting a failure, then retry with fullDomFetcher (this is your latest addition)
If scraping with client scripts and getting a failure, then retry once (or a configurable number of times) with fullDomFetcher (just a basic retry for transient issues)

Beyond the fact that there might be transient issues, the reason for this is that depending on your infrastructure/proxy solution you might end up doing the second scraping try with a different IP and therefore augmenting your success rate.

Given the results on my testing set (no clear winner, 1 vs 1 failures), I'm wondering whether it would make sense to keep both browsers and either have it configurable in the executeClientScripts (e.g. true == Puppeteer, false == htmlOnlyFetcher, "playwright" == playwright codebase) or use it as an escalation when retrying a failed term?

Best

Ndpnt · 2025-05-29T09:49:47Z

Hi @LVerneyEC,

Thanks for the testing and feedback.

Regarding the issue you mentioned with TikTok - Commercial Terms not working under Puppeteer: I dug into it and found that the problem can be fixed by changing the waitUntil option to consider page loaded to either networkidle0 or networkidle2. However, reverting to that approach will led to navigation timeouts on other terms.
To avoid that, I implemented a solution to explicitly waiting for expected elements to be present and non-empty on the page.
With this update, the TikTok - Commercial Terms page now loads successfully with Puppeteer both on my local machine and on our OVH experimentation server.

While adding a fallback to Playwright when Puppeteer fails could be a practical solution, it would significantly increase code complexity and maintenance overhead. For this reason, I prefer to avoid it unless it's absolutely necessary and there's no reliable workaround within the existing Puppeteer setup.

Regarding support for authenticating proxies, would you be open to proposing an implementation in a separate pull request?

Ndpnt · 2025-06-02T13:51:56Z

And I forgot, but indeed, it also seems like a good idea to expand the retry strategy to include failures that occurred with client script enabled since we’ve both seen transient blocking errors in that scenario.

Ndpnt · 2025-06-19T08:45:00Z

Hi @LVerneyEC,

Just following up as we haven’t received a reply, and the issues you were facing with Puppeteer have been addressed.
Thanks again for contributing to improve the tracking success rate. All improvements are available since engine version 5.6.0.
If you're interested in adding support for authenticated proxies, we'd be happy to review and merge a PR for that.

Closing this for now, but feel free to reopen if you run into other limitations that could be addressed with Playwright.

LVerneyEC · 2025-06-23T09:49:10Z

Hi @Ndpnt,

Sorry for coming back late on this. I cannot use the current upstream engine directly at the moment due to the sandboxing mechanism in Puppeteer (not possible to use in a Docker image running as root, which is imposed by my infrastructure constraints, due to the Puppeteer sandboxing). I would need basically this to be backported from this PR.

If this is OK for you, let me know whether you'd rather push it or should I open a small PR for these environment variables toggle for sandbox and/or headless.

I ran a test run on our infra (manually overloading the sandbox, see before), and I got two unexpected errors on Temu documents: https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/vlop-vlose-declarations/-/issues/13#note_461065 and https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/vlop-vlose-declarations/-/issues/34#note_461064.

Error message was

Fetch failed: Fetch failed: Execution context was destroyed, most likely because of a navigation.

And this is unexpected to me at the moment (looks like a bug in OTA logic with puppeteer, not really a selector/antibot issue). These are failing with Playwright, but with a proper selector error due to hitting an antibots page with the same infra in use.

Apart from this, it seems to be running on-par with the Playwright-based code from this PR.

Ndpnt · 2025-06-26T08:40:05Z

Hi @Ndpnt,

Sorry for coming back late on this. I cannot use the current upstream engine directly at the moment due to the sandboxing mechanism in Puppeteer (not possible to use in a Docker image running as root, which is imposed by my infrastructure constraints, due to the Puppeteer sandboxing). I would need basically this to be backported from this PR.

If this is OK for you, let me know whether you'd rather push it or should I open a small PR for these environment variables toggle for sandbox and/or headless.

Yes, that works for me. You can go ahead and open a small PR for that. Thanks!

I ran a test run on our infra (manually overloading the sandbox, see before), and I got two unexpected errors on Temu documents: https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/vlop-vlose-declarations/-/issues/13#note_461065 and https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/vlop-vlose-declarations/-/issues/34#note_461064.

Error message was

Fetch failed: Fetch failed: Execution context was destroyed, most likely because of a navigation.

And this is unexpected to me at the moment (looks like a bug in OTA logic with puppeteer, not really a selector/antibot issue). These are failing with Playwright, but with a proper selector error due to hitting an antibots page with the same infra in use.

I’m currently investigating this further to understand the root cause, and I’ll get back to you once I’ve identified anything concrete.

Move fullDomFetcher to Playwright

8f92235

LVerneyEC commented Apr 3, 2025

View reviewed changes

LVerneyEC added 2 commits April 3, 2025 14:26

Forgot to rebuild package-lock.json

b18893d

And forgot to lint

ffceb87

Ndpnt closed this Apr 10, 2025

Ndpnt reopened this Apr 10, 2025

Ndpnt mentioned this pull request May 29, 2025

Add content validation to ensure selectors contain actual text before considering them loaded #1159

Merged

Ndpnt closed this Jun 19, 2025

Move fullDomFetcher to Playwright #1144

Move fullDomFetcher to Playwright #1144

Uh oh!

Conversation

LVerneyEC commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LVerneyEC Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

LVerneyEC commented Apr 4, 2025

Uh oh!

MattiSG commented Apr 4, 2025

Uh oh!

LVerneyEC commented Apr 4, 2025

Uh oh!

Ndpnt commented Apr 10, 2025

Uh oh!

LVerneyEC commented Apr 11, 2025

Uh oh!

Ndpnt commented Apr 29, 2025

Uh oh!

LVerneyEC commented May 5, 2025

Uh oh!

Ndpnt commented May 20, 2025

Uh oh!

LVerneyEC commented May 21, 2025

Uh oh!

Ndpnt commented May 21, 2025

Uh oh!

LVerneyEC commented May 22, 2025

Uh oh!

Ndpnt commented May 29, 2025

Uh oh!

Ndpnt commented Jun 2, 2025

Uh oh!

Ndpnt commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LVerneyEC commented Jun 23, 2025

Uh oh!

Ndpnt commented Jun 26, 2025

Uh oh!

Uh oh!

LVerneyEC commented Apr 3, 2025 •

edited

Loading

Ndpnt commented Jun 19, 2025 •

edited

Loading