-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf Testing: Incorrect numbers via a headless browser? #51376
Comments
Seems like chromium headed mode has an overhead which skews the TTFB result when first load the page. I tested it in both puppeteer and playwright and they gave the same result. However, try reloading after await page.goto( '/' );
await page.reload();
const ttfb = await page.evaluate(
() =>
new Promise( ( resolve ) => {
new window.PerformanceObserver( ( entryList ) => {
const [ pageNav ] =
entryList.getEntriesByType( 'navigation' );
resolve( pageNav.responseStart );
} ).observe( {
type: 'navigation',
buffered: true,
} );
} )
); |
I'm not sure what the question is. Being different between headless and normal modes is something I would expect, and be really surprised if it weren't there. What's the problem we're attempting to solve? In our perf tests I would generally suggest that consistency is of far more value than absolute numbers because the test environment is unlike normal user environments and is unlikely to ever be so. We want to be able to track the direction and relative magnitude of changes, but don't need the measurements themselves. |
I'm trying to find out if paint-based metrics like FCP, LCP or TTI are correctly measured in a headless browser. How do we know that those metrics are meaningful in a headless env? |
Aha, as an experiment then this makes sense, and I guess the answer is "no"? It's not strong evidence, but the fact that those are such drastically different numbers suggests to me (in corroboration with some intuition) that headless tests are not going to be a reliable predictor of real in-browser rendering performance. |
I think we can skip the first sample to account for the initialization overhead. For the subsequent samples, IMO, it would be best to fire a fresh context/page for each iteration. Something like this: gutenberg/test/e2e/specs/performance/site-editor.spec.js Lines 95 to 128 in 4bf1c77
|
I don't think the goal was ever to capture real-life metrics but to have a solid baseline that we can track over time for regressions. This is a Lab test after all. |
I think the question is if, using a headless browser, we will actually catch regressions on the paint-related metrics, like LCP. We could probably design a regression test to see if it gets reflected by the numbers. Or do we already have some historic data that shows the current metrics catching regressions? |
@kevin940726 I was responding to the description
This topic is obviously incredibly complicated, and the more we look into it, also the more meaningless our numbers become. Nothing about the testing environment is normal. Also, if we want to narrow down more tightly into the metrics then we have to start making decisions that deviate from normal use cases. We should be taking the My point is that it's a diversion to hunt some truth in the performance metrics. Also, I think that very little has come through hard-to-measure changes. Most of the performance work that matters is instantly observable because it makes such a dramatic improvement. Everything else, including the main performance issues, are the kind of death by a thousand cuts that are almost impossible to measure, and impossible to measure improvement on. Therefore I don't know if the question is relevant if this catches regressions on the paint-related metrics. Will the existing tests? They haven't caught the most egregious ones that freeze up my writing experience. We probably don't have the right tests to catch that kind of regression, and so even today our performance numbers can improve while the editing experience gets slower. You could try writing a new test and in a branch intentionally degrade the LCP. Compare the results as you did already in headless vs. normal mode. I'm not sure how much that would speak to the core question: will this catch regressions? But it could show if it's possible to catch some specific regressions. |
@WunderBart I can share my experiments on this topic. In #48728 I disabled per-block stylesheets, meaning that we'd enqueue the stylesheets of all core blocks instead of only the stylesheet of the blocks in use in a given page. This means more work for the browser, so I expected that the LCP-TTFB metric (the time spent painting) to be increased by that PR. I then measured using three different mechanism: the CI job, curl, and a script created by the performance team. All of them confirmed the intuition, meaning they are catching regressions. |
Regarding numbers being different: it's my experience they'll be. While improving front-end performance and comparing numbers with several folks from the core performance team, we ran into this same thing.
|
As for the migration to playwright: it's unfortunate that the numbers will be affected, though it's not a blocker from the point of view of the front-end performance test. The major bumps I've seen in codevitals are related to the enviroment rather than a PR itself (changing WordPress base, etc.). Riad and I talked at some point about being able to "annotate" the graph, so it's clear what the bumps are. For reassurance, we could use the experiment I shared above #48728 to make sure the paint metrics reported by Playwright are working as expected. |
The codevitals milliseconds are indeed unrealistic, but I don't think we should look at them as an actual performance indicator. Since they show the relative changes from commit to commit, their main objective should be catching regressions. It can only happen if the measurements are reliable, though. I think they're currently a bit too unstable due to many factors, but migrating to Playwright should be a good first step here. Regarding real-world numbers, I don't think we'll ever be able to get them via a CI job running in a headless browser. The only way to get such numbers I can currently think of would be through some client-side hooks that would measure what we want to measure and send those measurements. I have no idea if that's even remotely feasible though. Any ideas?
Interesting! I compared the standard deviation for perf metrics from the last 10 (published) jobs, and the minimum value seems to be the most stable for most metrics. Not FCP though, as it doubled the deviation compared to the
The annotations sound good. Adding the WP version used and commit info would be much better than just the hash.
That's great to hear! I think the best we can do now is to ensure our metrics are reliable. We can also design more experiments like #48728 to ensure we can catch specific regressions. |
This seems too confident to me :P Any data that proves that playwright numbers are more stable than puppeteer ones?
We don't really care about having real world numbers. We want numbers that we improve over time. "Real world numbers" is meaningless because it depends on the environment and there's so many real world environments.
It would be cool to update codevitals to support some kind of "comment" per log and we could include the wp version in that comment.
💯 I've used the graphs so many times to catch regressions. Most recent example is the one fixed by this commit #51675 Without the graphs, we wouldn't have caught this in time and we wouldn't even know where to start looking to fix it. So yeah, while they may seem unstable, I think they are stable enough to be very useful to us. Of course, the more stable they become, the better it is. |
no, and my point was just that. we don't need to be shooting for realistic numbers because that's just not possible. it's liberating. as @youknowriad mentioned, all that really matters is long-term consistency.
I've found good reliability since @sgomes fixed the open-a-new-tab issues in the tests. when I run the tests with in separate work it is worth storing raw data so we can see the distributions of the measurements. those are much more illuminating than any mean or minimum. but it doesn't surprise me that the variance of the minimum is smaller. that's a capped lower bound. I hope the Playwright changes help, but they aren't going to drastically change the realities of the perf tests. currently the perf tests rarely fail anyway, so that's something they have that the e2e tests don't have. |
Alright, I was under the wrong impression then - I thought the metrics were a) not stable enough and b) not realistic enough. Now I'm learning that the current numbers are stable enough to catch regressions and that we're not really looking for realistic numbers. It's good news, of course – only less room for improvement than I thought.
By good first step, I meant using a framework that makes it easier to write stable tests and provides better debugging tools. I don't know if the numbers will improve, but at least (from my own experience) Playwright will make it much easier to analyze the current tests and work on improving them. Looking at the numbers from the migrated tests, I can only say that they don't seem to be less stable. Having said that, I can do |
Some numbers comparison I just did: #51084 (comment) |
While migrating performance specs to Playwright, I noticed that I'm getting vastly different numbers for the theme specs while running headfully vs. headlessly. From a local run:
I googled a bit but couldn't find a good read on testing performance headfully vs. headlessly, so I asked GPT. Here's the relevant bit:
The part about TTFB doesn't seem to check out, though, because I'm getting much larger numbers for that metric as well. Anyhow, the part about simulating a more realistic use experience makes sense to me, so I would like to discuss if we should start measuring via a headed browser instance to get more realistic metrics.
I'm curious what your thoughts are on this, @youknowriad, @dmsnell, @oandregal, @kevin940726 🙌
(I'm unsure yet if running headfully in CI is even possible, but I'm currently trying to verify it.)
The text was updated successfully, but these errors were encountered: