Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add block media feature for playwright - blocks images, videos, css #54

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

MQ37
Copy link

@MQ37 MQ37 commented Mar 11, 2025

closes #47 (comment)

@MQ37 MQ37 requested a review from matyascimbulka March 11, 2025 19:10
Copy link
Collaborator

@matyascimbulka matyascimbulka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Nice feature that will reduce proxy usage for user. Thank you.

But you've forgotten to add the input blockMedia into the defaults object in const.ts.

export const defaults = {
debugMode: inputSchema.properties.debugMode.default,
dynamicContentWaitSecs: inputSchema.properties.dynamicContentWaitSecs.default,
htmlTransformer: inputSchema.properties.htmlTransformer.default,
initialConcurrency: inputSchema.properties.initialConcurrency.default,
keepAlive: true, // Not in input_schema.json
maxConcurrency: inputSchema.properties.maxConcurrency.default,
maxRequestRetries: inputSchema.properties.maxRequestRetries.default,
maxRequestRetriesMax: inputSchema.properties.maxRequestRetries.maximum,
maxResults: inputSchema.properties.maxResults.default,
maxResultsMax: inputSchema.properties.maxResults.maximum,
minConcurrency: inputSchema.properties.minConcurrency.default,
outputFormats: inputSchema.properties.outputFormats.default,
proxyConfiguration: inputSchema.properties.proxyConfiguration.default,
query: undefined, // No default value in input_schema.json
readableTextCharThreshold: 100, // Not in input_schema.json
removeCookieWarnings: inputSchema.properties.removeCookieWarnings.default,
removeElementsCssSelector: inputSchema.properties.removeElementsCssSelector.default,
requestTimeoutSecs: inputSchema.properties.requestTimeoutSecs.default,
requestTimeoutSecsMax: inputSchema.properties.requestTimeoutSecs.maximum,
serpMaxRetries: inputSchema.properties.serpMaxRetries.default,
serpMaxRetriesMax: inputSchema.properties.serpMaxRetries.maximum,
serpProxyGroup: inputSchema.properties.serpProxyGroup.default,
scrapingTool: inputSchema.properties.scrapingTool.default,
};

This causes the input processing to drop it from the input and because of that it's never applied.

@jirispilka
Copy link
Collaborator

@MQ37, thank you!

I believe we want this enabled by default, right? It should be a non-breaking change that improves speed and reduces traffic with proxies.

Could you also provide 2-3 runs with this feature enabled and disabled so we can compare perf?

@MQ37
Copy link
Author

MQ37 commented Mar 12, 2025

@MQ37, thank you!

I believe we want this enabled by default, right? It should be a non-breaking change that improves speed and reduces traffic with proxies.

Could you also provide 2-3 runs with this feature enabled and disabled so we can compare perf?

Sure, we can enable this by default 👍

Here is the list of runs.

Blocking disabled:

Blocking enabled:

Another runs with performance analyzed.
Block disabled:

Average time for each time measure event: Map(10) {
  'request-received' => [ 0, 0, 0, 0, 0 ],
  'before-cheerio-queue-add' => [ 0, 0, 0, 0, 0 ],
  'cheerio-request-handler-start' => [ 8372, 8372, 8372, 8372, 8372 ],
  'before-playwright-queue-add' => [ 17, 17, 17, 17, 17 ],
  'playwright-request-start' => [ 19394, 15182, 46501, 27085, 35871 ],
  'playwright-wait-dynamic-content' => [ 7178, 10009, 1018, 1086, 1001 ],
  'playwright-remove-cookie' => [ 5113, 3584, 7259, 6300, 6700 ],
  'playwright-parse-with-cheerio' => [ 16993, 24006, 2394, 25708, 18014 ],
  'playwright-process-html' => [ 12, 296, 408, 193, 99 ],
  'playwright-before-response-send' => [ 488, 303, 1194, 310, 200 ]
}
request-received: 0 s
before-cheerio-queue-add: 0 s
cheerio-request-handler-start: 8372 s
before-playwright-queue-add: 17 s
playwright-request-start: 28807 s
playwright-wait-dynamic-content: 4058 s
playwright-remove-cookie: 5791 s
playwright-parse-with-cheerio: 17423 s
playwright-process-html: 202 s
playwright-before-response-send: 499 s
Time taken for each request: [ 57567, 61769, 67163, 69071, 70274 ]
Time taken on average 65168.8

Block enabled:

Average time for each time measure event: Map(10) {
  'request-received' => [ 0, 0, 0, 0, 0 ],
  'before-cheerio-queue-add' => [ 0, 0, 0, 0, 0 ],
  'cheerio-request-handler-start' => [ 2571, 2571, 2571, 2571, 2571 ],
  'before-playwright-queue-add' => [ 13, 13, 13, 13, 13 ],
  'playwright-request-start' => [ 6831, 29224, 38932, 37044, 67323 ],
  'playwright-wait-dynamic-content' => [ 5906, 1000, 2694, 10079, 10001 ],
  'playwright-remove-cookie' => [ 1486, 3809, 301, 312, 246 ],
  'playwright-parse-with-cheerio' => [ 2206, 1689, 1212, 727, 784 ],
  'playwright-process-html' => [ 97, 414, 96, 66, 70 ],
  'playwright-before-response-send' => [ 210, 892, 294, 15, 117 ]
}
request-received: 0 s
before-cheerio-queue-add: 0 s
cheerio-request-handler-start: 2571 s
before-playwright-queue-add: 13 s
playwright-request-start: 35871 s
playwright-wait-dynamic-content: 5936 s
playwright-remove-cookie: 1231 s
playwright-parse-with-cheerio: 1324 s
playwright-process-html: 149 s
playwright-before-response-send: 306 s
Time taken for each request: [ 19320, 39612, 46113, 50827, 81125 ]
Time taken on average 47399.4

@MQ37
Copy link
Author

MQ37 commented Mar 12, 2025

Looks great. Nice feature that will reduce proxy usage for user. Thank you.

But you've forgotten to add the input blockMedia into the defaults object in const.ts.

export const defaults = {
debugMode: inputSchema.properties.debugMode.default,
dynamicContentWaitSecs: inputSchema.properties.dynamicContentWaitSecs.default,
htmlTransformer: inputSchema.properties.htmlTransformer.default,
initialConcurrency: inputSchema.properties.initialConcurrency.default,
keepAlive: true, // Not in input_schema.json
maxConcurrency: inputSchema.properties.maxConcurrency.default,
maxRequestRetries: inputSchema.properties.maxRequestRetries.default,
maxRequestRetriesMax: inputSchema.properties.maxRequestRetries.maximum,
maxResults: inputSchema.properties.maxResults.default,
maxResultsMax: inputSchema.properties.maxResults.maximum,
minConcurrency: inputSchema.properties.minConcurrency.default,
outputFormats: inputSchema.properties.outputFormats.default,
proxyConfiguration: inputSchema.properties.proxyConfiguration.default,
query: undefined, // No default value in input_schema.json
readableTextCharThreshold: 100, // Not in input_schema.json
removeCookieWarnings: inputSchema.properties.removeCookieWarnings.default,
removeElementsCssSelector: inputSchema.properties.removeElementsCssSelector.default,
requestTimeoutSecs: inputSchema.properties.requestTimeoutSecs.default,
requestTimeoutSecsMax: inputSchema.properties.requestTimeoutSecs.maximum,
serpMaxRetries: inputSchema.properties.serpMaxRetries.default,
serpMaxRetriesMax: inputSchema.properties.serpMaxRetries.maximum,
serpProxyGroup: inputSchema.properties.serpProxyGroup.default,
scrapingTool: inputSchema.properties.scrapingTool.default,
};

This causes the input processing to drop it from the input and because of that it's never applied.

Thank you for noticing 👍 I tested it with local browser and it actually works without being added, but will add that 👍

@MQ37 MQ37 requested a review from matyascimbulka March 12, 2025 21:15
Copy link
Collaborator

@matyascimbulka matyascimbulka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the change. Now the Actor acts consistently between STANDALONE and STANDBY mode.

@jirispilka jirispilka self-requested a review March 13, 2025 09:11
Copy link
Collaborator

@jirispilka jirispilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MQ37, thank you! This looks very promising.

I’m afraid I don’t fully understand the test and the numbers you provided.
Did you call the Actor with several requests and then record the value?

It seems that with each additional request, the response time gets progressively slower.
An average response time of 47s doesn’t sound quite right.

Also, we need to document this option and its benefits in the README 🙏🏻

@MQ37
Copy link
Author

MQ37 commented Mar 13, 2025

@MQ37, thank you! This looks very promising.

I’m afraid I don’t fully understand the test and the numbers you provided. Did you call the Actor with several requests and then record the value?

It seems that with each additional request, the response time gets progressively slower. An average response time of 47s doesn’t sound quite right.

Also, we need to document this option and its benefits in the README 🙏🏻

Added this feature to the README, thank you for noticing 👍

Those performance numbers are from the src/performance-measures.ts. Basedo on my manual testing it is about 15 - 20 % faster with the media blocking enabled. Otherwise the run time is unpedictable, sometimes the run is faster and sometimes really slow with the same input.

@MQ37 MQ37 requested a review from jirispilka March 13, 2025 11:53
@MQ37
Copy link
Author

MQ37 commented Mar 19, 2025

@jirispilka should this PR be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check if we can exclude CSS and images when downloading HTML.
3 participants