Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add block media feature for playwright - blocks images, videos, css #54

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .actor/input_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,12 @@
"description": "If enabled, the Actor attempts to close or remove cookie consent dialogs to improve the quality of extracted text. Note that this setting increases the latency.",
"default": true
},
"blockMedia": {
"title": "Block media resources",
"type": "boolean",
"description": "If enabled, the Actor will block loading of images, videos and CSS resources when using the Playwright browser. This can improve performance and reduce bandwidth usage.",
"default": true
},
"debugMode": {
"title": "Enable debug mode",
"type": "boolean",
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ The extracted text can then be injected into prompts and retrieval augmented gen
- 📝 Output formats include **Markdown**, plain text, and HTML
- 🔌 Supports **OpenAPI and MCP** for easy integration
- 🪟 It's **open source**, so you can review and modify it
- 🖼️ **Media blocking** to skip images, videos, and CSS for faster scraping and lower bandwidth usage

## Example

Expand Down Expand Up @@ -119,6 +120,7 @@ The `/search` GET HTTP endpoint accepts the following query parameters:
| `maxRequestRetries` | number | `1` | The maximum number of times the Actor will retry loading the target web page on error. If the last attempt fails, the page will be skipped in the results. |
| `dynamicContentWaitSecs` | number | `10` | The maximum time in seconds to wait for dynamic page content to load. The Actor considers the web page as fully loaded once this time elapses or when the network becomes idle. |
| `removeCookieWarnings` | boolean | `true` | If enabled, removes cookie consent dialogs to improve text extraction accuracy. This might increase latency. |
| `blockMedia` | boolean | `true` | If enabled, blocks loading of images, videos, and CSS when using `browser-playwright`, improving speed and bandwidth. |
| `removeElementsCssSelector` | string | `see input` | A CSS selector matching HTML elements that will be removed from the DOM, before converting it to text, Markdown, or saving as HTML. This is useful to skip irrelevant page content. The value must be a valid CSS selector as accepted by the `document.querySelectorAll()` function. \n\nBy default, the Actor removes common navigation elements, headers, footers, modals, scripts, and inline image. You can disable the removal by setting this value to some non-existent CSS selector like `dummy_keep_everything`. |
| `debugMode` | boolean | `false` | If enabled, the Actor will store debugging information in the dataset's debug field. |

Expand Down
1 change: 1 addition & 0 deletions src/const.ts
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ export const defaults = {
query: undefined, // No default value in input_schema.json
readableTextCharThreshold: 100, // Not in input_schema.json
removeCookieWarnings: inputSchema.properties.removeCookieWarnings.default,
blockMedia: inputSchema.properties.blockMedia.default,
removeElementsCssSelector: inputSchema.properties.removeElementsCssSelector.default,
requestTimeoutSecs: inputSchema.properties.requestTimeoutSecs.default,
requestTimeoutSecsMax: inputSchema.properties.requestTimeoutSecs.maximum,
Expand Down
21 changes: 21 additions & 0 deletions src/input.ts
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,27 @@ function createPlaywrightCrawlerOptions(input: Input, proxy: ProxyConfiguration
maxConcurrency,
minConcurrency,
},
preNavigationHooks: input.blockMedia ? [
async ({ page }) => {
await page.route('**/*', async (route) => {
const resourceType = route.request().resourceType();
const url = route.request().url();

// Block if it's an image/video/css resource type or has an image/video extension
if (
resourceType === 'image'
|| resourceType === 'video'
|| resourceType === 'media'
|| resourceType === 'stylesheet'
|| /\.(jpg|jpeg|png|gif|bmp|webp|mp4|webm|ogg|mov|css)$/i.test(url)
) {
await route.abort();
} else {
await route.continue();
}
});
},
] : [],
},
};
}
Expand Down
1 change: 1 addition & 0 deletions src/types.ts
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ export type Input = {
removeElementsCssSelector: string;
removeCookieWarnings: boolean;
scrapingTool: 'browser-playwright' | 'raw-http';
blockMedia: boolean;
};

export type StandbyInput = Input & {
Expand Down
Loading