Skip to content

Commit e6c0c28

Browse files
authored
feat: add Camoufox-based Crawlee JS/TS templates (#336)
Following the apify/apify-sdk-js#364 and apify/crawlee#2842 , this PR adds Camoufox-enabled templates to Apify Actor templates. The implementation is heavily based on the existing Playwright + Chrome templates. The only issue (I'm aware of) currently is the immense size of those images (as they contain Chrome and we add Camoufox binaries). Installing Camoufox directly to a `node-debian` image results in missing system dependencies. While it might be possible to install those manually in the Dockerfile, it might make the Dockerfile too complex for a regular user. ![image](https://github.com/user-attachments/assets/fb0050fd-fadc-4bbc-80f3-0681dcfa2b92)
1 parent 54466c9 commit e6c0c28

File tree

24 files changed

+555
-0
lines changed

24 files changed

+555
-0
lines changed
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Specify the base Docker image. You can read more about
2+
# the available images at https://crawlee.dev/docs/guides/docker-images
3+
# You can also use any other image from Docker Hub.
4+
FROM apify/actor-node-playwright-chrome:20
5+
6+
# Check preinstalled packages
7+
RUN npm ls crawlee apify playwright
8+
9+
# Copy just package.json and package-lock.json
10+
# to speed up the build using Docker layer cache.
11+
COPY --chown=myuser package*.json ./
12+
13+
# Install NPM packages, skip optional and development dependencies to
14+
# keep the image small. Avoid logging too much and print the dependency
15+
# tree for debugging
16+
RUN npm --quiet set progress=false \
17+
&& npm install --omit=dev \
18+
&& echo "Installed NPM packages:" \
19+
&& (npm list --omit=dev --all || true) \
20+
&& echo "Node.js version:" \
21+
&& node --version \
22+
&& echo "NPM version:" \
23+
&& npm --version \
24+
&& rm -r ~/.npm
25+
26+
# Next, copy the remaining files and directories with the source code.
27+
# Since we do this after NPM install, quick build will be really fast
28+
# for most source file changes.
29+
COPY --chown=myuser . ./
30+
31+
# Run the image. If you know you won't need headful browsers,
32+
# you can remove the XVFB start script for a micro perf gain.
33+
CMD ./start_xvfb_and_run_cmd.sh && npm start --silent
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
{
2+
"actorSpecification": 1,
3+
"name": "project-playwright-camoufox-crawler-javascript",
4+
"title": "Camoufox Playwright Crawler JavaScript",
5+
"description": "Crawlee and Playwright project with Camoufox in JavaScript.",
6+
"version": "0.0",
7+
"meta": {
8+
"templateId": "js-crawlee-playwright-camoufox"
9+
},
10+
"input": "./input_schema.json",
11+
"dockerfile": "./Dockerfile"
12+
}
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
{
2+
"title": "PlaywrightCrawler Template",
3+
"type": "object",
4+
"schemaVersion": 1,
5+
"properties": {
6+
"startUrls": {
7+
"title": "Start URLs",
8+
"type": "array",
9+
"description": "URLs to start with.",
10+
"editor": "requestListSources",
11+
"prefill": [
12+
{
13+
"url": "https://apify.com"
14+
}
15+
]
16+
}
17+
}
18+
}
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# configurations
2+
.idea
3+
4+
# crawlee and apify storage folders
5+
apify_storage
6+
crawlee_storage
7+
storage
8+
9+
# installed files
10+
node_modules
11+
12+
# git folder
13+
.git
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
root = true
2+
3+
[*]
4+
indent_style = space
5+
indent_size = 4
6+
charset = utf-8
7+
trim_trailing_whitespace = true
8+
insert_final_newline = true
9+
end_of_line = lf
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"extends": "@apify",
3+
"root": true
4+
}
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# This file tells Git which files shouldn't be added to source control
2+
3+
.DS_Store
4+
.idea
5+
dist
6+
node_modules
7+
apify_storage
8+
storage
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
## PlaywrightCrawler + Camoufox template
2+
3+
This template is a production-ready boilerplate for developing an [Actor](https://apify.com/actors) with `PlaywrightCrawler`. It has [Camoufox](https://github.com/daijro/camoufox) - a stealthy fork of Firefox - preinstalled. Note that Camoufox might consume more resources than the default Playwright-bundled Chromium or Firefox.
4+
5+
Use this template to bootstrap your projects using the most up-to-date code.
6+
7+
> We decided to split Apify SDK into two libraries, Crawlee and Apify SDK v3. Crawlee will retain all the crawling and scraping-related tools and will always strive to be the best [web scraping](https://apify.com/web-scraping) library for its community. At the same time, Apify SDK will continue to exist, but keep only the Apify-specific features related to building actors on the Apify platform. Read the upgrading guide to learn about the changes.
8+
>
9+
10+
## Resources
11+
12+
If you're looking for examples or want to learn more visit:
13+
14+
- [Crawlee + Apify Platform guide](https://crawlee.dev/docs/guides/apify-platform)
15+
- [Documentation](https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler) and [examples](https://crawlee.dev/docs/examples/playwright-crawler)
16+
- [Node.js tutorials](https://docs.apify.com/academy/node-js) in Academy
17+
- [Scraping single-page applications with Playwright](https://blog.apify.com/scraping-single-page-applications-with-playwright/)
18+
- [How to scale Puppeteer and Playwright](https://blog.apify.com/how-to-scale-puppeteer-and-playwright/)
19+
- [Integration with Zapier](https://apify.com/integrations), Make, GitHub, Google Drive and other apps
20+
- [Video guide on getting data using Apify API](https://www.youtube.com/watch?v=ViYYDHSBAKM)
21+
- A short guide on how to create Actors using code templates:
22+
23+
[web scraper template](https://www.youtube.com/watch?v=u-i-Korzf8w)
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
{
2+
"name": "crawlee-playwright-javascript-camoufox",
3+
"version": "0.0.1",
4+
"type": "module",
5+
"description": "This is an example of an Apify actor.",
6+
"engines": {
7+
"node": ">=20.0.0"
8+
},
9+
"dependencies": {
10+
"apify": "^3.2.6",
11+
"camoufox-js": "^0.1.3",
12+
"crawlee": "^3.11.5",
13+
"playwright": "*"
14+
},
15+
"devDependencies": {
16+
"@apify/eslint-config": "^0.4.0",
17+
"eslint": "^8.50.0"
18+
},
19+
"scripts": {
20+
"start": "node src/main.js",
21+
"lint": "eslint ./src --ext .js,.jsx",
22+
"lint:fix": "eslint ./src --ext .js,.jsx --fix",
23+
"test": "echo \"Error: oops, the actor has no tests yet, sad!\" && exit 1",
24+
"postinstall": "npx camoufox-js fetch"
25+
},
26+
"author": "It's not you it's me",
27+
"license": "ISC"
28+
}
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
/**
2+
* This template is a production ready boilerplate for developing with `PlaywrightCrawler`.
3+
* Use this to bootstrap your projects using the most up-to-date code.
4+
* If you're looking for examples or want to learn more, see README.
5+
*/
6+
7+
// For more information, see https://docs.apify.com/sdk/js
8+
import { Actor } from 'apify';
9+
// For more information, see https://crawlee.dev
10+
import { PlaywrightCrawler } from 'crawlee';
11+
// this is ESM project, and as such, it requires you to specify extensions in your relative imports
12+
// read more about this here: https://nodejs.org/docs/latest-v18.x/api/esm.html#mandatory-file-extensions
13+
import { router } from './routes.js';
14+
import { firefox } from 'playwright';
15+
import { launchOptions as camoufoxLaunchOptions } from 'camoufox-js';
16+
17+
// Initialize the Apify SDK
18+
await Actor.init();
19+
20+
const {
21+
startUrls = ['https://crawlee.dev'],
22+
} = await Actor.getInput() ?? {};
23+
24+
const proxyConfiguration = await Actor.createProxyConfiguration();
25+
26+
const crawler = new PlaywrightCrawler({
27+
proxyConfiguration,
28+
requestHandler: router,
29+
launchContext: {
30+
launcher: firefox,
31+
launchOptions: await camoufoxLaunchOptions({
32+
headless: true,
33+
// fonts: ['Times New Roman'], // <- custom Camoufox options
34+
}),
35+
}
36+
});
37+
38+
await crawler.run(startUrls);
39+
40+
// Exit successfully
41+
await Actor.exit();

0 commit comments

Comments
 (0)