Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apify will sponsor your project: Crawl4AI Actor on Apify infrastructure #865

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

janbuchar
Copy link

@janbuchar janbuchar commented Mar 21, 2025

Dear Crawl4AI maintainers,

I have wrapped Crawl4AI as an Apify Actor by adding the Actor definition in the .actor directory and published the Crawl4AI Actor on Apify Store. I've also added the Actor status badge and a brief usage description to the README, including the "Run on Apify" button.

For the full description of the Actor, please see the README file in the .actor directory.

Crawl4AI can now be used in the cloud without installation, free of charge. This makes it accessible to users without having to manage dependencies or configure their own environment. The Actor can be used either from Apify Console, API, or CLI locally:

$ echo '{"startUrls": [{ "url": "https://docs.crawl4ai.com/" }], "maxCrawlDepth": 1}' | apify call -so janbuchar/crawl4ai
[{
  "url": "https://docs.crawl4ai.com/",
  "markdown": "https://api.apify.com/v2/key-value-stores/m1Sqnke1KWM0AI8co/records/content_4242424242.md",
  "html": "https://api.apify.com/v2/key-value-stores/m1Sqnke1KWM0AI8co/records/content_4242424242.html",
  "metadata": {
    "title": "Home - Crawl4AI Documentation (v0.5.x)",
    "description": "🚀🤖 Crawl4AI, Open-source LLM-Friendly Web Crawler & Scraper",
  }
},
{
  "url": "https://docs.crawl4ai.com/advanced/ssl-certificate/",
  "markdown": "https://api.apify.com/v2/key-value-stores/m1Sqnke1KWM0AI8co/records/content_4242424242.md",
  "html": "https://api.apify.com/v2/key-value-stores/m1Sqnke1KWM0AI8co/records/content_4242424242.html",
  "metadata": {
    "title": "SSL Certificate - Crawl4AI Documentation (v0.5.x)",
    "description": "🚀🤖 Crawl4AI, Open-source LLM-Friendly Web Crawler & Scraper",
  }
},
# ...
]

The Actor processes the configured URLs, follows links, and stores the results in an Apify Dataset. It enhances Crawl4AI with additional features:

  • Persistence – The crawl results are stored in Apify storage and crawls can be paused and resumed.
  • HTTP proxy support – Enables use of Apify’s proxies or custom proxy settings.
  • Automatic parallelism – Dynamically adjusts the number of concurrent page crawls.
  • Integrations – It is possible to integrate Crawl4AI with other Apify Actors and even with other automation platforms

Please see another open-source Actor example in the Docling project for further inspiration.

Technical implementation

The Actor is based on Crawlee, Apify's open-source web scraping framework, ensuring robustness and efficiency. It runs in a Docker container optimized for cloud execution and includes:

  • Support for structured output in Markdown format.
  • Apify’s storage for result persistence and resumable crawling.
  • Proxy support for avoiding rate limits and geo-restrictions.
  • Parallel request handling for improved performance.
  • Clean error handling and input validation.

Do note that the Actor does not cover all possible configuration options that Crawl4AI provides. We will be happy to work with you to add or adjust anything based on your feedback!

Apify will sponsor your project

All the links to Apify in this PR are affiliate links under the Apify open source fair share program with id crawl4ai in the passive tier of the program. In the passive tier, Apify commits to sending a monthly commission via the GitHub Sponsor button from all new sign-ups that come through your link. The only action required on your part is to accept this pull request and set up GitHub Sponsors.

You can earn a larger commission and gain insights into traffic by registering directly with Apify, claiming ownership of the Actor on the Apify Store, and maintaining the Actor yourself. Simply contact support after signing up and pass the ownership challenge. The Actor will then be transferred, e.g., to unclecode/crawl4ai, and you’ll see it under your Apify account.

To further increase your income from Apify, you can convert your Actor on Apify Store to the pay-per-event pricing model and join the active developer tier. We offer an individual competitive advantage for the active developer tier in the form of either a significantly reduced Apify margin or discounted compute unit pricing. Feel free to ask for it!

Benefits of the Actor Programming Model

The Web Actor Programming Model is a new concept for building serverless microapps, which are easy to develop, share, integrate, and build upon. Actors are a reincarnation of the UNIX philosophy for programs running in the cloud. Actors are web automation tools that are easy to integrate and scale up. The main benefit is that even a small piece of software can be turned into a public cloud service in a heartbeat.

Apify is the largest ecosystem where developers build, deploy, and publish data extraction, web automation tools, and AI agents. With over 4,000 Actors on Apify Store and 10 years of experience in the market, Apify makes Crawl4AI accessible to over 250,000 developers using the platform monthly. This also enables integration with other Actors on Store, custom Actors, and platform integrations that can create much more powerful workflows than just individual parts.

It might pique your interest that even without any promotion whatsoever, the Crawl4AI Actor already attracted 50 users to try it out.

Full disclosure

I work at Apify. Apify doesn’t sell your software, but we sell the computing resources needed to run your software in the cloud to the end users. Your project is one of the first we selected to pilot Apify's open source fair share program. Please let me know if there’s anything I can do to help you accept this PR! If you do, we’d be pleased to feature your project in our marketing communication.

If you have any questions or need assistance, don’t hesitate to reach out to me @janbuchar, my colleague @netmilk, or just write us to [email protected]. We're always excited to work with the open source community! There's definitely room for improvement regarding performance, error handling, coverage of the Crawl4AI parameters and release management, Apify proxies usage.

@unclecode
Copy link
Owner

@janbuchar Appreciate your interest in Crawl4AI and for creating an Apify Actor for our project! I'm genuinely excited to see this kind of community integration, and I appreciate the effort you've put into making our tool more accessible to users through the Apify platform.

I'm definitely open to this collaboration and see great potential in joining the open source fair share program. The ability to run Crawl4AI in the cloud without installation is a wonderful addition that can help more users benefit from our tool.

I do have a minor concern that I'd like to address: I noticed that the Actor might be using an older version of our Docker image. Our codebase has evolved significantly since then, with substantial performance improvements and new features. This might impact the overall performance and capabilities of the Actor.

I'll have our head of product and community @aravindkarnam test this integration to provide detailed feedback. Once we've evaluated it thoroughly, we can work together to ensure the Actor is using the most optimal and up-to-date version of Crawl4AI.

Thanks again for this contribution!

@aravindkarnam - Could you please take some time to test this Apify integration? They've created an Actor for Crawl4AI that allows users to run it in the cloud without installation, but I'm concerned they might be using an older version of our code.

Specifically, I'd like you to:

  1. Test the performance of their Actor compared to our current version
  2. Check which version/build they're using and identify any significant differences
  3. Evaluate the feature coverage - are there critical features missing?
  4. Prepare a brief report on your findings that we can share with them

Thanks!

@janbuchar
Copy link
Author

Hi @unclecode , those are great news! Regarding the Crawl4AI version, we use 0.5.0post4 in the current version. We install from PyPI and use it as a library so that we can have a tighter integration with the Apify platform.

Important points are

  1. It should be easy to update the Actor to the newest crawl4ai version, I will look into it today
  2. We can adjust the Actor to use your Docker image and wrap the CLI, but I would try to do some evaluation first

@janbuchar
Copy link
Author

janbuchar commented Mar 26, 2025

I managed to update to 0.5.0.post6. There are some playwright-related problems starting with post7, I will let you know once that is resolved.

EDIT: Updated to post8

@aravindkarnam
Copy link
Collaborator

@janbuchar I've checked out the actor, the time took by crawl4AI to finish it's processing and produce output was only 6.36 seconds(as per the logs), but all the other activities included took about 19 seconds (run ID).

I ran it again and this time crawl4AI took 4.37s total to fetch, scrape and produce output but rest of the activities took almost 40secs(run ID).

Is there any way to improve performance on this?

@janbuchar
Copy link
Author

Thanks @aravindkarnam for giving the Actor a test drive! I gave you some additional Apify credits so that you can try out more stuff 🙂

@janbuchar I've checked out the actor, the time took by crawl4AI to finish it's processing and produce output was only 6.36 seconds(as per the logs), but all the other activities included took about 19 seconds (run ID).

I ran it again and this time crawl4AI took 4.37s total to fetch, scrape and produce output but rest of the activities took almost 40secs(run ID).

I went throught the logs and found the following causes of overhead:

  • Pulling the Actor Docker container (~5s, ~28s)
  • Container initialization (~3s, ~2s)
  • Crawl4AI initialization (~2s, ~0.5s)
  • Crawlee initialization (~0.7s, ~0.5s)
  • Saving results (~0.6s, ~0.6s)

The actual Crawl4AI run took ~6s and ~4s.

Is there any way to improve performance on this?

Well, most of the overhead is caused by the overhead of pulling the Docker image. At Apify, we are continuously working on improving this. There is a lot of caching going on, so if the Actor runs frequently enough, the image should be cached on all worker nodes, which ensures stable initialization times.

Also, if your Docker image is smaller than the apify/actor-python-playwright:3.12 image that the Actor currently uses, we could benefit from switching to that one.

It is also possible not to use BasicCrawler from crawlee (our open source web scraping library). That would put you (Crawl4AI maintainers) in control of the overhead of crawling multiple pages, scaling to utilize available capacity, and so on. Custom code would, however, need to be written so that you can benefit from

  • the Apify Request Queue (this enables Actor progress tracking and pausing/restarting the crawl)
  • Proxy rotation
  • possibly other stuff that I can't think of right now 🙂

I'm excited to hear your thoughts on this! Also, we'll be happy to work with you to adjust the Actor.

@janbuchar
Copy link
Author

Do you see any blockers or room for improvement @aravindkarnam? Or is there anything I should explain or assist with?

Also, kinda unrelated, but do you have a square icon that I could use for the Apify Actor? 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants