-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apify will sponsor your project: Crawl4AI Actor on Apify infrastructure #865
base: main
Are you sure you want to change the base?
Conversation
@janbuchar Appreciate your interest in Crawl4AI and for creating an Apify Actor for our project! I'm genuinely excited to see this kind of community integration, and I appreciate the effort you've put into making our tool more accessible to users through the Apify platform. I'm definitely open to this collaboration and see great potential in joining the open source fair share program. The ability to run Crawl4AI in the cloud without installation is a wonderful addition that can help more users benefit from our tool. I do have a minor concern that I'd like to address: I noticed that the Actor might be using an older version of our Docker image. Our codebase has evolved significantly since then, with substantial performance improvements and new features. This might impact the overall performance and capabilities of the Actor. I'll have our head of product and community @aravindkarnam test this integration to provide detailed feedback. Once we've evaluated it thoroughly, we can work together to ensure the Actor is using the most optimal and up-to-date version of Crawl4AI. Thanks again for this contribution! @aravindkarnam - Could you please take some time to test this Apify integration? They've created an Actor for Crawl4AI that allows users to run it in the cloud without installation, but I'm concerned they might be using an older version of our code. Specifically, I'd like you to:
Thanks! |
Hi @unclecode , those are great news! Regarding the Crawl4AI version, we use 0.5.0post4 in the current version. We install from PyPI and use it as a library so that we can have a tighter integration with the Apify platform. Important points are
|
I managed to update to 0.5.0.post6. There are some playwright-related problems starting with post7, I will let you know once that is resolved. EDIT: Updated to post8 |
@janbuchar I've checked out the actor, the time took by crawl4AI to finish it's processing and produce output was only 6.36 seconds(as per the logs), but all the other activities included took about 19 seconds (run ID). I ran it again and this time crawl4AI took 4.37s total to fetch, scrape and produce output but rest of the activities took almost 40secs(run ID). Is there any way to improve performance on this? |
Thanks @aravindkarnam for giving the Actor a test drive! I gave you some additional Apify credits so that you can try out more stuff 🙂
I went throught the logs and found the following causes of overhead:
The actual Crawl4AI run took ~6s and ~4s.
Well, most of the overhead is caused by the overhead of pulling the Docker image. At Apify, we are continuously working on improving this. There is a lot of caching going on, so if the Actor runs frequently enough, the image should be cached on all worker nodes, which ensures stable initialization times. Also, if your Docker image is smaller than the It is also possible not to use
I'm excited to hear your thoughts on this! Also, we'll be happy to work with you to adjust the Actor. |
Do you see any blockers or room for improvement @aravindkarnam? Or is there anything I should explain or assist with? Also, kinda unrelated, but do you have a square icon that I could use for the Apify Actor? 🙂 |
Dear Crawl4AI maintainers,
I have wrapped Crawl4AI as an Apify Actor by adding the Actor definition in the
.actor
directory and published the Crawl4AI Actor on Apify Store. I've also added the Actor status badge and a brief usage description to the README, including the "Run on Apify" button.For the full description of the Actor, please see the README file in the
.actor
directory.Crawl4AI can now be used in the cloud without installation, free of charge. This makes it accessible to users without having to manage dependencies or configure their own environment. The Actor can be used either from Apify Console, API, or CLI locally:
The Actor processes the configured URLs, follows links, and stores the results in an Apify Dataset. It enhances Crawl4AI with additional features:
Please see another open-source Actor example in the Docling project for further inspiration.
Technical implementation
The Actor is based on Crawlee, Apify's open-source web scraping framework, ensuring robustness and efficiency. It runs in a Docker container optimized for cloud execution and includes:
Do note that the Actor does not cover all possible configuration options that Crawl4AI provides. We will be happy to work with you to add or adjust anything based on your feedback!
Apify will sponsor your project
All the links to Apify in this PR are affiliate links under the Apify open source fair share program with id
crawl4ai
in the passive tier of the program. In the passive tier, Apify commits to sending a monthly commission via the GitHub Sponsor button from all new sign-ups that come through your link. The only action required on your part is to accept this pull request and set up GitHub Sponsors.You can earn a larger commission and gain insights into traffic by registering directly with Apify, claiming ownership of the Actor on the Apify Store, and maintaining the Actor yourself. Simply contact support after signing up and pass the ownership challenge. The Actor will then be transferred, e.g., to
unclecode/crawl4ai
, and you’ll see it under your Apify account.To further increase your income from Apify, you can convert your Actor on Apify Store to the pay-per-event pricing model and join the active developer tier. We offer an individual competitive advantage for the active developer tier in the form of either a significantly reduced Apify margin or discounted compute unit pricing. Feel free to ask for it!
Benefits of the Actor Programming Model
The Web Actor Programming Model is a new concept for building serverless microapps, which are easy to develop, share, integrate, and build upon. Actors are a reincarnation of the UNIX philosophy for programs running in the cloud. Actors are web automation tools that are easy to integrate and scale up. The main benefit is that even a small piece of software can be turned into a public cloud service in a heartbeat.
Apify is the largest ecosystem where developers build, deploy, and publish data extraction, web automation tools, and AI agents. With over 4,000 Actors on Apify Store and 10 years of experience in the market, Apify makes Crawl4AI accessible to over 250,000 developers using the platform monthly. This also enables integration with other Actors on Store, custom Actors, and platform integrations that can create much more powerful workflows than just individual parts.
It might pique your interest that even without any promotion whatsoever, the Crawl4AI Actor already attracted 50 users to try it out.
Full disclosure
I work at Apify. Apify doesn’t sell your software, but we sell the computing resources needed to run your software in the cloud to the end users. Your project is one of the first we selected to pilot Apify's open source fair share program. Please let me know if there’s anything I can do to help you accept this PR! If you do, we’d be pleased to feature your project in our marketing communication.
If you have any questions or need assistance, don’t hesitate to reach out to me @janbuchar, my colleague @netmilk, or just write us to [email protected]. We're always excited to work with the open source community! There's definitely room for improvement regarding performance, error handling, coverage of the Crawl4AI parameters and release management, Apify proxies usage.