-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Description
crawl4ai version
0.7.8
Expected Behavior
When using crawl4ai in a Dockerized FastAPI application, crawling shouldn't create zombie threads.
AlEach call to the /crawl endpoint creates approximately four additional threads.
These threads remain in a zombie state and are never cleaned up.
Zombie threads accumulate indefinitely with repeated crawl requests.
The threads are not released even though the AsyncWebCrawler context manager exits.
Additional Context
This behavior appears related to known Playwright issues when running inside Docker containers without an init process. Playwright documentation explicitly recommends running containers with an init system enabled.
Relevant references:
https://playwright.dev/python/docs/docker#recommended-docker-configuration
https://docs.docker.com/reference/cli/docker/container/run/#init
Using docker run --init mitigates similar issues by ensuring child processes are properly reaped. However, this is not documented anywhere in crawl4ai.
Open Questions / Requested Clarification
Should crawl4ai users always run containers with --init when using AsyncWebCrawler?
Is this a known limitation inherited from Playwright, or is this a crawl4ai lifecycle issue?
How should this be handled in Kubernetes deployments, where docker run --init cannot be easily used?
Is there a recommended workaround, such as embedding an init system (e.g., tini) in the Docker image?
Documentation Request
Regardless of whether this is an upstream Playwright limitation or a crawl4ai issue, the required runtime setup should be documented clearly. Users need explicit guidance on:
Whether --init is required
How to configure containers correctly
How to avoid zombie thread accumulation in production environments (especially Kubernetes)
Current Behavior
When using crawl4ai’s AsyncWebCrawler inside a Docker container, each invocation of the crawler leaks OS-level threads. These threads remain in a zombie state and are never cleaned up. Over time, repeated crawl requests cause unbounded growth in zombie threads.
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Build and start a Docker image with following steps:
Create a file named main.py with the following content:
import subprocess
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig
from fastapi import FastAPI
app = FastAPI()
@app.get("/crawl")
async def crawl():
async with AsyncWebCrawler(
config=BrowserConfig(browser_type="chromium", headless=True)
) as crawler:
result = await crawler.arun(url="https://www.nbcnews.com/business
")
return result
@app.get("/get-threads-info")
def get_threads_info():
info = subprocess.run(
["top", "-H", "-b", "-n", "1"],
capture_output=True,
text=True
).stdout
print(info)
return info
Create a Dockerfile with the following content:
FROM python:3.11-slim-bookworm
RUN pip install fastapi==0.120.2 crawl4ai==0.7.8 uvicorn
RUN crawl4ai-setup
RUN apt update && apt install -y procps
WORKDIR /app
COPY . /app
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7777"]
Build the Docker image:
docker build -t fastapi-crawl-app .
Run the container:
docker run -p 7777:7777 fastapi-crawl-app
Call the crawl endpoint:
http://localhost:7777/crawl
Call the threads inspection endpoint:
http://localhost:7777/get-threads-info
Observe the thread state output from top in (the response of the endpoint and in the command line output of the docker container). It will look similar to:
...
Threads: 16 total, 1 running, 3 sleeping, 0 stopped, 12 zombie
...
Repeat calls to the /crawl endpoint and re-run /get-threads-info to see the remaining zombie threads.Code snippets
OS
On Windowns with Linux base docker image
Python version
3.11
Browser
Chrome
Browser version
No response
Error logs & Screenshots (if applicable)
No response