Elastic Open Web Crawler

Elastic Open Crawler is a lightweight, open code web crawler designed for discovering, extracting, and indexing web content directly into Elasticsearch.

This CLI-driven tool streamlines web content ingestion into Elasticsearch, enabling easy searchability through on-demand or scheduled crawls defined by configuration files.

Note

This repository contains code and documentation for the Elastic Open Web Crawler. Docker images are available for the crawler at the Elastic Docker registry.

Important

The Open Crawler is currently in beta. Beta features are subject to change and are not covered by the support SLA of generally available (GA) features. Elastic plans to promote this feature to GA in a future release.

Version compatibility

Elasticsearch	Open Crawler	Operating System
`8.x`	`v0.2.x`	Linux, OSX
`9.x`	`v0.2.1` and above	Linux, OSX

Quick links

Hands-on quickstart: Run your first crawl to ingest web content into Elasticsearch.
Learn more: Learn how to configure advanced features and understand detailed options.
Developer guide: Learn how to build and run Open Crawler from source, for developers who want to modify or extend the code.

Quickstart

Get from zero to crawling your website into Elasticsearch in just a few steps.

Steps

Prerequisites

Prerequisites

You'll need Docker Desktop installed and running
You'll need a running Elasticsearch instance
- Start a free Elastic Cloud Hosted or Serverless trial
- Get started locally

Step 1: Verify Docker setup and run a test crawl

First, let's test that the crawler works on your system by crawling a simple website and printing the results to your terminal. We'll create a basic config file and run the crawler against https://example.com.

Run the following in your terminal:

cat > crawl-config.yml << EOF
output_sink: console
domains:
  - url: https://example.com
EOF

docker run \
  -v "$(pwd)":/config \
  -it docker.elastic.co/integrations/crawler:latest jruby \
  bin/crawler crawl /config/crawl-config.yml

The -v "$(pwd)":/config flag maps your current directory to the container's /config directory, making your config file available to the crawler.

✅ Success check: You should see HTML content from example.com printed to your console, ending with [primary] Finished a crawl. Result: success;

Step 2: Get your Elasticsearch details

Tip

If you haven't used Elasticsearch before, check out the Elasticsearch basics quickstart for a hands-on introduction to fundamental concepts.

Before proceeding with Step 2, make sure you have a running Elasticsearch instance. See prequisites.

For this step you'll need:

Your Elasticsearch endpoint URL
An API key

For step-by-step guidance on finding endpoint URLs and creating API keys in the UI, see connection details.

If you'd prefer to create an API key in the Dev Tools Console use the following command:

Create API key via Dev Tools Console

Run the following in Dev Tools Console:

POST /_security/api_key
{
  "name": "crawler-key",
  "role_descriptors": { 
    "crawler-role": {
      "cluster": ["monitor"],
      "indices": [
        {
          "names": ["web-crawl-*"],
          "privileges": ["write", "create_index", "monitor"]
        }
      ]
    }
  }
}

Save the encoded value from the response - this is your API key.

Step 3: Set environment variables (optional)

Tip

If you prefer not to use environment variables (or are on a system where they don't work as expected), you can skip this step and manually edit the configuration file in Step 4.

Set your connection details and target website as environment variables. Replace the values with your actual values.

export ES_HOST="https://your-deployment.es.region.aws.elastic.cloud"
export ES_PORT="443"
export ES_API_KEY="your_encoded_api_key_here"
export TARGET_WEBSITE="https://your-website.com"

Note

Connection settings differ based on where Elasticsearch is running (e.g., cloud hosted, serverless, or localhost).

ES_HOST: Your Elasticsearch endpoint URL
- Cloud Hosted/Serverless: Looks like https://your-deployment.es.region.aws.elastic.cloud
- Localhost:
  - Use http://host.docker.internal if Elasticsearch is running locally but not in the same Docker network
  - Use http://elasticsearch if Elasticsearch is running in a Docker container on the same network
ES_PORT: Elasticsearch port
- Cloud Hosted/Serverless: 443
- Localhost: 9200
ES_API_KEY: API key from Step 2
TARGET_WEBSITE: Website to crawl
- Delete any trailing slashes (/) or you'll hit an error. ArgumentError: Domain "https://www.example.com/" cannot have a path

Step 4: Update crawler configuration for Elasticsearch

Create your crawler config file by running the following command. This will use the environment variables you set in Step 3 to populate the configuration file automatically.

cat > crawl-config.yml << EOF
output_sink: elasticsearch
output_index: web-crawl-test

elasticsearch:
  host: $ES_HOST
  port: $ES_PORT
  api_key: $ES_API_KEY
  pipeline_enabled: false

domains:
  - url: $TARGET_WEBSITE
EOF

If you skipped Step 3 or the environment variables aren't working on your computer, create the config file and replace the placeholders manually.

Manual configuration

cat > crawl-config.yml << 'EOF'
output_sink: elasticsearch
output_index: web-crawl-test

elasticsearch:
  host: https://your-deployment.es.region.aws.elastic.cloud  # Your ES_HOST
  port: 443                                                   # Your ES_PORT (443 for cloud, 9200 for localhost)  
  api_key: your_encoded_api_key_here                          # Your ES_API_KEY from Step 2
  pipeline_enabled: false

domains:
  - url: https://your-website.com                             # Your target website
EOF

Step 5: Crawl and ingest into Elasticsearch

Now you can ingest your target website content into Elasticsearch:

docker run \
  -v "$(pwd)":/config \
  -it docker.elastic.co/integrations/crawler:latest jruby \
  bin/crawler crawl /config/crawl-config.yml

✅ Success check: You should see messages like:

Connected to ES at https://your-endpoint - version: 8.x.x
Index [web-crawl-test] was found!
Elasticsearch sink initialized

Step 6: View your data

Now that the crawl is complete, you can view the indexed data in Elasticsearch:

Use the API

The fastest way is to use `curl` from the command line. This reuses the environment variables you set earlier.

curl -X GET "${ES_HOST}:${ES_PORT}/web-crawl-test/_search" \
    -H "Authorization: ApiKey ${ES_API_KEY}" \
    -H "Content-Type: application/json"

Alternatively, run the following API call in the Dev Tools Console:

GET /web-crawl-test/_search

Use Kibana/Serverless UI

Go to the Kibana or Serverless UI
Find the Index Management page using the global search bar
Select the web-crawl-test index

📖 Learn more

🚀 Essential guides and concepts

CLI reference - Commands for running crawls, validation, and management
Configuration - Understand how to configure crawlers with crawler.yml and elasticsearch.yml files
Document schema - Understand how crawled content is indexed into a set of predefined fields in Elasticsearch and how to add fields using extraction rules
Crawl rules - Control which URLs the crawler visits

⚙️ Advanced topics

Crawl lifecycle - Understand how the crawler discovers, queues, and indexes content across two stages: the primary crawl and the purge crawl
Extraction rules - Define how crawler extracts content from HTML
Binary content extraction - Extract text from PDFs, DOCX files
Crawler directives - Use robots.txt, meta tags, or embedded data attributes to guide discovery and content extraction
Scheduling - Automate crawls with cron scheduling
Ingest pipelines - Elasticsearch ingest pipeline integration
Logging - Monitor and troubleshoot crawler activity

⚖️ Elastic Crawler comparison

Feature comparison - See how Open Crawler compares to Elastic Crawler, including feature support and deployment differences

👩🏽‍💻 Developer guide

Build from source

You can build and run the Open Crawler locally using the provided setup instructions. Detailed setup steps, including environment requirements, are in the Developer Guide.

Contribute

Want to contribute? We welcome bug reports, code contributions, and documentation improvements. Read the Contributing Guide for contribution types, PR guidelines, and coding standards.

💬 Support

Learn how to get help, report issues, and find community resources in the Support Guide.

Name		Name	Last commit message	Last commit date
Latest commit History 194 Commits
.buildkite		.buildkite
.devcontainer		.devcontainer
.github		.github
bin		bin
config		config
docs		docs
lib		lib
script		script
spec		spec
vendor		vendor
.backportrc.json		.backportrc.json
.bundler-version		.bundler-version
.gitignore		.gitignore
.java-version		.java-version
.jrubyrc		.jrubyrc
.rspec		.rspec
.rubocop.yml		.rubocop.yml
.ruby-version		.ruby-version
Brewfile		Brewfile
Dockerfile		Dockerfile
Dockerfile.wolfi		Dockerfile.wolfi
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
Jarfile		Jarfile
Jars.lock		Jars.lock
LICENSE		LICENSE
Makefile		Makefile
NOTICE.txt		NOTICE.txt
README.md		README.md
catalog-info.yaml		catalog-info.yaml
docker-compose.yaml		docker-compose.yaml
product_version		product_version
renovate.json		renovate.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Elastic Open Web Crawler

Version compatibility

Quick links

Quickstart

Steps

Prerequisites

Step 1: Verify Docker setup and run a test crawl

Step 2: Get your Elasticsearch details

Step 3: Set environment variables (optional)

Step 4: Update crawler configuration for Elasticsearch

Step 5: Crawl and ingest into Elasticsearch

Step 6: View your data

📖 Learn more

🚀 Essential guides and concepts

⚙️ Advanced topics

⚖️ Elastic Crawler comparison

👩🏽‍💻 Developer guide

Build from source

Contribute

💬 Support

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors 21

Languages

License

elastic/crawler

Folders and files

Latest commit

History

Repository files navigation

Elastic Open Web Crawler

Version compatibility

Quick links

Quickstart

Steps

Prerequisites

Step 1: Verify Docker setup and run a test crawl

Step 2: Get your Elasticsearch details

Step 3: Set environment variables (optional)

Step 4: Update crawler configuration for Elasticsearch

Step 5: Crawl and ingest into Elasticsearch

Step 6: View your data

📖 Learn more

🚀 Essential guides and concepts

⚙️ Advanced topics

⚖️ Elastic Crawler comparison

👩🏽‍💻 Developer guide

Build from source

Contribute

💬 Support

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors 21

Languages

Packages