Skip to content

OscartGiles/spider_crab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spider_crab

A simple web crawler written in Rust and using the Tokio async runtime. It consists of a library and a CLI. The library could be reused for uses in different contexts (e.g. a web service).

I wrote this as part of a coding challenge. Read more about the design decisions I made in the design overview

Install CLI

Make sure you have Rust installed and then install with:

cargo install --git https://github.com/OscartGiles/spider_crab

Usage

Get help.

spider_crab --help

Crawl a website.

spider_crab https://oscartgiles.github.io/

Options

Save the results to file.

spider_crab https://oscartgiles.github.io/ -o crawl_results.txt

Hide links in the output.

spider_crab https://oscartgiles.github.io/ --hide-links

Limit the number of pages visited.

spider_crab https://docs.rs/ --max-pages 5 --hide-links

Crawl for a set period of time (in seconds).

spider_crab https://docs.rs/ --max-time 1 --hide-links

Limit the number of concurrent requests to a domain.

spider_crab https://docs.rs/ -c 1 --max-time 10  

Ignore robots.txt. spider_crab respects it by default.

spider_crab https://docs.rs/ --ignore-robots --max-time 10

Tracing

The CLI can export traces to an OTLP collector. For example, you could export traces to Jaeger. To try it out start Jaeger with docker:

docker run --name spider_crab_jaeger -e COLLECTOR_OTLP_ENABLED=true  -p6831:6831/udp -p6832:6832/udp -p16686:16686 -p14268:14268 -p 4317:4317 jaegertracing/all-in-one:latest

and then run the CLI with the --otl-endpoint option.

spider_crab https://oscartgiles.github.io/ --otl-endpoint http://localhost:4317

You can then view the logs at http://localhost:16686/.

Clean up the Jaeger container.

docker stop spider_crab_jaeger; docker rm spider_crab_jaeger

Library

Run tests.

cargo test

Run benchmarks (currently only http parsing is benchmarked) and open an html report.

cargo bench
open ./target/criterion/report/index.html

Open library docs.

cargo doc --open

About

Rust Web Crawler

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published