Grävling

Grävling is a web crawler that lets you search for terms across a website and saves them as a report in a format of your choosing.

Features

Search through human readable text
Search page markup
Supports both crawling by sitemap of by following links
Get reports as either csv, json or xml

Requirements

python 3.6+
pip

Install

pip install -r requirements.txt

Usage

Gather urls for pages that contains a certain text by crawling sitemap

This command will scrape all pages in the sitemap and return the matching text and the url where text was found. By default it will only search the human readable text.

cd gravling
scrapy crawl sitemap -o matches.csv -a url=https://client.test/sitemap.xml -a keywords="Lynx"

Will generate a list that looks like this:

text,url
Tomorrow. Lynx browser is,https://example.com/about
 is easy. Lynx is quick a,https://example.com/installation

Gather urls for pages that has markup that contains a certain text, by crawling sitemap

This command will only search page source markup and return any matches it finds.

cd gravling
scrapy crawl sitemap -o matches.csv -a url=https://client.test/sitemap.xml -a keywords="Lynx" -a search_html=1 -a search_text=0

Will generate a list that looks like this:

text,url
orrow. <b>Lynx</b> brows,https://example.com/credits
. <strong>Lynx</strong> ,https://example.com/unix

Gather urls for pages that contains a certain text by following links

cd gravling
scrapy crawl website -o matches.csv -a domain=lynx.browser.org -a keywords="lynx Lynx"

Will generate a list that looks like this:

text,url
 write to lynx-dev@nongn,https://lynx.browser.org
d by  [email protected],https://lynx.browser.org

Security

If you believe you have found a security issue with any of our projects please email us at [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
gravling		gravling
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Grävling

Features

Requirements

Install

Usage

Gather urls for pages that contains a certain text by crawling sitemap

Gather urls for pages that has markup that contains a certain text, by crawling sitemap

Gather urls for pages that contains a certain text by following links

Security

About

Releases

Packages

Languages

License

Frojd/Gravling

Folders and files

Latest commit

History

Repository files navigation

Grävling

Features

Requirements

Install

Usage

Gather urls for pages that contains a certain text by crawling sitemap

Gather urls for pages that has markup that contains a certain text, by crawling sitemap

Gather urls for pages that contains a certain text by following links

Security

About

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages