Skip to content
/ Gravling Public

A web crawler that finds which website pages contains a certain search term and generates a report.

License

Notifications You must be signed in to change notification settings

Frojd/Gravling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Grävling

Grävling is a web crawler that lets you search for terms across a website and saves them as a report in a format of your choosing.

Features

  • Search through human readable text
  • Search page markup
  • Supports both crawling by sitemap of by following links
  • Get reports as either csv, json or xml

Requirements

  • python 3.6+
  • pip

Install

pip install -r requirements.txt

Usage

Gather urls for pages that contains a certain text by crawling sitemap

This command will scrape all pages in the sitemap and return the matching text and the url where text was found. By default it will only search the human readable text.

cd gravling
scrapy crawl sitemap -o matches.csv -a url=https://client.test/sitemap.xml -a keywords="Lynx"

Will generate a list that looks like this:

text,url
Tomorrow. Lynx browser is,https://example.com/about
 is easy. Lynx is quick a,https://example.com/installation

Gather urls for pages that has markup that contains a certain text, by crawling sitemap

This command will only search page source markup and return any matches it finds.

cd gravling
scrapy crawl sitemap -o matches.csv -a url=https://client.test/sitemap.xml -a keywords="Lynx" -a search_html=1 -a search_text=0

Will generate a list that looks like this:

text,url
orrow. <b>Lynx</b> brows,https://example.com/credits
. <strong>Lynx</strong> ,https://example.com/unix

Gather urls for pages that contains a certain text by following links

cd gravling
scrapy crawl website -o matches.csv -a domain=lynx.browser.org -a keywords="lynx Lynx"

Will generate a list that looks like this:

text,url
 write to lynx-dev@nongn,https://lynx.browser.org
d by  [email protected],https://lynx.browser.org

Security

If you believe you have found a security issue with any of our projects please email us at [email protected].

About

A web crawler that finds which website pages contains a certain search term and generates a report.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages