Skip to content

stephenlstewart/WebCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebCrawler

This package turns Nuix into a Web Crawler. It leverages a 3rd party Web Crawler - Ache - that can dump the crawled pages out to a Kafka topic. Nuix the subscribes to teh Kafka topic using the Nuix RealTime capability. The WSS.Web.Scraper.py WSS pull the HTML page out of a metadata field and converts it into a child item.

It is pretty cool and highlights the ease to which you can build new use cases with Nuix and Kafka.

Docker Image - Web Crawler

https://github.com/VIDA-NYU/ache

docker run -p 8080:8080 vidanyu/ache:latest

Kakfa Setting for Nuix

Zookeeper - 127.0.0.1:2181
Bootstrap.servers - 127.0.0.1:9092

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages