Nutch Injector

This project provides a simple way to add new seed URLs to a Nutch crawl, bypassing the standard InjectorJob. Additionally, it adds the possibility to store redirections in the crawl DB.

The original use case is to crawl links found in a Twitter stream using Nutch. In this scenario, we continously get new URLs, which might have been crawled earlier already. Additionally, all links in a tweet are available as a shortened link (t.co/...) and the original link. We want to insert this relation into the crawl DB to capture the relation between the two URLs and to avoid re-crawling the same URL.

Version compatibility

Only Nutch 2.x is supported. Version 0.1 is compatible with Nutch versions <= 2.2.1. Version 0.2 supports Nutch >= 2.3.

Usage

Include the module through Maven:

	<dependencies>
	  <dependency>
	    <groupId>de.l3s.icrawl</groupId>
	    <artifactId>nutch-injector</artifactId>
	    <version>0.2</version>
	  </dependency>
	</dependencies>
	
	<repositories>
	  <repository>
	    <id>icrawl-releases</id>
	    <url>http://maven.l3s.uni-hannover.de:8088/nexus/content/repositories/icrawl_release/</url>
	  </repository>
	</repositories>

and use it in your Java code:

	Injector injector = new Injector(conf[, crawlId]);
	
	injector.inject("http://www.l3s.de/");
	injector.addRedirect("http://t.co/ZNyOoEwAwN", "http://www.l3s.de/");
	
	Map<String, String> metadata = new HashMap<>();
	metadata.add("source", "#l3s");
	injector.inject("http://www.l3s.de/", metadata);

License

This code can be used under the Apache License Version 2.0 (see http://www.apache.org/licenses/).

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
TODO.md		TODO.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Nutch Injector

Version compatibility

Usage

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

L3S/nutch-injector

Folders and files

Latest commit

History

Repository files navigation

Nutch Injector

Version compatibility

Usage

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages