Skip to content

L3S/nutch-injector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nutch Injector

This project provides a simple way to add new seed URLs to a Nutch crawl, bypassing the standard InjectorJob. Additionally, it adds the possibility to store redirections in the crawl DB.

The original use case is to crawl links found in a Twitter stream using Nutch. In this scenario, we continously get new URLs, which might have been crawled earlier already. Additionally, all links in a tweet are available as a shortened link (t.co/...) and the original link. We want to insert this relation into the crawl DB to capture the relation between the two URLs and to avoid re-crawling the same URL.

Version compatibility

Only Nutch 2.x is supported. Version 0.1 is compatible with Nutch versions <= 2.2.1. Version 0.2 supports Nutch >= 2.3.

Usage

Include the module through Maven:

	<dependencies>
	  <dependency>
	    <groupId>de.l3s.icrawl</groupId>
	    <artifactId>nutch-injector</artifactId>
	    <version>0.2</version>
	  </dependency>
	</dependencies>
	
	<repositories>
	  <repository>
	    <id>icrawl-releases</id>
	    <url>http://maven.l3s.uni-hannover.de:8088/nexus/content/repositories/icrawl_release/</url>
	  </repository>
	</repositories>

and use it in your Java code:

	Injector injector = new Injector(conf[, crawlId]);
	
	injector.inject("http://www.l3s.de/");
	injector.addRedirect("http://t.co/ZNyOoEwAwN", "http://www.l3s.de/");
	
	Map<String, String> metadata = new HashMap<>();
	metadata.add("source", "#l3s");
	injector.inject("http://www.l3s.de/", metadata);

License

This code can be used under the Apache License Version 2.0 (see http://www.apache.org/licenses/).

About

Utility methods to inject seed URLs into a Nutch crawl

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages