Search engine harvesting

If you want your site to be harvested by search engines, you will need to consider the effect this will have on server load. Excessive crawling can have a negative impact on search performance for all users. It is a good idea to create a sitemap which will tell crawlers which pages you want to be harvested. The SitemapGenerator gem can be used to create a sitemap periodically and ping search engines to trigger new harvests. You can see an implementation that creates links to all relevant documents within Solr for the Danish Research database. It is a good idea to trigger this via CRON to happen at times of low activity (for example at the weekend) so that harvesting doesn't impact human users.

If you expose a sitemap with all the pages you do want to be harvested, it is a good idea to tell crawlers which pages you do not want to be harvested. Some crawlers will construct urls for search results pages leading to a potentially infinite number of crawl targets. Therefore you should include a robots.txt file which will disallow search results pages. Here is an example:

# robots.txt
User-agent: *
Disallow: /catalog? # blocks search results pages
Disallow: /catalog/facet # blocks facet pages

Quickstart Guide

Upgrading to Blacklight 8

Demo

Example installations

Blacklight Summits and Events

Search engine harvesting

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally