Skip to content

Search engine harvesting

Ronan McHugh edited this page Aug 15, 2016 · 7 revisions

If you want your site to be harvested by search engines, you will need to consider the effect this will have on server load. Excessive crawling can have a negative impact on search performance for all users. It is a good idea to create a sitemap which will tell crawlers which pages you want to be harvested. The SitemapGenerator gem can be used to create a sitemap periodically and ping search engines to trigger new harvests. You can see an implementation that creates links to all relevant documents within Solr for the Danish Research database. It is a good idea to trigger this via CRON to happen at times of low activity (for example at the weekend) so that harvesting doesn't impact human users.

If you expose a sitemap with all the pages you do want to be harvested, it is a good idea to tell crawlers which pages you do not want to be harvested. Some crawlers will construct urls for search results pages leading to a potentially infinite number of crawl targets. Therefore you should include a robots.txt file which will disallow search results pages. Here is an example:

# robots.txt
User-agent: *
Disallow: /catalog? # blocks search results pages
Disallow: /catalog/facet # blocks facet pages
Clone this wiki locally