MetaInspector <img src=“http://travis-ci.org/jaimeiniesta/metainspector.png” />¶ ↑

MetaInspector is a gem for web scraping purposes. You give it an URL, and it lets you easily get its title, links, and meta tags.

Installation¶ ↑

Install the gem from RubyGems:

gem install metainspector

This gem is tested on Ruby versions 1.8.7, 1.9.2 and 1.9.3.

Usage¶ ↑

Initialize a scraper instance for an URL, like this:

page = MetaInspector::Scraper.new('http://pagerankalert.com')

or, for short, a convenience alias is also available:

page = MetaInspector.new('http://pagerankalert.com')

If you don’t include the scheme on the URL, http:// will be used by defaul:

page = MetaInspector.new('pagerankalert.com')

Then you can see the scraped data like this:

page.url                # URL of the page
page.scheme             # Scheme of the page (http, https)
page.title              # title of the page, as string
page.links              # array of strings, with every link found on the page
page.absolute_links     # array of all the links converted to absolute urls
page.meta_description   # meta description, as string
page.description        # returns the meta description, or the first long paragraph if no meta description is found
page.meta_keywords      # meta keywords, as string
page.image              # Most relevant image, if defined with og:image
page.images             # array of strings, with every img found on the page
page.absolute_images    # array of all the images converted to absolute urls
page.feed               # Get rss or atom links in meta data fields as array
page.meta_og_title      # opengraph title
page.meta_og_image      # opengraph image

MetaInspector uses dynamic methods for meta_tag discovery, so all these will work, and will be converted to a search of a meta tag by the corresponding name, and return its content attribute

page.meta_description       # <meta name="description" content="..." />
page.meta_keywords          # <meta name="keywords" content="..." />
page.meta_robots            # <meta name="robots" content="..." />
page.meta_generator         # <meta name="generator" content="..." />

It will also work for the meta tags of the form <meta http-equiv=“name” … />, like the following:

page.meta_content_language  # <meta http-equiv="content-language" content="..." />
page.meta_Content_Type      # <meta http-equiv="Content-Type" content="..." />

Please notice that MetaInspector is case sensitive, so page.meta_Content_Type is not the same as page.meta_content_type

You can also access most of the scraped data as a hash:

page.to_hash               # { "url"=>"http://pagerankalert.com", "title" => "PageRankAlert.com", ... }

The full scraped document if accessible from:

page.document # Nokogiri doc that you can use it to get any element from the page

Examples¶ ↑

You can find some sample scripts on the samples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:

$ irb
>> require 'metainspector'
=> true

>> page = MetaInspector.new('http://pagerankalert.com')
=> #<MetaInspector:0x11330c0 @url="http://pagerankalert.com">

>> page.title
=> "PageRankAlert.com :: Track your PageRank changes"

>> page.meta_description
=> "Track your PageRank(TM) changes and receive alerts by email"

>> page.meta_keywords
=> "pagerank, seo, optimization, google"

>> page.links.size
=> 8

>> page.links[5]
=> "http://pagerankalert.posterous.com"

>> page.document.class
=> String

>> page.parsed_document.class
=> Nokogiri::HTML::Document

ZOMG Fork! Thank you!¶ ↑

You’re welcome to fork this project and send pull requests. I want to thank specially:

Ryan Romanchuk github.com/rromanchuk
Edmund Haselwanter github.com/ehaselwanter
Jonathan Hernández github.com/ionmx

To Do¶ ↑

Get page.base_dir from the URL
Distinguish between external and internal links, returning page.links for all of them as found, page.external_links and page.internal_links converted to absolute URLs
Be able to set a timeout in seconds
If keywords seem to be separated by blank spaces, replace them with commas
Mocks
Check content type, process only HTML pages, don’t try to scrape TAR files like ftp.ruby-lang.org/pub/ruby/ruby-1.9.1-p129.tar.bz2 or video files like isabel.dit.upm.es/component/option,com_docman/task,doc_download/gid,831/Itemid,74/
Autodiscover all available meta tags

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!