We have opensourced a library for webcrawling called Grell. The main properties of Grell over other tools are: - It is written in Ruby so it is very easy to use it by a Rails application or a gem. - It uses Phantomjs under the hood. Which allow us to execute the Javascript of the pages we are crawling. This allow us to, for instance, follow links added by Ajax requests or elements which are not links but become clickable with Javascript.

Crawling can be used to gather information about all the pages your application creates. For instance you can visit every page to make sure none has an error.

Grell is very easy to use. You only need to provide the start page and Grell will yield to your code with each page crawled. In this example we print out the pages giving a 404 error, this code will not crawl till we finish every page in Google:

crawler = Grell::Crawler.new
crawler.start_crawling('http://www.google.com') do |page|
  #Grell will keep iterating this block which each unique page it finds
  if page.status == 404
    puts "#{page.url} has status: #{page.status}, headers: #{page.headers}, body: #{page.body}"
    puts "page id and parent_id #{page.id}, #{page.parent_id}"
  end
end

Grell adds two IDs to each page: ‘id’ and ‘parent_id’. With this information is very easy to re-create the crawling tree to, for instance, create visualizations. Grell also supports filtering with whitelist, blacklist, custom filtering, retries, etc.

For more information on all the possibilities of Grell, please visit the Grell repository and read the thorough documentation.

Analytics