Using anemone in Ruby

What is anemone?

Sample usage

  1. Require the library from your Ruby script.
  2. Provide a URL (or domain) and anemone will follow links and run your logic.
require 'anemone'

# Simple crawler that returns URLs
class Crawl
  # Only traverse one level deep
  def find_url(domain)
    urls = []
    Anemone.crawl(domain, depth_limit: 1) do |anemone|
      anemone.on_every_page do |page|
        urls.push(page.url)
      end
    end
    urls
  end
end

The method above receives a domain, crawls one level, and returns the discovered URLs as an array.

Options

You can pass options to Anemone.crawl:

OptionPurpose
depth_limitNumber of levels to traverse
delayWait time (seconds) between requests
skip_query_stringsIgnore query strings when true

Other helpers:

See the GitHub README for details.

Crawling behind a proxy

This tripped me up. When you set proxy options, the values differ from what Nokogiri expects.

OptionValue
proxy_hostOnly the proxy host name
proxy_portPort number as a string

Unlike Nokogiri—where you pass http://xxx.co.jp:80—anemone expects just the host portion (xxx.co.jp). Including the protocol (e.g., http://) breaks the proxy configuration even though the option is called “host.”

References