Using anemone in Ruby

2018-09-04

#Ruby
#Tips
#Know-how

2018/09/04

What is anemone?

anemone is a Ruby crawling framework.
Install it via gem install anemone.

Sample usage

Require the library from your Ruby script.
Provide a URL (or domain) and anemone will follow links and run your logic.

require 'anemone'

# Simple crawler that returns URLs
class Crawl
  # Only traverse one level deep
  def find_url(domain)
    urls = []
    Anemone.crawl(domain, depth_limit: 1) do |anemone|
      anemone.on_every_page do |page|
        urls.push(page.url)
      end
    end
    urls
  end
end

The method above receives a domain, crawls one level, and returns the discovered URLs as an array.

Options

You can pass options to Anemone.crawl:

Option	Purpose
`depth_limit`	Number of levels to traverse
`delay`	Wait time (seconds) between requests
`skip_query_strings`	Ignore query strings when `true`

Other helpers:

on_every_page lets you run logic for every visited page.
focus_crawl narrows down which pages on_every_page receives.
on_pages_like runs only on URLs that match a regex.
skip_links_like excludes URLs that match a regex.
You can fetch the URL, raw HTML (body), or a Nokogiri-ready doc (doc).

See the GitHub README for details.

Crawling behind a proxy

This tripped me up. When you set proxy options, the values differ from what Nokogiri expects.

Option	Value
`proxy_host`	Only the proxy host name
`proxy_port`	Port number as a string

Unlike Nokogiri—where you pass http://xxx.co.jp:80—anemone expects just the host portion (xxx.co.jp). Including the protocol (e.g., http://) breaks the proxy configuration even though the option is called “host.”

References

Share: X (Twitter)