Using anemone in Ruby
- #Ruby
- #Tips
- #Know-how
- 2018/09/04
What is anemone?
- anemone is a Ruby crawling framework.
- Install it via
gem install anemone.
Sample usage
- Require the library from your Ruby script.
- Provide a URL (or domain) and anemone will follow links and run your logic.
require 'anemone'
# Simple crawler that returns URLs
class Crawl
# Only traverse one level deep
def find_url(domain)
urls = []
Anemone.crawl(domain, depth_limit: 1) do |anemone|
anemone.on_every_page do |page|
urls.push(page.url)
end
end
urls
end
end
The method above receives a domain, crawls one level, and returns the discovered URLs as an array.
Options
You can pass options to Anemone.crawl:
| Option | Purpose |
|---|---|
depth_limit | Number of levels to traverse |
delay | Wait time (seconds) between requests |
skip_query_strings | Ignore query strings when true |
Other helpers:
on_every_pagelets you run logic for every visited page.focus_crawlnarrows down which pageson_every_pagereceives.on_pages_likeruns only on URLs that match a regex.skip_links_likeexcludes URLs that match a regex.- You can fetch the URL, raw HTML (
body), or a Nokogiri-ready doc (doc).
See the GitHub README for details.
Crawling behind a proxy
This tripped me up. When you set proxy options, the values differ from what Nokogiri expects.
| Option | Value |
|---|---|
proxy_host | Only the proxy host name |
proxy_port | Port number as a string |
Unlike Nokogiri—where you pass http://xxx.co.jp:80—anemone expects just the host portion (xxx.co.jp). Including the protocol (e.g., http://) breaks the proxy configuration even though the option is called “host.”
References
Share:
X (Twitter)