Building a web crawler

I have been building a web crawler in ruby, but then decided I did not like how resource intense it was, so I switched to crystal. Anyways, I was using httparty and nokogiri for it, and was wondering if there is anything similar in crystal?

There sure is, and both are in the standard library so you don’t even need to install anything (other than libxml2 if your system doesn’t already have it installed).

HTTP::Client works a lot like HTTParty:

require "http"
puts HTTP::Client.get("https://crystal-lang.org/").body

You can also instantiate the client to avoid having to create a new TCP connection every time, just like you do with HTTParty. Docs

And then for parsing the HTML bodies, you can use XML.parse_html:

require "http"
require "xml"

html = HTTP::Client.get(url).body
parsed = XML.parse_html(html)
parsed.xpath_nodes("//a").each do |link|
  puts link["href"]?
end
3 Likes

Before you sink a bunch of time into this, checkout GitHub - watzon/arachnid: Powerful web scraping framework for Crystal and see if that would suit your needs.

EDIT: NVM, apparently it’s broken…

\cc @watzon

Dang :joy:, that would’ve made my life easier.

Based on Incompatible with crystal v1.1.1 · Issue #8 · watzon/arachnid · GitHub, just needs some updates since Crystal 1.0. Could just make a PR to touch those things up and should be good to go.

I really need to fix arachnid. It is a pretty good crawler, I just got too ambitious with the Marionette integration and ended up making it a lot harder to work with. I need to take it back to the drawing board and try to simplify the architecture.

Back when I did GitHub - bcardiff/tasko.cr the main use case was to power a web crawler. Yet that is in a private repo and not in the samples.

1 Like