Building a web crawler

Inexperienced23 · March 12, 2022, 7:33pm

I have been building a web crawler in ruby, but then decided I did not like how resource intense it was, so I switched to crystal. Anyways, I was using httparty and nokogiri for it, and was wondering if there is anything similar in crystal?

jgaskins · March 12, 2022, 8:33pm

There sure is, and both are in the standard library so you don’t even need to install anything (other than libxml2 if your system doesn’t already have it installed).

HTTP::Client works a lot like HTTParty:

require "http"
puts HTTP::Client.get("https://crystal-lang.org/").body

You can also instantiate the client to avoid having to create a new TCP connection every time, just like you do with HTTParty. Docs

And then for parsing the HTML bodies, you can use XML.parse_html:

require "http"
require "xml"

html = HTTP::Client.get(url).body
parsed = XML.parse_html(html)
parsed.xpath_nodes("//a").each do |link|
  puts link["href"]?
end

Blacksmoke16 · March 12, 2022, 8:55pm

Before you sink a bunch of time into this, checkout GitHub - watzon/arachnid: Powerful web scraping framework for Crystal and see if that would suit your needs.

EDIT: NVM, apparently it’s broken…

\cc @watzon

Inexperienced23 · March 12, 2022, 11:25pm

Dang , that would’ve made my life easier.

Blacksmoke16 · March 13, 2022, 12:32am

Based on Incompatible with crystal v1.1.1 · Issue #8 · watzon/arachnid · GitHub, just needs some updates since Crystal 1.0. Could just make a PR to touch those things up and should be good to go.

watzon · March 28, 2023, 7:03pm

I really need to fix arachnid. It is a pretty good crawler, I just got too ambitious with the Marionette integration and ended up making it a lot harder to work with. I need to take it back to the drawing board and try to simplify the architecture.

bcardiff · March 28, 2023, 10:49pm

Back when I did GitHub - bcardiff/tasko.cr the main use case was to power a web crawler. Yet that is in a private repo and not in the samples.

zw963 · March 29, 2023, 4:59am

Topic		Replies	Views
Making curl-to-crystal tool Community	19	1385	October 7, 2023
What do you think is the most fleshed out HTML parsing shard? Help & Support	3	342	February 4, 2019
Cannot manipulate my results Help & Support	6	293	March 14, 2022
Any SOAP client? Help & Support	5	603	May 6, 2020
How load and scape a webpage? Help & Support	7	313	August 11, 2023

Building a web crawler

Related topics