How load and scape a webpage?

uri looks useful maybe?
e.g. read a table and get all table cell contents

require "http"
require "xml"
require "uri"

# Provide the URL on the CLI
uri = URI.parse(ARGV[0])

# Fetch the HTML page
response = HTTP::Client.get(uri)
# Parse the HTML document
doc = XML.parse_html(response.body)
# Find all `td` elements that are inside of `table` elements
table_cells = doc.xpath_nodes("//table//td")

pp table_cells.map(&.text)

The string passed to xpath_nodes can be customized for the content you’re scraping. But it’s important to keep in mind that it’s XPath syntax rather than CSS selectors.

If you want to use CSS selectors, you can use the kostya/lexbor shard. It’s easy enough to use, and I use it in one of my own apps.

2 Likes

I get this error:

/home/drhuffman12/.cache/crystal/crystal-run-web_scraper.tmp: error while loading shared libraries: libssl.so.1.1: cannot open shared object file: No such file or directory

and this doesn’t help:
sudo apt-get install libssl-dev
nor
sudo apt-get install libssl3.0-dev

Also,

libssl-dev is already the newest version (3.0.2-0ubuntu1.10).

Also:

require "lexbor"

tells me Error: can't find file 'lexbor', eventhough I added it to my shard like

dependencies:
  lexbor:
    github: kostya/lexbor

and run shards install

It sounds like you’ve got v3 of LibSSL/OpenSSL installed and Crystal is trying to use v1.1. I run into a similar issue recently with Ruby on macOS. To work around it, I had to ensure that the default OpenSSL version was 1.1 and not 3. On Homebrew that involved running brew link --overwrite openssl@1.1 but I don’t remember how to do that with apt.

2 Likes

Are you maybe using an old version of crystal? I am on debian unstable with just libssl 3.0.10-1, no version 1.1, and the latest crystal 1.9.2 is working fine with it

Apparantly, I have:

$ openssl version -a
OpenSSL 1.1.1s  1 Nov 2022
built on: Tue Nov  1 12:36:10 2022 UTC
platform: linux-x86_64
options:  bn(64,64) md2(char) rc4(8x,int) des(int) idea(int) blowfish(ptr) 
compiler: gcc-11 -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
OPENSSLDIR: "/home/linuxbrew/.linuxbrew/etc/openssl@1.1"
ENGINESDIR: "/home/linuxbrew/.linuxbrew/Cellar/openssl@1.1/1.1.1s/lib/engines-1.1"
Seeding source: os-specific