Streaming HTML parser?

As it seems the forum beats my googling skills every time (I must be getting old), I thought I’d ask:

Is there any HTML parser shard that works on a stream rather than chewing through the whole thing in one go?

I’ve always been fascinated by templating systems that work directly with the HTML structure ever since I stumbled across Amrita in Ruby.

lexbor, streaming parse method: lexbor/src/lexbor/parser.cr at master · kostya/lexbor · GitHub

Google is getting worse by the day

Oh my, if that existed for XML I’d be quite exited. We have this service at work that would be absolutely perfect for that.

Wouldn’t this just be XML::Reader - Crystal 1.20.0-dev?

I’d need to be able to modify the structure as I go. I don’t think that is possible with`Reader`? But thanks for the suggestion, there may be use cases for just parsing too.

Ahh gotcha. might be able to pair it with a XML::Builder where you create a reader using the source IO then the builder with the dest IO, then as you read call related write methods, passing things thru directly, or if you need to make a change then do that instead of the existing node.

EDIT: Did something similar with XML::Reader/JSON::Builder and XML::Builder/JSON::PullParser for oq.

Oh, great, now I just need to figure out how to use that. :grin:

Well, I lied a bit, it was DuckDuckGo, but I don’t think the difference is major.

crystal-html5 also supports parse HTML stream at two different levels.

Token-level — no tree built, constant memory: Useful when one only need tokens without building the tree.

HTML5.each_token(io) do |token|
  puts token.data if token.type.start_tag?
end

Also available as HTML5.token_iterator(io) if you want an Iterator.

SAX-style — event callbacks during tree construction: Useful when one need a tree-aware events

class MyHandler
  include HTML5::StreamingHandler

  def on_element_open(tag, attrs, namespace)
    puts "<#{tag}>"
  end 
end

doc = HTML5.stream(io, MyHandler.new)

HIH
//Ali

Oooh, I’m getting spoilt for options here.

image

That’s why the forum beats google…