Syntax highlighting shard: Tartrazine

Tartrazine is a port of Pygments/Chroma the Python/Go “standard” syntax highlighting libraries, available at GitHub

As of now: it passes 98.6% of the test suite, supports 241 languages and 65 styles

The only formatter implemented is a very simple HTML one but it’s easy to add more and I plan to do it this week.

Here’s some example code:

require "tartrazine"

lexer = Tartrazine.lexer("crystal")
theme = Tartrazine.theme("catppuccin-macchiato")
puts Tartrazine::Html.new.format(File.read(ARGV[0]), lexer, theme)

And here is how the output looks:

2024-08-06_15-22

If you are in the mood for reading, this is how I did it:

11 Likes

Update: Make that 332 themes thanks to sixteen

Sadly, while usable this is very slow.

The regexes from Chroma require the “anchored” flag and that makes pcre2 really really slow (slower than python slow)

Does that mean the performance is worse than the Python implementation?
That would be surprising because AFAIK Python uses PCRE as regex engine, like Crystal.

Go uses RE2 which might just perform better with anchored flag than PCRE2.
How much worse is it though?
Maybe PCRE2 could be optimized with some fine tuning?

Your monkeypatch to support MULTILINE_ONLY does not seem to support JIT compilation. Enabling this could be quite a performance factor.

NO_UTF_CHECK could also be a useful match flag. I would assume the input text has to be valid UTF-8 and you could ensure that once before executing any regular expression (instead of at every regex match again).

Interesting!

Setting NO_UTF_CHECK (I can assume everything is valid) does make it about 10% faster.

Because all regexes need to be anchored, the jit is never going to be enabled anyway, PCRE2 disables it (setting the flag or not makes no difference).

I did spend some time doing a re2 wrapper yesterday to compare and it is faster but I gave up because it was not processing all regexes (complained about things ilke “\2” being an invalid escape character and it seemed like too much work for what is really not a very important thing :-)

I am very confused about how the hell chroma does it, since golang’s engine is re2 AFAIK

I also tried PCRE (using the -D option for it) and it was slower than PCRE2, which is to be expected.

Currently tokenizing a file takes about 40% longer in my code than it does in pygments, I can try taking a real look at their code to find optimizations too.

AFAIK re2 does not support backreferences. So very confusing indeed :person_shrugging:

Mistery solved, they are using GitHub - dlclark/regexp2: A full-featured regex engine in pure Go based on the .NET engine which is (of all things) a port of .NET’s native regex engine.

1 Like

That’s actually pretty funny. I wonder how the original dotnet implementation performs.

And finally, after trying every regex library in the universe and finding them all lacking, I found a way to significantly improve performance: Not using Strings.

Turns out a lot of the time was being spent doing things like calculating what byte index a given char position was in.

This was only because the regex usage in this program is pretty extreme, it may try to do hundreds of thousands of regex matches, so all those small things added up to a lot.

So I did a tiny wrapper around LibPCRE that does regex matching on Bytes objects and returns matches as Bytes and it’s … faster. Here’s an example (the improvement is 2x because there is code doing other things too)

2024-08-14_10-33

This depends a lot on the input (and the lexer used) but in the worst-case performance seems to be identical and in the best case it gets to 5x better

5 Likes

And then, if you pass NO_UTF_CHECK to match (not to regex compilation, to match) …

2024-08-14_13-12

1 Like

You haven’t said which image above was pass that parameter.

  • Around 500msec is just using Regex
  • Around 230msec is using Regex on Bytes
  • Around 30msec is Regex on Bytes + NO_UTF_CHECK on the match() call
3 Likes

Have you tried Regex#match_at_byte_index etc.?
I’d expect that matching on byte indices should be possible with the current Regex API. And thus not having to calculate char indices all the time.
If not, we should make it happen.

1 Like

Well github is dead, I’ll try to do that one one of these days.

And I am now not going to try to improve performance anymore because hyperfocus is ok only up to a point :-D

1 Like

Then I tried a very large file (a 600KB C header file) and performance was atrocious (3x slower than Chroma!) so when I noticed 90% of the time was now the garbage collector and malloc and memcpy I turned the lexer into an iterator so it would not resize arrays and …

2 Likes

I just released version 0.6.0 which has support for more languages, and support for combining (some of) them in useful manners. For example, AFAIK there is no way to properly highlight a Jinja template that generates a Dockerfile with any other tools.

Here? use the jinja+dockerfile lexer and it will do the correct 2-pass highlighting.

Also new: markdown lexer, with correct fenced codeblock highlighting in the block’s language, bugs fixed and so on.

1 Like

Hi, i test use following command, but no highlighted output was seen when opened the 1.html use Firefox.

$: bin/tartrazine -fhtml -lcrystal src/styles.cr > 1.html

The output like following screenshot.

pretty sure you need to include some css for theming, because pygments does not embed the color directly inside the html attribute, it uses class names instead.

Add --standalone otherwise you get the html with classes but not the actual css

1 Like