Syntax highlighting shard: Tartrazine

ralsina · August 6, 2024, 8:06pm

Tartrazine is a port of Pygments/Chroma the Python/Go “standard” syntax highlighting libraries, available at GitHub

As of now: it passes 98.6% of the test suite, supports 241 languages and 65 styles

The only formatter implemented is a very simple HTML one but it’s easy to add more and I plan to do it this week.

Here’s some example code:

require "tartrazine"

lexer = Tartrazine.lexer("crystal")
theme = Tartrazine.theme("catppuccin-macchiato")
puts Tartrazine::Html.new.format(File.read(ARGV[0]), lexer, theme)

And here is how the output looks:

2024-08-06_15-22

If you are in the mood for reading, this is how I did it:

ralsina · August 6, 2024, 9:29pm

Update: Make that 332 themes thanks to sixteen

ralsina · August 8, 2024, 12:18am

Sadly, while usable this is very slow.

The regexes from Chroma require the “anchored” flag and that makes pcre2 really really slow (slower than python slow)

straight-shoota · August 8, 2024, 8:21am

Does that mean the performance is worse than the Python implementation?
That would be surprising because AFAIK Python uses PCRE as regex engine, like Crystal.

Go uses RE2 which might just perform better with anchored flag than PCRE2.
How much worse is it though?
Maybe PCRE2 could be optimized with some fine tuning?

Your monkeypatch to support MULTILINE_ONLY does not seem to support JIT compilation. Enabling this could be quite a performance factor.

NO_UTF_CHECK could also be a useful match flag. I would assume the input text has to be valid UTF-8 and you could ensure that once before executing any regular expression (instead of at every regex match again).

ralsina · August 8, 2024, 11:27am

Interesting!

Setting NO_UTF_CHECK (I can assume everything is valid) does make it about 10% faster.

Because all regexes need to be anchored, the jit is never going to be enabled anyway, PCRE2 disables it (setting the flag or not makes no difference).

I did spend some time doing a re2 wrapper yesterday to compare and it is faster but I gave up because it was not processing all regexes (complained about things ilke “\2” being an invalid escape character and it seemed like too much work for what is really not a very important thing :-)

I am very confused about how the hell chroma does it, since golang’s engine is re2 AFAIK

I also tried PCRE (using the -D option for it) and it was slower than PCRE2, which is to be expected.

Currently tokenizing a file takes about 40% longer in my code than it does in pygments, I can try taking a real look at their code to find optimizations too.

straight-shoota · August 8, 2024, 12:32pm

AFAIK re2 does not support backreferences. So very confusing indeed

ralsina · August 8, 2024, 1:03pm

Mistery solved, they are using GitHub - dlclark/regexp2: A full-featured regex engine in pure Go based on the .NET engine which is (of all things) a port of .NET’s native regex engine.

Bonarc · August 9, 2024, 5:29pm

That’s actually pretty funny. I wonder how the original dotnet implementation performs.

ralsina · August 14, 2024, 1:45pm

And finally, after trying every regex library in the universe and finding them all lacking, I found a way to significantly improve performance: Not using Strings.

Turns out a lot of the time was being spent doing things like calculating what byte index a given char position was in.

This was only because the regex usage in this program is pretty extreme, it may try to do hundreds of thousands of regex matches, so all those small things added up to a lot.

So I did a tiny wrapper around LibPCRE that does regex matching on Bytes objects and returns matches as Bytes and it’s … faster. Here’s an example (the improvement is 2x because there is code doing other things too)

2024-08-14_10-33

This depends a lot on the input (and the lexer used) but in the worst-case performance seems to be identical and in the best case it gets to 5x better

ralsina · August 14, 2024, 4:12pm

And then, if you pass NO_UTF_CHECK to match (not to regex compilation, to match) …

2024-08-14_13-12

zw963 · August 14, 2024, 5:23pm

You haven’t said which image above was pass that parameter.

ralsina · August 14, 2024, 6:39pm

Around 500msec is just using Regex
Around 230msec is using Regex on Bytes
Around 30msec is Regex on Bytes + NO_UTF_CHECK on the match() call

straight-shoota · August 14, 2024, 9:35pm

Have you tried Regex#match_at_byte_index etc.?
I’d expect that matching on byte indices should be possible with the current Regex API. And thus not having to calculate char indices all the time.
If not, we should make it happen.

ralsina · August 14, 2024, 11:31pm

Well github is dead, I’ll try to do that one one of these days.

ralsina · August 15, 2024, 3:25pm

And I am now not going to try to improve performance anymore because hyperfocus is ok only up to a point :-D

ralsina · August 16, 2024, 3:41pm

Then I tried a very large file (a 600KB C header file) and performance was atrocious (3x slower than Chroma!) so when I noticed 90% of the time was now the garbage collector and malloc and memcpy I turned the lexer into an iterator so it would not resize arrays and …

ralsina · August 23, 2024, 2:15pm

I just released version 0.6.0 which has support for more languages, and support for combining (some of) them in useful manners. For example, AFAIK there is no way to properly highlight a Jinja template that generates a Dockerfile with any other tools.

Here? use the jinja+dockerfile lexer and it will do the correct 2-pass highlighting.

Also new: markdown lexer, with correct fenced codeblock highlighting in the block’s language, bugs fixed and so on.

zw963 · August 23, 2024, 4:55pm

Hi, i test use following command, but no highlighted output was seen when opened the 1.html use Firefox.

$: bin/tartrazine -fhtml -lcrystal src/styles.cr > 1.html

The output like following screenshot.

npn · August 23, 2024, 5:13pm

pretty sure you need to include some css for theming, because pygments does not embed the color directly inside the html attribute, it uses class names instead.

ralsina · August 23, 2024, 5:21pm

Add --standalone otherwise you get the html with classes but not the actual css

Topic		Replies	Views
Syntaxer, a multi-backend syntax highlighter for Crystal Community	3	477	October 25, 2021
Question about the crystal syntax highlighter Help & Support	4	137	October 20, 2024
HELP with grammar for Editors plugin for Crystal Help & Support	5	317	March 7, 2021
GH syntax highlighing Help & Support	2	392	November 4, 2020
Wombat - A Syntax Highlighter that Uses Rust's bat from Crystal Community	3	288	January 8, 2025

Syntax highlighting shard: Tartrazine

Related topics