Does that mean the performance is worse than the Python implementation?
That would be surprising because AFAIK Python uses PCRE as regex engine, like Crystal.
Go uses RE2 which might just perform better with anchored flag than PCRE2.
How much worse is it though?
Maybe PCRE2 could be optimized with some fine tuning?
Your monkeypatch to support MULTILINE_ONLY does not seem to support JIT compilation. Enabling this could be quite a performance factor.
NO_UTF_CHECK could also be a useful match flag. I would assume the input text has to be valid UTF-8 and you could ensure that once before executing any regular expression (instead of at every regex match again).
Setting NO_UTF_CHECK (I can assume everything is valid) does make it about 10% faster.
Because all regexes need to be anchored, the jit is never going to be enabled anyway, PCRE2 disables it (setting the flag or not makes no difference).
I did spend some time doing a re2 wrapper yesterday to compare and it is faster but I gave up because it was not processing all regexes (complained about things ilke “\2” being an invalid escape character and it seemed like too much work for what is really not a very important thing :-)
I am very confused about how the hell chroma does it, since golang’s engine is re2 AFAIK
I also tried PCRE (using the -D option for it) and it was slower than PCRE2, which is to be expected.
Currently tokenizing a file takes about 40% longer in my code than it does in pygments, I can try taking a real look at their code to find optimizations too.
And finally, after trying every regex library in the universe and finding them all lacking, I found a way to significantly improve performance: Not using Strings.
Turns out a lot of the time was being spent doing things like calculating what byte index a given char position was in.
This was only because the regex usage in this program is pretty extreme, it may try to do hundreds of thousands of regex matches, so all those small things added up to a lot.
So I did a tiny wrapper around LibPCRE that does regex matching on Bytes objects and returns matches as Bytes and it’s … faster. Here’s an example (the improvement is 2x because there is code doing other things too)
This depends a lot on the input (and the lexer used) but in the worst-case performance seems to be identical and in the best case it gets to 5x better
Have you tried Regex#match_at_byte_index etc.?
I’d expect that matching on byte indices should be possible with the current Regex API. And thus not having to calculate char indices all the time.
If not, we should make it happen.
Then I tried a very large file (a 600KB C header file) and performance was atrocious (3x slower than Chroma!) so when I noticed 90% of the time was now the garbage collector and malloc and memcpy I turned the lexer into an iterator so it would not resize arrays and …
I just released version 0.6.0 which has support for more languages, and support for combining (some of) them in useful manners. For example, AFAIK there is no way to properly highlight a Jinja template that generates a Dockerfile with any other tools.
Here? use the jinja+dockerfile lexer and it will do the correct 2-pass highlighting.
Also new: markdown lexer, with correct fenced codeblock highlighting in the block’s language, bugs fixed and so on.
pretty sure you need to include some css for theming, because pygments does not embed the color directly inside the html attribute, it uses class names instead.