A blog article on performant vs idiomatic code (using Crystal examples)

nogginly · January 6, 2024, 8:57pm

Hola folks,

Happy New Year!

I’m trying to make a bit more effort on posting to my blog this year, and thought I’d share a link to my recent post:

https://nogginly.dev/posts/performant-code-not-always-idiomatic/

Feedback welcome; also happy for suggestions on faster idiomatic implementations in Crystal as well as the bottom-up one.

straight-shoota · January 6, 2024, 10:57pm

This is a nice case study.
Keep them blog posts coming!

straight-shoota · January 6, 2024, 10:59pm

The code for generating the representation seems excessively complex in all variations.
I think it can be shortened to just two branches.
This is the example for the last implementation, but the equivalent applies to the others as well:

if count >= 4 || c == '{'
  into << '{' << c << count << '}'
else
  count.times { into << c }
end

straight-shoota · January 6, 2024, 11:11pm

An idea for further optimizations: If you have long segments of characters that don’t get encoded (i.e. they’re copied 1:1 from the original string), it would be more efficient to copy them as a big batch instead of writing every single character.
For example, encoding the string abababababababab (either as the entire input or as a segment between two curly encodings) could be a single memcpy of 16 bytes instead of 16 individual writes of one byte.

This would require going even further down, using Char::Reader directly instead of String#each_char so you can keep track of byte indices.

This mechanism is used in some places in stdlib, by the way, for example in HTML.escape

github.com

crystal-lang/crystal/blob/7df0b2e9d0b1c735ebc9d30e903fa037f297538c/src/html.cr#L43-L64


      
          # Same as `escape(String, IO)` but accepts `Bytes` instead of `String`.
          #
          # The slice is assumed to be valid UTF-8.
          def self.escape(string : Bytes, io : IO) : Nil
            last_copy_at = 0
            string.each_with_index do |byte, index|
              str = case byte
                    when '&'  then "&amp;"
                    when '<'  then "&lt;"
                    when '>'  then "&gt;"
                    when '"'  then "&quot;"
                    when '\'' then "&#39;"
                    else
                      next
                    end
          
              io.write_string(string[last_copy_at, index &- last_copy_at])
              last_copy_at = index &+ 1
              io << str
            end

This file has been truncated. show original

nogginly · January 6, 2024, 11:53pm

Thank you.

nogginly · January 7, 2024, 2:51am

An idea for further optimizations: If you have long segments of characters that don’t get encoded (i.e. they’re copied 1:1 from the original string), it would be more efficient to copy them as a big batch instead of writing every single character.

Excellent suggestion. Using Char::Reader and Slice I have a new variant of the encoder and it is ~1.3x faster than the bottom-up variant for my test data. I’ll update the post after I clean up the code.

nogginly · January 7, 2024, 6:23am

FYI, I’ve updated the blog post with yet another faster lower-level implementation.

straight-shoota · January 7, 2024, 4:11pm

If you amend content to a blog posts, it would be nice to point out which part was added in an update (and when). This makes it easier for readers who visited before to know which parts are new.

nogginly · January 7, 2024, 8:53pm

Makes sense and I appreciate thoughtful feedback.

I’ve taken a first stab at annotating the changes with revision call-outs and updated the article.

Question: Is it sufficient to refer the discussion here generally or should I be more specific in attributing each change? I’m happy to be more specific; my goal with these blog posts is to put out ideas and I want to encourage feedback and discussion.

straight-shoota · January 8, 2024, 12:20am

That looks nice. IMO general reference is sufficient.
But if you can refer to a specific comment for an individual update, it would be a service for the reader to point to it directly.

nogginly · January 8, 2024, 2:53am

Makes sense. And done.

zw963 · January 10, 2024, 4:16pm

Do you mind sharing test data? if appropriate.

nogginly · January 13, 2024, 3:26am

Hola. I used two ASCII art data files (references below) which I concatenated to create a string as the input for the performance tests, like so:

s = (File.read("data/aa_hitman.txt") + File.read("data/aa_zelda.txt")).delete('\n')

(The code in the article will work without deleting the \n; however I have an unpublished encoder that uses a regular expression, and that did not like newlines in the input and I wanted to keep the testing fair.)

References

Hitman by Blazej Kozlowski here
Zelda Four Swords here

Xen · February 3, 2024, 9:03pm

Still, I like to try and see how close I can get a functional implementation to the performance of an optimal imperative one. It’s fun to see what we have to give up and what we can gain.

Reminds me about this video: https://www.youtube.com/watch?v=tD5NrevFtbU

Kinda misses the point that “clean code” is more about readability than performance, but it is worth noting that he manages a 15 times performance increase by making it “unclean”. As he says, that’s approximately 15 years of advances in computer hardware. But it’s also a very specific example.

But does resonate with something I heard in a Dave Farley video (can’t remember which one): that one of the bigger problems for projects these days is building for scale without considering the real need. As you build your todo list application to be able to handle a million concurrent users, you end up with an overly complex project, which slows development down. Realizing that it’s only you, your mom and a handful of people that’ll realistically ever use it, means that you could figure out that the file system might be a better data store than the cloud database you had in mind.

Whenever I look in the source of a CLI command and discovers that it needs a dependency injection container and a handful of classes in order to just parse it arguments, I wonder if there’s not a bit too much eagerness for overengineering and overabstracting in the business.

So I’m kinda trying to go the other way. In particular with a side-project I kinda got lost in, and now trying to do in Crystal instead. It had three micro services and an event driven architecture, and I ended up spending most time working with anything but the main business logic. So I’m trying to go with “the simplest thing that could possibly work” with each feature in the Crystal version.

And it’s actually hard. Crystals base speed helps, but 20 years in the business, and I keep catching myself thinking “hmm, I should make a background queue for processing these” before actually testing whether processing it synchronously is actually a problem. The first time I made the effort doing the simple thing and then see how that worked out, I discovered that the processing time wasn’t the bottleneck (at least in Crystal), it was decoding the JSON it got for input.

So I’m kinda doing the same thing, at a higher abstraction level.

nogginly · February 4, 2024, 4:41am

I’m with you on this.

Putting aside the “clean code vs performance” debate, one point Casey makes that resonates with me is that as time passes it becomes easier to develop software without knowing how CPUs work. In this podcast interview Casey says there is no “map over iterator” instruction to make the point that it’s an abstraction that hides a lot of how a computer works.

Over time I’ve learned that there is a lot to be gained from being able to, first and foremost, “solve a problem quickly”. Afterwards, knowing how things work under the hood helps me figure out how far I need to go to improve the solution, if and as needed.

What makes Crystal so interesting is that while I am working on solving the problem I seem to be farther ahead on the performance curve than I used to be, sometimes even for what might be lower level problems. It’s really nice to be able to “map over an iterator” of an array of records and get the same performance as a more verbose for loop over an array in C.

Xen · February 11, 2024, 10:47pm

Or networking, or filesystems, or databases, or… (I thought I’d said that, but apparently not).

Which is fine, most of the time. Hell, there are a lot of things I haven’t really got a clue on how works (query optimization in database servers, for one thing). And then you have the new frontend developers happily firing fetch off left and right, and you have the opportunity to go “well… Do you know what’s really happening?”, and share your wisdom.

Topic		Replies	Views
[Help] Optimization and code review Learning Resources	13	492	August 15, 2021
Build with --release performance is slow than the 2017 crystal version? Help & Support	16	645	June 13, 2022
LavinMQ: Understanding code to get the best performance Website	1	197	May 22, 2024
Propose: Incremental compilation Crystal Contrib	33	835	January 15, 2024
Crystal 0.34.0 slower? Help & Support	53	1921	June 1, 2020

A blog article on performant vs idiomatic code (using Crystal examples)

Related topics