Impact of `--release` vs `-O3`

I’m trying to iterate quickly on an app I’m working on in production. Deploying it was taking 5-6 minutes every time I made a change, which is actually pretty decent for continuous deployment but I’m impatient.

Since the --release option is just -O3 --single-module, I decided to check how the impact of --single-module had vs the plain -O3 on this app for both compilation and runtime. I know LLVM can perform more optimizations if you put it all into a single module, but I’ve never seen what the magnitude of that is on the kinds of apps I write (usually web services and infrastructure tooling). Ary showed some numbers here, which are great, but his examples aren’t the kinds of apps that I work on, so I wasn’t sure how impactful it would be.

Using -O3 instead of --release brought my build times on GitHub Actions from 5-6 minutes on paid runners to below 3 minutes (builds and deploys are not yet being gated behind CI — I’m running specs locally):

Build times this fast are worth a moderate tradeoff in runtime performance, especially since this app is relatively new. Latency between my house and the server makes generating synthetic load difficult (it takes a lot of concurrency to saturate the gaps created by that latency), so I ran it against the app running locally. I copied a request as a curl command from my browser dev tools and converted it to wrk, hitting an endpoint for an authenticated user that shows all organizations the user is a member of, which makes queries to both Redis and Postgres. I have live data in my local DB, so these are realistic requests.

For -O3:

$ wrk 'http://localhost:3201/organizations' \
  [HEADERS REMOVED FOR BREVITY]
Running 10s test @ http://localhost:3201/github_organizations
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.10ms  314.42us   6.54ms   98.37%
    Req/Sec     4.63k   131.63     4.81k    75.74%
  93073 requests in 10.10s, 456.41MB read
Requests/sec:   9214.29
Transfer/sec:     45.18MB

Over 9k requests per second is pretty solid, especially for requests that are running several real Redis and Postgres queries. I was like “wait, how much better could --release be?”

Turns out, it’s pretty significant:

$ wrk 'http://localhost:3202/organizations' \
  [HEADERS OMITTED FOR BREVITY]
Running 10s test @ http://localhost:3202/github_organizations
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   560.38us    1.53ms  40.94ms   97.71%
    Req/Sec    11.76k   692.11    12.68k    89.11%
  236375 requests in 10.10s, 1.13GB read
Requests/sec:  23402.55
Transfer/sec:    114.76MB

About 2.5x as fast for a web app running real database queries just by adding --single-module. I knew LLVM performed some great optimizations, but that’s quite a bit more than I expected. I assumed it would be around 50% faster, which still would’ve been worth it for apps that need the performance, but 2.5x is incredible.

6 Likes

Related: Optimize runtime for non single-module(01,02) compilation by kostya · Pull Request #14225 · crystal-lang/crystal · GitHub

2 Likes

So one of the things to remember here is all the things that are included in C -O3 that is not included in Crystal without single module. There is Pointer (though we tell it to be inlined, but who knows how often it actually is), there is arrays and integers etc, which are separate in crystal which are not separate in C. So a slightly more fair comparison would be a -O3 that still links a big part of the prelude statically.

Hmm. I wonder how big difference it would do to also link libgc statically.

It’s worth noting that in Crystal, every type maps to an LLVM module. That means without --single-module, it can only optimize across methods defined on the same type.
Any call to a method of a different type is a barrier to the optimizer. And frankly, there are a lot of those in Crystal. Even in very basic stdlib features. The type of application probably doesn’t matter too much for this.

Thanks for this. I also used -O3 -Drelease just because of compilation speed and reuse Codegen(bc+obj), but given your tests, I think I will start using --release for release builds. <3

1 Like

The -O args are kinda moot in Crystal, there are missed opportunities to inline low level code.

Since they’re crystal methods, not backed by a primitive, they end up into a distinct type (or module, or object file), and LLVM only optimizes and inlines inside a module. Since everything’s a type in Crystal, even Intrinsics, Pointer or Int32 will be a distinct module/object, then every basic methods such as pointer + 0 or pointer[0] lead to a call :snail:

As stated in the issue linked above, always inlining methods marked with @[AlwaysInline] leads to great performance improvements without --single-module when paired with more inlined methods in Pointer and Slice. That would be very nice to test out and see the impact on compilation vs runtime performance for -O1 and higher.

2 Likes