Specs on Crystal 0.36.0 are taking a LONG time

Just upgraded from 0.35.1 → 0.36.0, and when trying to run the spec suite on Lucky, it takes 10+ minutes to run. Is anyone else seeing this?

It seems that when I run crystal spec the first time, it’s like 10 minutes until I see the first green dot output. Then if I run crystal spec right after, the whole thing takes about 3 seconds to complete. After that, if I delete my ~/.cache/crystal/ and run crystal spec again, it’s back to the 10+ minutes to complete. With 0.35.1, it was always ~5sec to run the suite.

Here’s a few things I noticed.

  1. I upgraded Crystal through homebrew on Mac (not the M1), and it took me from LLVM 10 to LLVM 9.
  2. With no cache, it takes a really long time, but with the cache, it’s quick
  3. While the specs are running, I checked my process and saw these:
// When running super slow
❯ ps aux | ack -i "crystal"
jeremywoertink    2635 100.0  1.3  4609252 223000 s000  R+    9:24AM   0:13.64 /usr/local/Cellar/crystal/0.36.0/libexec/crystal spec
// When running fast
❯ ps aux | ack -i "crystal"
jeremywoertink    2484  98.7  0.7  4528900 115888 s000  R+    9:21AM   0:00.69 /usr/local/Cellar/crystal/0.36.0/libexec/crystal eval require "./tmp/console/1611768102528_console.cr";
jeremywoertink    2480   8.4  0.2  4392600  26440 s000  S+    9:21AM   0:00.29 /Users/jeremywoertink/.cache/crystal/crystal-run-spec.tmp
jeremywoertink    2365   0.0  4.3  5202960 725908 s000  S+    9:21AM   0:07.84 /usr/local/Cellar/crystal/0.36.0/libexec/crystal spec
  1. Clearing cache, and then building a spec runner file seems to take all the time.
# spec/spec_runner.cr
require "./**"

crystal build spec/spec_runner.cr → With cache: 9s, Without cache: 9m36s

  1. This isn’t an issue when just building a normal crystal file.
❯ crystal -v
Crystal 0.36.0 (2021-01-26)

LLVM: 9.0.1
Default target: x86_64-apple-macosx
MacOS: 10.15.7 (Catalina)

Well, I guess the “good” news is that this happens on the Avram spec suite as well. So not exclusive to Lucky :thinking:

To clarify, the problem is that it takes a lot of time to compile specs? But it’s not a problem to compile, say, a lucky app?

Ok, just tried the app I’m working on. There’s 3 of us working on this Lucky app. The first dev (using Apple M1 on crystal 0.35.1) took 12s to boot the app. The second dev (using a Arch Linux on crystal 0.35.1) took 10s to boot the app. Me using the Apple Intel on crystal 0.36 took 6 minutes and 53 seconds to boot the app. :frowning_face:

So it is a problem of booting the app but not compiling it, right?

Does the same happen on later execution of the same binary? So to discard if Gatekeeper is affecting you.

Booting or compiling the app. Running crystal build src/start_server.cr takes just as long as crystal src/start_server.cr. Also the same for crystal spec and crystal build spec_runner.cr

This is before cache with Crystal 0.36.0

❯ time crystal build src/start_server.cr
crystal build src/start_server.cr  422.70s user 5.64s system 106% cpu 6:42.32 total

This is after cache with Crystal 0.36.0

❯ time crystal build src/start_server.cr
crystal build src/start_server.cr  20.00s user 2.29s system 144% cpu 15.403 total

This is before cache with Crystal 0.35.1

❯ time crystal build src/start_server.cr
crystal build src/start_server.cr  41.33s user 2.83s system 142% cpu 30.887 total

This is after cache with Crystal 0.35.1

❯ time crystal build src/start_server.cr
crystal build src/start_server.cr  9.05s user 1.62s system 133% cpu 8.006 total

FWIW, this isn’t really anything new: to_i64 takes very long to release-compile · Issue #9654 · crystal-lang/crystal · GitHub. I updated that comment with 0.36.0, things are about twice as slow compared to 0.35.0. However the extra time is spent in LLVM land, so most likely just related to using LLVM 10 versus 8?

@jwoertink Could you compile with -s or --stats in the fast and slow scenarios so we can see which phase(s) have a significant time difference?

If you can run that using th darwin-targz, which use the same llvm (6.0.1) it could help

untar and use directly ~/Downloads/crystal-0.36.0-1/bin/crystal and you should be able to switch between those easily.

Crystal 0.35.1 from brew was using llvm 10, and 0.36.0 is using llvm 9. Probably but those numbers are really weird tbh.

Here is the slow version:

❯ crystal build --stats src/start_server.cr
Parse:                             00:00:00.000147604 (   0.75MB)
Semantic (top level):              00:05:17.036021855 ( 284.32MB)
Semantic (new):                    00:00:00.007973806 ( 284.32MB)
Semantic (type declarations):      00:00:00.069950239 ( 284.32MB)
Semantic (abstract def check):     00:00:00.016679054 ( 284.32MB)
Semantic (ivars initializers):     00:00:00.107012517 ( 300.32MB)
Semantic (cvars initializers):     00:00:00.012729690 ( 300.32MB)
Semantic (main):                   00:01:43.738350958 ( 943.82MB)
Semantic (cleanup):                00:00:00.006942938 ( 943.82MB)
Semantic (recursive struct check): 00:00:00.002674448 ( 943.82MB)
Codegen (crystal):                 00:00:05.938186130 ( 979.82MB)
Codegen (bc+obj):                  00:00:06.977458114 ( 979.82MB)
Codegen (linking):                 00:00:00.876010137 ( 979.82MB)
dsymutil:                          00:00:00.652352870 ( 979.82MB)

Macro runs:
 - /Users/jeremywoertink/Sites/myapp/lib/teeplate/src/lib/file_tree/macros/directory.cr: 00:01:43.625728256
 - /Users/jeremywoertink/Sites/myapp/lib/lucky/src/run_macros/generate_asset_helpers.cr: 00:01:41.797523013
 - /Users/jeremywoertink/Sites/myapp/lib/avram/src/run_macros/infer_table_name.cr: 00:01:49.246910285
 - /usr/local/Cellar/crystal/0.36.0/src/ecr/process.cr: 00:01:40.525327088

Codegen (bc+obj):
 - no previous .o files were reused

Here’s the fast version

❯ crystal build --stats src/start_server.cr
Parse:                             00:00:00.000163159 (   0.75MB)
Semantic (top level):              00:00:01.641595030 ( 187.88MB)
Semantic (new):                    00:00:00.006273244 ( 187.88MB)
Semantic (type declarations):      00:00:00.067652090 ( 187.88MB)
Semantic (abstract def check):     00:00:00.015063394 ( 203.88MB)
Semantic (ivars initializers):     00:00:00.246961155 ( 219.94MB)
Semantic (cvars initializers):     00:00:00.010917117 ( 219.94MB)
Semantic (main):                   00:00:03.429520286 ( 764.81MB)
Semantic (cleanup):                00:00:00.007368792 ( 764.81MB)
Semantic (recursive struct check): 00:00:00.002628261 ( 764.81MB)
Codegen (crystal):                 00:00:06.379794689 ( 895.81MB)
Codegen (bc+obj):                  00:00:02.871129433 ( 895.81MB)
Codegen (linking):                 00:00:00.781666617 ( 895.81MB)
dsymutil:                          00:00:00.617167133 ( 895.81MB)

Macro runs:
 - /Users/jeremywoertink/Sites/myapp/lib/teeplate/src/lib/file_tree/macros/directory.cr: reused previous compilation (00:00:00.025369863)
 - /Users/jeremywoertink/Sites/myapp/lib/lucky/src/run_macros/generate_asset_helpers.cr: reused previous compilation (00:00:00.025183963)
 - /Users/jeremywoertink/Sites/myapp/lib/avram/src/run_macros/infer_table_name.cr: reused previous compilation (00:00:00.024973046)
 - /usr/local/Cellar/crystal/0.36.0/src/ecr/process.cr: reused previous compilation (00:00:00.034067091)

Codegen (bc+obj):
 - 1957/1989 .o files were reused

These modules were not reused:
 - _main (_main.bc)
 - Process (P-rocess.bc)
 - Crystal::Signal (C-rystal5858S-ignal.bc)
 - Log::AsyncDispatcher (L-og5858A-syncD-ispatcher.bc)
 - Hash(Int32, PG::Decoders::Decoder) (H-ash40I-nt324432P-G-5858D-ecoders5858D-ecoder41.bc)
 - ASN1::BER::Length (A-S-N-15858B-E-R-5858L-ength.bc)
 - Tasker::CRON(Nil) (T-asker5858C-R-O-N-40N-il41.bc)
 - Tasker::Future(Nil) (T-asker5858F-uture40N-il41.bc)
 - Timer (T-imer.bc)
 - Carbon::SpawnStrategy (C-arbon5858S-pawnS-trategy.bc)
 - DB::Database (D-B-5858D-atabase.bc)
 - Cable::Handler (C-able5858H-andler.bc)
 - HTTP::Server (H-T-T-P-5858S-erver.bc)
 - HTTP::WebSocketHandler (H-T-T-P-5858W-ebS-ocketH-andler.bc)
 - Lucky::Params (L-ucky5858P-arams.bc)
 - UserQuery (U-serQ-uery.bc)
 - Avram::QueryBuilder (A-vram5858Q-ueryB-uilder.bc)
 - Cable::Server (C-able5858S-erver.bc)
 - AwardJobWorker (A-wardJ-obW-orker.bc)
 - Tasker::OneShot(Transaction) (T-asker5858O-neS-hot40T-ransaction41.bc)
 - Tasker::Future(Transaction) (T-asker5858F-uture40T-ransaction41.bc)
 - GraphQL::Schema (G-raphQ-L-5858S-chema.bc)
 - GraphQL::Language::ParserContext (G-raphQ-L-5858L-anguage5858P-arserC-ontext.bc)
 - TransactionQuery (T-ransactionQ-uery.bc)
 - FollowQuery (F-ollowQ-uery.bc)
 - Carbon::AwsSesAdapter::Email (C-arbon5858A-wsS-esA-dapter5858E-mail.bc)
 - HTTP::Client (H-T-T-P-5858C-lient.bc)
 - WelcomeNewUserEmail (W-elcomeN-ewU-serE-mail.bc)
 - Carbon::SendGridAdapter::Email (C-arbon5858S-endG-ridA-dapter5858E-mail.bc)
 - SceneQuery (S-ceneQ-uery.bc)
 - Cable::WebsocketPinger (C-able5858W-ebsocketP-inger.bc)
 - Schedule::Runner (S-chedule5858R-unner.bc)

EDIT: Just saw the note from @bcardiff These stats are just from my brew built Crystal. I’ll post again with new stats from those tars

1 Like

Since these are long, I threw them in to a gist

These are using those tar builds. Surprising that it was a lot quicker :thinking: The “slow” 0.36 only took about 1 minute to run. So I wonder what that one is doing that the brew version isn’t.

Do you have big specs files with lots of variables shared from the top level and inside specs? I think it might be related to one PR I sent, but I’m not sure.

I guess checking out dexter and running the specs is enough to reproduce this?

Oh, so there is a difference between the brew and non brew? Disregard then.

1 Like

I’m seeing the same slowness with Homebrew-installed Crystal. Most of my apps are probably smaller than Jeremy’s, but the one I’m working on took over a minute to compile with no cache and 1.5 seconds with a populated cache — a factor of 40x on a first compile.

I wonder if the problem is the LLVM change down to 9.0.1. Compiling the Crystal compiler looks like it uses LLVM 11 and only takes 90 seconds on the first, 16 seconds on subsequent compiles — a factor of 5.5x.

➜  crystal git:(master) rm -rf ~/.cache/crystal
➜  crystal git:(master) time (make clean crystal)
Using /usr/local/opt/llvm/bin/llvm-config [version=11.0.0]
rm -rf .build
rm -rf ./docs
rm -rf src/llvm/ext/llvm_ext.o
rm -rf src/ext/sigfault.o src/ext/libcrystal.a
c++ -c  -o src/llvm/ext/llvm_ext.o src/llvm/ext/llvm_ext.cc -I/usr/local/Cellar/llvm/11.0.0_1/include -std=c++14 -stdlib=libc++   -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS
cc -fPIC  -I/usr/local/opt/llvm/include  -c -o src/ext/sigfault.o src/ext/sigfault.c
ar -rcs src/ext/libcrystal.a src/ext/sigfault.o
CRYSTAL_CONFIG_LIBRARY_PATH="/usr/local/Cellar/crystal/0.35.1_1/embedded/lib:/usr/local/lib" CRYSTAL_CONFIG_BUILD_COMMIT="be5de4ac1" SOURCE_DATE_EPOCH="1611696425" ./bin/crystal build --link-flags="-L/usr/local/opt/llvm/lib"  -o .build/crystal src/compiler/crystal.cr -D without_openssl -D without_zlib
( make clean crystal; )  111.58s user 4.02s system 129% cpu 1:29.36 total
➜  crystal git:(master) time (make clean crystal)
Using /usr/local/opt/llvm/bin/llvm-config [version=11.0.0]
rm -rf .build
rm -rf ./docs
rm -rf src/llvm/ext/llvm_ext.o
rm -rf src/ext/sigfault.o src/ext/libcrystal.a
c++ -c  -o src/llvm/ext/llvm_ext.o src/llvm/ext/llvm_ext.cc -I/usr/local/Cellar/llvm/11.0.0_1/include -std=c++14 -stdlib=libc++   -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS
cc -fPIC  -I/usr/local/opt/llvm/include  -c -o src/ext/sigfault.o src/ext/sigfault.c
ar -rcs src/ext/libcrystal.a src/ext/sigfault.o
CRYSTAL_CONFIG_LIBRARY_PATH="/usr/local/Cellar/crystal/0.35.1_1/embedded/lib:/usr/local/lib" CRYSTAL_CONFIG_BUILD_COMMIT="be5de4ac1" SOURCE_DATE_EPOCH="1611696425" ./bin/crystal build --link-flags="-L/usr/local/opt/llvm/lib"  -o .build/crystal src/compiler/crystal.cr -D without_openssl -D without_zlib
( make clean crystal; )  23.52s user 2.04s system 155% cpu 16.402 total
1 Like

Okay, I found it.

The compiler compiles “macro run” programs in release. For example it compiles “src/ecr/process.cr” from the standard library once, then it will invoke that executable once for each “macro run”.

The reason that program is compiled in release mode is because supposedly we are going to compile it once and run it multiple times. Then the program is also cached and reused in next compilations.

These are the times with the old compiler:

crystal build -s --release src/ecr/process.cr
Parse:                             00:00:00.000136675 (   0.25MB)
Semantic (top level):              00:00:00.229377405 (  45.38MB)
Semantic (new):                    00:00:00.001104609 (  45.38MB)
Semantic (type declarations):      00:00:00.013862049 (  45.38MB)
Semantic (abstract def check):     00:00:00.003336744 (  45.38MB)
Semantic (ivars initializers):     00:00:00.005020561 (  45.38MB)
Semantic (cvars initializers):     00:00:00.064744379 (  76.51MB)
Semantic (main):                   00:00:00.031409134 (  76.51MB)
Semantic (cleanup):                00:00:00.003429698 (  76.51MB)
Semantic (recursive struct check): 00:00:00.000552440 (  76.51MB)
Codegen (crystal):                 00:00:00.205167882 (  92.57MB)
Codegen (bc+obj):                  00:00:05.910215561 (  92.57MB)
Codegen (linking):                 00:00:00.153066810 (  92.57MB)
dsymutil:                          00:00:00.046683514 (  92.57MB)

These are the times with the new compiler:

bin/crystal build -s --release src/ecr/process.cr
Using compiled compiler at .build/crystal
Parse:                             00:00:00.000309128 (   0.75MB)
Semantic (top level):              00:00:00.425719001 (  45.20MB)
Semantic (new):                    00:00:00.001362605 (  45.20MB)
Semantic (type declarations):      00:00:00.042852339 (  45.20MB)
Semantic (abstract def check):     00:00:00.012192835 (  45.20MB)
Semantic (ivars initializers):     00:00:00.036538063 (  60.26MB)
Semantic (cvars initializers):     00:00:00.168997900 (  76.26MB)
Semantic (main):                   00:00:00.149201186 (  76.32MB)
Semantic (cleanup):                00:00:00.012628783 (  76.32MB)
Semantic (recursive struct check): 00:00:00.002083400 (  76.32MB)
Codegen (crystal):                 00:00:00.677731959 (  92.32MB)
Codegen (bc+obj):                  00:00:30.447228009 (  92.32MB)
Codegen (linking):                 00:00:00.157901840 (  92.32MB)
dsymutil:                          00:00:00.049124261 (  92.32MB)

Codegen (bc+obj):
 - no previous .o files were reused

It goes from 6 seconds of bc+obj to 30 seconds! :disappointed:

That bc+obj phase is just asking LLVM to go from a .bc file to an .o file, essentially compiling LLVM code.

The old compiler I’m using is linked against LLVM 10. The new one is against LLVM 11. My guess is that LLVM 11 is much, much slower than LLVM 10, but I don’t know.

I’ll try installing LLVM 9, compiling a compiler against it and seeing if that lowers the time (LLVM 10 is not available in Mac for some reason.)

As a workaround we can change the compiler so that it compiles macro run programs in non-release. Maybe it will still be fast enough.

2 Likes

Well, my guess was wrong. With LLVM 9 it takes 1 minute 38 seconds to compile that ecr process.

Okay, I found the culprit: me :smiley:

In the last release I made two changes to always initialize constants and class variables unless they are used before they are declared:

It seems now all that initialization happens in the “main” function and that function ends up being huge and it makes LLVM have a really hard time optimizing that. Or that’s my guess.

I think we should revert those two PRs. After all they are just optimizations. And they might optimize the runtime performance of programs a bit, but if it makes compilation that much slower it’s not worth it, or we can rethink it later.

~And also, it seems reverting that also fixes LLVM 11 optimizations breaks code · Issue #10220 · crystal-lang/crystal · GitHub for some reason, so that optimization might actually be doing something wrong.~

Actually, with LLVM 11 it’s still broken, I forgot I temporarily switched to LLVM 9 :man_facepalming:

8 Likes

That explains why the Lucky and Avram specs take so long. We use those run macros a few times. Thanks for finding this!

I was curious if there was a difference on Linux, too, so I created a branch upgrading one of my apps to Crystal 0.36. This is the CI build result (both were run just a few minutes apart):

The Docker builds compile 3 different app entrypoints with shards build --release --production. The spec run is just crystal spec. The master branch is using Crystal 0.35.1.

@jwoertink are you seeing similar on your apps?

To be clear, this is nowhere near as large a difference as on macOS with Homebrew, but still seems significant.