Crystal spec is VERY slow and very high CPU usage when the first time run `crystal spec -O1`

Please check following screenshot.

Why the first time run spec with -O1 is so slow?(15 min on a very little shard) I almost think about
crystal spec is in the dead loop.

You can reproduce it use GitHub - crystal-china/captcha: Crystal library that generates image CAPTCHAs. and run spec with: crystal spec -O1

I don’t usually run specs with -O1, but that does take an exceptionally long time to compile. I let it run for 27 minutes on my machine before killing it.

âžś  captcha git:(master) time csp -O1
^C
crystal spec -O1  24.24s user 0.60s system 1% cpu 27:48.08 total

Without -O1, it takes significantly less time even on a cold compilation:

âžś  captcha git:(master) crystal clear_cache
âžś  captcha git:(master) time csp
...

Finished in 557.48 milliseconds
3 examples, 0 failures, 0 errors, 0 pending
crystal spec  10.23s user 0.76s system 126% cpu 8.715 total

Even --release wasn’t bad:

âžś  captcha git:(master) crystal clear_cache
âžś  captcha git:(master) time csp --release
...

Finished in 101.54 milliseconds
3 examples, 0 failures, 0 errors, 0 pending
crystal spec --release  14.58s user 1.38s system 99% cpu 16.127 total
1 Like

Since the second run took 2 seconds, I suspect the cache mechanism is involved. Adding the -O1 option would likely require recompiling many standard library classes into objects again. However, I’m not sure if that alone explains the slow compilation.

Created issue

1 Like

ysbaddaden has already commented on the GitHub issue, but I was also able to reproduce the problem on my side.

rm -r tmp/*
CRYSTAL_CACHE_DIR=/home/kojix2/Crystal/captcha/tmp crystal spec -O1 --stats
Parse:                             00:00:00.000041077 (   0.98MB)
Semantic (top level):              00:00:00.379616584 ( 151.17MB)
Semantic (new):                    00:00:00.001428824 ( 151.17MB)
Semantic (type declarations):      00:00:00.030816626 ( 159.17MB)
Semantic (abstract def check):     00:00:00.009114052 ( 167.17MB)
Semantic (restrictions augmenter): 00:00:00.010037323 ( 167.17MB)
Semantic (ivars initializers):     00:00:00.014306303 ( 175.17MB)
Semantic (cvars initializers):     00:00:00.188233005 ( 207.23MB)
Semantic (main):                   00:00:00.446909827 ( 287.36MB)
Semantic (cleanup):                00:00:00.000357046 ( 287.36MB)
Semantic (recursive struct check): 00:00:00.000592775 ( 287.36MB)
Codegen (crystal):                 00:00:00.544523850 ( 335.36MB)
Codegen (bc+obj):                  00:48:03.114563685 ( 343.36MB)
Codegen (linking):                 00:00:00.431805899 ( 343.36MB)

Codegen (bc+obj):
 - no previous .o files were reused
...

Finished in 206.41 milliseconds
3 examples, 0 failures, 0 errors, 0 pending

As shown above, Codegen (bc+obj) took 48 minutes and 3 seconds.
Next, I inspected the timestamps in the cache directory:

ls -ltr --time-style="+%H:%M:%S" tmp/home-kojix2-Crystal-captcha-spec/ | tail -n 5
-rw-r--r--. 1 kojix2 kojix2   4316 18:17:53 S-lice40U-I-nt3241.o1.bc
-rw-r--r--. 1 kojix2 kojix2   4248 18:17:53 S-lice40U-I-nt3241.o1.o
-rw-r--r--. 1 kojix2 kojix2 690024 19:05:28 S-taticA-rray40U-I-nt3244321638441.o1.o
-rw-r--r--. 1 kojix2 kojix2   2596 19:05:28 C-rI-mage5858T-I-F-F-.o1.bc
-rw-r--r--. 1 kojix2 kojix2   2912 19:05:28 C-rI-mage5858T-I-F-F-.o1.o

It appears that generating S-taticA-rray40U-I-nt3244321638441.o1.o alone took 47 minutes and 35 seconds.

2 Likes

Oh, nice idea to use file timestamps!

2 Likes

To be honest, what I still don’t quite understand is why the llc command doesn’t take much time to compile the .bc file.

llc-18 -O1 -time-passes -filetype=obj S-taticA-rray40U-I-nt3244321638441.o1.bc -o out.o

This takes only 29 seconds.
What is the difference between the Crystal compiler and the llc command?

llc -O1 only runs backend optimizations like instruction selection and register allocation. It does NOT run middle-end optimization passes, which is why it’s fast.

Crystal’s compiler runs the full LLVM optimization pipeline via LLVM.run_passes with “default”. This is equivalent to running opt -O1 first, then doing what llc does.

1 Like

That’s right.
But, when I run the opt command, it takes less than a second.

opt-18 -O1 -time-passes S-taticA-rray40U-I-nt3244321638441.o1.bc -o /dev/null

So it’s hard to tell which part of the pipeline is actually slow.
In fact, llc seems to be taking longer.
My understanding of compiler internal is fairly limited. And investigation is based entirely on AI support.

Can anyone reproduce the long object-file generation time from .bc files, and explain why?

Just for clarify, because crimage use Slice instead of StaticArray in this recent commit, so, if someone want to reproduce, need use code before this commit instead.