Crystal 0.34.0 slower?

I compiled, ran all my test specs and performance test
using Crystal 0.30, 0.31, 0.32, 0.33 and 0.34
The performance degradation starts with Crystal 0.31.

System information

$ inxi -MSC
System:    Host: laptop Kernel: 4.4.0-21-generic x86_64 (64 bit) Desktop: Cinnamon 3.0.7
           Distro: Linux Mint 18 Sarah
Machine:   System: ASUSTeK (portable) product: N551JX v: 1.0
           Mobo: ASUSTeK model: N551JX v: 1.0 Bios: American Megatrends v: N551JX.207 date: 08/06/2015
CPU:       Quad core Intel Core i7-4750HQ (-HT-MCP-) cache: 6144 KB
           clock speeds: max: 3200 MHz 1: 2000 MHz 2: 2447 MHz 3: 1999 MHz 4: 2000 MHz 5: 2005 MHz 6: 2033 MHz
           7: 2098 MHz 8: 3028 MHz

Summary

Crystal      0.30     0.31     0.32     0.33    0.34
(seconds)

Compile    105.09   144.28   147.32   145.88  154.12
Specs      150.47   163.34   163.36   163.74  169.20
Fibonacci    1.93     3.43     3.62     3.71    3.48

Detail report

===========================================
Crystal 0.30.1
===========================================

$ crystal version
Crystal 0.30.1 [5e6a1b672] (2019-08-12)

LLVM: 4.0.0
Default target: x86_64-unknown-linux-gnu

$ time crystal build main.cr -o crasy.out --release --no-debug
user : 105.09 system : 1.50 total : 1:45.15  CPU : 101%

$ ll crasy.out
... 4135464 May 18 17:45 crasy.out

$ time crystal spec
Finished in 1:49 minutes
5272 examples, 0 failures, 0 errors, 15 pending
user : 150.47 system : 2.56 total : 2:09.27  CPU : 118%

$ time ./crasy.out tests/testfibr.asy
fibonnaci(035) : 9227465
user : 1.93 system : 0.00 total : 0:01.93  CPU : 100%

===========================================
Crystal 0.31.1
===========================================
$ crystal version

Crystal 0.31.1 [0e2e1d067] (2019-09-30)

LLVM: 8.0.0
Default target: x86_64-unknown-linux-gnu

$ time crystal build main.cr -o crasy.out --release --no-debug
user : 144.28 system : 1.35 total : 2:24.47  CPU : 100%

$ ll crasy.out
... 3960576 May 18 17:58 crasy.out

$ time crystal spec
5272 examples, 0 failures, 0 errors, 15 pending
user : 163.34 system : 2.31 total : 2:18.15  CPU : 119%

$ time ./crasy.out tests/testfibr.asy
fibonnaci(035) : 9227465
user : 3.43 system : 0.00 total : 0:03.43  CPU : 100%

===========================================
Crystal 0.32.1
===========================================
$ crystal version
Crystal 0.32.1 [41bd18fbe] (2019-12-18)

LLVM: 8.0.0
Default target: x86_64-unknown-linux-gnu

$ time crystal build main.cr -o crasy.out --release --no-debug
user : 147.32 system : 1.43 total : 2:27.13  CPU : 101%

$ ll crasy.out
... 3897608 May 18 18:06 crasy.out

$ time crystal spec
5272 examples, 0 failures, 0 errors, 15 pending
user : 163.36 system : 2.51 total : 2:16.67  CPU : 121%

$ time ./crasy.out tests/testfibr.asy
fibonnaci(035) : 9227465
user : 3.62 system : 0.00 total : 0:03.62  CPU : 100%

===========================================
Crystal 0.33.0
===========================================
$ crystal version

Crystal 0.33.0 [612825a53] (2020-02-14)

LLVM: 8.0.0
Default target: x86_64-unknown-linux-gnu

$ time crystal build main.cr -o crasy.out --release --no-debug
user : 145.88 system : 1.64 total : 2:26.41  CPU : 100%

$ ll crasy.out
.. 3913952 May 18 18:15 crasy.out

$ time crystal spec
5272 examples, 0 failures, 0 errors, 15 pending
user : 163.74 system : 2.56 total : 2:16.79  CPU : 121%

$ time ./crasy.out tests/testfibr.asy
fibonnaci(035) : 9227465
user : 3.71 system : 0.00 total : 0:03.71  CPU : 100%

===========================================
Crystal 0.34.0
===========================================
$ crystal version

Crystal 0.34.0 [4401e90f0] (2020-04-06)

LLVM: 8.0.0
Default target: x86_64-unknown-linux-gnu

$ time crystal build main.cr -o crasy.out --release --no-debug
user : 154.12 system : 1.66 total : 2:34.67  CPU : 100%

$ ll crasy.out
...  3914176 May 18 18:30 crasy.out

$ time crystal spec
5272 examples, 0 failures, 0 errors, 15 pending
user : 169.20 system : 2.78 total : 2:23.69  CPU : 119%

$ time ./crasy.out tests/testfibr.asy
fibonnaci(035) : 9227465
user : 3.48 system : 0.00 total : 0:03.50  CPU : 99%
1 Like

I still think it’s related to overflow checking. That would fit in with where the performance hit first starts. Is the Fibonacci code small enough to share? Would be a good place to start at least.

I also think it has to do with overflow checking, but yesterday I ran the same code with overflow checking disabled (using &+, &- etc) per your suggestion, and things only got worse.

The fibonacci code interpreted by my bytecode interpreter is as follows

int fib(int n) {
  if (n < 3) {
    return 1;
  }
  return fib(n - 1) + fib(n - 2);
}

string s; int x; int i;
i = 35;
s = format("fibonnaci(%03d) : ",i);
x = fib(i);
write(s,x);

And you replaced every instance of those? Not just within the fib method?

EDIT: Also could be related to // not having a &//?

The fibonacci code I showed you, is interpreted by my bytecode interpreter.
In the bytecode interpreter I replaced all the + into &+, all the - into &- and all the * into &* .
That is the test I did yesterday using Crystal 0.34 and the fibonacci code ran even slower.

Today I used my original code base to do the tests as per straight-shoota’s suggestion.
I think it is safe to say the overflow checking is making things slower as seen in the test output I posted.

Perhaps tomorrow I will redo all the tests (from 0.31 to 0.34) with overflow checks disabled, not sure yet.

EDIT: Also could be related to // not having a &// ?

I don’t think so, the fibonacci code is not using any divisions only subtraction and addition.

1 Like

This is how you use Instruments: https://www.mikeperham.com/2016/06/24/profiling-crystal-on-osx/

You can also open it as a stand-alone program, load the executable and then run it.

4 Likes

I was interested in the discussion of profiling a crystal executable with valgrind, so I went down a bit of a rabbit hole today figuring out how to do that. I’m a relative newbie to valgrind, but I’ve used it a little bit in profiling C++, so I had somewhere to start.

While Crystal 0.34.0 doesn’t seem to emit function names in a way that Valgrind can consume, the most recent nightly “release” (via docker; thanks, @Blacksmoke16, for the help with that) has @ComputerMage’s recently-merged work on debugging, which fixes the function name output. I was able, after some trial and error, to figure out that when I compiled the executable on the container with the newer compiler and ran Valgrind there (as opposed to on my host machine) I could get really useful results.

If you’re using a docker container to try this until the next release (which should have the debugging changes), I recommend you also make a mount point on the host and view the callgrind output in kcachegrind, as has been mentioned. It’s still a little arcane to a newcomer, but it’s a whole lot better than the text output.

1 Like

Not sure about Crystal emitting the function names. As per earlier versions of the Crystal compiler I
am seeing function names which are just the function names I am using in my code.

The only thing I do, is to instruct the compiler to include debug information.

 crystal build main.cr -o debug_exe --debug

Some instructions on profiling:

https://crystal-lang.org/reference/guides/performance.html

Maybe somebody should add the word “profiling” to the title… :)

Basically profile once with 0.30 then again with 0.31, look for anything changes.

Maybe LLVM version used in by Crystal 0.30 vs 0.31 changed the code generated for fibonacci test substantially?

$ crystal version

Crystal 0.31.1 [0e2e1d067] (2019-09-30)

LLVM: 8.0.0

The fibonacci code is written in a language I am designing, it is not written in Crystal.

My program, the interpreter (written in Crystal), generates bytecode instructions for the fibonacci program. The bytecode instructions are the same in both situations, independent of any Crystal compiler or any version of LLVM.

Thanks for making the observation, any help/input is appreciated.

Conclusion

The performance degradation starts with Crystal 0.31 and
it remains slow up to Crystal 0.34. The overflow checks
do have a performance impact as expected, but the effect
is small. Still unclear what is causing the slowness.
See the table below for the results of the various tests.

Perhaps I am using older versions of required library files
or the code generated by the Crystal compiler is not optimized
for my system architecture.

I read all the release notes from 0.30 up to 0.34 to see
I missed some step but I could not find anything special.

Any thoughts as what to do next to find the cause ?

Summary (20-05-2020)
I changed my code to disable any overflow checks.
Compiled and ran the fibonnaci recursive test for n = 35
using Crystal 0.30, 0.31, 0.32, 0.33 and 0.34

Crystal      0.30     0.31     0.32     0.33    0.34
timing

Compile       N/A   149.48   148.88   148.55  152.66
Fibonacci     N/A     3.41     3.27     3.42    3.36

Summary (18-05-2020)
No changes made in my codebase

Crystal      0.30     0.31     0.32     0.33    0.34
timing

Compile    105.09   144.28   147.32   145.88  154.12
Fibonacci    1.93     3.43     3.62     3.71    3.48

System information

$ inxi -MSC
System:    Host: laptop Kernel: 4.4.0-21-generic x86_64 (64 bit) Desktop: Cinnamon 3.0.7
           Distro: Linux Mint 18 Sarah
Machine:   System: ASUSTeK (portable) product: N551JX v: 1.0
           Mobo: ASUSTeK model: N551JX v: 1.0 Bios: American Megatrends v: N551JX.207 date: 08/06/2015
CPU:       Quad core Intel Core i7-4750HQ (-HT-MCP-) cache: 6144 KB
           clock speeds: max: 3200 MHz 1: 2000 MHz 2: 2447 MHz 3: 1999 MHz 4: 2000 MHz 5: 2005 MHz 6: 2033 MHz
           7: 2098 MHz 8: 3028 MHz
```

Have you tried compiling crystal from master, and then run against that? I’m curious if the issue was “fixed”, or if it gets worse.

No, I did not do that. A lot of work, to be honest. I am still looking for a LLVM version 8 for my platform.
Not keen on the whole idea.

Perhaps clone the Crystal repo and looking at the compiler source for version 0.30 compared to 0.31 should give some clues. The overflow checks were introduced but as shown that is not the reason of the slowdown. There must be something else, perhaps the newer version of LLVM is the problem. Just thinking out loud here.

I think a good next step would be to use Callgrind while running your Fibonacci program (with a smaller n) on Crystal 0.30 and 0.34 to compare the output of each.

The time is being spent somewhere, so Callgrind should show that. Once you can isolate the Crystal code, it will be easier to debug.

This is what I did yesterday, correct me where I am wrong (new to valgrind)
for my fibonnaci test with n = 20

$ valgrind --tool=callgrind --branch-sim=yes --cache-sim=yes ./crasy030.out tests/fib.asy
$ valgrind --tool=callgrind --branch-sim=yes --cache-sim=yes ./crasy034.out tests/fib.asy

Then I started kcachegrind for both callgrind outputs and visually compared the parts doing the actually work. The column IR in kcachegrind shows the absolute total number of times a given instruction is executed (that is how I understand it)

I do a lot of pushing and popping in my interpreter and also a lot of constant lookups so I concentrated my investigation on those parts. The number given below are the total number of times the function is called, the number in parens are the number of IR’s.

What is clear that the number of total IR’s is much higher using the version 0.34 executable. I just listed 2 function calls but on the whole the number of IR’s using 0.34 is a lot higher. So for some reason more work is done and thus takes more time to complete.

Why this is the case is still a mystery

Crystal 				   0.30           						  0.34

# called         

constant lookup    33873 (# Ir 16563897)          33873 (# Ir 17817198)
push               33873 (# Ir  9044091)          33873 (# Ir  9450567)

Just to make sure I understand correctly, your language is compiled to byte code which is then run by your VM written in Crystal, right? And you are running Valgrind against the VM, not your compiled Fibonacci program right?

If you’ve traced it back to constant lookup and push, could you paste the Crystal source to those two functions?

I am now also curious to know if each release of Crystal is benchmarked for performance regressions. Anyone know?

In response to mgomes questions, I hope this will make sense.

I am writing an interpreter for my Crasy language in my most favorite
language Crystal. After compilation of the Crystal code, I use the executable
‘crasy,out’ to read/evaluate/process bytecode instructions for a given Crasy program
in a run loop.

To quickly test performance I am using a small program which computes
the fibonnaci number (recursive) for a given input value in my example n = 35.

I did this test using Crystal 0.30 and Crystal 0.34, see

https://forum.crystal-lang.org/t/crystal-0-34-0-slower/2124/21

for results of these tests.

The Crasy source code for this program is

int fib(int n) {

  if (n < 3) {
    return 1;
  }
  return fib(n - 1) + fib(n - 2);
}

and the bytecode instructions for this program (length 54)

0035 0000 0004 00-3 0047 0003 0045 0020 0006 0003 0047 0001 0024 0019 0001 0003
0004 00-3 0047 0001 0040 0023 0000 0001 0004 00-3 0047 0002 0040 0023 0000 0001
0039 0024 0007 0001 0007 0002 0007 0003 0047 0002 0008 0004 0003 0006 0005 0023
0000 0001 0008 0006 0003 0034

and the disassembled bytecode

================================================================================
function : fib
================================================================================
line [offset] : name                       opcode

0003 [0000]   : OP_FUNCTION                0035   [value fib]
0003 [0002]   : OP_GET_LOCAL               0004   [param -3]
0005 [0004]   : OP_ICONSTANT               0047   [value 3]
0005 [0006]   : OP_ILESS                   0045
0005 [0007]   : OP_JUMP_IF_FALSE           0020   [jump 0007 -> 0015]
0005 [0009]   : OP_POP                     0003
0005 [0010]   : OP_ICONSTANT               0047   [value 1]
0006 [0012]   : OP_RETURN                  0024
0006 [0013]   : OP_JUMP                    0019   [jump 0013 -> 0016]
0006 [0015]   : OP_POP                     0003
0006 [0016]   : OP_GET_LOCAL               0004   [param -3]
0009 [0018]   : OP_ICONSTANT               0047   [value 1]
0009 [0020]   : OP_ISUBTRACT               0040
0009 [0021]   : OP_CALL                    0023   [address : 0 - args : 1]
0009 [0024]   : OP_GET_LOCAL               0004   [param -3]
0009 [0026]   : OP_ICONSTANT               0047   [value 2]
0009 [0028]   : OP_ISUBTRACT               0040
0009 [0029]   : OP_CALL                    0023   [address : 0 - args : 1]
0009 [0032]   : OP_IADD                    0039
0009 [0033]   : OP_RETURN                  0024

================================================================================
function : main
================================================================================
line [offset] : name                       opcode

0009 [0034]   : OP_DEFINE_GLOBAL           0007   [index 1 variable s]
0012 [0036]   : OP_DEFINE_GLOBAL           0007   [index 2 variable x]
0013 [0038]   : OP_DEFINE_GLOBAL           0007   [index 3 variable i]
0013 [0040]   : OP_ICONSTANT               0047   [value 2]
0014 [0042]   : OP_SET_GLOBAL              0008   [index 4 value i]
0016 [0044]   : OP_POP                     0003
0016 [0045]   : OP_GET_GLOBAL              0006   [index 5 value i]
0016 [0047]   : OP_CALL                    0023   [address : 0 - args : 1]
0018 [0050]   : OP_SET_GLOBAL              0008   [index 6 value x]
0018 [0052]   : OP_POP                     0003
0018 [0053]   : OP_HALT                    0034

The crasy program reads these bytecode instructions and depending
on the code (opcode), will call a Crystal function in the interpreter
(virtual machice). The virtual machine uses a stack (Crystal stack array)
to push and pop values.

Next is some of the Crystal code of the virtual machine
which is called for popping and pushing values from the stack as well as
code to read an opcode and add and subtract integer values.


STACK_MAX = 1024

class VM
  property compiler  : Compiler
  property sp        : Int32 = -1
  property ip        : Int32 = 0
  property fp        : Int32 = 0
  property stack     : StaticArray(Env::Value,STACK_MAX)
  property status    : Int32 = 0

  # read a bytecode
  def read_int() : Int32
    opcode = @compiler.code[@ip]
    @ip    = @ip + 1
    return opcode
  end # read_int

  # lookup
  def do_iconstant()
    value  = read_int()
    push(Env::Value.new(value))
  end

  def pop()
    val = @stack[@sp]
    @sp = @sp - 1
    return val
  end

  def push(val : Env::Value)
    @sp = @sp + 1
    if @sp >= STACK_MAX
      runtime_error("stack overflow",true)
    end

    @stack[@sp] = val
  end

  def do_iadd()
    b = pop()
    a = pop()

    # Disabled Crystal overflow check
    r = a.vint &+ b.vint
    val = Env::Value.new(r)
    push(val)

  end

  def do_isubtract()
    b = pop()
    a = pop()

    # Disabled Crystal overflow check
    r = a.vint &- b.vint
    val = Env::Value.new(r)
    push(val)

  end

  # ... more Crystal code ...

end # class VM

Maybe could do a git bisect to try and figure out the slowdown (between 0.33 and 0.33.1 or whatever it was).