Invalid memory access in Crystal compiler [pre-solved]

Hello!

In one of our projects, Crystal compiler segfaults on:

Dependencies are satisfied
Building: *something*
Error target *something* failed to compile:
Invalid memory access (signal 11) at address 0x7f54bae43de8
[0x7f54dd6a5c56] ???
[0x7f54dd5efddb] ???
[0x7f54de4734f1] ???

On average, 3-4 release builds out of ten is segfaultting. Each compilation attempt is made as follows:

rm -rf ~/.cache/crystal/ && time shards build --release

We can’t share the project source code, because it’s private.

I can’t create github issue until I have some relevant information instead of several question marks and irrelevant hex addresses. What can I do next? Compile Crystal compiler myself in debug mode? Is there any instruction, how to do it? If I had a debug stacktrace, would that be enough to detect a compiler error?

Thanks.

A full debug stacktrace will be helpful in narrowing things down, but it’ll probably be impossible to fix this without a reproducing example.

A hotspot for these kinds of errors unfortunately is still debug data, so you might want to try if --no-debug is improving things for you.

You can get a compiler with debug symols by installing LLVM (with development headers), then cloning the repo and running make FLAGS="--debug". Then using the bin/crystal wrapper script to compile your project.

What you can also then do is spot interesting places in the backtrace for adding debug prints to the compiler itself which might help you identify which areas of your codebase trigger this, in order to reduce it to a sharable example. Another good thing to watch out there is for loops in the stacktrace, indicating some infinite recursion in the compiler which can still be triggered by defining an infinitely recursive type.

The easiest thing to reproduce these bugs is to remove a bit of code, compile again. Of it segfaults, remove more code, try again. Of it doesn’t segfault, put back the code that you just removed and try removing something else.

It’s tedious but eventually you’ll get to the bottom of it.

This also. includes removing code from shards. It’s easy because all code is in the project directory.

I did this at least ten times now and I always managed to find the smallest code that triggers the bug. One day if I have time I’ll make a video on how to do it, but essentially it’s what I explain above.

Well, that does work less well if you have seemingly random memory corruption in the compiler as it seems here.

On average, 3-4 release builds out of ten is segfaultting

Still, I’d love somebody attempting to do something like bugpoint for Crystal.

In fact I forgot to mention, you can compile with --verbose and at least see which stage it’s faulting in. If it’s inside LLVM, you can use use --emit=llvm-bc and see if bugpoint can pinpoint it, then try to correlate the output back to the original code. If it’s inside LLVM, getting a LLVM debug build can also be helpful as it runs more assertions.

That’s strange that builds are not always segfaulting, usually a code will always cause a segfault, or not – not randomly, except when they are runtime-related. Maybe it is in this case, at the program startup? However you said that a build is either segfaulty or not, so that’s not the case here.

Where this builds are compiled, in a CI or in your local machine? Do you see the same 3-4 out of ten ratio on other machines?

@j8r - no, it’s not the target compiled binary segfault. It’s the compiler segfault Error target *something* failed to compile:. And in the release mode only (--release).

Build is compiled on the developer laptop (Core i5, Ubuntu 20.04), also segfaulting in the local docker build (not the server CI, dev laptop again), and yes, my colleague confirmed that his segfaulting ratio is similar (Core i7, Pop OS 20.04).

This random compiler segfaults occurs through all Crystal 0.3x versions including the new 0.35.

@asterite We can’t use “remove a bit of code, compile again” cycle to find the smallest code that triggers the bug (IMHO), because there is hundreds of classes in the project - I can’t imagine where to start with this method :)

@jhass with the --no-debug switch, the segfault ratio is a little better.

I will try to build debug version of Crystal compiler in some docker container and post stacktrace.

Ha sorry. Did you try:

  • Compiling the compiler in the host, then using it
  • Using the compiler of Alpine Linux, of alpine@edge and the official Alpine Crystal image

The result may be the same, but we never know, could be due to the LLVM version.

Pop OS is based on Ubuntu, so it is kind of expected.

Thanks for the tip regarding Alpine docker images, next compile test results in local docker:

  • FROM crystallang/crystal:0.35.0 - first Crystal compiler segfault after 3-4 builds
  • FROM crystallang/crystal:0.35.0-alpine - first Crystal compiler segfault after 3-4 builds

and the winner is:

  • FROM alpine:edge - no Crystal compiler segfault (tested 15 builds)

The cache ~/.cache/crystal/ is empty before each build.
One build compilation time is about 5 minutes.
Segfaults occurs only in --release mode.
If I use --progress --time, segfault occurs in the Codegen (bc+obj) compiler stage.

I am still don’t know, what the difference is (maybe Crystal on alpine:edge is compiled with different llvm or some libraries versions)… but it’s “interesting” and we have at least “something”.

What do you think about it?

Oh yes! Actual Crystal compiler/package on Ubuntu is compiled with LLVM 8.0.0 - it’s segfaulting randomly.

I compiled my own Crystal compiler 0.35.0 in the Ubuntu 20.04 docker image with actual llvm-8 package, which is 8.0.1 and it’s not segfaulting! Bingooo! Bonus: compilation looks faster with LLVM 8.0.1.

Can you (Crystal team) rebuild Crystal on all possible platforms with LLVM 8.0.1?

@asterite @jhass @j8r Should I create an issue on the GitHub?

1 Like

I don’t think so, it is maybe why LLVM 8.0.1 exists. This issue may be related to a regression fixed in this patch release.

Edit: may be fixed with LLVM 10: https://github.com/crystal-lang/distribution-scripts/pull/68

LLVM 10.0.0 looks good (tested now). It’s the reason, why Crystal on alpine:edge is not segfaulting - there is also LLVM 10.0.0.