Invalid memory access in Crystal compiler [pre-solved]

pfischer · June 13, 2020, 7:38am

Hello!

In one of our projects, Crystal compiler segfaults on:

Dependencies are satisfied
Building: *something*
Error target *something* failed to compile:
Invalid memory access (signal 11) at address 0x7f54bae43de8
[0x7f54dd6a5c56] ???
[0x7f54dd5efddb] ???
[0x7f54de4734f1] ???

On average, 3-4 release builds out of ten is segfaultting. Each compilation attempt is made as follows:

rm -rf ~/.cache/crystal/ && time shards build --release

We can’t share the project source code, because it’s private.

I can’t create github issue until I have some relevant information instead of several question marks and irrelevant hex addresses. What can I do next? Compile Crystal compiler myself in debug mode? Is there any instruction, how to do it? If I had a debug stacktrace, would that be enough to detect a compiler error?

Thanks.

jhass · June 13, 2020, 8:17am

A full debug stacktrace will be helpful in narrowing things down, but it’ll probably be impossible to fix this without a reproducing example.

A hotspot for these kinds of errors unfortunately is still debug data, so you might want to try if --no-debug is improving things for you.

You can get a compiler with debug symols by installing LLVM (with development headers), then cloning the repo and running make FLAGS="--debug". Then using the bin/crystal wrapper script to compile your project.

What you can also then do is spot interesting places in the backtrace for adding debug prints to the compiler itself which might help you identify which areas of your codebase trigger this, in order to reduce it to a sharable example. Another good thing to watch out there is for loops in the stacktrace, indicating some infinite recursion in the compiler which can still be triggered by defining an infinitely recursive type.

asterite · June 13, 2020, 10:40am

The easiest thing to reproduce these bugs is to remove a bit of code, compile again. Of it segfaults, remove more code, try again. Of it doesn’t segfault, put back the code that you just removed and try removing something else.

It’s tedious but eventually you’ll get to the bottom of it.

This also. includes removing code from shards. It’s easy because all code is in the project directory.

I did this at least ten times now and I always managed to find the smallest code that triggers the bug. One day if I have time I’ll make a video on how to do it, but essentially it’s what I explain above.

jhass · June 13, 2020, 10:43am

Well, that does work less well if you have seemingly random memory corruption in the compiler as it seems here.

On average, 3-4 release builds out of ten is segfaultting

Still, I’d love somebody attempting to do something like bugpoint for Crystal.

In fact I forgot to mention, you can compile with --verbose and at least see which stage it’s faulting in. If it’s inside LLVM, you can use use --emit=llvm-bc and see if bugpoint can pinpoint it, then try to correlate the output back to the original code. If it’s inside LLVM, getting a LLVM debug build can also be helpful as it runs more assertions.

j8r · June 13, 2020, 7:03pm

That’s strange that builds are not always segfaulting, usually a code will always cause a segfault, or not – not randomly, except when they are runtime-related. Maybe it is in this case, at the program startup? However you said that a build is either segfaulty or not, so that’s not the case here.

Where this builds are compiled, in a CI or in your local machine? Do you see the same 3-4 out of ten ratio on other machines?

pfischer · June 13, 2020, 11:00pm

@j8r - no, it’s not the target compiled binary segfault. It’s the compiler segfault Error target *something* failed to compile:. And in the release mode only (--release).

Build is compiled on the developer laptop (Core i5, Ubuntu 20.04), also segfaulting in the local docker build (not the server CI, dev laptop again), and yes, my colleague confirmed that his segfaulting ratio is similar (Core i7, Pop OS 20.04).

This random compiler segfaults occurs through all Crystal 0.3x versions including the new 0.35.

@asterite We can’t use “remove a bit of code, compile again” cycle to find the smallest code that triggers the bug (IMHO), because there is hundreds of classes in the project - I can’t imagine where to start with this method :)

@jhass with the --no-debug switch, the segfault ratio is a little better.

I will try to build debug version of Crystal compiler in some docker container and post stacktrace.

j8r · June 14, 2020, 11:44pm

Ha sorry. Did you try:

Compiling the compiler in the host, then using it
Using the compiler of Alpine Linux, of alpine@edge and the official Alpine Crystal image

The result may be the same, but we never know, could be due to the LLVM version.

Pop OS is based on Ubuntu, so it is kind of expected.

pfischer · June 16, 2020, 11:40pm

Thanks for the tip regarding Alpine docker images, next compile test results in local docker:

FROM crystallang/crystal:0.35.0 - first Crystal compiler segfault after 3-4 builds
FROM crystallang/crystal:0.35.0-alpine - first Crystal compiler segfault after 3-4 builds

and the winner is:

FROM alpine:edge - no Crystal compiler segfault (tested 15 builds)

The cache ~/.cache/crystal/ is empty before each build.
One build compilation time is about 5 minutes.
Segfaults occurs only in --release mode.
If I use --progress --time, segfault occurs in the Codegen (bc+obj) compiler stage.

I am still don’t know, what the difference is (maybe Crystal on alpine:edge is compiled with different llvm or some libraries versions)… but it’s “interesting” and we have at least “something”.

What do you think about it?

pfischer · June 17, 2020, 2:52pm

Oh yes! Actual Crystal compiler/package on Ubuntu is compiled with LLVM 8.0.0 - it’s segfaulting randomly.

I compiled my own Crystal compiler 0.35.0 in the Ubuntu 20.04 docker image with actual llvm-8 package, which is 8.0.1 and it’s not segfaulting! Bingooo! Bonus: compilation looks faster with LLVM 8.0.1.

Can you (Crystal team) rebuild Crystal on all possible platforms with LLVM 8.0.1?

@asterite @jhass @j8r Should I create an issue on the GitHub?

j8r · June 17, 2020, 4:41pm

I don’t think so, it is maybe why LLVM 8.0.1 exists. This issue may be related to a regression fixed in this patch release.

Edit: may be fixed with LLVM 10: https://github.com/crystal-lang/distribution-scripts/pull/68

pfischer · June 17, 2020, 6:12pm

LLVM 10.0.0 looks good (tested now). It’s the reason, why Crystal on alpine:edge is not segfaulting - there is also LLVM 10.0.0.

Topic		Replies	Views
Debugging segfaults further Help & Support	2	330	May 25, 2019
Compilation issue with Crystal 1.6.x Help & Support	21	723	November 8, 2022
Invalid memory access (signal 11) on 1.13.1, not present on 1.12.2? Help & Support	5	239	August 28, 2024
Cross compiling to ARM (Debian, Crouton, Asus Chromebit) Community	7	837	December 16, 2019
Yaml.parse fails with invalid memory access, but only in release builds using llvm-11 Help & Support	1	269	January 28, 2021

Invalid memory access in Crystal compiler [pre-solved]

Related topics