Segfault while bootstraping from ruby

I’ve been trying to bootstrap the compiler from ruby using GitHub - crystal-lang/bootstrap-script: Automated script to bootstrap the crystal compiler from source, with the goal of updating the script to the latest versions & using containers, but can’t get past the second stage:

==> Bootstrapping Crystal (001/160)
patching file bin/crystal
patching file bootstrap/crystal/compiler.cr
patching file bootstrap/crystal/program.cr
patching file lib/crystal/compiler.rb
                           user     system      total        real
parse                  0.000000   0.000000   0.000000 (  0.000615)
normalize              2.910000   0.120000   3.040000 (  3.030746)
type inference:       26.740000   1.190000  27.930000 ( 27.953474)
fix empty types        0.350000   0.000000   0.350000 (  0.352627)
afert type inference   0.730000   0.020000   0.750000 (  0.739854)
codegen-llvm          17.450000   0.930000  18.380000 ( 18.401006)
codegen-llc          .crystal//home/geopjr/bootstrap-script/buildroot/src/crystal/bootstrap/crystal.cr/main.s: Assembler messages:
.crystal//home/geopjr/bootstrap-script/buildroot/src/crystal/bootstrap/crystal.cr/main.s:8847: Warning: stand-alone `data16' prefix
.crystal//home/geopjr/bootstrap-script/buildroot/src/crystal/bootstrap/crystal.cr/main.s:8849: Warning: stand-alone `data16' prefix
.crystal//home/geopjr/bootstrap-script/buildroot/src/crystal/bootstrap/crystal.cr/main.s:8850: Warning: stand-alone `data16' prefix
  0.850000   0.170000  54.040000 ( 54.169867)
codegen-clang        /usr/bin/ld: .crystal//home/geopjr/bootstrap-script/buildroot/src/crystal/bootstrap/crystal.cr/main.o: in function `_2A_Crystal_3A__3A_Compiler_23_compile_3C_Crystal_3A__3A_Compiler_3E__3A_Nil':
.crystal//home/geopjr/bootstrap-script/buildroot/src/crystal/bootstrap/crystal.cr/main.bc:(.text+0x11b46): warning: the use of `tmpnam' is dangerous, better use `mkstemp'
/usr/bin/ld: warning: .crystal//home/geopjr/bootstrap-script/buildroot/src/crystal/bootstrap/crystal.cr/main.o: missing .note.GNU-stack section implies executable stack
/usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
  0.000000   0.110000   0.570000 (  0.576038)
TOTAL:                49.030000   2.540000 105.060000 (105.224228)
==> Bootstrapping Crystal (002/160)
Normalize: 0.618294 seconds
Type inference: 10.2995 seconds
Codegen (crystal): 5.31784 seconds
Codegen (bitcode): 1.07611 seconds
./bootstrap: line 192: 24294 Segmentation fault      (core dumped) "$input" "${crystal_args[@]}" -o "$output" src/compiler/crystal.cr
Normalize: 0.598149 seconds
Type inference: 10.868 seconds
Codegen (crystal): 5.3217 seconds
Codegen (bitcode): 1.17425 seconds
./bootstrap: line 192: 25957 Segmentation fault      (core dumped) "$input" "${crystal_args[@]}" -o "$output" src/compiler/crystal.cr

Using the compiler itself is a bit more verbose:

1-crystal -e "puts \"foo\""

No such file or directory:
_2A_Exception_23_initialize_3C_Errno_2C__20_String_3E__3A_Array_28_String_29_ + [0]
_2A_Errno_23_initialize_3C_Errno_3E__3A_Array_28_String_29_ + [0]
_2A_Errno_3A__3A_new_3C_Errno_3A_Class_3E__3A_Errno + [0]
_2A_File_3A__3A_expand_path_3C_File_3A_Class_2C__20_String_3E__3A_String + [0]
_2A_Crystal_3A__3A_Program_23_require_from_load_path_3C_Crystal_3A__3A_Program_2C__20_String_3E__3A_Nil_20__7C__20_Crystal_3A__3A_ASTNode_2B_ + [0]
_2A_Crystal_3A__3A_Program_23_require_3C_Crystal_3A__3A_Program_2C__20_String_2C__20_Nil_20__7C__20_String_20__7C__20_Crystal_3A__3A_VirtualFile_3E__3A_Nil_20__7C__20_Crystal_3A__3A_ASTNode_2B_ + [0]
_2A_Crystal_3A__3A_Normalizer_23_transform_3C_Crystal_3A__3A_Normalizer_2C__20_Crystal_3A__3A_Require_2B__3E__3A_Crystal_3A__3A_ASTNode_2B_ + [0]
_2A_Crystal_3A__3A_ASTNode_2B__23_transform_3C_Crystal_3A__3A_ASTNode_2B__2C__20_Crystal_3A__3A_Normalizer_3E__3A_Crystal_3A__3A_ASTNode_2B_ + [0]
_2A_Crystal_3A__3A_Normalizer_23_transform_3C_Crystal_3A__3A_Normalizer_2C__20_Crystal_3A__3A_Expressions_2B__3E__3A_Crystal_3A__3A_ASTNode_2B_ + [0]
_2A_Crystal_3A__3A_ASTNode_2B__23_transform_3C_Crystal_3A__3A_ASTNode_2B__2C__20_Crystal_3A__3A_Normalizer_3E__3A_Crystal_3A__3A_ASTNode_2B_ + [0]
_2A_Crystal_3A__3A_Normalizer_23_normalize_3C_Crystal_3A__3A_Normalizer_2C__20_Crystal_3A__3A_ASTNode_2B__3E__3A_Crystal_3A__3A_ASTNode_2B_ + [0]
_2A_Crystal_3A__3A_Program_23_normalize_3C_Crystal_3A__3A_Program_2C__20_Crystal_3A__3A_ASTNode_2B__3E__3A_Crystal_3A__3A_ASTNode_2B_ + [0]
_2A_Crystal_3A__3A_Compiler_23_compile_3C_Crystal_3A__3A_Compiler_3E__3A_Nil + [0]
__crystal_main + [0]
main + [0]
__libc_init_first + [0]
__libc_start_main + [0]
_start + [0]

More info:

  • if you attempt to run the script, it will fail due to pcl’s download location
-- curl -L -o "$downloads"/pcl-1.12.tar.gz http://xmailserver.org/pcl-1.12.tar.gz
++ curl -L -o "$downloads"/pcl-1.12.tar.gz http://www.xmailserver.org/pcl-1.12.tar.gz
  • LLVM 3 requires python 2 (?) which might have been replaced by 3 on your distro (and removed completely from its repos), you can run the following to install it in the build env (run it when it fails and start again after it finishes):
curl -L -o ./downloads/Python-2.7.3.tgz http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz && cd ./buildroot/src/ && rm -Rf Python-2.7.3.tgz && tar xzf ../../downloads/Python-2.7.3.tgz && cd Python-2.7.3 && ./configure --prefix="$(pwd)"/../../ --enable-shared && make && make install && cd ../../../ && rm -rf buildroot/src/llvm-3.3.src/
  • It probably won’t be able to find librt and you might need to symlink it (not sure if this is the best approach but I didn’t want to modify the stage patches yet): ln -s /usr/lib/x86_64-linux-gnu/librt.so.1 ./buildroot/lib/librt.so
  • The script fails on my machine during linking one of the deps, so all my attempts have been on debian containers and vms, including debian versions before the glibc librt & python change, so I doubt the symlink or python2 is the issue

Any ideas? :person_shrugging:

what’s the goal?

I mean, you mention updating the script and using the container, but do you have a more specific goal?

do you have a more specific goal?

Yes, kind-of… there’s a certain distro in alpha with an unusual stack and while figuring out what would be the best way to package Crystal for it I went down the bootstrapping from Ruby rabbit hole. Fixing the script is not that important for reaching that goal as there are better ways to port it over but it’d be nice to have it as a working alternative.

However, I’m about to give up as even using the packages of the time yields the same result and I don’t think anyone is up to debugging a decade old build:

# git clone https://github.com/crystal-lang/crystal/ -b ruby
FROM ubuntu:precise
WORKDIR /app
COPY . .

RUN cat llvm-3.3.tar.gz | tar xz --strip-components=1 -C /usr
RUN sed -i -re 's/([a-z]{2}\.)?archive.ubuntu.com|security.ubuntu.com/old-releases.ubuntu.com/g' /etc/apt/sources.list
RUN apt-get update && apt-get install -y  zlib1g-dev ruby1.9.3 build-essential git libpcre3-dev libunwind7-dev wget


WORKDIR /app/crystal
RUN gem install bundler -v '< 1.8' --no-ri --no-rdoc
RUN bundle
# RUN bin/crystal -e "puts 1" # Works
RUN bin/crystal -stats bootstrap/crystal.cr
RUN ./crystal -e "puts 1"
# segfault
1 Like

Hi, @GeopJr and other readers…

I’ve also tried to get the bootstrap script working and ran into similar issues (on Debian 12 “bookworm”, in my case). Here’s what I did so far:

  1. Like you, I saw I needed Python 2. I built Python 2.7.18 from source and made a symlink so that the python2 command would run this version.
  2. I saw a message about needing to install GraphViz. This is a red herring: GraphViz was already installed, and this message is displayed regardless of whether or not it’s installed. So, you don’t need to worry about this message.
  3. I had an error about “Cannot find llvm-config”. This was displayed immediately after the GraphViz message (which is why I ended up looking into that). I’m still not sure why this happened, but after a bit of investigating, I ended up just doing rm -rf buildroot/src/llvm-3.3.src and re-running the bootstrap script, and it worked that time.
  4. I got the error with librt.so. Now I can share some useful information about this error: it happens because that library has been removed from modern glibc. It was only needed by time.linux.cr, which is easily fixed with the following patch (add this to the end of stage1.patch, and note that the empty-looking line before class Time needs a single space on it to be a valid patch file):
diff --git a/std/time.linux.cr b/std/time.linux.cr
index 8aa07734b..dbe094b9a 100644
--- a/std/time.linux.cr
+++ b/std/time.linux.cr
@@ -1,4 +1,4 @@
-lib Librt("rt")
+lib C
   struct TimeSpec
     tv_sec, tv_nsec : Int64
   end
@@ -7,7 +7,7 @@ end
 
 class Time
   def initialize
-    Librt.clock_gettime(0, out time)
+    C.clock_gettime(0, out time)
     @seconds = time.tv_sec + time.tv_nsec / 1e9
   end
 end

If you search for clock_gettime, you might find yourself here, which states “Link with -lrt (only for glibc versions before 2.17)”; that’s how I figured to just change Librt to C.

  1. I got an error about libunwind.so. I tried the following workaround (I don’t suggest doing this, as it doesn’t seem to work, see below):
# as root
cd /usr/lib/x86_64-linux-gnu
ln -s libunwind.so.8.0.1 libunwind.so

Doing this does allow stage1 to build, but when it tries to actually run it, it gets a segfault - maybe the same segfault that you’re getting…

I did do a little research, which reveals there’s multiple different libunwind projects, and maybe we’re linking to the wrong one. My libunwind.so.8.0.1, which is part of my Debian 12 system, comes from “nongnu” libunwind, whatever “nongnu” means. But there’s an LLVM libunwind as well, as discussed here, and I see it’s mentioned here in the LLVM docs. Since Crystal uses LLVM I’m pondering whether I need to somehow get it to link with the LLVM libunwind instead of the “nongnu” one, and whether that’ll resolve the segfault.

But I’m getting very out of my depth here, so that’s where I stopped. If anyone has some pointers for how to make further progress, I’d really appreciate it!

Hopefully related, I’ve also posted over on the Debian User Forums to try to find out how Debian is compiling Crystal, since I feel sure they must be bootstrapping it somehow.

1 Like

Hi all, I’ve made some progress…!

I installed Debian 7, 8, 9, 10 and 11 in virtual machines, in addition to my desktop Debian 12 installation.

I have now gotten past Stage 2 without a segfault, on Debian 9 & 10 only. On Debian 11 & 12 it still segfaults. I didn’t test 7 & 8 to this point since once I found the success on Debian 9 I felt there was no reason to proceed further on older versions at this point. Note that I cancelled the bootstrap process after it got to Stage 3 due to time constraints, so I don’t yet know if there’ll be further trouble later on.

As I suspected, you don’t want to manually create a libunwind.so symlink like I described above. There’s a package for it; it was only a matter of finding it.

From a clean Debian install, as well as ensuring that python2 runs Python 2, you need at least:

apt-get install \
  automake \
  build-essential \
  cmake \
  git \
  libgc-dev \
  libpcre3-dev \
  libunwind-dev \
  libyaml-dev \
  zlib1g-dev
  • libyaml-dev and zlib1g-dev are required in order to get ruby to build in a way that lets you install gems. If either are missing, then ruby will successfully build, but you’ll get an error later when it tries to install gems, which doesn’t at all point you towards the fact that it’s due to these packages not being installed.
  • libpcre3-dev is what you need for the bootstrap script to work, not libpcre2-dev. Confusingly, although libpcre2-dev means PCRE 2, libpcre3-dev means PCRE 1 and is older. This had me stumped for a while.
  • libunwind-dev is (unsurprisingly) the package which creates libunwind.so. There are various numbered alternatives available as well, depending on the Debian version.
  • cmake is only required later in the bootstrap process. I haven’t actually gotten to the point where it’s required yet, but figured I may as well list it.
  • EDIT 2025-01-25: added libgc-dev, this is required for some later stages.

One basic thing I want to explore is if substituting one of the other available libunwind packages on Debian 11 and/or 12 resolves the segfault there. I suspect it probably won’t, though, in which case I guess the next thing I’ll want to do is check which other system libraries are being picked up, and whether I can replace more of them with older versions compiled from source, and somehow figure out a combination that won’t segfault.

I’ll also let the bootstrap script run further on 9 & 10 when I have more time, and see whether it can get all the way to the end yet or if I have any further trouble beyond stage 3.

I spent a while trying to get this to work again a while ago, on arch linux (the same install I wrote the script on…) I’m starting to suspect I may have to make a more substantial container to reproduce the build.

However, the idea was to have something as lightweight as possible to allow the bootstrap to be reproduced in an official debian package. But the set of things that need to be built into the local prefix is so much larger now, and the failures are harder to debug.

1 Like

Thanks for the reply!

I’m planning to do some more work on this soon, and happy to help out where I can. I’ve got several changes to the bootstrap script already - once I’ve got further I can certainly make a PR.

Yeah, that’s a total footgun :man_facepalming: Who could come up with such a crooked idea?

Right, now that holidays are over and I’ve had another chance to get further with this…

LLVM 3.9.1

On Debian 10, after additionally doing apt install libgc-dev, it now successfully gets to stage 130, but then it tries to build and install LLVM 3.9.1, which exits with an obscure ld error.

So I tried just doing the LLVM 3.9.1 step on its own on Debian 12… where it works fine. I didn’t look into this error any further.

libunwind

Meanwhile, I looked into the libunwind versions. The numbered versions libunwind-N-dev are not what you want, because those are the LLVM versions. It is only libunwind-dev that is relevant here, which is the NonGNU one. So I tried lifting the version of libunwind.so from my Debian 10 installation onto my Debian 12 installation (while rather dangerously temporarily moving my systemwide one on Debian 12 out of the way, so it definitely wouldn’t find it by mistake)… unfortunately it turns out that the stage-2 segfault is still there.

So, I conclude the stage-2 segfault has nothing to do with which version of (NonGNU) libunwind is installed.

gdb

I tried good old gdb, but (as expected) that didn’t produce anything useful either, just that it’s crashing while trying to create a thread:

(gdb) run -stats -o "/home/.../crystal-lang-bootstrap-script/stages/2-crystal" src/compiler/crystal.cr
Starting program: /home/.../crystal-lang-bootstrap-script/stages/1-crystal -stats -o "/home/.../crystal-lang-bootstrap-script/stages/2-crystal" src/compiler/crystal.cr
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after vfork from child process 38089]
[Detaching after vfork from child process 38091]
[Detaching after vfork from child process 38096]
[Detaching after vfork from child process 38098]
Normalize: 0.164507 seconds
Type inference: 3.15898 seconds
Codegen (crystal): 1.49343 seconds
[Detaching after vfork from child process 38100]
Codegen (bitcode): 0.445166 seconds
[New Thread 0x7ffff563f6c0 (LWP 38102)]
[New Thread 0x7ffff4e3e6c0 (LWP 38103)]
[New Thread 0x7ffff463d6c0 (LWP 38104)]

Thread 2 "1-crystal" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff563f6c0 (LWP 38102)]
0x00005555ec4413b0 in ?? ()
(gdb) bt
#0  0x00005555ec4413b0 in ?? ()
#1  0x00007ffff69c91c4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#2  0x00007ffff6a4985c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) thread apply all bt

Thread 4 (Thread 0x7ffff463d6c0 (LWP 38104) "1-crystal"):
#0  clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:62
#1  0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7ffff4e3e6c0 (LWP 38103) "1-crystal"):
#0  0x00005555ec432430 in ?? ()
#1  0x00007ffff69c91c4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#2  0x00007ffff6a4985c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 2 (Thread 0x7ffff563f6c0 (LWP 38102) "1-crystal"):
#0  0x00005555ec4413b0 in ?? ()
#1  0x00007ffff69c91c4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#2  0x00007ffff6a4985c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 1 (Thread 0x7ffff7ebc740 (LWP 38086) "1-crystal"):
#0  clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:62
#1  0x00007ffff6a498ad in __GI___clone_internal (cl_args=cl_args@entry=0x7fffffffd170, func=func@entry=0x7ffff69c8ef0 <start_thread>, arg=arg@entry=0x7ffff463d6c0) at ../sysdeps/unix/sysv/linux/clone-internal.c:54
#2  0x00007ffff69c8df0 in create_thread (pd=pd@entry=0x7ffff463d6c0, attr=attr@entry=0x7fffffffd270, stopped_start=stopped_start@entry=0x7fffffffd266, stackaddr=stackaddr@entry=0x7ffff3e3d000, stacksize=<optimized out>, thread_ran=thread_ran@entry=0x7fffffffd267) at ./nptl/pthread_create.c:295
#3  0x00007ffff69c98bd in __pthread_create_2_1 (newthread=<optimized out>, attr=<optimized out>, start_routine=<optimized out>, arg=<optimized out>) at ./nptl/pthread_create.c:831
#4  0x0000555555b6eb1e in _2A_Thread_28_Crystal_3A__3A_Compiler_2C__20_Nil_29__23_initialize_3C_Thread_28_Crystal_3A__3A_Compiler_2C__20_Nil_29__2C__20_Crystal_3A__3A_Compiler_2C__20_Crystal_3A__3A_Compiler_20__2D__3E__20_Nil_3E__3A_Void ()
#5  0x0000555555b6eab5 in _2A_Thread_28_Crystal_3A__3A_Compiler_2C__20_Nil_29__3A__3A_new_3C_Thread_28_Crystal_3A__3A_Compiler_2C__20_Nil_29__3A_Class_2C__20_Crystal_3A__3A_Compiler_2C__20_Crystal_3A__3A_Compiler_20__2D__3E__20_Nil_3E__3A_Thread_28_Crystal_3A__3A_Compiler_2C__20_Nil_29_ ()
#6  0x000055555556c4c7 in _2A_Crystal_3A__3A_Compiler_23_compile_3C_Crystal_3A__3A_Compiler_3E__3A_Nil ()
#7  0x000055555555a767 in __crystal_main ()
#8  0x000055555555b959 in main ()

other libs

stage-1 crystal doesn’t use many system libraries:

objdump -p stages/1-crystal

...

  NEEDED               libpcre.so.3
  NEEDED               libunwind.so.8
  NEEDED               libLLVM-3.3.so
  NEEDED               libm.so.6
  NEEDED               libc.so.6
  NEEDED               ld-linux-x86-64.so.2

...

So I’m really struggling to understand what’s causing this segfault and why it breaks between Debian 10 and 11.

let’s just lift it

One more thing I tried - since I managed to get stages up to 130 working on Debian 10 but not 12, and LLVM 3.9.1 working on Debian 12 but not 10, I rsync’d the stages up to 130 from 10 to 12 and then ran bootstrap from there on 12 to see what’d happen. It successfully built all the stages from 131 up to 162 but then failed on the very last stage (163… but it’s wrongly numbered 162) with the following error:

src/llvm/ext/llvm_ext.cc: In function ‘LLVMBool LLVMExtCreateMCJITCompilerForModule(LLVMOpaqueExecutionEngine**, LLVMModuleRef, LLVMMCJITCompilerOptions*, size_t, LLVMBool, char**)’:
src/llvm/ext/llvm_ext.cc:548:50: error: ‘AttributeList’ has not been declared
  548 |       Attrs = Attrs.addAttribute(F.getContext(), AttributeList::FunctionIndex,
      |                                                  ^~~~~~~~~~~~~
src/llvm/ext/llvm_ext.cc:561:45: error: no matching function for call to ‘unwrap(LLVMCodeModel&, bool&)’
  561 |   if (Optional<CodeModel::Model> CM = unwrap(options.CodeModel, JIT))
      |                                       ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~

So, what next?

We’re at the point now, where if we’re happy to build stages up to 130 on Debian 10, and then switch to 12 for the remaining stages, we just need to fix stage 163 and catch up with the last (4 years of?) additional stages.

With that said - it’d be far preferable if we could do everything on Debian 12 (and newer versions moving forward) - but I’m stumped on what’s causing the stage 2 segfault.

Can anyone offer some help with this? @RX14 have you had any luck in the meantime?

I unfortunately haven’t had time to look into it, what really should happen is that the whole build happens in a chroot so the external environment has no effect at all… But that would require building a lot more dependencies to get a full toolchain (libc, clang, etc.)

That should be relatively easy.
A Crystal 1.0 compiler can build 1.15 as well, so it’s not necessary to go through all intermediate versions. Neither the build instructions nor dependnecies have changed much, except for the update to libpcre2 in Crystal 1.8.

Although I’m wondering if all the intermediate steps in the bootstrap scropt are even necessary. IIRC there would’ve been some forward compatibility, even pre 1.0.

Right. I might try learning seriously about chroot, then - it’s something I’ve read about a lot by now but never actually used myself.

One thing I’m wondering is whether this stage-2 segfault is caused by some UB, which is “just by accident” not causing a problem on old-enough kernels, but suddenly turns into a crash on newer kernels. If that were the case, chroot won’t help, since AIUI it won’t affect the kernel.

Yes, I was hoping to go through and remove some of the intermediate builds once I’d gotten the “comprehensive” build working. If the builds can be made reproducible, bit-for-bit identical, then it would be trivial to prove that skipping stage X changes nothing, if the output is identical to not skipping stage X. But getting the bootstrap working at all is my first priority here :)

At least pre-0.1.0, I made pains to make the number of build steps as small as possible. After that some could be removed maybe, but prior to 0.5.0 even the previous release couldn’t guarantee to build the next. Things are messier than they seem, and sometimes you build a compiler succesfully with an old version, but that compiler then compiles a compiler which compiles compilers which segfaults, so you can be sent back 3 or 4 steps when figuring out what broke. Thinning out the steps is a much trickier process than you’d expect.