--release generates a single LLVM module in addition to aggressive optimizations as to get the best performance possible. I believe its just an alias to -O3 --single-module
If you compile with just -O3 then you’ll see re-use.
--release applies the best code optimization possible.
This optimization doesn’t work across module boundaries, so all code is merged into a single module.
One way around that is link time optimization. Which is something that we used to have some sort of support for, but it rotted due to not being on by default and then code rot happened. So it was removed.
The implementation was also not faster. It would allow some higher level of parallelization in the initial phase where the object files where generated, but then the LTO step would be slow enough to eat any gains.
I wonder a bit if it could be structured differently and be faster that way - my impression is that link_with_LTO(A.o, B.o, C.o, D.o) was essentially LTO(LTO(LTO(A.o, B.o), C.o), D.o). Perhaps it would be possible to structure it like a tree, LTO(LTO(A.o, B.o), LTO(C.o, D.o)). That could potentially have some of the steps be executed in parallel and thus be faster in aggregate assuming multiple CPU. But I don’t know if that is either possible or faster if it is. Perhaps someone more familiar with compiling big C/C++ projects could chime in if something like that would be possible?
Theoretically some of those could also be done in parallel to code generation, but that would require some more extensive changes to the compiler to be less focused on phases than today.