Collections, intermediate objects, and GC

betelgeuse · September 27, 2019, 1:04pm

Hi, first time poster, recovering C developer from the 90’s, long time Rubyist, and roughly 2 yrs long admirer (but not practitioner) of Crystal.

I see that there’s a directive called @[AlwaysInline] that can be applied at the method level.

I’m working on a PoC for a stream cipher and the code is hastily written against collections / enumerables code, thrashing the GC (as expected) in both TruffleRuby and MRI (JRuby in fact is slower than MRI for my PoC). Kotlin has a bevy of features to tap into the JVM’s JIT compiler to solve these kinds of issues, but my appetite is pretty low for discovering obscure tweaks and design patterns by trial-and-error.

Are there any tips for not loading up the GC in Crystal without resorting to boxing off data and pipelining it through a lot of tedious for loops w/ unsafe pointers?

-a lazy, no good hack

bcardiff · September 27, 2019, 1:34pm

The whole std-lib asumes there is a GC.

You can build things with the GC disabled that will use libC malloc, but no free will happen: -Dgc_none is the compiler flag for that.

I don’t follow why the AlwaysInline will make a difference here.

If you want to build a C library with Crystal, you should init the runtime as needed: both for the GC and for the Concurrency aspects. If you want to go outside that you can, but you might need to discover some stuff along the way.

asterite · September 27, 2019, 1:56pm

Also, the standard library is designed in many ways to try to avoid memory allocations. If you have a program in mind we can try to think together how to design it to avoid allocations.

betelgeuse · September 27, 2019, 3:04pm

Thanks both. my Ruby PoC code compiled with very few changes, just initializing arrays w/ a type … that was it! But … to my surprise I just saw that popcount is implemented in the stblib, which is going to make my life a lot easier. I’m rewriting the code to take better advantage of slices which I suspect will qualm any fears about the GC.

Thanks for that. I was looking at Pointer docs in the api and see that pointer assumes GC, too, and there is no method for free, so GC seems like an easy-to-live-with must. I assume that any pointerof’s are cleaned up when mallocs are dereferenced by the GC? Being able to accurately zeroize memory here is great.

“init the runtime as needed” - if GC is not specifically disabled, and calling as a lib through external code, is there some other warmup type ceremony that needs to happen?

bcardiff · September 27, 2019, 7:51pm

We use bdw-gc which is conservative. When removing elements from Array, pointers are zeroed explicitly. Any reachable object is not freed.

GC.free is available, but not used in the std-lib. Read more about it at GitHub - ivmai/bdwgc: The Boehm-Demers-Weiser conservative C/C++ Garbage Collector (bdwgc, also known as bdw-gc, boehm-gc, libgc) readme.

The logic is mainly at:

Related discussions:

jgaskins · September 28, 2019, 2:07am

Slices are amazing, and they’re allocated on the stack, but they do allocate GC-managed memory. If you’re interested in working entirely on the stack, you can try using StaticArrays. They are completely allocated on the stack with no heap memory at all and might save you a bit of CPU time, as well. Here’s a quick benchmark of allocating a bunch of one vs the other:

Example code here

require "benchmark"

# Assigning to values declared outside the block so the block
# executions don't get optimized away
slice = Slice(Int32).new(10, 42)
array = StaticArray(Int32, 10).new(42)
Benchmark.ips do |x|
  x.report "slice" { slice = Slice(Int32).new(512, 42) }
  x.report "static array" { array = StaticArray(Int32, 512).new(42) }
end

# We also sometimes need to use the values after so the assignment
# doesn't get optimized away :-)
p slice.size
p array.size

       slice   1.72M (579.85ns) (± 2.77%)  2.01kB/op   2.49× slower
static array   4.29M (233.06ns) (± 1.17%)    0.0B/op        fastest

Static arrays aren’t perfect and if you’re passing them heavily between methods it you might have better performance with slices because of time spent in memcpy, but if you spend a significant amount of time in GC, they might be worthwhile.

betelgeuse · September 28, 2019, 10:47am

Amazing tip - and very much appreciated! Like TLS, the code is working against a fixed record size so keeping the final encrypt fn on the stack as much as possible is ideal, even if the keystream and ciphertext are being fed from the heap. Having a blast - it feels like Christmas

girng_github · September 28, 2019, 10:54am

Every time I alt-tab to Visual Studio code and see Crystal code instead of JavaScript makes me feel like it’s Christmas every day!

betelgeuse · September 28, 2019, 10:59am

Thanks for the links! Ultimately I would like to bundle this for the usual players - ruby, python, and node.js. Pony has a similar requirement for its runtime in distributing static executables. Nim cheats by transpiling to C :P hehe

betelgeuse · September 28, 2019, 12:14pm

Compared to server-side Node, I can only imagine … welcome back to sanity :P Typescript I think is a really good effort that cemented ES6’s course correction against JS anarchy. I was really surprised the last time I played around with it. Not bad at all if you greenfield a project on it with a dev or two that has experience in a strongly typed language helping out with code review.

There are so many good language choices out there right now it’s impossible to keep track of them, but I hope to see Crystal keep making adoption gains here over the next 12-18 months. When Go devs figure out they can keep CSP, ditch the inane error handling, and get a bump in performance I expect a lot of people will be open to making the change.

betelgeuse · September 28, 2019, 12:34pm

For sure I’ll be back to pick your brain with better informed questions after revising my code. You’ve really put a lot of thought and effort into Crystal and it is mature way beyond its years (or for being 0.3.*). Big kudos for the core team in where you’re at.

I have to admit that I started out the port to Crystal to just increase demo performance for my PoC and get a good baseline impl in a typed lang before porting it to either Kotlin or Rust, neither of which are my forte. After a few days, I’m starting to re-think that. Like Kotlin, Crystal takes care of null pointer references. After that a lot of Rust’s safety features are a little oversold.

I’m working on a framework to integrate cryptocurrencies w/ fintech apps and the stream cipher is just one piece. Managing keys, nonces, and seed inputs from securerandom have to be done very carefully. This is an absolute nightmare on JVM even though there are NIST approved libraries all over the place from HSM providers, and now the Bouncy Castle folks. With the tips from this thread alone I feel a lot more confident about using Crystal to handle crypto and wish I’d pinged y’all earlier!

girng_github · September 28, 2019, 1:38pm

Yeah, nodejs will have a special place in my heart, probably because I went to callback hell and back with them. Maybe that’s why I’m so hopeless and pessimistic.

This was before await was a thing, and bluebird’s promises just came out. I started to adopt promises, but it was too late, my gameserver code became an absolute nightmare. 5-6 level deep db.query waterfall chains, etc. EWWW!

Transitioning from JS to Crystal is one of the greatest things I have ever done. I am at the point where if I think of an idea, my thoughts will transcend from my brain, into the editor as Crystal code. This happened with JS, however, my confidence level with Crystal is through the roof in comparison. It’s a wonderful feeling.

betelgeuse · September 28, 2019, 3:06pm

Call back hell in Ruby has a name: <fade-in distant crying, whispering winds> Event Machine.

girng_github · September 28, 2019, 3:14pm

Hehe. I never used Ruby before, so guess I dodged a bullet ;). I’ll have to google that and see some examples

betelgeuse · September 28, 2019, 3:44pm

Not part of the stdlib - EM is used to achieve non-blocking IO in ruby based web servers. It became popular around the same time that python’s twisted framework came out. If you try to program against EM directly it can lead you to the same place as any other evented lib.

CSP and its use of channels is so much easier to reason about. For the same reasons, vertx is a popular CSP implementation for JVM and dotnet frameworks. It’s really awesome to have this functionality as a core part of the stdlib in Crystal.

rogerdpack · September 30, 2019, 1:44pm

Have you profiled it and found that the GC is slow?

betelgeuse · September 30, 2019, 8:24pm

Great question, and the answer is no, I didn’t. When I wrote my post I was trying to figure out what “real” language to bring my PoC code into and was worried about intermediate objects because of all the collections code in it is mostly strung together with .map {} and .each_with_index loops which thrashes MRI and runs really, really bad on JRuby. I got it to compile in crystal but didn’t bench it or even run it after running into a snag with an upstream keccak dependency. no biggie.

The revised version of this code in crystal I think is going to eliminate any performance concerns about the GC. Using byte slices and IO::ByteFormat::LittleEndian represents a big cleanup of the crappy PoC code into something that should run very well.

I think one of the big culprits on the old ruby code is .pack and .unpack on the upstream function which has to convert between strings and byte arrays. will keep the thread posted after I get it finished off and benchmarked. exciting stuff!

rogerdpack · October 2, 2019, 5:30pm

Don’t forget to profile “first” too, it’s possible the GC isn’t even a big deal in crystal (but…maybe you already ran it in crystal and it was slow there?). Good luck, report back. There are a few tweaks to the GC that I might be able to expose as well, if desired (more/less aggressive, for instance). Hopefully I check back here sometime to see your response it’s hard to catch responses in discourse LOL.

betelgeuse · October 11, 2019, 11:01am

Without any deep dives into profiling or fuzzing tools yet, the first version of the proof-of-concept is running very well. Through judicious use of StaticArray and the record macro, performance is bang-on and the GC is totally not an issue.

Someone who isn’t named after a poltergeist could make a pretty good implementation of QUIC. The performance here is the real deal.

Topic		Replies	Views
A few concerns about crystal Crystal Contrib	4	569	June 19, 2020
Blog: Parallelism in Crystal Official	6	897	September 13, 2019
More slices, less alloc? Crystal Contrib	11	576	October 26, 2022
Memory monitoring/debugging Help & Support	8	843	February 27, 2019
Memory profiling Crystal Contrib	16	1066	June 30, 2023

Collections, intermediate objects, and GC

Related Topics