Hey guys, help me believe in Crystal potential: I’m converting a heavy scraper implemented in Node.js but I found Crystal consumes the same amount of resources and is much more slower…I tried optimizing everything according to official docs but now I’m feeling like fighting with the language to get the holy grail…if you have 5 minutes to spare, please take a look at my code: https://gist.github.com/thelinuxlich/c459ca5cd77718307a58a4a3e3c335c5
Mmh, I can’t spot any major red flag, only small or structural things.
I don’t know how optimized halite is, it might be worth to change it for stdlib HTTP::Client
or even a libcurl binding for libcurl’s multi interface. Also having a pool of HTTP::Client
instances, keeping connections open to the same hosts might improve some. If you do a lot of requests against the same host then have more than one instance per host, if you do requests to a lot of different hosts then make sure to limit the amount of kept connections.
https://gist.github.com/thelinuxlich/c459ca5cd77718307a58a4a3e3c335c5#file-scraper-cr-L65 Avoid allocations if you can. For example here you can avoid all the intermediate arrays just by swapping the []
for {}
within the map. You can avoid the bigger intermediate array by swapping the map
for to_h {|match| {match[1], match[2] }
and you can avoid the hash by using scan
with a block and a little case statement within.
https://gist.github.com/thelinuxlich/c459ca5cd77718307a58a4a3e3c335c5#file-scraper-cr-L86 Avoid calling the regex engine if possible, so for example this is .delete("^0-9")
. This doesn’t call the regex engine but uses Char#in_set?
internally.
https://gist.github.com/thelinuxlich/c459ca5cd77718307a58a4a3e3c335c5#file-scraper-cr-L87-L88 Avoid doing conversions twice. x = v if v = x.to_i?
https://gist.github.com/thelinuxlich/c459ca5cd77718307a58a4a3e3c335c5#file-scraper-cr-L105 does this happen frequently? If so it might be worth to avoid raising and calling next
instead to restart the loop.
It might also be worth to prepare the SQL statements outside the loop http://crystal-lang.github.io/crystal-db/api/0.8.0/DB/SessionMethods.html#prepared(query)-instance-method. I would also play with wrapping things into transactions, with a commit every thousand iterations or so. Having a pool of DB and Redis connections rather than all the workers fighting over a single one might also be beneficial.
Collecting some timings with Benchmark.measure
and calculating averages for sections of the program in some global state might help in understanding what is actually a bottleneck and should be focused on first. If you extract things to methods, you might even have a chance to learn something about that using standard perf
tools.
Finally you might want to toy with -Dpreview_mt
.
I realize these are mostly not simple changes and each requires testing and verification, but such is performance optimization anywhere :) Also realize that this problem is mostly IO bound, so applying the same level of optimization to your nodejs implementation will likely see a comparable speed.
I’d also expect most issues with HTTP and DB client.
I’m not sure how much overhead halite adds, but it’s based on HTTP::Client and I don’t think it adds reusable connections. So every time you send a request, it needs to initialize a connection and then drops it.
If your scraper connects to only a single host (or a couple), you should try to keep a reusable HTTP connection per fiber. The repeated connections to api.mercadolibre.com
could definitely be cached.
Database connections are cached and reused internaly by the database driver. So that’s already dealt with. But managing them still adds some overhead and you could try to use a single DB connection per fiber.
Considering all that, you should also care about your worker count. 1000 might be too high to be efficient. Especially when considering reusing resources per worker.
Maybe splitting the jobs into separate worker task might also be a good idea. So you could have one set of workers doing the actual scraping, sending HTTP requests and extracting data. Then that data is sent to a channel and picked up by a separate set of workers which handle DB insertion. This could help improve resource efficiency.
This kind of code is I/O bound. The CPU is waiting around for the responses and so just switching to a faster language won’t help.
Node.js has a pretty good model for dealing with code like this through its event loop. Crystal does as well with its concurrent fiber scheduler. That’s why you see similar performance between the two. If you tried something like Ruby or Python, you would see a much larger slowdown. I would suspect if you wrote your sample in Go (using goroutines for fibers), you would see similar performance as well.
@mgomes, you might think that at first glance, but I’ve found the regex extraction plays a good part(it’s a huge string of HTML), so I’ve followed probably 99% of the tips here and on the Crystal gitter channel(thanks @watzon!) and achieved the dream: half the Node.js processing time, from 25gb RAM to 1.5GB!
The processing time might be cut down even more once I migrate other Node.js scrapers, so it will run faster and distribute better the resources available!
I’m gonna post here another gist with the changes, not yet in a good organization, just like the original one, so anyone stumbling upon this post can do a comparison:
Thank you very much for the support, it’s a hell of a ride, but I’m loving it!
I imagine you mean MB rather than GB (either that or 2.5 rather than 25)? But awesome! Glad we could help!
It’s exactly that, the Node scraper was consuming 25GB and Crystal is consuming 1.5GB at maximum.