What are the best practices for Crystal to handle compute-intensive tasks?

I recently wrote a web application using Kemal, which has an API for generating QR images: /key/qrcode. It takes about tens of milliseconds to call it once.
If I use the wrk test to only respond to static page API, such as the home page, this is the result:

➜  ~ wrk -c 8000 -t 6 -d 15 http://localhost:8080/
Running 15s test @ http://localhost:8080/
  6 threads and 8000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    61.02ms   11.64ms 265.48ms   89.10%
    Req/Sec     2.74k   847.52     4.29k    67.90%
  243625 requests in 15.04s, 1.38GB read
  Socket errors: connect 6983, read 0, write 0, timeout 0
Requests/sec:  16201.40
Transfer/sec:     94.14MB

But testing the API generated by the QR Code, the result becomes like this:

➜  ~ wrk -c 8000 -t 6 -d 15 http://localhost:8080/key/qrcode
Running 15s test @ http://localhost:8080/key/qrcode
  6 threads and 8000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.00s   571.39ms   1.98s    58.33%
    Req/Sec    10.76      8.27    40.00     76.26%
  455 requests in 15.03s, 2.83MB read
  Socket errors: connect 6983, read 0, write 0, timeout 395
Requests/sec:     30.28
Transfer/sec:    193.00KB

The QR Code is implemented using C binding for libqrencode and stumpy_png.

Worse, my Kemal app couldn’t respond to any requests when I tested the /key/qrcode API.

How should Crystal respond to this situation?

What’s the benchmark for testing just the QR logic without kemal in the middle?

That’s expected, because Crystal is currently single-threaded and the work task doesn’t allow any fiber switches for concurrency. So while a handler is generating a QR code, the server is unresponsive.

You could mitigate this by allowing fiber switches during the work task (which would probably require to rewrite the QR generator), but that would obviously only lead to even longer response times. Another option is to run multiple processes in parallel with SO_REUSEPORT which would allow handling requesta in parallel on different cores.

1 Like

The way I’m reading it, allowing fiber switches would lead to slightly larger response times but should increase overall throughput across the application? e.g. allowing responses to /home while /key/qrcode is doing its thing.

I’m not sure of a way to handle this purely within crystal but the suggestion of having more processes running will aid with things. You might need to change the design/architechture of your app in order to support this at higer levels of load, such as pushing QR code generation into a job and then having the client poll (or a websocket push) the endpoint for completion. You’d be essentially taking the QR generation out of the request/response cycle and would require additional services to support this one function so whether that would be worth it or not would only be for you to say. You’re also approaching microservice architecture at that point and for me, I’d go with straight-shoota’s suggestion.

Multiple processes will probably get you across the line unless you’re serving insane levels of traffic or someone’s hammering your /key/qrcode endpoint.

No, throughput would not really benefit. The program can execute different fibers concurrently, but it’s not able to send more data. Latency would improve for fast requests and worsen for long running requests (because they’re interrupted). And availablility should also improve in case latency causes requests to time out.

1 Like

I am wondering if I can make the /key/qrcode API not affect the entire program?

Even if the QR Code is generated slowly, it doesn’t matter, but it can’t stop the application from responding to other requests, otherwise, using this API can easily defeat my web application.

I know that I can run the QR Code generation task independently in another process and then call it through RPC.

But since I came here for help, I just wanted to know if there were other ways to solve it.

There’s no way to solve it. A CPU intensive task will block the only Thread of execution that exists right now in Crystal, and that means it will block every fiber in that Thread (and thus everywhere).

Eventually Crystal will be able to spawn multiple threads, so if you have a request doing QR code generation another fiber in another thread can take requests. I see this feature (parallelism) as something really good, exactly because of this case, even if it means generally slowing down the performance of the entire application.

Long running tasks in web applications are usually delegated to a dedicated task runner. There are a few available for Crystal such as mosquito, Onyx::Background, Ost and sidekiq.cr. The general idea is to offload computation intense tasks to a separate worker process (or multiple ones). You could either implement some kind of callback or polling directly in the client interface or directly in the server handler. This keeps the server responsive because the fiber executing the QR-code request can be sent to sleep while the result is generated in a separate process. Once it is available, the fiber can continue to send it back to the client.