I recently wrote a web application using Kemal, which has an API for generating QR images: /key/qrcode
. It takes about tens of milliseconds to call it once.
If I use the wrk
test to only respond to static page API, such as the home page, this is the result:
➜ ~ wrk -c 8000 -t 6 -d 15 http://localhost:8080/
Running 15s test @ http://localhost:8080/
6 threads and 8000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 61.02ms 11.64ms 265.48ms 89.10%
Req/Sec 2.74k 847.52 4.29k 67.90%
243625 requests in 15.04s, 1.38GB read
Socket errors: connect 6983, read 0, write 0, timeout 0
Requests/sec: 16201.40
Transfer/sec: 94.14MB
But testing the API generated by the QR Code, the result becomes like this:
➜ ~ wrk -c 8000 -t 6 -d 15 http://localhost:8080/key/qrcode
Running 15s test @ http://localhost:8080/key/qrcode
6 threads and 8000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.00s 571.39ms 1.98s 58.33%
Req/Sec 10.76 8.27 40.00 76.26%
455 requests in 15.03s, 2.83MB read
Socket errors: connect 6983, read 0, write 0, timeout 395
Requests/sec: 30.28
Transfer/sec: 193.00KB
The QR Code is implemented using C binding for libqrencode
and stumpy_png.
Worse, my Kemal app couldn’t respond to any requests when I tested the /key/qrcode
API.
How should Crystal respond to this situation?
What’s the benchmark for testing just the QR logic without kemal in the middle?
That’s expected, because Crystal is currently single-threaded and the work task doesn’t allow any fiber switches for concurrency. So while a handler is generating a QR code, the server is unresponsive.
You could mitigate this by allowing fiber switches during the work task (which would probably require to rewrite the QR generator), but that would obviously only lead to even longer response times. Another option is to run multiple processes in parallel with SO_REUSEPORT
which would allow handling requesta in parallel on different cores.
1 Like
The way I’m reading it, allowing fiber switches would lead to slightly larger response times but should increase overall throughput across the application? e.g. allowing responses to /home while /key/qrcode is doing its thing.
I’m not sure of a way to handle this purely within crystal but the suggestion of having more processes running will aid with things. You might need to change the design/architechture of your app in order to support this at higer levels of load, such as pushing QR code generation into a job and then having the client poll (or a websocket push) the endpoint for completion. You’d be essentially taking the QR generation out of the request/response cycle and would require additional services to support this one function so whether that would be worth it or not would only be for you to say. You’re also approaching microservice architecture at that point and for me, I’d go with straight-shoota’s suggestion.
Multiple processes will probably get you across the line unless you’re serving insane levels of traffic or someone’s hammering your /key/qrcode endpoint.
No, throughput would not really benefit. The program can execute different fibers concurrently, but it’s not able to send more data. Latency would improve for fast requests and worsen for long running requests (because they’re interrupted). And availablility should also improve in case latency causes requests to time out.
1 Like
I am wondering if I can make the /key/qrcode
API not affect the entire program?
Even if the QR Code is generated slowly, it doesn’t matter, but it can’t stop the application from responding to other requests, otherwise, using this API can easily defeat my web application.
I know that I can run the QR Code generation task independently in another process and then call it through RPC.
But since I came here for help, I just wanted to know if there were other ways to solve it.
There’s no way to solve it. A CPU intensive task will block the only Thread of execution that exists right now in Crystal, and that means it will block every fiber in that Thread (and thus everywhere).
Eventually Crystal will be able to spawn multiple threads, so if you have a request doing QR code generation another fiber in another thread can take requests. I see this feature (parallelism) as something really good, exactly because of this case, even if it means generally slowing down the performance of the entire application.
Long running tasks in web applications are usually delegated to a dedicated task runner. There are a few available for Crystal such as mosquito, Onyx::Background, Ost and sidekiq.cr. The general idea is to offload computation intense tasks to a separate worker process (or multiple ones). You could either implement some kind of callback or polling directly in the client interface or directly in the server handler. This keeps the server responsive because the fiber executing the QR-code request can be sent to sleep while the result is generated in a separate process. Once it is available, the fiber can continue to send it back to the client.