Concurrency issues with an API server

Hello, Crystal community,

We’ve recently used Crystal to develop a Kemal-based API service in our company. It is a very simple one, having a couple endpoints; and it reads its data from a Postgres database.

We’ve hit some issues, two of which we could not find any workaround for:

When a batch of tens of requests is thrown against the service, the response times, initially small, grow all the time up to the last successfully processed request; for example, for 20 concurrent requests, response processing times (in seconds) look like this:

0.46
0.75
0.69
1.30
1.59
1.45
1.93
2.38
2.47
2.91
3.14
3.81
3.74
4.21
4.36
4.72
5.11
5.35
5.89
6.17

Monitoring the database server reveals that for the requests processed longer, at least half of the time is spent in the ClientWrite event (which is explained as “Waiting to write data to the client”, seems like the database server is ready to push data to the client but the client does something else instead of readily consuming it, at least for a while).

We would expect the times to grow slower, where the main component would be the time of processing a single request, plus a relatively small concurrency overhead.

On the contrary, times continue to grow quite fast, linearly with the number of concurrent requests, e.g. for a batch of 50 requests, the last would have times close to 15 seconds.

The other issue we could not solve or diagnose is that when errors are raised (e.g. clients time out and disconnect, or a PoolTimeout exception is raised) certain fibers are lost, never returning to the main section, where an ensure clause decrements a counter of requests in-flight - this counter remains positive instead of going back to zero. The number of fibers lost differs but can by as big as 5 or 8 for a batch of 80 requests.

The database pool is initialized having max_idle_pool_size equal to max_pool_size so that all created connections are accepted back into the pool.

Has anyone experienced something similar? Any hints on what might be wrong or what we may be missing? Any idea or help is highly appreciated.

1 Like

Some follow up questions in regards to the first issue:

  1. Do you get the same increase in response time w/o the DB interactions?
  2. Are the endpoints doing anything with class vars in regards to sharing state?
  3. Are you using the DB::Database methods within each endpoint? I.e. to ensure each request gets its own connection from the pool.

Some code/commands that can be used to reproduce this behavior would be :pray:.

2 Likes

@Blacksmoke16: Thanks a lot for your reply!

  1. Do you get the same increase in response time w/o the DB interactions?

No, if the endpoint gets the response body e.g. from memory or a file, the response times are close to what it takes to process a single request and stay that way even for a batch of several hundred requests.

  1. Are the endpoints doing anything with class vars in regards to sharing state?

The stats counters are stored in class variables; they are updated on each request (naturally).

  1. Are you using the DB::Database methods within each endpoint? I.e. to ensure each request gets its own connection from the pool.

We are using query_each and we can observe the pool size growing up to the limit given, so yes, it is confirmed that each request gets its own connection. Large enough batches do produce PoolTimeouts.

Some code/commands that can be used to reproduce this behavior would be :pray:.

I guess we need to prepare an example equivalent to the application at stake. Wanted to check first if we aren’t missing something obvious :slight_smile: Any ideas based on my feedback? Thanks! :bowing_man:

Just to confirm, something like:

DATABASE = DB.open "..."

# ...

DATABASE.query_each "..." do 
  # ...
end

yea?

Yea this would be quite helpful for debugging. Doesn’t have to be the exact query or anything. Tho would be worth trying something super simple like DATABASE.scalar "SELECT 1;" and see if that reproduces it, otherwise maybe the problem comes from something more unique used in your query if its not a simple select.

Just to confirm, something like …

Yes, exactly -

db = ::DB.open "postgres://scott:tiger@localhost/database"

db.query_each sql_statement, query_param

The query itself is a union of two SELECTs returning sometimes quite a lot of records (~ 40,000). The issue seems to appear for big resultsets, not so much for smaller ones.

we need to prepare an example

… this would be quite helpful for debugging

Yeah, I guess so. Will try to find the simplest setup that reproduces it.

Thanks so much for attending! :bowing_man:

1 Like

I experienced something similar to this a while ago which if I remember correctly I think I solved using using_connection to take a connection out of the pool explicitly.

db.using_connection do |conn|
  conn.query_each statement, params
end

I never fully tracked down the issue but had a hunch it was related to connections not properly being released.

1 Like

@garymardell: Thanks a lot for sharing! Tried that but doesn’t seem to change the outcome for this case, unfortunately :slightly_frowning_face:

@Blacksmoke16:
I’ve set up a repository here with a db-seeding tool and a showcase application which spawns a bunch of fibers doing queries to the db backend. Seems to reproduce the issue quite well.

The default backend is Postgres. If you have the time to also setup a SQLite database and run the showcase app against that, you will see the difference – for the SQLite case there is no growth of the processing times. Also, now I often experience the “lost fibers” phenomenon with the Postgres backend but never had that occur with the SQLite one. Just FYI.

Again, thanks so much for caring!

2 Likes

Thanks a lot for the reproduce @ya55en, I’ll also check this issue and will try my best to find the cause :+1:

@sdogruyol: Thank you so much! We would really love to have this resolved and that application to start working in production. Other than these issues, it is quite fast and has low memory requirements and a small footprint.

Looking forward to hearing from you guys.
Please let me know if I can be of any further assistance.

Little update: the “lost fibers” are simply fibers that didn’t got a response from the db driver, and are therefore stuck.

1 Like

@beta-ziliani: Thanks for the update!

I’ve added MySQL support to the sample application and experimented a bit with that. The behavior is similar to the one of the Postgres backend, time grows linearly (kind of), and often some fibers get lost.

HTH, and thanks for taking the time to look at this.

1 Like

One thing: At the moment I only care about the missing packets. The times being linear is reasonable to me: take into account that you start measuring immediately after the fiber starts, so the measured time can only grow from there waiting for the db to respond to the previous requests. Try adding a growing sleep time to let the db respond in time and you see what I mean: every request just takes some ms, from the first to the last.

Try changing the Makefile for the showcase compilation to include the flag -Dpreview_mt:

bin/showcase: $(SHOWCASE_SRC)
  @mkdir -p bin/
  crystal build -o bin/showcase --error-trace src/app/main.cr -Dpreview_mt

Then also at runtime you can increase the amount of concurrency by adjusting the CRYSTAL_WORKERS env var based on your machine. For example:

CRYSTAL_WORKERS=20 bin/showcase -c100

You’re running this true multithread?