Introducing bindings to DuckDB, an in-process SQL OLAP DBMS

Hi everyone!

For my work I use Crystal to pre-process data for statistical analysis in R, but with the lack of support of certain formats used in analytics (e.g., Parquet and Arrow) I often have to resort to CSV, which is inefficient for larger datasets.

Recently I discovered DuckDB, which seemed apt for my use case to do Online Analytical Processing (OLAP) with larger-than-memory datasets. DuckDB fills the void of an in-process relational database (like SQLite) but for OLAP workloads (like ClickHouse).

DuckDB uses a columnar store for data, like most DB engines intended for analytics, and it is one of the most performant for medium-sized data, beating even Pandas and Spark on certain operations. In addition, it offers an appending feature to efficiently add rows to the database directly from the host application (orders of magnitude faster than insert statements). In short, it’s a relatively young but exciting project.

For these reasons, I have created crystal-duckdb, which offers a driver for crystal-db plus some features specific to DuckDB. This is an initial implementation compatible with the recently released DuckDB v0.2.8, but it should already cover many use cases.

It you need to do analytics and would like to use Crystal, give it a try. As always PRs are welcomed!

7 Likes

Awesome! More drivers!

Feel free to send a PR adding it to GitHub - crystal-lang/crystal-db: Common db api for crystal README to give (hopefully) some visibility.

I see the sqlite driver served as a basis, am I right? Are there any particular nuances you stumbled while implementing it?

Thanks Brian! Yes, the SQLite driver was my main reference, but I also had to look at the other drivers to better understand what was needed :sweat_smile:. BTW, DB::DriverSpecs was a huge time saver; I only needed to skip nested transactions.

There is a particular nuance that I found for which I would appreciate your feedback. Unlike other DBMS, even SQLite, to use the DuckDB via the C API one needs to first create a duckdb_database handle, and then with this handle one can create one or more duckdb_connection handles. I did not find a clear way to do this with crystal-db, but honestly I didn’t delve into it.

For instance, with ruby-duckdb:

require 'duckdb'

DuckDB::Database.open do |db|
  db.connect do |con|
    con.query("SELECT 1")
  end
end

And in R, following the DBI spec (analogue to crystal-db):

library(duckdb)
drv <- duckdb(":memory:", read_only = FALSE)
con <- dbConnect(drv)
dbGetQuery(con, "SELECT 1")

For the moment, in crystal-duckdb a duckdb_database handle is created for every connection. It works, and given the OLAP context it is not an issue. However, ideally when using DB.open a single duckdb_database handle should be created for all of the pool connections, and when directly using DB.connect a duckdb_database should automatically be created for the connection handle.

1 Like

Really every database connection workflow is like this AFAIK.

crystal-db just provides an API abstraction that works without explicitly creating a connection. It uses a connection pool internally to re-use existing connection.

So when you call db.query(sql), that actually translates to db.using_connection {|connection| connection.query(sql) }. Database#using_connection checks out a connection from the pool (or creates a new one) and returns it back to the pool afterwards.

Yes, such connection workflow is common across most DBMS. However the issue regards the reuse of a single database handle (a duckdb_database C struct) for the connections (with underlying duckdb_connection C structs) that are reused inside of the pool, not the reuse of the connections themselves. Most likely I didn’t explain myself clearly or perhaps there is a way to easily do it in crystal-db that I missed.

For instance, with the SQLite C API you only need sqlite3_open() and sqlite3_close() to have a connection handle (there is no database handle). This is in line with other DBMS that use a similar connection concept (e.g., via sockets) as the database is usually an external process. However, with the DuckDB C API you do need a database handle to create subsequent connection handles.

// Using DuckDB C API
duckdb_database db;
duckdb_connection con;

// NULL is for in-memory; could be a string for filename.
if (duckdb_open(NULL, &db) == DuckDBError) {
	// handle error
}
if (duckdb_connect(db, &con) == DuckDBError) {
	// handle error
}

Therefore, I’m not sure which approach should I take within crystal-db (e.g., which classes/methods to inherit from) in order to reuse a single C duckdb_database handle when the pool connections are created?

I’m not that familiar with crystal-db, so I hope it is clearer now. :stuck_out_tongue:

Ah, I understand now. Thanks for bearing with me.

You could probably just keep the db handle in the driver implementation. Each driver instance is specific to a database, so no worries about sharing state between instances.

Maybe @bcardiff has some better idea, though =)

I see. The only way I can think of with crystal-db as is right now is to reopen module ConnectionContext with a lazy initialized duckdb_handle property (and probably nillable).

So from the DuckDB::Connection you can access it. Note that @context : ConnectionContext is the initializer’s argument in Connection.

The lazy property would need to be lazy initializable since the types including ConnectionContext are not parametrized per driver, they are shared.

If this is not clear enough let me know and I can try to iterate a bit :-)

I never thought of it but sqlite3 :memory: seems to require a ?cache=shared in order to work with the connection pool. In-Memory Databases #TIL

I’ll give it a shot when working on the next release for DuckDB v0.2.9. Thanks!