Unified Intermediary Serialized Data Representation

Blacksmoke16 · August 23, 2024, 1:51am

Introduction

I had an opportunity to work with Symfony’s Serializer component recently. While I don’t think we need something like it in the stdlib, one part did stick out as possibly useful: how it uses an associative array as an intermediary between the raw format and an object.

Crystal currently handles this with a per-format data structure, JSON::Any and YAML::Any. These structures are very closely related in regards to their function and API:

Method	JSON::Any	YAML::Any	Notes
#==(other : self)	X	X
#==(other)	X	X
#	X	X	JSON has separate overloads which I guess makes it impossible to has `Int32` keyed hashes?
#?	X	X	Ditto
#as_a, #as_a?	X	X
#as_bool, #as_bool?	X	X
#as_bytes, #as_bytes?		X
#as_f, #as_f?	X	X
#as_f32, #as_f32?	X	X
#as_h, #as_h?	X	X
#as_i, #as_i?	X	X
#as_i64, #as_i64?	X	X
#as_nil	X	X
#as_s, #as_s?	X	X
#as_time, #as_time?		X
#dig	X	X
#raw	X	X
#size	X	X

Proposal

I had a thought of merging these two concepts into a Serializable::Any. I also considered just Any as there’s nothing that makes this specific to serialization, but I wanted to keep it more clear this is something used related to serialization, not just something you should reach for any time you want “dynamic” types. This type would then represent data currently in-between a serialization format, or a deserialized value. This comes with a few benefits that I can think of:

Provide a common type to reduce duplication and enable standardization.
1. Adding support for a new underlying type/value would work across the board
Function as intermediary state for objects as well. i.e. https://github.com/crystal-lang/crystal/issues/6309
1. Could be used to implement a single representation of the state of an obj to be (de)serialized, using standardized annotations as well
Make it easier for third-party libs to handle custom formats. As if you create a shard to handle say TOML it would function and work with anything setup to handle Serializable::Any, which includes both to/from-object and to/from-format directions.

Open Questions

This is probably not a replacement to the current streaming APIs
- Given this approach needs to load all the data into memory, it wouldn’t be as efficient; but would be more flexible. Third-party shards could implement their own serialization logic using it if they want, but stdlib would still use JSON::PullParser for example.
How to implement this in a backwards compatible way?
- Given the APIs are so similar, we might be able to get away with just aliasing both the existing any types to this new type. Constructors are basically the same, minus the one specific to JSON::PullParser and/or YAML::ParseContext, so TBD how to handle that
- Another option would be an entirely separate method/alternate to the existing .parse methods that returns this type instead.
- Option three, provide some built-in way to pass an existing any type to this type and it’ll convert itself.
  - Expanding upon this a bit more, if we make this new any type also implement https://github.com/crystal-lang/crystal/issues/10886
How to handle forward compatibility given adding say #to_i128 could be considered a breaking change?
TBD

Starting this off as a forum thread as I’d like to get some feedback/flesh things out a bit more before I think it’s actionable.

straight-shoota · August 23, 2024, 9:34am

I’m not sure if merging JSON::Any and YAML::Any (and potentially other formats) would be a good idea. These representations are very specific to the respective data format. JSON and YAML are quite similar because the former is a subset of the latter. But already we have some differences that have an effect on the typed interface. For example, in YAML, keys can be any type while they must be strings in JSON.
They’re little things, but there are quite more substantial differences when incoporating other data formats such as TOML, XML, or CSV for example. A generalization like this should be explicitly open for such extensions.

Instead, I see much potential for merging JSON::Serializable and YAML::Serializable into a generalized and extensible feature.
If I understand correctly, that’s what Symfony’s Serializer component is primarily about. Being able to define a universal serialization interface which supports a number of different data formats. Currently, if you want to support different serialization formats, you have to duplicate annotations for each format (@[JSON::Field], @[YAML::Field] etc.) and new formats need to implement the entire serialization process, instead of focusing on the part that’s specific to the data format.

I previously mentioned Rust’s serde as another inspiration for that.

jgaskins · August 24, 2024, 5:58am

I’ve been wanting a unified serialization mixin for a long time, as well, with different adapters for the different serialization formats specifically because of the issues with annotations. Serializing a type designed around JSON into a different format with Crystal is an amazing experience until you need to customize serialization on a type you don’t own.

I would honestly love to see something like Serde, though with a more Crystal-like API. I started playing around with the idea a while back and ended up finishing a proof of concept today:

require "serializable/json"
require "serializable/msgpack"
require "serializable/db"
require "pg"
require "uuid"

struct User
  include Serializable::Object

  getter id : UUID
  getter name : String
  @[Serializable::Field(converter: String::UpcaseConverter)]
  getter capitalized_string : String
  @[Serializable::Field(converter: Time::NanosecondsConverter)]
  getter created_at : Time

  def initialize(*, @id = UUID.v7, @name, @capitalized_string, @created_at)
  end

  # Strictly for the example
  def to_db_args
    {
      id.to_s,
      name,
      capitalized_string,
      created_at.to_rfc3339(fraction_digits: 9),
    }
  end
end

pg = DB.open("postgres:///")
user = User.new(
  name: "Jamie",
  capitalized_string: "FOO bar",
  created_at: Time.utc,
)
pp(
  canonical_user: user,
  from_msgpack: User.from_msgpack(user.to_msgpack),
  from_json: User.from_json(user.to_json),
  from_db: pg.query_one(
    <<-SQL,
      SELECT
        $1::uuid AS id,
        $2::text AS name,
        lower($3::text) AS capitalized_string,
        $4::timestamptz AS created_at
      SQL
    *user.to_db_args,
    as: User,
  ),
)
# {canonical_user:
#   User(
#    @capitalized_string="FOO bar",
#    @created_at=2024-08-24 04:52:53.955393000 UTC,
#    @id=UUID(019182ba-ec43-74a7-820a-5cc50c2f0a61),
#    @name="Jamie"),
#  from_msgpack:
#   User(
#    @capitalized_string="FOO BAR",
#    @created_at=2024-08-24 04:52:53.955393000 UTC,
#    @id=UUID(019182ba-ec43-74a7-820a-5cc50c2f0a61),
#    @name="Jamie"),
#  from_json:
#   User(
#    @capitalized_string="FOO BAR",
#    @created_at=2024-08-24 04:52:53.955393000 UTC,
#    @id=UUID(019182ba-ec43-74a7-820a-5cc50c2f0a61),
#    @name="Jamie"),
#  from_db:
#   User(
#    @capitalized_string="FOO BAR",
#    @created_at=2024-08-23 23:52:53.955393000 -05:00 America/Chicago,
#    @id=UUID(019182ba-ec43-74a7-820a-5cc50c2f0a61),
#    @name="Jamie")}

module String::UpcaseConverter
  extend self

  def from_json(json : JSON::PullParser) : String
    json.read_string.upcase
  end

  def to_json(string : String, json : JSON::Builder) : Nil
    json.string string.downcase
  end

  def from_msgpack(msgpack : MessagePack::Unpacker) : String
    pp msgpack.read_string.upcase
  end

  def to_msgpack(string : String, msgpack : MessagePack::Packer) : Nil
    msgpack.write string.downcase
  end

  def from_rs(rs : DB::ResultSet)
    rs.read(String).upcase
  end
end

module Time::NanosecondsConverter
  extend self

  def to_json(time : Time, json : JSON::Builder) : Nil
    json.string do |io|
      time.to_rfc3339 io, fraction_digits: 9
    end
  end

  def from_json(json : JSON::PullParser) : Time
    Time.new(json)
  end

  def to_msgpack(time : Time, msgpack : MessagePack::Packer) : Nil
    msgpack.write_array_start 2
    msgpack.write time.@seconds
    msgpack.write time.@nanoseconds
  end

  def from_msgpack(msgpack : MessagePack::Unpacker) : Time
    seconds, nanoseconds = Tuple(Int64, Int32).new(msgpack)
    Time.new(
      seconds: seconds,
      nanoseconds: nanoseconds,
      location: Time::Location::UTC,
    )
  end

  def from_rs(rs : DB::ResultSet) : Time
    rs.read Time
  end
end

Blacksmoke16 · August 24, 2024, 2:46pm

My biggest gripe with a mix-in approach is it just feels like bad design IMO. It very tightly couples the serialization logic with the object being serialized. This ultimately makes testing and such a whole lot harder.

I tried with Athena Serializer to keep things somewhat separated with a dedicated Serializer type/interface. I’m not really happy with how it’s implemented tho. I might look into refactoring it at some point to see if Symfony’s implementation is any better. Either way I think it’s going to be tricky due to PHP being dynamically typed, and them having runtime reflection and stuff to construct objections outside of the instance itself. I do like the approach it takes design wise, so just have to see if it would be possible to port, and what the DX would be like.

It’s just a bit unclear to me at the moment how to handle going from the decoded data to object instance, when each format has its own representation. In PHP land it’s easy as it’s all just an array. Maybe an internal representation would be enough that each decoder is responsible for converting the data into vs having Serializable::Any be a public thing…I’ll have to play around with it a bit more…

jgaskins · August 25, 2024, 7:05pm

Bad design? Or a compromise?

When I originally started this experiment I did try to have serialization and deserialization happen in separate objects. I’m a fan of small objects over throwing absolutely anything and everything related to a user into the User object. It was the driving force behind Interro separating queries from models. IMO, querying the object from the DB, representing it in the application, and serializing it over the wire are separate responsibilities and it makes sense to have that functionality live in separate objects.

However, ditching the mixin also forfeits the initialize method it defines which, for most serializable objects I write, is the only initialize method for that type. I never instantiate them in application code, so defining initialize is not necessary. They’re instantiated exclusively by DB queries and remote-API responses. And any non-nilable ivars must be initialized even if the type is never used. I agree with you on the mixin not being ideal, but it’s a decent compromise.

In cases where you might want to represent an object differently in your API than you do in your database, you can (and should!) use a separate API-specific serialization. You probably want to store the user’s hashed password in the DB but you probably don’t want to send it out over the API, for example. The following code defines a ModelSerializer mixin to use for objects whose sole purpose is to serialize models for an API response. A ModelSerializer can also reference other Serializable::Object instances, so the PostSerializer wraps both a Post DB model and a UserSerializer. I’ve provided a complete, executable example, you just need to load the jgaskins/serializable and jgaskins/interro shards if you want to run it.

require "interro"
require "crypto/bcrypt/password"
require "serializable/json"
require "serializable/db"

configure_db

# Create the objects in the DB and return them
user = UserQuery.new.create(email: "jamie@example.com", password: Password.create("password", cost: 4))
post = PostQuery.new.create(title: "First post", body: "Hello world", author: user)

pp user
# User(
#  @created_at=2024-08-25 13:59:55.013007000 -05:00 America/Chicago,
#  @email="jamie@example.com",
#  @id=UUID(c588af6e-e7a2-4546-956f-414533e755eb),
#  @password=$2a$04$GM3eayD5TbwTwZdArICLUuaalkdXiWy/oEesMnaL6sS1jWshV9J9i)
pp post
# Post(
#  @author_id=UUID(c588af6e-e7a2-4546-956f-414533e755eb),
#  @body="Hello world",
#  @created_at=2024-08-25 13:59:55.014550000 -05:00 America/Chicago,
#  @id=UUID(ad22b56b-cf51-450d-8b53-fb6bd4c93387),
#  @published_at=nil,
#  @title="First post")

# Set up the serialized output like you would in your API
output = PostSerializer.new(
  model: post,
  author: UserSerializer.new(user)
)

puts output.to_json
# {"id":"ad22b56b-cf51-450d-8b53-fb6bd4c93387","title":"First post","body":"Hello world","author":{"id":"c588af6e-e7a2-4546-956f-414533e755eb","email":"jamie@example.com","created_at":"2024-08-25T13:59:55-05:00"}}

# I'm not writing this whole type name out every time
alias Password = Crypto::Bcrypt::Password

struct User
  include Serializable::Object

  getter id : UUID
  getter email : String
  @[Serializable::Field(converter: Password)]
  getter password : Password
  getter created_at : Time
end

struct UserQuery < Interro::QueryBuilder(User)
  table "users"

  def find(id : UUID) : User?
    where(id: id).first?
  end

  def create(email : String, password : Password)
    insert email: email, password: password.to_s
  end
end

struct Post
  include Serializable::Object

  getter id : UUID
  getter title : String
  getter body : String
  getter author_id : UUID
  getter published_at : Time?
  getter created_at : Time
end

struct PostQuery < Interro::QueryBuilder(Post)
  table "posts"

  def create(title : String, body : String, author : User) : Post
    insert title: title, body: body, author_id: author.id
  end
end

module ModelSerializer
  include Serializable::Object

  annotation Field
  end

  def initialize(model)
    {% for ivar in @type.instance_vars %}
      {% ann = ivar.annotation(::ModelSerializer::Field) %}
      {% if !ann || ann[:ignore] != true %}
        @{{ivar}} = model.{{ivar}}
      {% end %}
    {% end %}
  end
end

struct UserSerializer
  include ModelSerializer

  getter id : UUID
  getter email : String
  getter created_at : Time
end

struct PostSerializer
  include ModelSerializer

  getter id : UUID?
  getter title : String?
  getter body : String?
  getter published_at : Time?

  # Provided directly in `initialize` so we don't want to infer it from the model
  @[ModelSerializer::Field(ignore: true)]
  getter author : UserSerializer

  def initialize(model, @author)
    super model
  end
end

class Password
  def self.from_rs(rs : DB::ResultSet)
    new rs.read(String)
  end
end

def configure_db(db = DB.open("postgres:///"))
  db.exec "DROP TABLE IF EXISTS posts"
  db.exec "DROP TABLE IF EXISTS users"
  db.exec <<-SQL
    CREATE TABLE users(
      id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
      email TEXT UNIQUE NOT NULL,
      password TEXT NOT NULL,
      created_at TIMESTAMPTZ NOT NULL DEFAULT now()
    )
    SQL
  db.exec <<-SQL
    CREATE TABLE posts(
      id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
      title TEXT NOT NULL,
      body TEXT NOT NULL,
      author_id UUID NOT NULL REFERENCES users(id),
      published_at TIMESTAMPTZ,
      created_at TIMESTAMPTZ NOT NULL DEFAULT now()
    )
    SQL

  Interro.config do |config|
    config.db = db
  end
end

Blacksmoke16 · August 25, 2024, 8:17pm

Not necessarily. I don’t think you’re going to be able to ditch the mix-in entirely. But you could make it so that it really only defines the constructor and acts as a marker. Versus also embedding all the (de)serialization logic/methods as well.

The libraries I’ve used in the pass handle this via serialization groups and/or annotations that can be used to exclude specific properties either uni or bi-directional. But yea, using a type separate from your DB model I’ve found to be a good practice as well.

I started playing around with the Serializer component refactor. There’s still a few things I need to figure out but the interface is basically like:

module SerializerInterface
  abstract def serialize(data : _, format : String, context : ASR::Context = ASR::Context.new) : String
  abstract def deserialize(data : _, type : T.class, format : String, context : ASR::Context = ASR::Context.new) forall T
end

The serializer is setup like:

serializer = ASR::Serializer.new(
  encoders: [ASR::Encoder::JSONEncoder.new],
  normalizers: [ASR::Normalizer::Time.new, ASR::Normalizer::UUID.new]
)

Where Encoders go from a raw format (JSON, YAML, etc) to ASR::Any representation (which is basically an internal clone of JSON::Any) and vice versa. Normalizers go from an object to ASR::Any and vice versa. This would allow custom objects to instantiate from the ASR::Any type and be generic to work with any format. In theory at least . I think this works well as third party shards can define/create their own encoders/normalizers and things just work.

From here you can go ahead and deserialize something via:

context = ASR::Context.new
  .for(ASR::Normalizer::Time, format: "%F")

raw = %("2024-08-24")
pp serializer.deserialize raw, Time, "json", context # => 2024-08-24 00:00:00.0 UTC

Context allows passing around state between the encoders/normalizers to control things globally. There would also be a way to define context scoped to specific properties, or global.

Not sure if this is closer to what @straight-shoota had in mind/or is similar to how serde works. But it seems like a pretty solid implementation, if I can get everything working as I’d like.

jgaskins · August 25, 2024, 9:38pm

This is actually the reason I stopped trying to avoid serialization inside the object. If I had to include a mixin to avoid having to write an initialize method, I was effectively right back where I started.

To me, the most important thing is my interface to the feature. Whatever it does behind the scenes is fine with me. I don’t see much difference between including a mixin that implements serialization/deserialization and including a mixin that allows the object to be serialized/deserialized. As a consumer of the framework, that feels like a distinction without a meaningful difference.

I do like your idea around normalizers, though. A given app should almost certainly serialize the same type of value uniformly for a given format and the annotation approach requires you to specify the converter for every single one. And since interacting with third-party APIs requires using their particular serialization so you can’t just monkeypatch. I’ve wanted to monkeypatch Time#to_json(JSON::Builder) in so many apps so I didn’t have to copy/paste the annotation so many times but I couldn’t because different APIs I was using all serialize times very differently:

API	Timestamp format
GitHub	ISO-8601 strings with 1s precision
OpenAI	integer Unix timestamps in seconds
NATS	integer Unix timestamps in nanoseconds
Slack	floating-point Unix timestamps in seconds with µs precision, but they’re serialized as strings

It made me wish there was a way to tell JSON::Serializable to use the same converter annotation for all Time ivars in any object in a given namespace.

Blacksmoke16 · August 26, 2024, 12:26am

IMO this is exactly the meaningful difference it provides. Having the separation of concerns makes it a lot easier to do almost everything. Easier to test the code, easier to add points of customization/extension, and easier to change things in the future since things are more or less decoupled from one another.

But thanks, I’ll see if I can get something working and report back if/when I do. Have to see how things play out in practice.

Blacksmoke16 · August 27, 2024, 3:39am

Alright, the code so far is on this branch: Comparing master...serializer-refactor · athena-framework/athena · GitHub. I have it all in one file atm just to make things easier while its a WIP, but will eventually move them into proper file structure.

Expanding upon my last comment, I think it’s working pretty darn well so far. It makes heavy use of free variables to pass around the type you want to deserialize/denormalize into. You’re able to access ivars off the type as well so my understanding is each unique type will get its own overload of some of the methods. This seems to keep things type safe and prevent #deserialize returning a union of all deserializable types.

I also learned you can access other types off of free variables, e.g. T::Context. That made things a bit more slick as it can throw compile time errors if you try to set context for a type that doesn’t support it.

As things stand, it’s seemingly able to handle the more primitive types out of the box (Number, Bool, String, Nil, Array, Hashes, and Unions). For other types, I think would make sense to be added as Normalizers like I did for Time and UUID.

In this first pass I did some unsafe things to try and see if I could make ASR::Serializable nothing more than a marker interface, but I think I’m going to have to make it define some sort of constructor to more safely instantiate the object, as atm it segfaults on structs.

Annotation support is pretty much only ASRA::Ignore as I need to wire up the logic for the others.

Symfony Serializer also has some other features I want to try and implement, but also quite a few that I don’t think that make sense at all for us, so going to punt on those.

At the moment, if a type is unable to be (de)serialized a runtime error is raised. I’m not sure what people’s expectations around this are, but with the way it’s implemented it’s not really clear at compile time if something can be (de)serialized or not. So is probably fine for now .

But yea, open to feedback/suggestions.

Example code, only deserialization is supported atm:

class Book
  include ASR::Serializable

  getter title : String? = nil
end

class User
  include ASR::Serializable

  getter name : String
  getter age : Int32
  getter active : Bool = true
  getter values : Array(Int32 | String) = [] of Int32 | String
  getter map : Hash(String, Bool) = {} of String => Bool
  getter book : Book = Book.new

  def initialize(@name, @age); end
end

serializer = ASR::Serializer.new(
  encoders: [ASR::Encoder::JSONEncoder.new],
  normalizers: [ASR::Normalizer::Object.new]
)

raw = %({"name":"Jon","age":16,"values":[6, "9", 12],"map":{"0": false,"1":true},"book":{"title":"Moby"}})
pp serializer.deserialize raw, User, "json"
# #<User:0x702ca29ae080
#  @active=true,
#  @age=16,
#  @book=#<Book:0x702ca29a9940 @title="Moby">,
#  @map={"0" => false, "1" => true},
#  @name="Jon",
#  @values=[6, "9", 12]>

Topic		Replies	Views
Should Json::Serializable provide a method to emit a Json::Any? Crystal Contrib	8	732	July 23, 2021
A small shard for dealing with input/output schema & validation Community	3	574	August 11, 2022
Is there a better way to use YAML? Help & Support	2	547	September 16, 2019
Parsing YAML with dynamic keys Help & Support	9	1144	January 28, 2022
New to Crystal, i don't understand why my code is not working Help & Support	2	312	July 14, 2022

Unified Intermediary Serialized Data Representation

Introduction

Proposal

Open Questions

Related topics