Possible to serializer/deserialize and object like Ruby marshaling?

There’s something I’d like to see in Crystal (actually I started it, but with all the other things it’s going quite slow…) it’s a serialization mechanism similar to https://serde.rs where you tell how your object is to be serialized but not with which serialization mechanism, and you can use anything like json, yaml, msgpack,… easily…

And like you say @asterite you can configure some things using attributes, like field ignore, additional fields, conversions, custom serialization (still mechanism agnostic)

3 Likes

Yes, that’s something I’d like to see as well. It’s so much hassle when you have type that should be serializable to different formats.

So in this generic serialization format… how do you say something should be an XML attribute or an XML element?

In this case there would be some specific flags for XML (de-)serializer I guess…

https://github.com/RReverser/serde-xml-rs doesn’t seem to have that, maybe the (de-)serializer is smart for some things? (no time to check right now)

As far as I understand it, serde-xml-rs uses both attributes and child elements to deserialize an object. Serialization seems to not create any attributes by default. See their test suite.

I have experimented a bit, and got this:

require "json"

module JSON
  def self.serialize(data, &block)
    pull = PullParser.new(data)
    pull.read_begin_object
    while pull.kind != :end_object
      key = pull.read_object_key
      yield key, pull
    end
  end

  def self.deserialize(**members)
    String.build do |str|
      JSON.deserialize str, **members
    end
  end

  def self.deserialize(io : IO, **members)
    JSON.build(io) do |json|
      members.build json
    end
  end
end

class String
  def self.from(pull : JSON::PullParser)
    pull.read_string
  end

  def build(builder : JSON::Builder)
    to_json builder
  end
end

struct Int32
  def self.from(pull : JSON::PullParser)
    v = pull.int_value.to_i32
    pull.read_next
    v
  end

  def build(builder : JSON::Builder)
    to_json builder
  end
end

struct NamedTuple
  def build(builder : JSON::Builder)
    to_json builder
  end
end

class Object
  def from(type, data)
    {% for ivar in @type.instance_vars %}
    _{{ivar.id}} = nil
    {% end %}

    {% begin %}
    # Standard JSON/YAML etc iterator
    type.serialize(data) do |key, pull|
    case key
    {% for ivar in @type.instance_vars %}
    when {{ivar.stringify}} then _{{ivar.id}} = {{ivar.type.id}}.from(pull)
    {% end %}
    else raise "unknown key: #{key}"
    end
    end

    {{@type.id}}.new(
      {% for ivar in @type.instance_vars %}\
        {{ivar.id}}: _{{ivar.id}}.as({{ivar.type}}),
      {% end %}\
    )
    {% end %}
  end
  
  macro method_missing(build)
  def build(type)
    type.deserialize(
      {% for ivar in @type.instance_vars %}\
        {{ivar}}: @{{ivar}},
      {% end %}
    )
  end
  end
end

record Point, x : Int32, y : String
data = %({"x": 1, "y": "abc"})

point = Point.from(JSON, data)
puts point #=> Point(@x=1, @y="abc")

puts point.build(JSON) #=> {"x":1,"y":"abc"}
1 Like

The implementation is imperfect, it only exists to show that’s possible. We can then implement custom generic annotations, inspired by Serializable.
If we implement a new way to map JSON/YAML, we have to think how to phase out .mapping and Serializable. It won’t be reasonable to have 3 ways to do the same thing in the stdlib.

I been thinking about this a bit as I came up with a better way to implement CrSerializer that would be more flexible. However I think it’s important for us to define what serialization in Crystal should look like, and the goals we wish to achieve. Then from there we have some criteria to evaluate various implementations.

Some of my thoughts/ideas.

Annotations

Annotations should be used to control how properties get serialized/deserialized in a (mostly) format agnostic way. The current usage of using JSON::Field and YAML::Field is not scalable if every format needs its own annotation.

However, if each format has its own; it would allow you to control how type is serialized on a per format basis. I kinda doubt this is common/useful enough to worry about?

Pluggable

Supporting a new format shouldn’t require anything more than defining how each type gets serialized. The “framework” around the serialization logic should be kept separate from the actual implementation.

There will be exceptions to this, for example XML: annotating a property with @[XmlAttribute] would be specific to how that property is serialized in XML

Flexible

The current implementations make it hard to add new features/customization beyond converters. The ideal framework would allow greater control over how, when, and if a property should be serialized.

Having some extra control/flexibility would be great. A few examples would be:

  • Serialize based on a version since/until x
  • Changing the view based on groups a property is in
  • Able to consume a property on deserialization but skip it on serialization
  • Something custom the user wants to implement
    • E.x. If a property should be serialized

API

I’m thinking a good api would be like

Serializer.serialize(data : _, format : Format, context : Context? = nil) : R

Where:

  • data - The obj/type you want to serialize
  • format - An Enum value representing the supported types
  • context - A class that could be used to pass generic data to the framework
    • Like which groups to use, or what version, etc
  • R - The return type, String, Bytes etc.

Then, we probably could retrain the to_json method but have it internally call serialize while passing an optional context object.

Final Thoughts

While this by no means represents how the actual implementation will be like, I think its a conversation we should start having sooner than later.

I think its also important to understand we don’t have to do all of this in macro land. Using macros/annotations to provide “metadata” objects for each property that can then be processed into the final output at runtime is much easier and gives way more flexibility.

Annotations

Yes, there should be a generic annotation. I suppose we won’t come around to need some format-specific options as well. But annotation arguments are flexible, so you could just put format-specific options into the generic annotation.

@[Serializer::Field(json_bignum_to_string: true)]
# vs.
@[Serializer::Field]
@[JSON::Field(bignum_to_string: true)]

It might get a bit convoluted though when there are a lot of specifics. But it avoids having duplicate annotation types and questions like if JSON::Field is present, do you still need Serializer::Field as well?

Flexibility & API

These examples look like they’re only specific to an individual serializer implementation. So it could just be kept to that.

The important feature of a serialization framework is to standardize data types and mappings. IMHO it doesn not need a generic API to dispatch different serializers.
We don’t have to care how an individual serializer is invoked, just provide a basis for it to work on. That’s much more flexible than trying to fit everything in a unified API call, especially for providing custom options.

JSON.serialize(data, **options)
JSON.deserialize(string, **options)

Serde doesn’t have a unified API either. You just call serde_json::to_string and serde_json::from_str.

I think i’d rather there be separate annotations for each “option”. E.x.

@[Serializer::SerializedName("some_key")]
@[Serialize::Expose]
property some_prop : String

vs

@[Serializer::Field(serialized_name: "some_key", expose: true)]
property some_prop : String

IMO this makes it easier to read as, while there would be more annotations, the intent is more clear on each. It also would be more flexible around adding user defined functionality. I.e. a user could define their own annotation to use which wouldn’t conflict with other keys, and allows greater control over what keys are valid.

A pattern I got into recently was having a class/struct that maps to an annotation, which you could then do like (which all gets built at compile time in a macro) MyCustomAnnotation.new {{ivar.annotation(MyCustom).named_args.double_splat}}. You would inherently be able to control the valid fields allowed on the annotation and their types.

Fair enough, was mainly thinking around how to share the generic portion of each format, but that could easily be done via another method that is used in each format’s implementation. Possibly via a module that can be included into each format’s module. Would also allow that format to modify context if needed before serializing.

I been working on some refactoring to CrSerializer. I’d be happy to hear some thoughts on its implementation when I get it to a good enough place to share.

1 Like

I’m pretty sure I’d rather choose the latter style. Declaring serialisation options is a single feature, we shouldn’t have to use a bunch of different annotations for this.

1 Like

The current serializer is pretty simple feature wise, with only 5 options you can edit. The biggest downside of having everything in one annotation is it can quickly become unwieldy, especially if additional features are added like groups, versioning, etc.

Either way I don’t think it would change the implementation that much. The question now is working on, how to go about implementing something, the API we want it to have, and how it will be used/fit into the current ecosystem.

FWIW I refactored my serialization shard to take a more generic approach. Is some work left to do on the deserialization side of things, but I’m quite happy with how it came out.

https://blacksmoke16.github.io/CrSerializer/CrSerializer.html

I would like to relaunch the topic with a new approach I found to implement out-of-the-box serialization/deserialization.

Compared to my previous approach and the stdlib:

  • no monkey-patching for serialization, maybe for deserialization
  • out of the box serialization/deserialization for any type.
require "json"

module Crystalizer::JSON
  extend self

  def serialize(object)
    String.build do |str|
      serialize str, object
    end
  end

  def serialize(io : IO, object : O) forall O
    ::JSON.build(io) do |builder|
      serialize builder, object
    end
  end

  def serialize(builder : ::JSON::Builder, object : Int32)
    object.to_json builder
  end

  def serialize(builder : ::JSON::Builder, object : String)
    object.to_json builder
  end

  def serialize(builder : ::JSON::Builder, object : O) forall O
    builder.object do
      {% for ivar in O.instance_vars %}
        builder.field {{ivar.stringify}} do
          serialize builder, object.@{{ivar}}
        end
      {% end %}
    end
  end

  def deserialize(data : String | IO, *, to type : O.class) forall O
    deserialize ::JSON::PullParser.new(data), type
  end
  
  def deserialize(pull : ::JSON::PullParser, type : O.class) forall O
    {% begin %}
    {% properties = {} of Nil => Nil %}
    {% for ivar in O.instance_vars %}
      {% ann = ivar.annotation(::Serialization) %}
      {% unless ann && ann[:ignore] %}
        {%
          properties[ivar.id] = {
            type:        ivar.type,
            key:         ((ann && ann[:key]) || ivar).id.stringify,
            has_default: ivar.has_default_value?,
            default:     ivar.default_value,
            nilable:     ivar.type.nilable?,
            root:        ann && ann[:root],
            converter:   ann && ann[:converter],
            presence:    ann && ann[:presence],
          }
        %}
      {% end %}
    {% end %}

    {% for name, value in properties %}
      %var{name} = nil
      %found{name} = false
    {% end %}

    pull.read_begin_object
    while !pull.kind.end_object?
      key = pull.read_object_key
      case key
      {% for name, value in properties %}
      when {{value[:key]}}
        raise "duplicated key: #{key}" if %found{name}
        %found{name} = true
        %var{name} = deserialize pull, {{value[:type]}}
      {% end %}
      else raise "#{key} not found"
      end
    end
    
    O.new(
    {% for name, value in properties %}
      {{name}}: %var{name}.as({{value[:type]}}),
    {% end %}
    )
    {% end %}
  end

  def deserialize(pull : ::JSON::PullParser, type : String.class)
    pull.read_string
  end

  def deserialize(pull : ::JSON::PullParser, type : Int32.class)
    v = pull.int_value.to_i32
    pull.read_next
    v
  end
end


annotation Serialization
end

struct Point
  getter x : Int32
  @[Serialization(key: "YYY")]
  getter y : String

  def initialize(@x, @y)
  end
end


struct MainPoint
  getter p : Point

  def initialize(@p)
  end
end


data = %({"p": {"x": 1, "YYY": "abc"}})

point = Crystalizer::JSON.deserialize data, to: MainPoint
puts point # => MainPoint(@p=Point(@x=1, @y="abc"))
#{Crystalizer::YAML,
{Crystalizer::JSON}.each do |type|
  puts type.serialize point

  # ---
  # p:
  #   x: 1
  #   y: abc
  # {"p":{"x":1,"y":"abc"}}
end

This POC is of course the very base; it is not correct as is.

I would like to have others opinions on this. It be a shard at first.

One point I have not a golden solution: how to create an instance from an object T for deserialization?
T#new won’t necessarily take all ivars, and defining any custom method will required to monkey-patch either T or Object.

Another point, annotations: we should be able to tell how to serialize/deserialize without monkey-patching it with annotations. I can be done with named arguments, or an object passed as argument defining how to (de)serialize.

Maybe the following won’t work for all situations, but this works fairly well for https://github.com/drhuffman12/ai4cr/blob/master/src/ai4cr/neural_network/backpropagation.cr#L93 [better than the marshalling methods that the Ruby version used]:

require "json"
class Foo
  include ::JSON::Serializable
  def initialize(@bar : ...)
  ...
end
...

# then save `Foo.new(...).to_json` to a file or db txt field

# and load via Foo.from_json(previously_saved_json)

However, it doesn’t seem to work so well if Foo has an instance [or class] variable that is defined as a union of types or as a parent type and you try assigning it values of child types.

Is there [community interest to create] a common list [specs?] for what should be serializable that these ‘Better Serializable’ libs could be run against?

The problem with this approach, IMO, is things are still tightly coupled to JSON. If you wanted to support YAML, you would essentially have to duplicate the entire module. This is fine for things that are specific to JSON, like the (de)serialization logic, but isn’t ideal for things that aren’t going to change in between formats, like the main deserialize method that handles the name of the key to use for example.

I’m also not sure blindly serializing all properties of a type is the best idea. I think it would be better take a more explicit approach, only serializing things that opt into it. Take this for example:

# Imagine its the base for some ORM
# Inside it could have common internal properties
class Base
  # Such as an array of errors
  getter errors : Array(Error) = [] of Error
end

# Now you extend this type to create a model class
class User < Base
  getter id : Int32
  getter name : String
end

# You now go to serialize a user object,
# but notice it includes things from the base class
User.new.to_json # => {"id": 1, "name": "Jim", "errors":[]}

If a goal if this POC is for

How would you go about excluding the errors without moneky-patching Base, or redefining @errors in every sublcass with an annotation or something?

Not to steal the show, but I’ve also been working on a new serialization shard. An evolution of Possible to serializer/deserialize and object like Ruby marshaling? you could say. Within it, I made an effort to abstract the data from the format that it should be (de)serialized into/from.

This is achieved by defining some methods when you include the module that returns an array of metadata objects that include information about each property; such as internal/external names, type, owning classes, etc. Annotations/macros are used to filter out unwanted properties at compile time.

The benefit of this is that that object itself doesn’t need to know about what format it will be (de)serialized into/from. The format implementations just need to handle working with a common interface. New formats can be added without altering any model code, or using format specific annotations.

The way I handled this is similar to what I did with the metadata objects. I created an abstraction around the data. Currently this is currently just built on top of the current JSON/YAML parsing logic and is essentially just JSON::Any | YAML::Any. In the future a more robust abstraction could be implemented if needed, but this is sufficient for now.

Since there is now a singluar interface for working with the data, I was able to have my module define a single initializer, that takes a type used for deserialization, the metadata objects related to this type, and the data.

def initialize(navigator : ASR::Navigators::DeserializationNavigatorInterface, properties : Array(ASR::PropertyMetadataBase), data : ASR::Any)
  ...
end

Essentially similar to what JSON::Serializable does, but in a more generic way.

Not sure I follow what your point is there. I would say that’s a valid way to store the state of an object for use at a later point yes.

As I said, there will be annotations/arguments to tell which ivars have to be serialized.

The custom initialize will need to be monkey-patched to the object we want to deserialize - as what I suspected, can’t work out-of-the-box for any object :confused:

For the deserialize logic, it can be put on a common base module, that will be included in JSON, YAML, etc.

That’s a good idea, I am thinking also to separate actual objects with how (de)serializing them. This is even more important because all formats are not the same, and can require different properties.

I found a way to instantiate objects without monkey patching them (of course, that’s unsafe):

class Point
  getter x : Int32
  getter y : String

  def initialize(@x, @y)
  end
end

instance = Point.allocate

GC.add_finalizer(instance) if instance.responds_to?(:finalize)
pointerof(instance.@x).value = 1
pointerof(instance.@y).value = "abc"

p! instance

Sadly this assumes there is an initializer argument for each ivar. I don’t think there is a way around needing to add some custom initializer for the deserialization process :/.

I released Crystalizer, which brings out of the the box [de]serialization for any type from/to YAML/JSON.
A common core is used, no monkey patching, shared annotations, and virtually anyone can use the library interface to add support for new formats.

5 Likes