More on Symbols

I am fairly new to Crystal, being a Ruby programmer for many years. One of the main reasons that I looked at Crystal is that it is so Ruby-like, and that means I can port existing quite a bit of Ruby code relatively easily. I have a concern when there is serious talk about making Crystal less Ruby-like. And that includes talk about removing Symbols.

Ruby programmers are encouraged to use Symbols as hash keys rather than Strings. This is not only for performance, but also for program readability. These port over to Crystal in some cases only - they fail when the keys come from outside the program and need to be converted from Strings. As pointed out by others, Crystal also has Enums which can be used in a very similar way to how Symbols currently work, except that you have to define them (which can be a good thing). Partly because of this, some people on this forum have suggested that Symbols be removed completely from Crystal. I think that would be a bad idea.

What I would much rather see is that Symbols be implemented more like they are in Ruby. That includes implementing String#to_sym. Crystal has StringPools which seem to work in much the same way, but are far uglier to use. Although the performance of Symbols would then probably not be much different than using Strings, there would be better Ruby compatibility. People who want ultimate performance can use Enums. Perhaps StringPools could then be removed instead?

2 Likes

I think this is fundamentally a semantic issue. Ruby goes out of the way to make a distinction between strings as opaque data and symbols as specifically names. Symbols are native class for methods and ivs identifiers, as well as a central element that makes key: value notation work so well.

Well, Crystal doesn’t really have run-time reflection, so the first point is moot, and key: value is reused as NamedTuple, so there you go.

I still think some kind of “strings that are names” with a distinct syntax would probably be a good idea, and Symbols are already there, but after a while I’ve become somewhat less hot on this topic, since I’ve stopped using free-form hashes as a primary data structure and moved to JSON.mapping classes. I mean, the compiler gives you names at compile time, why not use them?

2 Likes

Hi @stronny, the problem for me is that some new “strings that are names” mechanism would just create another construct similar to Symbols and Enums. And since the compiler already gives you names at compile time in the form of Enums, it makes sense to me to change Symbols to be more Ruby-like without adding new syntax. I accept that the {key: value, ...} notation is already used by NamedTuples - my main issue is that there is lots of Ruby code that gets fields from the outside world and immediately maps them to Symbols (including code I wrote!), and I see no good reason why this shouldn’t be possible in Crystal.

For fun, I am actually trying to have a look at the compiler in my spare time to see what scope there is for this. My thinking is that perhaps Symbols could be compiled almost exactly the way they are now, but add a mapping from their name as a string to the symbol value (assuming that it doesn’t exist at the moment). Then if a program uses String#to_sym we could at that point create a hash table of all the symbols, allowing new Symbols to be created dynamically and subsequent String#to_sym calls to work efficiently. That means that if String to Symbol mapping is never used, there is no performance penalty. Perhaps there is a hole in this - I’m not sure yet!

Symbols are definitely a really nice feature. But they work better in a dynamic language like Ruby. In Crystal there is essentially no real benefit over using either enums (for a fixed set of values, and also gives type safety) or strings (for dynamic values).

In Crystal, symbols are implemented completely static. They’re essentially converted to integers by the compiler and thus need to be fully known at compile time. You can’t add symbols dynamically by converting a string as in Ruby. Without that, you could only map strings to known symbols. That’s certainly not a desirable solution because it can’t go the full way.

In order to be able to do that, the symbol implementation needs to be completely changed and make symbols essentially equivalent to strings. That’s going to impact performance and reduce the advantage of symbols. Using a hash table won’t be able to mitigate this. All in all, this would make symbols more complex and less useful.

The major use case for symbols right now is for named tuple keys. If we trade away efficiency for flexibility, this use case vanishes, giving even more reason to remove symbols entirely from the language.

Please name a good reason why this should be possible instead of turning the argument around?

In the way this would be implemented, there is no essential difference to simply use strings directly. It would just be a very similar alternative without any major benefits. And it makes it harder to use because people need to decide between both which can be very cumbersome.

In Ruby this is already known to be a problem which resulted in “solutions” like HashWithIndifferentAccess.
Without symbols, there wouldn’t be such problems in the first place.

Matz even tried to remove them from Ruby, but that would break to much code. In Crystal we can do so.

I know it seems hard at first. But when you think about it at bit longer, you’ll notice that you can do very well without symbols. I can’t remember having used a single symbol in Crystal for anything else than named tuples and autocasted enum values.

3 Likes

To be fair, this

And it makes it harder to use because people need to decide between both which can be very cumbersome.

is not a problem if you add autocasting from symbols to strings. There is also no static string in Crystal, which again will lead to difficulties trying to optimize for allocations.

Or you may go the other way and just change the syntax so that :text will also be a String. Most people don’t care about optimizations imo, they just like the syntax.

1 Like

@straight-shoota wrote:

Please name a good reason why this should be possible instead of turning the argument around?

I think you only have to read my first post in this topic to answer that. There are way more Ruby programmers than Crystal programmers, and way more Ruby code than Crystal code in existence, and making it harder to port just makes Ruby programmers less likely to use Crystal. I think I am a reasonable example of this. I started off playing with Crystal by trying to compile existing Ruby code. I hit several roadbocks on the way, some of which I accept as being reasonable, and some of which I do not. Being persistent and because I really like the concept of Crystal, I thought it would better to voice my concerns & suggestions here rather than going away and dropping Crystal quietly.

<soapbox>
This discussion comes down to why you believe Crystal should exist. If it is merely a language “inspired by Ruby” which will go off in its own direction, I think it will most likely wither and die like most of the multitude of languages out there today. However, if it implements as much of Ruby that is feasible, then its chances of success are much higher, as it will keep on attracting Rubyists.

I accept that certain features in Ruby are not feasible in Crystal. I also accept that Ruby has evolved and has left in a certain amount of junk which could and perhaps should be removed, and has been done in Crystal. I even accept that Crystal has corrected Ruby’s English grammar (include? → includes?) :smiley:. However, I think one of Crystal’s goals should be to maintain good Ruby compatibility, certainly in commonly used features (such as Symbols and dare I say it, also Strings, which I would argue should be made mutable, but that’s another topic), whether or not that means it is a “perfect” language.
</soapbox>

I understand that Crystal compiles Symbols to a number. I think you have misunderstood my suggestion, which actually would not impact performance of Symbols and does not make them into strings. They are left as-is, with the compiler adding a run-time accessible list of symbol names & their values if it is not already available. That is all. Other changes would be at run-time to add support for to_sym. Only if to_sym is actually used would a hash table of Symbols be built. When a new Symbol is created, a new number would be allocated for it, different from all the other numbers that have been assigned to existing Symbols, including those done by the compiler.

I agree with @mselig, in principle it seems possible to add the extra flexibility without sacrificing the existing runtime performance. Just a small compilation overhead would perhaps be needed.

The basic idea is that the compiler would build a String->Int32 mapping, with keys being the symbol names it has seen in the code and the values would be the Int32s they map to.

I suppose that would be a rather small overhead, if any, because the compiler already needs to somehow keep track of the symbol names it has seen to ensure that the same name always maps to the same number.

The only method that would ever make use this mapping would be the String#intern (==String#to_sym). It would do the lookup into the aforementioned mapping and add a new element to the mapping if the symbol name has not been seen yet.

This way only the users of String#intern would pay the runtime price of the added flexibility, and since String#intern does not exist yet, no existing code would be affected.

The difference to what @mselig wrote above is that the “symbol table” would always be built, independently of the String#intern usage, because

  1. I believe this table must already be available in some form to ensure that the symbol->Int32 mapping is unique, and
  2. if it is not already available, one would not like to re-scan the entire code if String#intern is encountered in some deep require.

@straight-shoota: do you see any fundamental problems with this idea?

There’s the problem that the symbol table could grow indefinitely, specially from a result of users passing malicious data, and crash your program. This affected Ruby in the past. Now they GC the symbols. If we go that route it will only get more and more complex.

As one of the original designers of the language I see no point in having symbols anymore. I’d like to remove them. Enums serve that purpose very well.

It doesn’t matter if people that come from Ruby have to do things in a different way. People are already migrating to languages much different than Ruby and Crystal already.

Hi @noc, Yes, what you are saying is what I was suggesting. Though I am a bit confused where the method String#intern came from or what it is - wouldn’t it be better to be compatible with Ruby and be called to_sym?

Actually, it may be possible that this could be implemented with no compiler changes at all, if there is a way of enumerating the symbol table at runtime and determining which of the entries are actually Symbols (in the Crystal & Ruby class meaning). Currently I do not know the compiler well enough to know if this is possible or whether a special table needs to be built.

Other people have commented that a program could simply just use Strings in the first place (which requires minimal editing of existing Ruby code) without conversion to a Symbol. However there is a performance penalty in that each time a string is used as a hash table key, it has to be hashed. Also testing equality of such Strings is slower than equality of Symbols. Using a StringPool is another option to address these performance issues, but its usage as a constant is ugly (pool.get("str") vs :str) - you’d want to create a macro, I’d imagine. Anyhow, I think my suggestion for Symbols is clean and provides much better Ruby compatibility.

Hi @mselig, IIRC Symbol#to_sym is just an alias for String#intern in Ruby.

As for the performance, I did a quick ad-hoc “benchmarking” by populating a hash with 2 million entries and then looking up a particular key repeatedly. The results are:

EDIT: forget it, wasn’t compiling crystal with --release, and with --release, the relevant code just gets evaluated at compile time.

EDIT 2: here the new “benchmark”, the CPU time used seems to scale with the number of loop iterations (here 10 million):

$ ./hash_speed_vs_key_type
Int32 populate: N=10000000 5.282001s == 1893221.9058648418/s 
Int32 lookup: N=10000000 0.162289s == 61618470.752792865/s 
String populate: N=10000000 10.443109s == 957569.244944202/s 
String lookup: N=10000000 0.391429s == 25547417.283849686/s   

So it seems that string keys entail a ca. 2.5x performance penalty, which is quite a lot.
Enums on the other hand are quite impractical, if not impossible to use if you don’t know beforehand what keys your hash might contain.

In addition to that, symbols are syntactically sooo much nicer than strings, even if they are abandoned as a separate type, I hope the syntax will remain valid and we will just get strings instead of symbols out of it.

1 Like

@asterite: OK, I get your point, between Enums and Strings, there is little place left for Symbols as a separate type.
But would that mean that the nice syntax of the symbols would also be gone, i.e. would
h={a:1, b:2} then necessarily become h={“a”=>1, “b”=>2}, or would the symbols syntax remain valid, but just produce (say) strings instead of symbols?

Ruby isn’t the only language with Symbols, Scala has them too. Being a bytecode compiled language, Scala may have more in common with Crystal — I don’t know the semantics of symbols in scala.

@mselig I take your desire to make or keep Crystal more like Ruby to heart. I’ve been a ruby dev for a decade and I love so much about it. But Crystal is a very different language from ruby. Java is a very different language from C, though they also share a lot of syntax.

Crystals goal as a syntax is very similar to Ruby, to make a language very high level and easy to use. Crystal also has another core tennant: speed. I’m here for the speed. High level syntax is a major perk, and rubyish stdlib is too.

When you make a transition from one language to another, it takes time to adjust your thinking to become idiomatic. It is easy to write idiomatic Java but with Ruby characters – I’ve seen many new-to-Ruby coworkers do it. When you first come to Crystal, it’s easy to just write Ruby and watch that it works and is fast, but I submit that it is actually not Crystal you’re writing.

As I learned more and more Crystal, I relaxed my dependence on Symbols. They’re unnecessary at compile time, which is where I was using them most, building a DSL of some sort. (Just leave off the : and have no need of making a string) You May be using symbols in places you shouldn’t, even in Ruby. For me, String.to_sym is a strict bad practice in Ruby, but that comes from the past vulnerabilities.

JSON in a compiler language is a very different beast than in a dynamic one. Mappings are smooth, but extra keys and missing keys suddenly become typing problems. It’s easy to see why the previous attempt at APIs landed with XML and DTDs.

I guess my tldr is, I like symbols. And I don’t use them very often. While I would hate to see them go away, I don’t think it’ll really change what idiomatic Crystal becomes.

I’d like to expand a bit on why I like symbols. My previous post comes off as being more anti symbol than I am.

Symbols and Enums are both examples of magic number programming. The benefit is that you don’t have to remember that a number corresponds to a state, you have names instead. High Level Programming.

Enums are SO much more verbose. MyModule::EnumName::Value vs :value

Enums imply ownership where it is unnecessary. Why should ModuleA::State::Success and ModuleB::State::Success be different? As a library user, it is far more readable to type :success in both cases. It’s more high level.

Symbols are not harder to convert to actual numbers than Enums, Enums just have a DSL. If you assign explicit numbers to your Enum (for maintainability), it’s just as verbose as making a symbol mapping method. Just not as pretty because no DSL has been provided yet.

1 Like

FWIW if you have a method argument, ivar, etc typed as MyModule::EnumName, then you can give it :value and it’s autocasted to that enum member. The key difference is that if you messed up and gave it :valu, it wouldn’t compile but if you had it typed as Symbol it would and you just introduced a bug.

enum Test
  One
  Two
  Three
end
 
def test(member : Test) : String
  "You chose #{member}"
end
 
pp test :two      # => "You chose Two"
pp test Test::One # => "You chose One"

https://play.crystal-lang.org/#/r/bjpu

Because in the context of a library there may be two different states representing different things. Just because they share a Success member doesn’t mean they’re redundant. As such, modifying one shouldn’t affect the other. Of course that doesn’t mean they can’t just be consolidated into a single enum and shared if they do in fact represent the same thing.

As mentioned previously this point is moot as the :success symbol would be autocasted to the enum member (assuming it had that member) anyway. IMO Symbols should be removed from the lang in their current form and implemented solely for enum autocasting. This way you get the benefits of an enum, with the ease of use of a symbol.

EDIT:

Technically these are two separate types. The first is a NamedTuple, which shouldn’t really be used to store data as they’re meant to represent named arguments. You’d be better off with a record (struct). The latter is a hash, which could have Symbol keys. I.e. h = {:a => 1, :b => 2}. But for reasons mentioned previously, there’s not really a benefit of that over using String. And esp not for using NamedTuple as a means to store data.

5 Likes

The lexer uses symbols to identify tokens:

def next_token
  # ...
  case current_char
  when '='
    case next_char
    when '='
      case next_char
      when '='
        next_char :"==="
      else
        @token.type = :"=="
      end
    when '>'
      next_char :"=>"
    when '~'
      next_char :"=~"
    else
      @token.type = :"="
    end
  # ...
  end
end

I think these read better than enum constants like EQ_EQ_EQ or EQ_EQ_TILDE. A lexer benefits from this because the collection of symbols truly ranges over all possible strings, whereas constant names necessarily have a smaller alphabet.

For NamedTuple you wouldn’t use it over a Hash for the same reasons you wouldn’t use a Tuple over an Array, but NamedTuple's special subtyping rules give rise to some uses that are difficult to mimic without it (e.g. disjoint unions).

Apart from these I don’t think there are any significant use cases for symbols in Crystal.

1 Like

In the past I wanted to change those symbols to enums and I faced the same “issue.” Well, it’s not an issue, you just have to think of names, be consistent and do it.

I still think enums are generally better than symbols and that, if we can get rid of symbols, it would be nice.

We could use a custom lookup table for tokens like this:

SYMBOLS = {} of _ => _

enum Token
  NONE
  
  def to_s
    {% begin %}
      case to_i
      {% for key, value in SYMBOLS %}
      when {{ value }}
        {{ key }}
      {% end %}
      end
    {% end %}
  end
end

macro tok(name)
  Token.new({{ SYMBOLS[name] || (SYMBOLS[name] = SYMBOLS.size + 1) }})
end

def foo(token : Token)
  case token
  when tok("==")
    puts "Found =="
  else
      puts "Unknown token: #{token}"
  end
end

foo(tok("=="))
foo(tok("+"))

In case anyone wants to try, all that is needed to implement dynamic Symbols is the ability to query the number of predefined Symbols:

# src/compiler/crystal/codegen/primitives.cr
class Crystal::CodeGenVisitor
  def codegen_primitive(call, node, target_def, call_args)
    @call_location = call.try &.name_location

    @last = case node.name
            # ...
            when "symbol_predefined_count"
              int(@symbols.size)
            else
              raise "BUG: unhandled primitive in codegen: #{node.name}"
            end

    @call_location = nil
  end
end
# src/primitives.cr
struct Symbol
  @[Primitive(:symbol_predefined_count)]
  def self.predefined_count : Int32
  end

  # renamed from `#to_s`
  @[Primitive(:symbol_to_s)]
  protected def to_s_primitive : String
  end
end
# src/symbol.cr
require "string_pool"

struct Symbol
  @@strings = StringPool.new(Math.pw2ceil(predefined_count))
  @@s_to_sym = Hash(String, Symbol).new(initial_capacity: predefined_count).compare_by_identity
  @@i_to_s = Array(String).new(predefined_count)

  private def self.add_symbol(str : String) : self
    i = @@i_to_s.size
    @@i_to_s << str
    @@s_to_sym[str] = i.unsafe_as(Symbol)
  end

  private def self.init_string_pool
    predefined_count.times do |i|
      add_symbol(@@strings.get(i.unsafe_as(Symbol).to_s_primitive))
    end
  end

  init_string_pool

  def self.new(str : String)
    str = @@strings.get(str)
    @@s_to_sym.fetch(str) { add_symbol(str) }
  end

  def to_s : String
    @@i_to_s[to_i]
  end

  def self.each(& : Symbol ->)
    @@i_to_s.size.times do |i|
      yield i.unsafe_as(Symbol)
    end
  end

  def self.all_symbols : Array(Symbol)
    Array.new(@@i_to_s.size, &.unsafe_as(Symbol))
  end

  def to_sym : self
    self
  end
end

class String
  def to_sym : Symbol
    Symbol.new(self)
  end
end
Symbol.all_symbols # => [:sequentially_consistent, :xchg, :skip, :none, :unchecked, :add, :active, :done, :to_s, :file]

a = "xchg".to_sym
b = "xchg".to_sym
a.to_i     # => 1
b.to_i     # => 1
:xchg.to_i # => 1

a = String.build(&.<< "foo").to_sym
b = String.build(&.<<("f").<< "foo").to_sym
a.to_i # => 10
b.to_i # => 10

Symbol.all_symbols # => [:sequentially_consistent, :xchg, :skip, :none, :unchecked, :add, :active, :done, :to_s, :file, :foo]

The counterargument is that if Symbols are used in this capacity, then one can simply use the Strings and the StringPool directly without pulling in all the predefined Symbol constants, thereby avoiding global state, and the memory usage / performance will be same. One key difference, however, is that the entire StringPool can be garbage-collected even if its elements cannot be removed individually. In fact I did this in a custom JSON (de)serializer to reduce the generated document’s size (with a hard limit on the number of distinct strings).

We tackled the operator token issue recently, and we are now pretty close to removing all Symbol-typed variables in the standard library and the compiler, other than API changes. These are #11775, #11020, and the following:

# src/time/location.cr

class Time::Location
  struct Zone
    # Prints `#offset` to *io* in the format `+HH:mm:ss`.
    # When *with_colon* is `false`, the format is `+HHmmss`.
    #
    # When *with_seconds* is `false`, seconds are omitted; when `:auto`, seconds
    # are omitted if `0`.
    def format(io : IO, with_colon = true, with_seconds = :auto)
      # ...
    end

    # Returns the `#offset` formatted as `+HH:mm:ss`.
    # When *with_colon* is `false`, the format is `+HHmmss`.
    #
    # When *with_seconds* is `false`, seconds are omitted; when `:auto`, seconds
    # are omitted if `0`.
    def format(with_colon = true, with_seconds = :auto)
      String.build do |io|
        format(io, with_colon: with_colon, with_seconds: with_seconds)
      end
    end
  end
end

with_seconds is unrestricted, but the method body expects it to be one of true, false, or :auto. If we deprecate these as well then symbols can only appear in compile-time contexts, e.g. as arguments to #responds_to?; I think these are fine, because those symbols emphasize their compile-time aspect.

4 Likes

I miss symbols a lot because I also used them as a different type to strings in ruby. Like: a method would react differently depending if it received a numeric, a symbol or a string. I used symbols especially in situations like user ids, match codes, etc, whenever it was important that those would be unique and one couldn’t stand for different objects. This can be done manually of course, but with symbols I got this completely for free and without any further action of my side. And sometimes really just as a second type of string. Imo they also make code far better readable, while also drawing the focus on a different type of thing: a symbol is primarily for the dev, while strings are usually for the user of an application (and need translations if multiple languages have to be supported).

Imo they bring a lot more benefits then just a performance boost for ruby. And while Crystal might not need them for the performance boost, I think that Crystal loses a lot by not having them (at runtime available) for those additional benefits.

If Crystal doesn’t want to support them the way as people are used to them in ruby I can fully accept it, but in this case I would suggest to get rid off them completely as a main feature, so people who really want them can have their own Symbol class (and being named symbol). It would be great, if it would be somehow possible to give those the ability to continue using the :foo syntax.

But tbh I don’t see a reason why they shouldn’t be offered just the way like people are used to them. Those who don’t know symbols, probably won’t use them anyway. And those who miss them would appreciate them a lot. I just don’t see how they would hurt anyone if they were fully supported.

1 Like