Js and java are faster in my test, how can that be?

madex · December 29, 2019, 3:54pm

Solve difficult sudoku as a performance test for different programming languages

The work on this started in 2013 and I´ve done it in a lot of programming languages since then.

The concept is a recursive algorithm that need as little memory as possible.
It manipulates the field directly. The rules are made of arrays, so this solution is quite minimal.

See

The tests ard made on macOs on macbook pro 15 from early 2013.

lang	time	exe-size
c (clang)	7 ms	9 KB
rust	6 ms	320 KB
cystal	32 ms	420 KB
js Chorme	24 ms
js Firefox	21 ms
java	24 ms
lua	1,7 s
python3.7	2.7 s

module Sudoku
  VERSION           = "0.1.0"
  RULE_ROWS         = (0..8).cycle.first(9*9).to_a
  RULE_COLUMS       = (0..80).map { |x| x // 9 }.to_a
  RULE_BASIC_BLOCKS =
    (0..80).map { |x| (x % 9 // 3) + ((x//27) * 3) % 9 }.to_a
  RULES = [RULE_ROWS, RULE_COLUMS, RULE_BASIC_BLOCKS]

  def self.print_board(board : Array(Int32))
    puts ""
    board.each_with_index do |x, i|
      print " #{x.zero? ? "_" : x}"
      print " " if i %  3 ==  2
      puts ""   if i %  9 ==  8
      puts ""   if i % 27 == 26
    end
  end

  private def self.set_field?(index : Int32, board : Array(Int32), set_mask : Array(Int32))
    field_no = index
    loop do
      return true if field_no >= 81
      break if board[field_no].zero?
      field_no += 1
    end
    (1..9).each do |i|
      if ((1 << i) & set_mask[field_no]).zero?
        board[field_no] = i
        new_set_mask = set_mask.clone
        if RULES.all? { |rule| valid?(board, rule, new_set_mask) }
          if set_field?(field_no + 1, board, new_set_mask)
            return true
          end
        end
      end
    end
    board[field_no] = 0
    false
  end

  private def self.valid?(board, rule, set_mask)
    flags = Array.new(9, 0)
    board.zip(rule).each do |(f, r)|
      next if f.zero?
      mask = 1 << f
      return false if !(flags[r] & mask).zero?
      flags[r] |= mask
    end
    rule.each_with_index do |r, i|
      set_mask[i] |= flags[r] # mark numbers that are taken for this rule
    end
    true
  end

  def self.solve?(board : Array(Int32))
    set_mask = Array.new(81, 0)
    set_field?(0, board, set_mask)
   end

  def self.main
    field = [
      1, 0, 0,  0, 0, 7,  0, 9, 0,
      0, 3, 0,  0, 2, 0,  0, 0, 8,
      0, 0, 9,  6, 0, 0,  5, 0, 0,

      0, 0, 5,  3, 0, 0,  9, 0, 0,
      0, 1, 0,  0, 8, 0,  0, 0, 2,
      6, 0, 0,  0, 0, 4,  0, 0, 0,

      3, 0, 0,  0, 0, 0,  0, 1, 0, 
      0, 4, 0,  0, 0, 0,  0, 0, 7,
      0, 0, 7,  0, 0, 0,  3, 0, 0
    ]

    print_board(field)
    start_time = Time.local
    result = solve?(field)
    stop_time = Time.local
    puts "Millisecounds: #{(stop_time - start_time).nanoseconds() / 1000_000.0}"
    if result
      print_board(field)
    else
      puts "no solition found"
    end
  end

  main
end

girng_github · December 29, 2019, 4:36pm

inb4 the code is crystal’d by lead devs and roffle stomps that chart

straight-shoota · December 29, 2019, 4:43pm

Java and JavaScript have some very sophisticated virtual machines which can lead to pretty good performance with JIT compilation and other optimizations. It obviously depends on the specifics of the use case. And generally, Crystal should usually outperform them.

First of all, did you compile with --release flag? Otherwise the code won’t be optimized.

In .valid? there is an unnecessary heap allocation with board.zip(rule).each do |(f, r)|. This is probably just a minor issue, but this seems to be frequently executed. You should use board.zip(rule) do |f, r| instead to avoid instantiating an iterator.

straight-shoota · December 29, 2019, 4:45pm

Also, for measuring elapsed time, please use Time.monotonic (or the convenient Time.measure helper). This won’t improve performance but make sure that the measurements are taken correctly.

madex · December 29, 2019, 6:05pm

Thank you.

Of course I run it with --release.
Remove the .each as you describe reduces the execution time to 15 ms!!!. but rust is twice as fast in this case.

crystal is now faster than js and java and this fits my expectations. Are there Arrays with fixed size on the stack?

If someone has other tips the improvements, please.

asterite · December 29, 2019, 6:15pm

I don’t understand why, but with these changes the code runs in 0.14ms:

# TODO: Write documentation for `Sudoku`
module Sudoku
  VERSION   = "0.1.0"
  RULE_ROWS = StaticArray[
    0, 0, 0, 0, 0, 0, 0, 0, 0,
    1, 1, 1, 1, 1, 1, 1, 1, 1,
    2, 2, 2, 2, 2, 2, 2, 2, 2,
    3, 3, 3, 3, 3, 3, 3, 3, 3,
    4, 4, 4, 4, 4, 4, 4, 4, 4,
    5, 5, 5, 5, 5, 5, 5, 5, 5,
    6, 6, 6, 6, 6, 6, 6, 6, 6,
    7, 7, 7, 7, 7, 7, 7, 7, 7,
    8, 8, 8, 8, 8, 8, 8, 8, 8,
  ]
  RULE_COLUMS       = (0..80).map { |x| x // 9 }.to_a
  RULE_BASIC_BLOCKS =
    (0..80).map { |x| (x % 9 // 3) + ((x//27) * 3) % 9 }.to_a
  RULES = [RULE_ROWS, RULE_COLUMS, RULE_BASIC_BLOCKS]

  def self.print_board(board : Array(Int32))
    puts ""
    board.each_with_index do |x, i|
      print " #{x.zero? ? "_" : x}"
      print " " if i % 3 == 2
      puts "" if i % 9 == 8
      puts "" if i % 27 == 26
    end
  end

  private def self.set_field?(index : Int32, board : Array(Int32), set_mask : Array(Int32))
    field_no = index
    loop do
      return true if field_no >= 81
      break if board[field_no].zero?
      field_no += 1
    end
    (1..9).each do |i|
      if ((1 << i) & set_mask[field_no]).zero?
        board[field_no] = i
        new_set_mask = set_mask.dup
        if RULES.all? { |rule| valid?(board, rule, new_set_mask) }
          if set_field?(field_no + 1, board, new_set_mask)
            return true
          end
        end
      end
    end
    board[field_no] = 0
    false
  end

  private def self.valid?(board, rule, set_mask)
    flags = StaticArray(Int32, 9).new(0)
    board.zip(rule) do |f, r|
      next if f.zero?
      mask = 1 << f
      return false if !(flags[r] & mask).zero?
      flags[r] |= mask
    end
    rule.each_with_index do |r, i|
      set_mask[i] |= flags[r] # mark numbers that are taken for this rule
    end
    true
  end

  def self.solve?(board : Array(Int32))
    set_mask = Array.new(81, 0)
    set_field?(0, board, set_mask)
  end

  def self.main
    field = [
      1, 0, 0, 0, 0, 7, 0, 9, 0,
      0, 3, 0, 0, 2, 0, 0, 0, 8,
      0, 0, 9, 6, 0, 0, 5, 0, 0,

      0, 0, 5, 3, 0, 0, 9, 0, 0,
      0, 1, 0, 0, 8, 0, 0, 0, 2,
      6, 0, 0, 0, 0, 4, 0, 0, 0,

      3, 0, 0, 0, 0, 0, 0, 1, 0,
      0, 4, 0, 0, 0, 0, 0, 0, 7,
      0, 0, 7, 0, 0, 0, 3, 0, 0,
    ]

    print_board(field)
    start_time = Time.monotonic
    result = solve?(field)
    stop_time = Time.monotonic
    puts "Millisecounds: #{(stop_time - start_time).nanoseconds / 1000_000.0}"
    if result
      print_board(field)
    else
      puts "no solition found"
    end
  end

  main
end

Moving more things to static array makes it worse (like 12~15ms).

My advice is, if you are going to benchmark two languages, it’s better to write the exact same code. In Crystal you initialize the arrays doing some array manipulation, mapping, etc., while in Java you use static arrays with constants (no need to compute anything), plus I’m almost sure Java will optimize that away to static (immutable) arrays.

asterite · December 29, 2019, 6:21pm

You are also using byte in Java and Int32 in Crystal.

The benchmarks aren’t comparing the same code at all. But, I think once you change zip to avoid creating an intermediate array the difference is 9ms Java vs 15ms Crystal and I think that’s pretty acceptable, given that the Crystal version is objectively worse (coded with less performance in mind) than Java.

asterite · December 29, 2019, 6:29pm

Actually, the code I provided gives a different (I think wrong) result. In any case, what I said about different codes being benchmarked still holds.

There was a talk recently, I can’t remember where, in StrangeLoop or similar, about how little changes in code layout can change drastically how code behaves. So I wouldn’t rely on these micro-benchmarks for anything.

madex · December 29, 2019, 10:02pm

Thank you !! I really enjoy programming in crystal.

I don´t like the java code, this might be better, with iterators. And this is done years ago from a friend.

I was originally compared to rust or other compiled languages like c. The crystal implementation is really similar to the rust.

I switched to UInt8 and the UInt16 for the StaticArray() and now I got down to 9.5 ms.
So everything is fine. crystal has a GC, and a heap objects so I expect rust to be a little bit faster.

Oh if you got no solution you might have done something wrong, and the time is quite short.

Eureka · December 30, 2019, 5:56am

When you do intend to benchmark with Go programming too?

rogerdpack · December 30, 2019, 6:18am

You can profile it: https://github.com/crystal-lang/crystal-book/blob/master/guides/performance.md and try and see where the bottlenecks are…

ksec · January 6, 2020, 9:36am

“Performance Matters” by Emery Berger

I think it is this one? Amazing talk on performance, it just open my eyes how little I knew about computing.

asterite · January 7, 2020, 3:43am

Indeed, that talk is crazy! After watching it I don’t trust microbenchmarks anymore.

vlazar · January 7, 2020, 7:40am

Thank you for sharing this mind-blowing talk!
Everyone who is interested in performance go watch this, it’s worth every second of your time.

straight-shoota · January 7, 2020, 12:29pm

Can we make coz work with Crystal?

vlazar · January 7, 2020, 1:58pm

It looks like there is only a C and Java versions of coz profiler currently https://github.com/plasma-umass/coz

If Crystal can emit C code via LLVM and mark corresponding sections in C code with COZ_PROGRESS_NAMED, COZ_BEGIN/COZ_END macros https://github.com/plasma-umass/coz then probably it can be usable.

RX14 · January 7, 2020, 4:58pm

The coz.h which is actually part of the executable is definitely able to be rewritten in crystal using the C bindings.

There’s no need to compile to C at all.

RX14 · January 7, 2020, 11:03pm

Wrote it :)

Unfortunately it’s very limited by the incredibly inaccurate debug info which crystal emits.

ComputerMage · January 7, 2020, 11:41pm

@RX14 I am still working on it ;)
I am trying to fix line locations first after that will implement union debug types. nullable types are already working in my branch.

madex · January 10, 2020, 11:10pm

This website is interesting for performance testing of languages
https://benchmarksgame-team.pages.debian.net/benchmarksgame/performance/nbody-cpu.html
It could be interesting to implement these programs in crystal too and get listed.

But some of them are really heavy optimized and uses, for instance, SIMD Cpu instructions directly.
There are also Ruby solutions.

Topic		Replies	Views
There is a way to optimize this program? Help & Support	61	1140	November 22, 2024
WoW! I finally found a way to show the differentiation in speed between JSON::Any and static types	20	1455	April 15, 2019
How to make this faster? Help & Support	10	312	January 23, 2024
An update on Crystal performance News	7	640	December 23, 2022
Array.rotate! can be faster Learning Resources	4	816	November 25, 2019

Js and java are faster in my test, how can that be?

Solve difficult sudoku as a performance test for different programming languages

“Performance Matters” by Emery Berger

Related topics