There is a way to optimize this program?

I have an old Intel based MacBook, MacOS 10.15, LLVM

% sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz

% crystal --version
Crystal 1.12.0 (2024-04-09)

LLVM: 17.0.6
Default target: x86_64-apple-macosx

% crystal build --release collatz3.cr
% ./collatz3
Seed 1000000000 Steps: 100
collatz took: 00:06:20.618209571

% crystal build --release --mcpu native collatz3.cr
% ./collatz3
Seed 1000000000 Steps: 100172
collatz took: 00:06:17.123127458

So the difference in my case is negligible, less than 1%

1 Like

Strange that the number of steps are different though. Is that expected?

Wired, i tried build above examples again on my laptop

 ā•°ā”€ $ ./collatz3
Seed 1000000000 Steps: 100
collatz took: 00:02:05.057338846

--mcpu native
 ā•°ā”€ $ ./collatz3
Seed 1000000000 Steps: 100
collatz took: 00:02:06.073337643
 ā•°ā”€ $ ./collatz1
Seed 1000000000 Steps: 100
collatz took: 00:02:45.230708896

--mcpu native
 ā•°ā”€ $ ./collatz1
Seed 1000000000 Steps: 100
collatz took: 00:02:49.934104333
+ cr build --release 1.cr -o 1
+ ./1
collatz took: 00:02:23.130819476
+ echo -----------------
+ cr build --release --mcpu native 1.cr -o 2
+ ./2
collatz took: 00:03:07.363442950

All my benchmark, --mcpu native version is slower than another, the results completely opposite to before!

I donā€™t know what happened, this really wired.

some factors concerned:

  1. When i last ran this benchmarks, my laptop is plugged by a Dell monitor, TypeC, 90W, but, i connect a 120W AC adapter this time, although latter has much better performance, but, i consider run both with same condition is should comparable.

2. When i last ran, I use Crystal compiler which built without --mcpu native, but my current compiler built with this option.

  1. when i last ran, compiler built with llvm 17, but, llvm 18 this time.

EDIT: I tried recompile my crystal compiler without --mcpu native as before, and run benchmark again, as you can see, almost no difference, so above factor 2 can be ignored.

 ā•°ā”€ $ sh 1.sh
+ cr build --release 1.cr -o 1
+ ./1
collatz took: 00:02:22.366686933
+ echo -----------------
+ cr build --release --mcpu native 1.cr -o 2
+ ./2
collatz took: 00:03:07.554092518

And i donā€™t think the factor 1 is the reason, so, the only valid factor is, llvm update from 17 to 18?

CPU: AMD Ryzen 7 3700X

crystal build --release collatz3.cr

Seed 1000000000 Steps: 100
collatz took: 00:02:58.970378338

crystal build --release --mcpu native collatz3.cr

Seed 1000000000 Steps: 100
collatz took: 00:02:46.273276544

On my slightly older computer I saw a speed improvement of about 7.6%.
But this is just one run, it could be different if you run it many times.

I donā€™t know if this is the appropriate topic for this forum, but another thing I would like to know is the speedup by -D preview_mt. Are you guys using this option?

For current CPU bounded benchmark, adding this option has no any benefits, in fact, it hurt it.

Ha! I didnā€™t even notice it.
Run that again and got similar timing and 100 steps. No idea why:

% ./collatz3
Seed 1000000000 Steps: 100
collatz took: 00:06:17.701735560
1 Like

What you may be experiencing is a change in your runtime environment.

Make sure your system is always plugged in with the power supply.
On my system, it has 3 performance settings: lo, med, hi.
Be sure youā€™re always running in hi (performance) mode.

Run tests on a quiet system (no competing applications running).

I get repeatable consistent times running performance tests in this manner.

I also run htop in a terminal to see threads activity, and in another I run:

watch -n1 "grep \"cpu MHz\" /proc/cpuinfo"

This will show the speed at which each thread is running.

This thread started by comparing the speed of Go and Crystal. However, in real-world scenarios, parallel execution is often necessary for fast calculations.

The Collatz functionā€™s steps calculation can be easily parallelized since each seedā€™s processing is independent. Here, I will extend jzakiyaā€™s collatz3.cr to include parallel execution.

In Crystal, we can use spawn and channel, and compile with the -D preview_mt flag to enable parallel execution. The simplest method is to place collatzResult = collatz(seed) inside a spawn block.

  until seed > 1_000_000_000
    spawn do
      collatzResult = collatz(seed)
      if seed.divisible_by? 1_000_000
        print "Seed #{seed} Steps: #{collatzResult}\r"
      end
    end
    seed += 1
  end

However, this method creates too many fibers and might freeze your computer. Instead, we need to increase the batch size and reduce the number of fibers.

Programming with parallel execution is quite difficult for me, so I wrote this with the help of ChatGPT. Hereā€™s what I came up with:

collatz_mt.cr

def collatz(seed : UInt64)
  steps = 0_u64
  while seed > 1
    tz_count = seed.trailing_zeros_count
    steps += tz_count
    seed >>= tz_count
    if seed > 1
      steps += 1
      seed = seed * 3 + 1
    end
  end
  return steps
end

TOTAL_SEEDS = 1_000_000_000
BATCH_SIZE = 1000

num_workers = (ENV["CRYSTAL_WORKERS"]? || 4).to_i
channel = Channel(Tuple(Int32, Slice(UInt64))).new(num_workers)

def process_results(channel, num_workers)
  num_workers.times do
    batch_index, results = channel.receive
    results.each_with_index do |steps, seed_index|
      seed = BATCH_SIZE * batch_index + seed_index + 1
      if seed % 1_000_000 == 0
        print "Seed #{seed} Steps: #{steps}\r"
      end
    end
  end
end

start = Time.measure do
  active_workers = 0

  (TOTAL_SEEDS / BATCH_SIZE).to_i.times do |batch_index|
    spawn do
      results = Slice(UInt64).new(BATCH_SIZE)
      BATCH_SIZE.times do |seed_index|
        seed = (BATCH_SIZE * batch_index + seed_index + 1).to_u64
        results[seed_index] = collatz(seed)
      end
      channel.send({batch_index, results})
    end
    active_workers += 1

    if active_workers >= num_workers
      process_results(channel, num_workers)
      active_workers -= num_workers
    end
  end

  process_results(channel, active_workers)
end

puts "\ncollatz(mt) took: #{start}"
crystal build --release -Dpreview_mt collatz3_mt.cr

For strict result order, you might need to sort the results. However, for terminal output, the results will be the same since num_workers * BATCH_SIZE is much smaller than 1_000_000. This was checked using puts instead of print.

On my Mac mini (Apple M2), the results were as follows.

collatz took: 00:02:31.422758917
collatz(mt) took: 00:00:52.956233708

There is some overhead, but that is expected.

(English is also difficult, so I communicate with you all with the help of ChatGPT and DeepL. AI is really indispensable.)

1 Like

For fiber involved concerns, please check this discusstion for how use fixed number fiber as a pool to limit the memory usage.

And, i consider go can easily defeat the currently implemented preview_mt if multi-thread involved, there is even no need for comparison anyway, Crystal can only defeat go in single thread mode.

Comparing with Go is important to identify optimization opportunities in Crystal. However, Iā€™m more interested in writing tools with Crystal. Since I donā€™t plan to use Go, I donā€™t really care if Crystal defeat Go or not :slightly_smiling_face:

1 Like

Still using Crystal 1.12.2.

I did some slight code modifications.

(TOTAL_SEEDS / BATCH_SIZE).to_i.times do

to do integer division

(TOTAL_SEEDS // BATCH_SIZE).times do

and

if seed % 1_000_000 == 0

to

if seed.divisible_by? 1_000_000

My system currently only has 16GB of ram, and would bomb
for BATCH_SIZE = 1000, so I kept raising it til it worked.

These are single run times (done once just to see the trends).

compiled as:
crystal build --release -D preview_mt (--mcpu native) collatz3-mt1.cr

run as:
CRYSTAL_WORKERS=16 ./collatz3-mt1

  BATCH_SIZE |  non --mpcu  | --mcpu native
-------------|--------------|---------------
    20_000   | 21.462461732 | 20.64834345
-------------|--------------|---------------
    25_000   | 19.648608257 | 17.970281164
-------------|--------------|---------------
    40_000   | 17.290314075 | 16.784457625
-------------|--------------|---------------
    50_000   | 16.940723973 | 15.741673784
-------------|--------------|---------------
    80_000   | 16.105024553 | 14.965173563
-------------|--------------|---------------
   100_000   | 15.863517012 | 14.846495923
-------------|--------------|---------------
   125_000   | 15.644565438 | 14.439763090
-------------|--------------|---------------
   160_000   | 15.571596193 | 14.378076468
-------------|--------------|---------------
   200_000   | 15.429948915 | 14.351718695
-------------|--------------|---------------
   250_000   | 15.037558804 | 13.954640639
-------------|--------------|---------------
   400_000   | 14.852013307 | 13.714046571
-------------|--------------|---------------
   500_000   | 14.690805390 | 13.684973576
-------------|--------------|---------------
 1_000_000   | 14.328056481 | 13.256448490
-------------|--------------|---------------
 2_000_000   | 14.151256391 | 13.186936261
-------------|--------------|---------------
 2_500_000   | 14.078736752 | 13.105861409
-------------|--------------|---------------
 4_000_000   | 14.239786044 | 13.226919653
-------------|--------------|---------------
 5_000_000   | 14.155460105 | 13.213493322
-------------|--------------|---------------
2 Likes

First made this display code modification.

Changed:
print "Seed #{seed} Steps: #{steps}\r"

To:
print "Seed #{seed.format} Steps: #{steps}\r"

Now formatted output like: Seed 997,000,000 Steps: 22400

Before using Crystal 1.12.2

āžœ  ~ crystal -v
Crystal 1.12.2 [04998c0c7] (2024-05-31)

LLVM: 15.0.7
Default target: x86_64-unknown-linux-gnu

Got these results.

  BATCH_SIZE |  non --mpcu  | --mcpu native
-------------|--------------|---------------
 1_000_000   | 14.328056481 | 13.256448490
-------------|--------------|---------------
 2_000_000   | 14.151256391 | 13.186936261
-------------|--------------|---------------
 2_500_000   | 14.078736752 | 13.105861409
-------------|--------------|---------------
 4_000_000   | 14.239786044 | 13.226919653
-------------|--------------|---------------
 5_000_000   | 14.155460105 | 13.213493322
-------------|--------------|---------------

Got Crystal 1.13.1 today (2024/07/12) and ran code with it.

  ~ crystal -v
Crystal 1.13.1 [0cef61e51] (2024-07-12)

LLVM: 18.1.6
Default target: x86_64-unknown-linux-gnu

Get these results.

  BATCH_SIZE |  non --mpcu  | --mcpu native
-------------|--------------|---------------
 1_000_000   | 12.771790424 | 12.742333954
-------------|--------------|---------------
 2_000_000   | 12.712604724 | 12.699085654
-------------|--------------|---------------
 2_500_000   | 12.501386958 | 12.508320673
-------------|--------------|---------------
 4_000_000   | 12.698391035 | 12.699728326
-------------|--------------|---------------
 5_000_000   | 12.641033522 | 12.698599464
-------------|--------------|---------------

Observations:

  1. Around BATCH_SIZE = 2_500_000 appears optimum on my system.

  2. The performance with 1.13.1 is greatly improved, and
    compiling with --mcpu native has no real advantage anymore.

I donā€™t know how much to attribute this to a better
concurrency implementation in Crystal, or is it just
that LLVM 18.1.6 does a much better job than LLVM 15.0.7.

But clearly, 1.13.1 is a lot faster running this code than 1.12.2 was.

Kudos to the Crystal Devs and community!! :partying_face:

3 Likes

Ref my previous comment, i guess this caused by the update of llvm. although, i still not ruled out the factors of TypeC power supply and AC power. (both test under same situation)

Same here on Intel base Mac.

% sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz

Updating to Crystal 1.13.1 (which uses LLVM: 18.1.8) gives better results:

% crystal build --release collatz3.cr
% ./collatz3
Seed 1000000000 Steps: 100
collatz took: 00:06:01.066968219

% crystal build --release --mcpu native collatz3.cr
% ./collatz3
Seed 1000000000 Steps: 100
collatz took: 00:05:59.658690231

With Crystal 1.12.0 (and LLVM: 17.0.6) it was

% crystal build --release collatz3.cr
% ./collatz3
Seed 1000000000 Steps: 100
collatz took: 00:06:20.618209571

% crystal build --release --mcpu native collatz3.cr
% ./collatz3
Seed 1000000000 Steps: 100
collatz took: 00:06:17.701735560

Hereā€™s a simpler|faster parallel implementation.
As with the previous version, BATCH_SIZE must create an integer division.

This can actually be made more simpler|faster by performing the function of
process_results directly inside the spawn loop, but to maintain the
structure of the previous code I keep it separate here too.

(I created a test suite that ran|stored the results of the serial|parallel
versions and verifed their output for the 1 billion seeds were the same.)

def collatz(seed : UInt64)
  steps = 0_u64
  while seed > 1
    tz_count = seed.trailing_zeros_count
    steps += tz_count
    seed >>= tz_count
    if seed > 1
      steps +=1
      seed = seed*3 + 1
    end
  end
  steps
end

TOTAL_SEEDS = 1_000_000_000
BATCH_SIZE  = 3_125_000
batches = TOTAL_SEEDS // BATCH_SIZE

puts "BATCH_SIZE = #{BATCH_SIZE.format}"
done = Channel(Nil).new(batches)

def process_results(batch_index, results)
  results.each_with_index do |steps, seed_index|
    seed = BATCH_SIZE * batch_index + seed_index + 1
    if seed.divisible_by? 1_000_000
      print "Seed #{seed.format} Steps: #{steps}\r"
    end
  end
end

start = Time.measure do
  batches.times do |batch_index|
    spawn do
      results = Slice(UInt64).new(BATCH_SIZE)
      BATCH_SIZE.times do |seed_index|
        seed = (BATCH_SIZE * batch_index + seed_index + 1).to_u64
        results[seed_index] = collatz(seed)
      end
      process_results(batch_index, results)
      done.send(nil)
    end
  end
  batches.times { done.receive }
end

puts "\ncollatz(mt) took: #{start}"

As shown, the times are faster than the previous version, and much closer.
Because of that, I used the best time out of 5 runs.

  BATCH_SIZE |  non --mpcu  | --mcpu native
-------------|--------------|---------------
 1_000_000   | 11.983658699 | 12.016212832
-------------|--------------|---------------
 2_000_000   | 11.965050076 | 11.962311806
-------------|--------------|---------------
 2_500_000   | 11.897171669 | 11.829114604
-------------|--------------|---------------
 3_125_000   | 11.864008350 | 11.769375493
-------------|--------------|---------------
 4_000_000   | 11.988934988 | 11.958665643
-------------|--------------|---------------
 5_000_000   | 12.006046679 | 11.975759521
-------------|--------------|---------------

Thus on my system, the original Go code took 181.47s (3m1.47s),
and the fastest with Crystal 1.12.2 was 109.75s (1m49.75s).

Now with Crystal 1.13.1, this parallel version consistently does under 12s!

4 Likes

I see. This is very fast, and it seems to be very efficient based on the CPU usage shown in htop.

I was concerned that increasing the channel buffer would increase memory usage, but htop shows that this is not the case; in fact, memory usage is significantly reduced.

If you want to increase computation speed through parallelization in Crystal language, it seems like a good idea to increase the buffer size of the channels and reduce the number of context switches (if that is the correct term). Thank you for sharing your experiment and results.

2 Likes

You can make it use much less memory by only creating as many fibers as there are threads, sending the batches through a Channel(Int64...Int64) and have each fiber loop to process the next batch. It may even be a bit faster, especially if the channel buffer size is the number of batches.

Notice that by creating as many fibers as batch numbers, you get fasterā€¦ because you end up creating less fibers. With only a fixed set of fibers, it would be the opposite: small batches (e.g. 1000) are faster than larger ones.

require "wait_group"

def collatz(seed : Int64)
  steps = 0_i64

  while seed > 1
    while seed % 2 == 0
      steps &+= 1
      seed //= 2
    end

    if seed > 1
      steps &+= 1
      seed = seed &* 3 &+ 1
    end
  end

  steps
end

n = ENV.fetch("N", "1_000_000_000").to_i64(underscore: true)
b = ENV.fetch("B", "1_000").to_i64(underscore: true)
w = ENV.fetch("CRYSTAL_WORKERS", "4").to_i

channel = Channel({Int64, Int64}).new((n // b + 1).to_i32)
wg = WaitGroup.new(w)

w.times do |i|
  spawn(name: "WORKER-#{i}") do
    while r = channel.receive?
      (r[0]...r[1]).each do |seed|
        steps = collatz(seed)

        if seed % 1_000_000 == 0
          print "Seed: #{seed} Steps: #{steps}\r"
        end
      end
    end
    wg.done
  end
end

start = Time.measure do
  s = n // b
  r0 = 0_i64

  s.times do
    r1 = r0 &+ b
    channel.send({r0, r1})
    r0 = r1
  end

  if n - b &* s > 0
    channel.send({r0, n})
  end

  channel.close
  wg.wait
end

puts "\ncollatz took: #{start}"

As for timings, on my old i7 1472HQ, the Go example takes 6m30s, the direct Crystal port 5m30s and the above 1m30s (8 threads).

5 Likes

Here I want to conceptually explain my parallel implementation so that people reading this thread (now|future) will understand how to use these techniques.

First, threads|cores are what your hardware supplies, and fibers are communication abstractions.

The biggest difference between my parallel implementation, and the previous one, is the use of Channels to only pass information, not data.

You should ideally prevent|minimize intra thread communication so that each thread is not dependent on information|data from another.

For this application, thereā€™s no need for information|data sharing between threads, so Channels donā€™t need to be used for that purpose.

Here, each thread does arithmetic processing and output display, and only needs to indicate outside each thread its finished, by sending a message.

Thus for this task, we do in parallel the math to process|display a batch of seeds used to perform the collatz algorithm.

So for BATCH_SIZE = 2_000_000 there are 500 batches of 2_000_000 seeds that need processing.

Thus spawn allocates each batch to a separate thread until all 500 are done.

So done = Channel(Nil).new(batches) creates a channel to send 500 done messages, one for each thread.

All we needed is a way to indicate the spawned threads are all done.

Thus we use a Channel to pass a message that each thread has finished: done.send(nil)

Outside the spawning process we sit and wait for 500 done messages to know weā€™re finished: batches.times { done.receive }

Thatā€™s it. Simple!! :partying_face:

Spawing Threads
The spawning process allocates a batch to an unused thread, in a round robin fashion. If you set CRYSTAL_WORKERS to the max number of threads you get maximum performance, and max system memory use. You can also set a pool of threads to separate tasks, but donā€™t make their total greater than your hardware threads number.

Memory Use
Total system memory use here is a function of results|BATCH_SIZE size and the number of threads used. A smaller BATCH_SIZE means smaller results Slices, which means lower memory (and vice versa) per thread. And fewer used threads means lower system memory use, and vice versa. Thus if system memory use is a concern pick a smaller BATCH_SIZE and/or allocate fewer threads.

This covers most of the conceptual basics for Crystal parallel coding design, at least for this simple task.

I would encourage you to understand the conceptual flow of your problem as much as possible first before coding.

The simpler you can make your problem be, the simpler you can implement it in code, and the more likely it will be correct.

Extra: A Wish
I really enjoyed this thread, and personally learned new stuff. I think extracting the flow and information of this thread into a blog post, or even better a video, would be an excellent tool to teach people how to program for performance using Crystal.

This kind of tutorial information would go a long way to show people coming from other languages the benefits|utility|simplicity of using Crystal.

3 Likes

Just curious, you state it shares information but not data, how are you defining information and data in this context?