There is a way to optimize this program?

I have an old Intel based MacBook, MacOS 10.15, LLVM

% sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz

% crystal --version
Crystal 1.12.0 (2024-04-09)

LLVM: 17.0.6
Default target: x86_64-apple-macosx

% crystal build --release collatz3.cr
% ./collatz3
Seed 1000000000 Steps: 100
collatz took: 00:06:20.618209571

% crystal build --release --mcpu native collatz3.cr
% ./collatz3
Seed 1000000000 Steps: 100172
collatz took: 00:06:17.123127458

So the difference in my case is negligible, less than 1%

1 Like

Strange that the number of steps are different though. Is that expected?

Wired, i tried build above examples again on my laptop

 ╰─ $ ./collatz3
Seed 1000000000 Steps: 100
collatz took: 00:02:05.057338846

--mcpu native
 ╰─ $ ./collatz3
Seed 1000000000 Steps: 100
collatz took: 00:02:06.073337643
 ╰─ $ ./collatz1
Seed 1000000000 Steps: 100
collatz took: 00:02:45.230708896

--mcpu native
 ╰─ $ ./collatz1
Seed 1000000000 Steps: 100
collatz took: 00:02:49.934104333
+ cr build --release 1.cr -o 1
+ ./1
collatz took: 00:02:23.130819476
+ echo -----------------
+ cr build --release --mcpu native 1.cr -o 2
+ ./2
collatz took: 00:03:07.363442950

All my benchmark, --mcpu native version is slower than another, the results completely opposite to before!

I don’t know what happened, this really wired.

some factors concerned:

  1. When i last ran this benchmarks, my laptop is plugged by a Dell monitor, TypeC, 90W, but, i connect a 120W AC adapter this time, although latter has much better performance, but, i consider run both with same condition is should comparable.

2. When i last ran, I use Crystal compiler which built without --mcpu native, but my current compiler built with this option.

  1. when i last ran, compiler built with llvm 17, but, llvm 18 this time.

EDIT: I tried recompile my crystal compiler without --mcpu native as before, and run benchmark again, as you can see, almost no difference, so above factor 2 can be ignored.

 ╰─ $ sh 1.sh
+ cr build --release 1.cr -o 1
+ ./1
collatz took: 00:02:22.366686933
+ echo -----------------
+ cr build --release --mcpu native 1.cr -o 2
+ ./2
collatz took: 00:03:07.554092518

And i don’t think the factor 1 is the reason, so, the only valid factor is, llvm update from 17 to 18?

CPU: AMD Ryzen 7 3700X

crystal build --release collatz3.cr

Seed 1000000000 Steps: 100
collatz took: 00:02:58.970378338

crystal build --release --mcpu native collatz3.cr

Seed 1000000000 Steps: 100
collatz took: 00:02:46.273276544

On my slightly older computer I saw a speed improvement of about 7.6%.
But this is just one run, it could be different if you run it many times.

I don’t know if this is the appropriate topic for this forum, but another thing I would like to know is the speedup by -D preview_mt. Are you guys using this option?

For current CPU bounded benchmark, adding this option has no any benefits, in fact, it hurt it.

Ha! I didn’t even notice it.
Run that again and got similar timing and 100 steps. No idea why:

% ./collatz3
Seed 1000000000 Steps: 100
collatz took: 00:06:17.701735560
1 Like

What you may be experiencing is a change in your runtime environment.

Make sure your system is always plugged in with the power supply.
On my system, it has 3 performance settings: lo, med, hi.
Be sure you’re always running in hi (performance) mode.

Run tests on a quiet system (no competing applications running).

I get repeatable consistent times running performance tests in this manner.

I also run htop in a terminal to see threads activity, and in another I run:

watch -n1 "grep \"cpu MHz\" /proc/cpuinfo"

This will show the speed at which each thread is running.

This thread started by comparing the speed of Go and Crystal. However, in real-world scenarios, parallel execution is often necessary for fast calculations.

The Collatz function’s steps calculation can be easily parallelized since each seed’s processing is independent. Here, I will extend jzakiya’s collatz3.cr to include parallel execution.

In Crystal, we can use spawn and channel, and compile with the -D preview_mt flag to enable parallel execution. The simplest method is to place collatzResult = collatz(seed) inside a spawn block.

  until seed > 1_000_000_000
    spawn do
      collatzResult = collatz(seed)
      if seed.divisible_by? 1_000_000
        print "Seed #{seed} Steps: #{collatzResult}\r"
      end
    end
    seed += 1
  end

However, this method creates too many fibers and might freeze your computer. Instead, we need to increase the batch size and reduce the number of fibers.

Programming with parallel execution is quite difficult for me, so I wrote this with the help of ChatGPT. Here’s what I came up with:

collatz_mt.cr

def collatz(seed : UInt64)
  steps = 0_u64
  while seed > 1
    tz_count = seed.trailing_zeros_count
    steps += tz_count
    seed >>= tz_count
    if seed > 1
      steps += 1
      seed = seed * 3 + 1
    end
  end
  return steps
end

TOTAL_SEEDS = 1_000_000_000
BATCH_SIZE = 1000

num_workers = (ENV["CRYSTAL_WORKERS"]? || 4).to_i
channel = Channel(Tuple(Int32, Slice(UInt64))).new(num_workers)

def process_results(channel, num_workers)
  num_workers.times do
    batch_index, results = channel.receive
    results.each_with_index do |steps, seed_index|
      seed = BATCH_SIZE * batch_index + seed_index + 1
      if seed % 1_000_000 == 0
        print "Seed #{seed} Steps: #{steps}\r"
      end
    end
  end
end

start = Time.measure do
  active_workers = 0

  (TOTAL_SEEDS / BATCH_SIZE).to_i.times do |batch_index|
    spawn do
      results = Slice(UInt64).new(BATCH_SIZE)
      BATCH_SIZE.times do |seed_index|
        seed = (BATCH_SIZE * batch_index + seed_index + 1).to_u64
        results[seed_index] = collatz(seed)
      end
      channel.send({batch_index, results})
    end
    active_workers += 1

    if active_workers >= num_workers
      process_results(channel, num_workers)
      active_workers -= num_workers
    end
  end

  process_results(channel, active_workers)
end

puts "\ncollatz(mt) took: #{start}"
crystal build --release -Dpreview_mt collatz3_mt.cr

For strict result order, you might need to sort the results. However, for terminal output, the results will be the same since num_workers * BATCH_SIZE is much smaller than 1_000_000. This was checked using puts instead of print.

On my Mac mini (Apple M2), the results were as follows.

collatz took: 00:02:31.422758917
collatz(mt) took: 00:00:52.956233708

There is some overhead, but that is expected.

(English is also difficult, so I communicate with you all with the help of ChatGPT and DeepL. AI is really indispensable.)

1 Like

For fiber involved concerns, please check this discusstion for how use fixed number fiber as a pool to limit the memory usage.

And, i consider go can easily defeat the currently implemented preview_mt if multi-thread involved, there is even no need for comparison anyway, Crystal can only defeat go in single thread mode.

Comparing with Go is important to identify optimization opportunities in Crystal. However, I’m more interested in writing tools with Crystal. Since I don’t plan to use Go, I don’t really care if Crystal defeat Go or not :slightly_smiling_face:

1 Like

Still using Crystal 1.12.2.

I did some slight code modifications.

(TOTAL_SEEDS / BATCH_SIZE).to_i.times do

to do integer division

(TOTAL_SEEDS // BATCH_SIZE).times do

and

if seed % 1_000_000 == 0

to

if seed.divisible_by? 1_000_000

My system currently only has 16GB of ram, and would bomb
for BATCH_SIZE = 1000, so I kept raising it til it worked.

These are single run times (done once just to see the trends).

compiled as:
crystal build --release -D preview_mt (--mcpu native) collatz3-mt1.cr

run as:
CRYSTAL_WORKERS=16 ./collatz3-mt1

  BATCH_SIZE |  non --mpcu  | --mcpu native
-------------|--------------|---------------
    20_000   | 21.462461732 | 20.64834345
-------------|--------------|---------------
    25_000   | 19.648608257 | 17.970281164
-------------|--------------|---------------
    40_000   | 17.290314075 | 16.784457625
-------------|--------------|---------------
    50_000   | 16.940723973 | 15.741673784
-------------|--------------|---------------
    80_000   | 16.105024553 | 14.965173563
-------------|--------------|---------------
   100_000   | 15.863517012 | 14.846495923
-------------|--------------|---------------
   125_000   | 15.644565438 | 14.439763090
-------------|--------------|---------------
   160_000   | 15.571596193 | 14.378076468
-------------|--------------|---------------
   200_000   | 15.429948915 | 14.351718695
-------------|--------------|---------------
   250_000   | 15.037558804 | 13.954640639
-------------|--------------|---------------
   400_000   | 14.852013307 | 13.714046571
-------------|--------------|---------------
   500_000   | 14.690805390 | 13.684973576
-------------|--------------|---------------
 1_000_000   | 14.328056481 | 13.256448490
-------------|--------------|---------------
 2_000_000   | 14.151256391 | 13.186936261
-------------|--------------|---------------
 2_500_000   | 14.078736752 | 13.105861409
-------------|--------------|---------------
 4_000_000   | 14.239786044 | 13.226919653
-------------|--------------|---------------
 5_000_000   | 14.155460105 | 13.213493322
-------------|--------------|---------------
2 Likes

First made this display code modification.

Changed:
print "Seed #{seed} Steps: #{steps}\r"

To:
print "Seed #{seed.format} Steps: #{steps}\r"

Now formatted output like: Seed 997,000,000 Steps: 22400

Before using Crystal 1.12.2

➜  ~ crystal -v
Crystal 1.12.2 [04998c0c7] (2024-05-31)

LLVM: 15.0.7
Default target: x86_64-unknown-linux-gnu

Got these results.

  BATCH_SIZE |  non --mpcu  | --mcpu native
-------------|--------------|---------------
 1_000_000   | 14.328056481 | 13.256448490
-------------|--------------|---------------
 2_000_000   | 14.151256391 | 13.186936261
-------------|--------------|---------------
 2_500_000   | 14.078736752 | 13.105861409
-------------|--------------|---------------
 4_000_000   | 14.239786044 | 13.226919653
-------------|--------------|---------------
 5_000_000   | 14.155460105 | 13.213493322
-------------|--------------|---------------

Got Crystal 1.13.1 today (2024/07/12) and ran code with it.

  ~ crystal -v
Crystal 1.13.1 [0cef61e51] (2024-07-12)

LLVM: 18.1.6
Default target: x86_64-unknown-linux-gnu

Get these results.

  BATCH_SIZE |  non --mpcu  | --mcpu native
-------------|--------------|---------------
 1_000_000   | 12.771790424 | 12.742333954
-------------|--------------|---------------
 2_000_000   | 12.712604724 | 12.699085654
-------------|--------------|---------------
 2_500_000   | 12.501386958 | 12.508320673
-------------|--------------|---------------
 4_000_000   | 12.698391035 | 12.699728326
-------------|--------------|---------------
 5_000_000   | 12.641033522 | 12.698599464
-------------|--------------|---------------

Observations:

  1. Around BATCH_SIZE = 2_500_000 appears optimum on my system.

  2. The performance with 1.13.1 is greatly improved, and
    compiling with --mcpu native has no real advantage anymore.

I don’t know how much to attribute this to a better
concurrency implementation in Crystal, or is it just
that LLVM 18.1.6 does a much better job than LLVM 15.0.7.

But clearly, 1.13.1 is a lot faster running this code than 1.12.2 was.

Kudos to the Crystal Devs and community!! :partying_face:

2 Likes

Ref my previous comment, i guess this caused by the update of llvm. although, i still not ruled out the factors of TypeC power supply and AC power. (both test under same situation)

Same here on Intel base Mac.

% sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz

Updating to Crystal 1.13.1 (which uses LLVM: 18.1.8) gives better results:

% crystal build --release collatz3.cr
% ./collatz3
Seed 1000000000 Steps: 100
collatz took: 00:06:01.066968219

% crystal build --release --mcpu native collatz3.cr
% ./collatz3
Seed 1000000000 Steps: 100
collatz took: 00:05:59.658690231

With Crystal 1.12.0 (and LLVM: 17.0.6) it was

% crystal build --release collatz3.cr
% ./collatz3
Seed 1000000000 Steps: 100
collatz took: 00:06:20.618209571

% crystal build --release --mcpu native collatz3.cr
% ./collatz3
Seed 1000000000 Steps: 100
collatz took: 00:06:17.701735560