There is a way to optimize this program?

ysbaddaden · July 18, 2024, 4:24pm

@jzakiya I just tried your version on my 10 years old laptop with 4c/8t.

It’s performing quite well, but requires lots of memory: with a batch size of 1_000_000 it takes ~126MB of RES memory and ~9GB of VIRT memory which isn’t really allocated, but still counted somehow, because with a lower batch size, the GC panics with out of memory errors (OOM). It finishes in 1m07s.

I tweaked my version above to allocate a full buffer for all batches in the channel (so fibers will never block). It only needs 3MB of RES memory and 0.7GB VIRT memory. It finishes in 1m06s.

I reduced the batch size to 1000, it then needed 18MB of RES memory (larger channel buffer), VIRT memory didn’t change, and it finished in… 1m05s. One second less

take aways:

starting more CPU bound fibers than available hardware threads only hogs more memory: it won’t be faster (maybe slower since starting lots of fibers takes time);
channels are fast, but you must pay attention to the buffer size: too small and the threads will have to wait () but make it too high and you will queue too many jobs than can be actually be processed (here for a benchmark, we don’t care).

tips:

Crystal 1.13 introduced WaitGroup to replace Channel(Nil).
You don’t need to collect each result, you only need 1 every 1_000_000 (but still have to save 'em all).

jzakiya · July 18, 2024, 5:12pm

In the original parallel version procesS_resultS is defined as:

def process_results(channel, num_workers)

where channel is defined as:

channel = Channel(Tuple(Int32, Slice(UInt64))).new(num_workers)

So channel passes data used for arithmetic processing, and a data Slice.

The design is also dependent on the number of CRYSTAL_WORKERS.

In my design process_results is defined as:

def process_results(batch_index, results)

All it needs is the batch_index for the current thread, and its results Slice.

No knowledge of the number of threads is needed, which spawn manages. You can directly give it the data in Tuple(Int32, Slice(UInt64)).

Here data means information that needs to be arithmetically processed.

In my design a channel is only used to pass a done message.
It isn’t used for arithmetic processing, but only for batches counting.

And as you see, the code becomes much simpler, faster, and memory efficient.

Thus pass data directly for arithmetic processing, and messages for events processing.

jzakiya · July 18, 2024, 5:28pm

I wasn’t aware of this.

Can you provide links to documentation that explains it purpose, and how to use it.

kojix2 · July 19, 2024, 2:11am

When I looked at the release notes for Crystal 1.13.1, I noticed the WaitGroup. I thought it might be related to this thread, but I wasn’t sure how to use it.

However, @ysbaddaden , the author of WaitGroup, posted about how to use it. I understand that WaitGroup is a library for performing operations like @fiber.enqueue and Fiber.suspend at the appropriate timing for Fibers.

In the implementations by @kojix2 and @jzakiya, each fiber is assigned a predetermined job and retrieves the result from the channel.

However, in @ysbaddaden’s implementation, a certain number of fibers exist from start to finish and receive jobs sequentially through the channel. This implementation seems closer to “The Producer Consumer Pattern.”

Image Source: The Producer Consumer Pattern

When measuring @jzakiya’s implementation and @ysbaddaden’s implementation with /usr/bin/time, the results were as follows:

Metric	jzakiya	ysbaddaden / WaitGroup
User Mode Time	189.07 seconds	229.26 seconds
Kernel Mode Time	0.36 seconds	0.02 seconds
Total Elapsed Time	47.46 seconds	57.33 seconds
CPU Usage	399%	399%
Max Resident Set Size	200832 KB (approximately 200.8 MB)	19840 KB (approximately 19.8 MB)
Input Operations	0	0
Output Operations	0	0
Major Page Faults	0	0
Minor Page Faults	50145	4488
Swap Operations	0	0

This shows that jzakiya’s implementation is faster, but the memory usage in ysbaddaden’s implementation (WaitGroup) is reduced to 1/10. It seems that by reducing the number of fibers created, memory usage can be greatly reduced.

Here are my thoughts:

When you want to retrieve all calculation results, would it be better to create another channel for result retrieval?

Image Source: Fork–join model - Wikipedia

Is the practice of keeping the number of fibers low appropriate even when -Dpreview_mt is not enabled? (I want the program to work efficiently even when multithreading is not enabled.)

(This text was written with the help of ChatGPT)

ysbaddaden · July 19, 2024, 7:51am

WaitGroup is documented in the API.

Yes, it implements a producer/consumer pattern (Channel is a MPMC), here there’s only one producer, but there could be many producers without any changes.

Reporting could use another Channel, either one by one (lots of contention), or send back an Array (no contention but requires to allocate memory). The advantage is results can be processed as they come.

In the context of a benchmark, each fiber could write its result directly into a global array (but that’s ugly) or passed through the initial channel. The disadvantage is that processing is postponed until everything’s done.
It depends on the case. Here the algorithm is CPU bound, with little to no synchronization that could give an opportunity to switch fibers while we wait on something. Limiting the number of fibers keeps resource usage in check.

Another case with lots of IO or Channel interaction (lots of time spent on waiting) will benefit from having much more fibers.

So: it all depends on the problem at hand (and what it implies for the future).

jzakiya · July 19, 2024, 5:59pm

First, thanks @kojix2 for the information on WaitGroups.

(Compiled with --mcpu native, consistently faster by about 0.2s.)

@ysbaddaden’s code on my system consistently gets: 12.1..12.2 secs.

But it didn’t use the faster collatz version code below.

def collatz(seed : UInt64)
  steps = 0_u64
  while seed > 1
    tz_count = seed.trailing_zeros_count
    steps += tz_count
    seed >>= tz_count
    if seed > 1
      steps += 1
      seed = seed * 3 + 1
    end
  end
  steps
end

So when using it I consistently get: 11.7..11.8 secs

However, I then used the &*|&+ operators to optimize it more, shown below.

def collatz(seed : UInt64)
  steps = 0_u64
  while seed > 1
    tz_count = seed.trailing_zeros_count
    steps &+= tz_count
    seed  >>= tz_count
    if seed > 1
      steps &+= 1
      seed = seed &* 3 &+ 1
    end
  end
  steps
end

I used this optimal collatz version in my code too, and here are the times:

@ysbaddaden: 11.7..11.8s --> 5.94..5.95s

@jzakiya:    11.8..11.9s --> 6.1..6.2s

This is an insane ~50% speedup by just using the &*|&+ operators!

And my serial (single core) version is now: ~110s → ~72s

I then used WaitGroups in my version, with no significant time differences.

wg = WaitGroup.new(batches)

start = Time.measure do
  batches.times do |batch_index|
    spawn do
      results = Slice(UInt64).new(BATCH_SIZE)
      BATCH_SIZE.times do |seed_index|
        seed = (BATCH_SIZE * batch_index + seed_index + 1).to_u64
        results[seed_index] = collatz(seed)
      end
      process_results(batch_index, results)
      wg.done
    end
  end
  wg.wait
end

However, the two parallel versions are not apples-to-apples comparisons.

@ysbaddaden’s version is more faithful to the Go version, because it doesn’t store the seeds step values in a Slice.

So the dfference in mem use has little to do with the fibers, it’s because his version doesn’t use Slices to store|pass results.

Using htop shows, my version uses a peak of ~650MB (BATCH_SIZE = 3_125_000), and his is ~20MB, out of 16GB.

I’ll play around to see if I can do a faster version that doesn’t store|pass the seeds step values.

But let me emphasize again, the information in this forum thread needs to be extracted into some kind of tutorial blog|video.

Until I started to play around with the code, and learned the techniques shown herein, I would have never thought you could turn Go code that took 3m1s into fairly simple Crystal code that takes 6s.

It would be interesting to see what Rust, C|C++, et al, can do.

kojix2 · July 20, 2024, 12:05am

There are some new ideas that are interesting. But right now, I am very confused about something basic. It is about operators like &* and &+. What are these? At first, I thought they were for bitwise operations, but do they just work the same as * and + ? Why do they make things faster? Are there any disadvantages or side effects to using these operators? Why doesn’t the compiler optimize this automatically?

Blacksmoke16 · July 20, 2024, 12:29am

Because of:

Blacksmoke16:

The next catch is that Crystal always wants to give you a “correct” result. So if you were to do like:
foo = Int64::MAX
foo += 4
You get a Unhandled exception: Arithmetic overflow (OverflowError), whereas doing this in Go:
var foo int64 = math.MaxInt64
foo = foo + int64(4)
You get -9223372036854775805.

You can get the same value Go does by using the wrapping positive operator as per Operators - Crystal . E.g.
foo = Int64::MAX
foo &+= 4

Because the wrapping operator does not perform the overflow checks, the overhead of that can be avoided entirely; at the cost of possibly producing the “incorrect” result. E.g. adding 1 to Int64::MAX in most cases would be considered a bug, hence why Crystal raises by default and leaving the wrapping behavior opt-in via the & prefix methods.

Keep in mind this is comparing multi-threaded code with single-threaded code. I’m sure Go could be re-written to also be MT, in which case the difference in timing may be less dramatic.

kojix2 · July 20, 2024, 12:50am

Thank you, Blacksomke16. I understand now. I joined the thread later, so I missed your important comment at the beginning. I also think the current Crystal behavior of raising an error on overflow is good. I was surprised that allowing overflow for UInt64 can make calculations so much faster. But it seems wise to avoid this unless you really need high performance. (Yes, I make a lot of mistakes.)

jzakiya · July 20, 2024, 3:24am

Hopefully this can help understanding why this works.

It’s generally faster to do all the math as the cpu bit type, no matter how small the values. This helps the compiler, because all the cpu registers|mem addrs are 64bit based (on 64 bit systems). So in general, don’t mix math types if you can help it, or make the first (left) value in a math expression be the largest type.

The collatz method is performed for every seed value, i.e. 1 billion times here. It does 3 adds and 1 multiply each time, that’s 3 billion adds and 1 billion multiplications total.

UInt64 range is 0--18,446,744,073,709,551,615

Thus +|* would then perform 4 billion bounds checks on UNIT64::MAX
But &+|&* eliminates them, and we see the significant time difference, which is over 5 seconds (parallel version) on my system.

We can safely do this here because no seed|step value comes anywhere close to this MAX (or even U|INT32:MAX).

The moral of the story is, the better you know your problem’s data profile the better you can optimally code it.

In fact, back in the discussion of the &(**|*|+)creation, there was some sentiment to make non-bounds checking the default behavior (for speed), and have users specify in their code when to do bounds checking. But the current behavior won to ensure runtime overflow error notification.

kojix2 · July 27, 2024, 2:44am

I am addicted to ChatGPT, so I loaded this entire thread into ChatGPT and created a table summarizing optimization methods. (To export a Discourse thread to text, see here)

Optimization Method	Description
Type Unification	Changed `UInt32` to `Int64` to avoid type conversions and improve efficiency.
Operator Changes	Used `//=` for correct floor division.
Wrapping Operators	Applied `&+`, `&*` to avoid overflow checks, speeding up arithmetic operations.
Compilation Flags	Used `--release --mcpu native` to optimize binaries for the specific CPU.
Parallel Execution	Employed `spawn` and `Channel`, with `-D preview_mt` for multi-threading, to utilize multiple CPU cores.
Batch Processing	Processed tasks in batches to reduce overhead and enhance memory usage.
Algorithm Optimization	Used `trailing_zeros_count` for faster even number division in the Collatz function.
Avoiding Unnecessary IO	Reduced `puts` calls in loops to minimize IO overhead.
Reduced Memory Usage	Used Slices instead of Arrays and avoided unnecessary allocations.
Fixed Number of Fibers	Limited fibers to the number of CPU cores, using `WaitGroup` to synchronize, reducing context switching and memory usage.
Benchmarking and Profiling	Used `htop` and `GNU time` (`/usr/bin/time`) to identify bottlenecks and verify optimizations.
Community Collaboration	Shared insights and techniques within the community to discover and apply new optimization methods.

Are the important points covered?
Is the output correct?

zw963 · August 9, 2024, 4:09pm

Really impressive.

I test on my AMD 7840hs laptop.

Create 16 fibers, with buffered channel size 16 too. (because my laptop 8c/16t, $(nproc) = 16 )

I run it with CRYSTAL_WORKERS=$(nproc) ./1 several times, take only 12 seconds when build with mt enabled.

 ╰─ $ CRYSTAL_WORKERS=16 ./6
Seed: 999000000 Steps: 162
collatz took: 00:00:12.770363768

zw963 · October 21, 2024, 10:48am

Hi, @ysbaddaden , I fork your’s original version of code (only little renaming changes), and write my own version which just some adjustment on batch_size, and batches, but give totally different(wrong) result.

runnable code

require "wait_group"

def collatz(seed : Int64)
  steps = 0_i64

  while seed > 1
    while seed % 2 == 0
      steps &+= 1
      seed //= 2
    end

    if seed > 1
      steps &+= 1
      seed = seed &* 3 &+ 1
    end
  end

  steps
end

def calculate(total_seeds, batch_size, worker_size, batches)
  channel = Channel({Int64, Int64}).new(batches + 1)
  wg = WaitGroup.new(worker_size)

  p! total_seeds
  p! batch_size
  p! worker_size
  p! batches

  worker_size.times do |i|
    spawn(name: "WORKER-#{i}") do
      while r = channel.receive?
        (r[0]...r[1]).each do |seed|
          steps = collatz(seed)

          if seed % 1_000_000 == 0
            print "Seed: #{seed} Steps: #{steps}\r"
          end
        end
      end
      wg.done
    end
  end

  start = Time.measure do
    r0 = 0_i64

    batches.times do
      r1 = r0 &+ batch_size
      channel.send({r0, r1})
      r0 = r1
    end

    if total_seeds - batch_size &* batches > 0
      channel.send({r0, total_seeds})
    end

    channel.close
    wg.wait
  end

  puts "\ncollatz took: #{start}"
end

# ---------------- common part ----------------

total_seeds = 1_000_000_000_i64
worker_size = ENV.fetch("CRYSTAL_WORKERS").to_i

batch_size = 1000_i64
batches = (total_seeds // batch_size).to_i32
puts "(@ysbaddaden original version)large batches, with a small batch_size, result correct, and faster."
calculate(total_seeds, batch_size, worker_size, batches)

puts "-"*100

batches = worker_size
batch_size = (total_seeds // batches).to_i32
puts "(@zw963 changed version)small batches, with a large batch_size, result is unpredictable, and slower(most of time)."
calculate(total_seeds, batch_size, worker_size, batches)

then built it with:

crystal build --release -Dpreview_mt 1.cr

then run it with:

CRYSTAL_WORKERS=16 ./1

The output result like following:

 ╰──➤ $ CRYSTAL_WORKERS=16 ./1
(@zw963 changed version)small batches, with a large batch_size, result is unpredictable, and slower(most of time).
total_seeds # => 1000000000
batch_size # => 62500000
worker_size # => 16
batches # => 16
Seed: 937000000 Steps: 224
collatz took: 00:00:13.270829818
----------------------------------------------------------------------------------------------------
(@ysbaddaden original version)large batches, with a small batch_size, result correct, and faster.
total_seeds # => 1000000000
batch_size # => 1000
worker_size # => 16
batches # => 1000000
Seed: 999000000 Steps: 162
collatz took: 00:00:12.969579920

I am very confused about the following two point:

after my adjustment, result is unpredictable, as you can see, my version output 937000000 instead of 999000000.
my version is slower(most of time), i thought if the fiber number same as the channel buffer size (in my version) and run a big loop in individual fiber and no need ch.receive? data frequently, it should faster, but I think I was wrong.

could you please help on explain what’s happening?

Thanks

EDIT: one more question.

It only needs 3MB of RES memory and 0.7GB VIRT memory. It finishes in 1m06s.

Can i know how you measurement the usage of RES and VIRT memory of crystal program in linux?

kojix2 · October 21, 2024, 11:31am

/usr/bin/time -v
Is that not good enough?

zw963 · October 21, 2024, 1:00pm

It’s works

      Command being timed: "./1"
        User time (seconds): 393.76
        System time (seconds): 0.32
        Percent of CPU this job got: 1512%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:26.05
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 20312
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 4699
        Voluntary context switches: 512
        Involuntary context switches: 89134
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Maximum memory usage should be 19M.

I was thought maybe there is a Crystal Way to do this? if get a result from an external tool, we can use htop too.

jgaskins · October 21, 2024, 9:44pm

I assumed he was using htop because RES and VIRT are headers it uses. I don’t recall seeing them anywhere else, at least not written like that specifically.

zw963 · October 22, 2024, 3:09am

I don’t recall seeing them anywhere else, at least not written like that specifically.

htop follow same headers as standard top command.

Sunrise · October 22, 2024, 7:23am

Actually, we can still further improve the performance of the original program with some tricks:

We know that this sequence alternates between odds and evens. When n is an odd number, it will multiply by 3 and add 1 to become an even and won’t return.
Use bit operations to replace the even? check for odd and even.
Always use self-addition instead of addition operations.

def collatz_sequence(n : UInt64)
  loop do
    if n & 1 == 0
      tz_count = n.trailing_zeros_count
      n >>= tz_count
      yield n, tz_count &+ 1
    end
    n = n &* 3 &+ 1
  end
end

def collatz(initial : UInt64)
  return 0 if initial == 1

  length = 0
  collatz_sequence(initial) do |x, steps|
    length &+= steps
    if x == 1
      return length - 1 if initial & 1 == 0
      return length
    end
  end
end

seed = 1_u64
start = Time.measure do
  until seed > 100_000_000
    collatzResult = collatz seed
    # if seed.divisible_by? 1_000_000
    #   puts "Seed #{seed} Steps: #{collatzResult}\r"
    # end
    seed += 1
  end
end
puts "collatz took: #{start}"

Now we can see a performance improvement of nearly half compared to the original one:

PS > .\collatz_cr.exe
collatz took: 00:00:09.831338000
PS > .\collatz_go.exe
collatz took 19.6764526s

zw963 · October 22, 2024, 9:25am

I guess i know the answer of the first question, This is not calculation error, it just print error, because there is no guarantee which fiber will exit last, so, the print different each time. i verified it with run this script 100 times, and the output result is always in a fixed set.

So, now there’s only one question left, why large batches with a small batch_size, is quickly than small batches, with a large batch_size?

kojix2 · October 22, 2024, 11:15am

I don’t know if this is the right idea, but this is the first idea that came to mind.

Topic		Replies	Views
The CrystalRuby gem through the Crystal lens Community	5	363	May 10, 2024
Crystal vs Go: Notes and Reflections Learning Resources	3	7244	March 3, 2021
[Help] Optimization and code review Learning Resources	13	491	August 15, 2021
Slow performance Help & Support	5	453	July 26, 2021
My last article on Crystal lang Community	7	856	February 1, 2022

There is a way to optimize this program?

Related topics