There is a way to optimize this program?

@jzakiya I just tried your version on my 10 years old laptop with 4c/8t.

It’s performing quite well, but requires lots of memory: with a batch size of 1_000_000 it takes ~126MB of RES memory and ~9GB of VIRT memory which isn’t really allocated, but still counted somehow, because with a lower batch size, the GC panics with out of memory errors (OOM). It finishes in 1m07s.

I tweaked my version above to allocate a full buffer for all batches in the channel (so fibers will never block). It only needs 3MB of RES memory and 0.7GB VIRT memory. It finishes in 1m06s.

I reduced the batch size to 1000, it then needed 18MB of RES memory (larger channel buffer), VIRT memory didn’t change, and it finished in… 1m05s. One second less :man_shrugging:

take aways:

  • starting more CPU bound fibers than available hardware threads only hogs more memory: it won’t be faster (maybe slower since starting lots of fibers takes time);
  • channels are fast, but you must pay attention to the buffer size: too small and the threads will have to wait (:zzz:) but make it too high and you will queue too many jobs than can be actually be processed (here for a benchmark, we don’t care).

tips:

  • Crystal 1.13 introduced WaitGroup to replace Channel(Nil).
  • You don’t need to collect each result, you only need 1 every 1_000_000 (but still have to save 'em all).

In the original parallel version procesS_resultS is defined as:

def process_results(channel, num_workers)

where channel is defined as:

channel = Channel(Tuple(Int32, Slice(UInt64))).new(num_workers)

So channel passes data used for arithmetic processing, and a data Slice.

The design is also dependent on the number of CRYSTAL_WORKERS.

In my design process_results is defined as:

def process_results(batch_index, results)

All it needs is the batch_index for the current thread, and its results Slice.

No knowledge of the number of threads is needed, which spawn manages. You can directly give it the data in Tuple(Int32, Slice(UInt64)).

Here data means information that needs to be arithmetically processed.

In my design a channel is only used to pass a done message.
It isn’t used for arithmetic processing, but only for batches counting.

And as you see, the code becomes much simpler, faster, and memory efficient.

Thus pass data directly for arithmetic processing, and messages for events processing.

1 Like

I wasn’t aware of this.

Can you provide links to documentation that explains it purpose, and how to use it.

When I looked at the release notes for Crystal 1.13.1, I noticed the WaitGroup. I thought it might be related to this thread, but I wasn’t sure how to use it.

However, @ysbaddaden , the author of WaitGroup, posted about how to use it. I understand that WaitGroup is a library for performing operations like @fiber.enqueue and Fiber.suspend at the appropriate timing for Fibers.

In the implementations by @kojix2 and @jzakiya, each fiber is assigned a predetermined job and retrieves the result from the channel.

However, in @ysbaddaden’s implementation, a certain number of fibers exist from start to finish and receive jobs sequentially through the channel. This implementation seems closer to “The Producer Consumer Pattern.”

Image Source: The Producer Consumer Pattern

When measuring @jzakiya’s implementation and @ysbaddaden’s implementation with /usr/bin/time, the results were as follows:

Metric jzakiya ysbaddaden / WaitGroup
User Mode Time 189.07 seconds 229.26 seconds
Kernel Mode Time 0.36 seconds 0.02 seconds
Total Elapsed Time 47.46 seconds 57.33 seconds
CPU Usage 399% 399%
Max Resident Set Size 200832 KB (approximately 200.8 MB) 19840 KB (approximately 19.8 MB)
Input Operations 0 0
Output Operations 0 0
Major Page Faults 0 0
Minor Page Faults 50145 4488
Swap Operations 0 0

This shows that jzakiya’s implementation is faster, but the memory usage in ysbaddaden’s implementation (WaitGroup) is reduced to 1/10. It seems that by reducing the number of fibers created, memory usage can be greatly reduced.

Here are my thoughts:

  1. When you want to retrieve all calculation results, would it be better to create another channel for result retrieval?

Image Source: Fork–join model - Wikipedia

  1. Is the practice of keeping the number of fibers low appropriate even when -Dpreview_mt is not enabled? (I want the program to work efficiently even when multithreading is not enabled.)

(This text was written with the help of ChatGPT)

WaitGroup is documented in the API.

Yes, it implements a producer/consumer pattern (Channel is a MPMC), here there’s only one producer, but there could be many producers without any changes.

  1. Reporting could use another Channel, either one by one (lots of contention), or send back an Array (no contention but requires to allocate memory). The advantage is results can be processed as they come.

    In the context of a benchmark, each fiber could write its result directly into a global array (but that’s ugly) or passed through the initial channel. The disadvantage is that processing is postponed until everything’s done.

  2. It depends on the case. Here the algorithm is CPU bound, with little to no synchronization that could give an opportunity to switch fibers while we wait on something. Limiting the number of fibers keeps resource usage in check.

    Another case with lots of IO or Channel interaction (lots of time spent on waiting) will benefit from having much more fibers.

So: it all depends on the problem at hand (and what it implies for the future).

1 Like

First, thanks @kojix2 for the information on WaitGroups.

(Compiled with --mcpu native, consistently faster by about 0.2s.)

@ysbaddaden’s code on my system consistently gets: 12.1..12.2 secs.

But it didn’t use the faster collatz version code below.

def collatz(seed : UInt64)
  steps = 0_u64
  while seed > 1
    tz_count = seed.trailing_zeros_count
    steps += tz_count
    seed >>= tz_count
    if seed > 1
      steps += 1
      seed = seed * 3 + 1
    end
  end
  steps
end

So when using it I consistently get: 11.7..11.8 secs

However, I then used the &*|&+ operators to optimize it more, shown below.

def collatz(seed : UInt64)
  steps = 0_u64
  while seed > 1
    tz_count = seed.trailing_zeros_count
    steps &+= tz_count
    seed  >>= tz_count
    if seed > 1
      steps &+= 1
      seed = seed &* 3 &+ 1
    end
  end
  steps
end

I used this optimal collatz version in my code too, and here are the times:

@ysbaddaden: 11.7..11.8s --> 5.94..5.95s

@jzakiya:    11.8..11.9s --> 6.1..6.2s

This is an insane ~50% speedup by just using the &*|&+ operators!

And my serial (single core) version is now: ~110s~72s

I then used WaitGroups in my version, with no significant time differences.

wg = WaitGroup.new(batches)

start = Time.measure do
  batches.times do |batch_index|
    spawn do
      results = Slice(UInt64).new(BATCH_SIZE)
      BATCH_SIZE.times do |seed_index|
        seed = (BATCH_SIZE * batch_index + seed_index + 1).to_u64
        results[seed_index] = collatz(seed)
      end
      process_results(batch_index, results)
      wg.done
    end
  end
  wg.wait
end

However, the two parallel versions are not apples-to-apples comparisons.

@ysbaddaden’s version is more faithful to the Go version, because it doesn’t store the seeds step values in a Slice.

So the dfference in mem use has little to do with the fibers, it’s because his version doesn’t use Slices to store|pass results.

Using htop shows, my version uses a peak of ~650MB (BATCH_SIZE = 3_125_000), and his is ~20MB, out of 16GB.

I’ll play around to see if I can do a faster version that doesn’t store|pass the seeds step values.

But let me emphasize again, the information in this forum thread needs to be extracted into some kind of tutorial blog|video.

Until I started to play around with the code, and learned the techniques shown herein, I would have never thought you could turn Go code that took 3m1s into fairly simple Crystal code that takes 6s.

It would be interesting to see what Rust, C|C++, et al, can do.

2 Likes

There are some new ideas that are interesting. But right now, I am very confused about something basic. It is about operators like &* and &+. What are these? At first, I thought they were for bitwise operations, but do they just work the same as * and + ? Why do they make things faster? Are there any disadvantages or side effects to using these operators? Why doesn’t the compiler optimize this automatically?

Because of:

Because the wrapping operator does not perform the overflow checks, the overhead of that can be avoided entirely; at the cost of possibly producing the “incorrect” result. E.g. adding 1 to Int64::MAX in most cases would be considered a bug, hence why Crystal raises by default and leaving the wrapping behavior opt-in via the & prefix methods.

Keep in mind this is comparing multi-threaded code with single-threaded code. I’m sure Go could be re-written to also be MT, in which case the difference in timing may be less dramatic.

1 Like

Thank you, Blacksomke16. I understand now. I joined the thread later, so I missed your important comment at the beginning. I also think the current Crystal behavior of raising an error on overflow is good. I was surprised that allowing overflow for UInt64 can make calculations so much faster. But it seems wise to avoid this unless you really need high performance. (Yes, I make a lot of mistakes.)

Hopefully this can help understanding why this works.

It’s generally faster to do all the math as the cpu bit type, no matter how small the values. This helps the compiler, because all the cpu registers|mem addrs are 64bit based (on 64 bit systems). So in general, don’t mix math types if you can help it, or make the first (left) value in a math expression be the largest type.

The collatz method is performed for every seed value, i.e. 1 billion times here. It does 3 adds and 1 multiply each time, that’s 3 billion adds and 1 billion multiplications total.

UInt64 range is 0--18,446,744,073,709,551,615

Thus +|* would then perform 4 billion bounds checks on UNIT64::MAX
But &+|&* eliminates them, and we see the significant time difference, which is over 5 seconds (parallel version) on my system.

We can safely do this here because no seed|step value comes anywhere close to this MAX (or even U|INT32:MAX).

The moral of the story is, the better you know your problem’s data profile the better you can optimally code it.

In fact, back in the discussion of the &(**|*|+)creation, there was some sentiment to make non-bounds checking the default behavior (for speed), and have users specify in their code when to do bounds checking. But the current behavior won to ensure runtime overflow error notification.

1 Like

I am addicted to ChatGPT, so I loaded this entire thread into ChatGPT and created a table summarizing optimization methods. (To export a Discourse thread to text, see here)

Optimization Method Description
Type Unification Changed UInt32 to Int64 to avoid type conversions and improve efficiency.
Operator Changes Used //= for correct floor division.
Wrapping Operators Applied &+, &* to avoid overflow checks, speeding up arithmetic operations.
Compilation Flags Used --release --mcpu native to optimize binaries for the specific CPU.
Parallel Execution Employed spawn and Channel, with -D preview_mt for multi-threading, to utilize multiple CPU cores.
Batch Processing Processed tasks in batches to reduce overhead and enhance memory usage.
Algorithm Optimization Used trailing_zeros_count for faster even number division in the Collatz function.
Avoiding Unnecessary IO Reduced puts calls in loops to minimize IO overhead.
Reduced Memory Usage Used Slices instead of Arrays and avoided unnecessary allocations.
Fixed Number of Fibers Limited fibers to the number of CPU cores, using WaitGroup to synchronize, reducing context switching and memory usage.
Benchmarking and Profiling Used htop and GNU time (/usr/bin/time) to identify bottlenecks and verify optimizations.
Community Collaboration Shared insights and techniques within the community to discover and apply new optimization methods.

Are the important points covered?
Is the output correct?

1 Like

Really impressive.

I test on my AMD 7840hs laptop.

Create 16 fibers, with buffered channel size 16 too. (because my laptop 8c/16t, $(nproc) = 16 )

I run it with CRYSTAL_WORKERS=$(nproc) ./1 several times, take only 12 seconds when build with mt enabled.

 ╰─ $ CRYSTAL_WORKERS=16 ./6
Seed: 999000000 Steps: 162
collatz took: 00:00:12.770363768

Hi, @ysbaddaden , I fork your’s original version of code (only little renaming changes), and write my own version which just some adjustment on batch_size, and batches, but give totally different(wrong) result.

runnable code
require "wait_group"

def collatz(seed : Int64)
  steps = 0_i64

  while seed > 1
    while seed % 2 == 0
      steps &+= 1
      seed //= 2
    end

    if seed > 1
      steps &+= 1
      seed = seed &* 3 &+ 1
    end
  end

  steps
end

def calculate(total_seeds, batch_size, worker_size, batches)
  channel = Channel({Int64, Int64}).new(batches + 1)
  wg = WaitGroup.new(worker_size)

  p! total_seeds
  p! batch_size
  p! worker_size
  p! batches

  worker_size.times do |i|
    spawn(name: "WORKER-#{i}") do
      while r = channel.receive?
        (r[0]...r[1]).each do |seed|
          steps = collatz(seed)

          if seed % 1_000_000 == 0
            print "Seed: #{seed} Steps: #{steps}\r"
          end
        end
      end
      wg.done
    end
  end

  start = Time.measure do
    r0 = 0_i64

    batches.times do
      r1 = r0 &+ batch_size
      channel.send({r0, r1})
      r0 = r1
    end

    if total_seeds - batch_size &* batches > 0
      channel.send({r0, total_seeds})
    end

    channel.close
    wg.wait
  end

  puts "\ncollatz took: #{start}"
end

# ---------------- common part ----------------

total_seeds = 1_000_000_000_i64
worker_size = ENV.fetch("CRYSTAL_WORKERS").to_i

batch_size = 1000_i64
batches = (total_seeds // batch_size).to_i32
puts "(@ysbaddaden original version)large batches, with a small batch_size, result correct, and faster."
calculate(total_seeds, batch_size, worker_size, batches)

puts "-"*100

batches = worker_size
batch_size = (total_seeds // batches).to_i32
puts "(@zw963 changed version)small batches, with a large batch_size, result is unpredictable, and slower(most of time)."
calculate(total_seeds, batch_size, worker_size, batches)

then built it with:

crystal build --release -Dpreview_mt 1.cr

then run it with:

CRYSTAL_WORKERS=16 ./1

The output result like following:

 ╰──➤ $ CRYSTAL_WORKERS=16 ./1
(@zw963 changed version)small batches, with a large batch_size, result is unpredictable, and slower(most of time).
total_seeds # => 1000000000
batch_size # => 62500000
worker_size # => 16
batches # => 16
Seed: 937000000 Steps: 224
collatz took: 00:00:13.270829818
----------------------------------------------------------------------------------------------------
(@ysbaddaden original version)large batches, with a small batch_size, result correct, and faster.
total_seeds # => 1000000000
batch_size # => 1000
worker_size # => 16
batches # => 1000000
Seed: 999000000 Steps: 162
collatz took: 00:00:12.969579920

I am very confused about the following two point:

  1. after my adjustment, result is unpredictable, as you can see, my version output 937000000 instead of 999000000.

  2. my version is slower(most of time), i thought if the fiber number same as the channel buffer size (in my version) and run a big loop in individual fiber and no need ch.receive? data frequently, it should faster, but I think I was wrong.

could you please help on explain what’s happening?

Thanks

EDIT: one more question.

It only needs 3MB of RES memory and 0.7GB VIRT memory. It finishes in 1m06s.

Can i know how you measurement the usage of RES and VIRT memory of crystal program in linux?

/usr/bin/time -v
Is that not good enough?

It’s works

      Command being timed: "./1"
        User time (seconds): 393.76
        System time (seconds): 0.32
        Percent of CPU this job got: 1512%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:26.05
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 20312
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 4699
        Voluntary context switches: 512
        Involuntary context switches: 89134
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Maximum memory usage should be 19M.

I was thought maybe there is a Crystal Way to do this? if get a result from an external tool, we can use htop too.

I assumed he was using htop because RES and VIRT are headers it uses. I don’t recall seeing them anywhere else, at least not written like that specifically.

I don’t recall seeing them anywhere else, at least not written like that specifically.

htop follow same headers as standard top command.

Actually, we can still further improve the performance of the original program with some tricks:

  1. We know that this sequence alternates between odds and evens. When n is an odd number, it will multiply by 3 and add 1 to become an even and won’t return.
  2. Use bit operations to replace the even? check for odd and even.
  3. Always use self-addition instead of addition operations.
def collatz_sequence(n : UInt64)
  loop do
    if n & 1 == 0
      tz_count = n.trailing_zeros_count
      n >>= tz_count
      yield n, tz_count &+ 1
    end
    n = n &* 3 &+ 1
  end
end

def collatz(initial : UInt64)
  return 0 if initial == 1

  length = 0
  collatz_sequence(initial) do |x, steps|
    length &+= steps
    if x == 1
      return length - 1 if initial & 1 == 0
      return length
    end
  end
end

seed = 1_u64
start = Time.measure do
  until seed > 100_000_000
    collatzResult = collatz seed
    # if seed.divisible_by? 1_000_000
    #   puts "Seed #{seed} Steps: #{collatzResult}\r"
    # end
    seed += 1
  end
end
puts "collatz took: #{start}"

Now we can see a performance improvement of nearly half compared to the original one:

PS > .\collatz_cr.exe
collatz took: 00:00:09.831338000
PS > .\collatz_go.exe
collatz took 19.6764526s
2 Likes

I guess i know the answer of the first question, This is not calculation error, it just print error, because there is no guarantee which fiber will exit last, so, the print different each time. i verified it with run this script 100 times, and the output result is always in a fixed set.

So, now there’s only one question left, why large batches with a small batch_size, is quickly than small batches, with a large batch_size?


I don’t know if this is the right idea, but this is the first idea that came to mind.

1 Like