Slow performance of Channels in MT mode

lribeiro · April 20, 2020, 8:55am

While testing the overhead of serialisation using MSGPACK or JSON over TCP,
using Channels as a control point to check how much performance penalty we would incur,
found out something unexpected.

Channels have much better performance when running without the preview_mt flag.
Some performance loss could be justified by extra locking required for all threads to behave,
but 80x slower seems to point a possible bottleneck.

The reduced code

require "benchmark"

record Params, a : Int32, b : String
record Result, r : Float64

ch = Channel(Params).new
chr = Channel(Result).new
# Worker
spawn do
	while true
		params = ch.receive
		result = Result.new(params.a + params.b.to_f)	
		chr.send(result)
	end
end

Benchmark.ips do |r|
	r.report("channel") do
		ch.send(Params.new(3,"5.5"))
		res = chr.receive
	end
end

Now if we run this without preview_mt we get about 8M ops / s

➜  time crystal build channel_mt_slow.cr --release 
crystal build channel_mt_slow.cr --release --warnings none  0.74s user 0.14s system 115% cpu 0.757 total
➜  ./channel_mt_slow 
channel   7.98M (125.26ns) (± 2.73%)  0.0B/op  fastest

With preview_mt enabled it drops to about 100k ops / s

➜  time crystal build channel_mt_slow.cr --release  --warnings none -Dpreview_mt
crystal build channel_mt_slow.cr --release --warnings none -Dpreview_mt  13.50s user 0.22s system 100% cpu 13.624 total
➜  ./channel_mt_slow                                                            
channel 101.69k (  9.83µs) (± 1.38%)  0.0B/op  fastest

80x slower on MT is this something we should expect for channel performance?

lribeiro · April 20, 2020, 9:30am

Doing something similar in Golang, the performance drop is a lot smaller (~27%)

1 Thread

➜  GOMAXPROCS=1 go test -bench=. 
goos: darwin
goarch: amd64
BenchmarkFibComplete 	 3887295	       300 ns/op
PASS
ok  	_/Users/lribeiro/Work/bench	1.482s

2 Threads

➜  GOMAXPROCS=2 go test -bench=.
goos: darwin
goarch: amd64
BenchmarkFibComplete-2   	 3249416	       351 ns/op
PASS
ok  	_/Users/lribeiro/Work/bench	1.519s

12 Threads

➜  GOMAXPROCS=12 go test -bench=.
goos: darwin
goarch: amd64
BenchmarkFibComplete-12    	 3051240	       390 ns/op
PASS
ok  	_/Users/lribeiro/Work/bench	1.596s

GO Code

package main

import (
	"strconv"
	"testing"
)

type Params struct {
	a int32   
	b string
}

type Result struct {
	r float64
}

var result float64

// Worker
func run(pch chan Params, rch chan Result){
	for {
		param := <- pch
		svalue, err := strconv.ParseFloat(param.b,64)
		
		if  err == nil {
			value := float64(param.a) + svalue
			result := Result{value}
			rch <- result
		}
	}
}

func BenchmarkFibComplete(b *testing.B) {
	var r float64
	pch := make(chan Params)
	rch := make(chan Result)

	go run(pch,rch)

	b.ResetTimer()
	for n := 0; n < b.N; n++ {
		pch <- Params{3,"5.5"}
		res := <- rch
		r = res.r 
	}
	// always store the result to a package level variable
	// so the compiler cannot eliminate the Benchmark itself.
	result = r
}

straight-shoota · April 20, 2020, 9:46am

MT causing an overhead is obvious. 80% might be a bit much, and need investigation.

But this is also not a good example for a comparison of actual MT behaviour. With MT enabled it still behaves the same way as the single threaded program. Because of the blocking channels you don’t get any parallel execution benefits, but have to pay the synchronization prize.

Also be aware that another effect might be some over optimization by LLVM. I’m not sure whether there’s a realistic chance, but it might just happen to drop some of the code because it’s not considered relevant and that could work differently in single and multi threaded. It’s just a wild guess and probably not the case.

asterite · April 20, 2020, 12:24pm

Advice: profile it. Run it with XCode Instruments and see what takes so much time. Just run the benchmark code in a loop (without the benchmark code).

lribeiro · April 20, 2020, 1:15pm

Would take 80% drop over the current 98,75%

Because adding more threads can’t improve this scenario it’s a good way to test overhead of multiple threads vs 1 thread.

Single Thread

Multi Thread

– Update with profilling zips
Single Thread MacOS Instruments Profile
http://s000.tinyupload.com/index.php?file_id=26876922846129458324

Multi Thread MacOS Instruments Profile
http://s000.tinyupload.com/index.php?file_id=56342273467726017859

straight-shoota · April 20, 2020, 2:22pm

Ooops, my bad. Even more so then

AFAIK you can only upload images on the forum. But you could just put your ZIP file on any file hoster and share the download link.

Topic		Replies	Views
Locked code in benchmarks Help & Support	4	331	April 29, 2023
Multithreaded Crystal initial thoughts	38	7467	February 21, 2024
My last article on Crystal lang Community	7	856	February 1, 2022
Crystal and parallelism Help & Support	7	326	February 10, 2025
Very slow build speeds for hello world Help & Support	42	913	August 18, 2024

Slow performance of Channels in MT mode

Related topics