Slow performance of Channels in MT mode

While testing the overhead of serialisation using MSGPACK or JSON over TCP,
using Channels as a control point to check how much performance penalty we would incur,
found out something unexpected.

Channels have much better performance when running without the preview_mt flag.
Some performance loss could be justified by extra locking required for all threads to behave,
but 80x slower seems to point a possible bottleneck.

The reduced code

require "benchmark"

record Params, a : Int32, b : String
record Result, r : Float64

ch = Channel(Params).new
chr = Channel(Result).new
# Worker
spawn do
	while true
		params = ch.receive
		result = + params.b.to_f)	

Benchmark.ips do |r|"channel") do
		res = chr.receive

Now if we run this without preview_mt we get about 8M ops / s

➜  time crystal build --release 
crystal build --release --warnings none  0.74s user 0.14s system 115% cpu 0.757 total
➜  ./channel_mt_slow 
channel   7.98M (125.26ns) (± 2.73%)  0.0B/op  fastest

With preview_mt enabled it drops to about 100k ops / s

➜  time crystal build --release  --warnings none -Dpreview_mt
crystal build --release --warnings none -Dpreview_mt  13.50s user 0.22s system 100% cpu 13.624 total
➜  ./channel_mt_slow                                                            
channel 101.69k (  9.83µs) (± 1.38%)  0.0B/op  fastest

80x slower on MT is this something we should expect for channel performance?

Doing something similar in Golang, the performance drop is a lot smaller (~27%)

1 Thread

➜  GOMAXPROCS=1 go test -bench=. 
goos: darwin
goarch: amd64
BenchmarkFibComplete 	 3887295	       300 ns/op
ok  	_/Users/lribeiro/Work/bench	1.482s

2 Threads

➜  GOMAXPROCS=2 go test -bench=.
goos: darwin
goarch: amd64
BenchmarkFibComplete-2   	 3249416	       351 ns/op
ok  	_/Users/lribeiro/Work/bench	1.519s

12 Threads

➜  GOMAXPROCS=12 go test -bench=.
goos: darwin
goarch: amd64
BenchmarkFibComplete-12    	 3051240	       390 ns/op
ok  	_/Users/lribeiro/Work/bench	1.596s

GO Code

package main

import (

type Params struct {
	a int32   
	b string

type Result struct {
	r float64

var result float64

// Worker
func run(pch chan Params, rch chan Result){
	for {
		param := <- pch
		svalue, err := strconv.ParseFloat(param.b,64)
		if  err == nil {
			value := float64(param.a) + svalue
			result := Result{value}
			rch <- result

func BenchmarkFibComplete(b *testing.B) {
	var r float64
	pch := make(chan Params)
	rch := make(chan Result)

	go run(pch,rch)

	for n := 0; n < b.N; n++ {
		pch <- Params{3,"5.5"}
		res := <- rch
		r = res.r 
	// always store the result to a package level variable
	// so the compiler cannot eliminate the Benchmark itself.
	result = r

MT causing an overhead is obvious. 80% might be a bit much, and need investigation.

But this is also not a good example for a comparison of actual MT behaviour. With MT enabled it still behaves the same way as the single threaded program. Because of the blocking channels you don’t get any parallel execution benefits, but have to pay the synchronization prize.

Also be aware that another effect might be some over optimization by LLVM. I’m not sure whether there’s a realistic chance, but it might just happen to drop some of the code because it’s not considered relevant and that could work differently in single and multi threaded. It’s just a wild guess and probably not the case.

Advice: profile it. Run it with XCode Instruments and see what takes so much time. Just run the benchmark code in a loop (without the benchmark code).

1 Like

Would take 80% drop over the current 98,75% :slightly_smiling_face:

Because adding more threads can’t improve this scenario it’s a good way to test overhead of multiple threads vs 1 thread.

Single Thread

Multi Thread

– Update with profilling zips
Single Thread MacOS Instruments Profile

Multi Thread MacOS Instruments Profile

Ooops, my bad. Even more so then :smiley:

AFAIK you can only upload images on the forum. But you could just put your ZIP file on any file hoster and share the download link.