I have been watching and reading some articles about performance. In summary using array like data structures on the stack instead of the heap.
Here is a link to a blog with some more in depth explanation with a link to the video that first got me interested in this. Small vector optimization | Prog stuff
So I wanted to check what the performance is like in Crystal and got some interesting results. Here is a link to some benchmarking code and some graphing.
gistfile1.txt
require "benchmark"
std_array_sm = Array(Int32).new(8) { rand(0..1000) }
std_array_md = Array(Int32).new(16) { rand(0..1000) }
std_array_lg = Array(Int32).new(32) { rand(0..1000) }
static_array_sm = StaticArray(Int32, 8).new { rand(0..1000) }
static_array_md = StaticArray(Int32, 16).new { rand(0..1000) }
static_array_lg = StaticArray(Int32, 32).new { rand(0..1000) }
std_sorted_array_sm = Array(Int32).new(8) { rand(0..1000) }
This file has been truncated. show original
There are some definite performance gains in some cases but I have some questions.
Some cases are worse
Some scaling is worse then array
Some cases are not as performant then I thought
I was looking into performance gains from StaticArray but I don’t think I totally understand the tradeoffs and the intuition of when they would not perform as well compared to Arrays.
2 Likes
Do you have an automated solution for the graphing, or did you do it manually?
Here is some expanded benchmarking.
I do not have an automated solution for graphing. I am just putting it into apple numbers.
Okay, so I did get different results when using the benchmarks to modify a global variable (in an attempt to avoid things being optimized away).
Here’s the new code: gist
And here are my timings:
That looks much more in line with what I expect from StaticArray
.
3 Likes
that is way more what I expect. It is interesting that an optimization could cause different results.
mselig
July 22, 2021, 6:51am
7
I tried increasing the array sizes by a factor of 10, and got results that seemed much more as-expected:
#bsearch(&block : T -> Bool)
std_sorted_array_sm 95.37M ( 10.49ns) (± 1.64%) 0.0B/op 1.17× slower
std_sorted_array_md 81.29M ( 12.30ns) (± 1.65%) 0.0B/op 1.38× slower
std_sorted_array_lg 70.81M ( 14.12ns) (± 0.90%) 0.0B/op 1.58× slower
std_sorted_array_xlg 63.07M ( 15.86ns) (± 0.94%) 0.0B/op 1.77× slower
std_sorted_array_xxlg 56.40M ( 17.73ns) (± 1.42%) 0.0B/op 1.98× slower
static_sorted_array_sm 111.88M ( 8.94ns) (± 1.39%) 0.0B/op fastest
static_sorted_array_md 94.27M ( 10.61ns) (± 0.86%) 0.0B/op 1.19× slower
static_sorted_array_lg 81.24M ( 12.31ns) (± 0.90%) 0.0B/op 1.38× slower
static_sorted_array_xlg 71.04M ( 14.08ns) (± 1.39%) 0.0B/op 1.57× slower
static_sorted_array_xxlg 63.47M ( 15.76ns) (± 0.47%) 0.0B/op 1.76× slower
#all?(&)
std_array_sm 19.07M ( 52.45ns) (± 1.18%) 0.0B/op 1.06× slower
std_array_md 10.66M ( 93.80ns) (± 0.63%) 0.0B/op 1.89× slower
std_array_lg 5.66M (176.68ns) (± 0.79%) 0.0B/op 3.57× slower
std_array_xlg 2.91M (343.13ns) (± 1.36%) 0.0B/op 6.93× slower
std_array_xxlg 1.49M (673.03ns) (± 0.28%) 0.0B/op 13.59× slower
static_array_sm 20.20M ( 49.51ns) (± 2.16%) 0.0B/op fastest
static_array_md 11.70M ( 85.48ns) (± 0.73%) 0.0B/op 1.73× slower
static_array_lg 6.38M (156.75ns) (± 2.35%) 0.0B/op 3.17× slower
static_array_xlg 3.35M (298.45ns) (± 0.71%) 0.0B/op 6.03× slower
static_array_xxlg 1.71M (584.16ns) (± 1.49%) 0.0B/op 11.80× slower
#count(&)
std_array_sm 15.91M ( 62.86ns) (± 1.34%) 0.0B/op 1.07× slower
std_array_md 8.87M (112.76ns) (± 0.83%) 0.0B/op 1.92× slower
std_array_lg 4.54M (220.25ns) (± 1.48%) 0.0B/op 3.75× slower
std_array_xlg 2.31M (432.84ns) (± 1.20%) 0.0B/op 7.37× slower
std_array_xxlg 1.16M (858.90ns) (± 0.91%) 0.0B/op 14.63× slower
static_array_sm 17.03M ( 58.71ns) (± 1.38%) 0.0B/op fastest
static_array_md 8.89M (112.54ns) (± 1.50%) 0.0B/op 1.92× slower
static_array_lg 4.58M (218.41ns) (± 0.75%) 0.0B/op 3.72× slower
static_array_xlg 2.33M (429.74ns) (± 2.39%) 0.0B/op 7.32× slower
static_array_xxlg 1.16M (859.57ns) (± 1.37%) 0.0B/op 14.64× slower
#first
std_array_sm 563.27M ( 1.78ns) (± 1.00%) 0.0B/op 1.00× slower
std_array_md 561.96M ( 1.78ns) (± 1.60%) 0.0B/op 1.00× slower
std_array_lg 563.98M ( 1.77ns) (± 0.56%) 0.0B/op fastest
std_array_xlg 562.01M ( 1.78ns) (± 1.57%) 0.0B/op 1.00× slower
std_array_xxlg 563.63M ( 1.77ns) (± 0.86%) 0.0B/op 1.00× slower
static_array_sm 563.28M ( 1.78ns) (± 0.98%) 0.0B/op 1.00× slower
static_array_md 562.72M ( 1.78ns) (± 1.54%) 0.0B/op 1.00× slower
static_array_lg 563.13M ( 1.78ns) (± 1.03%) 0.0B/op 1.00× slower
static_array_xlg 562.42M ( 1.78ns) (± 1.52%) 0.0B/op 1.00× slower
static_array_xxlg 563.33M ( 1.78ns) (± 1.07%) 0.0B/op 1.00× slower
#minmax
std_array_sm 28.62M ( 34.94ns) (± 1.42%) 0.0B/op 19.68× slower
std_array_md 16.49M ( 60.64ns) (± 1.07%) 0.0B/op 34.17× slower
std_array_lg 9.44M (105.93ns) (± 1.47%) 0.0B/op 59.69× slower
std_array_xlg 4.92M (203.10ns) (± 0.75%) 0.0B/op 114.44× slower
std_array_xxlg 2.56M (390.23ns) (± 1.29%) 0.0B/op 219.88× slower
static_array_sm 563.46M ( 1.77ns) (± 1.00%) 0.0B/op fastest
static_array_md 16.60M ( 60.26ns) (± 0.93%) 0.0B/op 33.95× slower
static_array_lg 9.29M (107.64ns) (± 1.06%) 0.0B/op 60.65× slower
static_array_xlg 4.99M (200.57ns) (± 0.45%) 0.0B/op 113.01× slower
static_array_xxlg 2.55M (392.54ns) (± 1.41%) 0.0B/op 221.18× slower
#reduce(memo, &)
std_array_sm 18.63M ( 53.66ns) (± 1.18%) 0.0B/op 1.06× slower
std_array_md 10.15M ( 98.55ns) (± 0.68%) 0.0B/op 1.95× slower
std_array_lg 5.35M (187.00ns) (± 0.48%) 0.0B/op 3.70× slower
std_array_xlg 2.75M (363.59ns) (± 1.24%) 0.0B/op 7.20× slower
std_array_xxlg 1.40M (716.62ns) (± 0.84%) 0.0B/op 14.19× slower
static_array_sm 19.80M ( 50.50ns) (± 1.24%) 0.0B/op fastest
static_array_md 11.48M ( 87.13ns) (± 2.10%) 0.0B/op 1.73× slower
static_array_lg 6.21M (161.01ns) (± 1.13%) 0.0B/op 3.19× slower
static_array_xlg 3.34M (299.51ns) (± 1.46%) 0.0B/op 5.93× slower
static_array_xxlg 1.66M (602.03ns) (± 0.81%) 0.0B/op 11.92× slower
Sorry, I didn’t graph them, but it seems to show that StaticArray
is perhaps slightly faster than Array
, but there isn’t really much in it. This is to be expected - stack allocation should be much faster, and less pressure on garbage collection, but these benchmarks don’t measure that.