I have been watching and reading some articles about performance. In summary using array like data structures on the stack instead of the heap.
Here is a link to a blog with some more in depth explanation with a link to the video that first got me interested in this. Small vector optimization | Prog stuff
So I wanted to check what the performance is like in Crystal and got some interesting results. Here is a link to some benchmarking code and some graphing.
gistfile1.txt
require "benchmark"
std_array_sm = Array(Int32).new(8) { rand(0..1000) }
std_array_md = Array(Int32).new(16) { rand(0..1000) }
std_array_lg = Array(Int32).new(32) { rand(0..1000) }
static_array_sm = StaticArray(Int32, 8).new { rand(0..1000) }
static_array_md = StaticArray(Int32, 16).new { rand(0..1000) }
static_array_lg = StaticArray(Int32, 32).new { rand(0..1000) }
std_sorted_array_sm = Array(Int32).new(8) { rand(0..1000) }
This file has been truncated. show original
There are some definite performance gains in some cases but I have some questions.
Some cases are worse
Some scaling is worse then array
Some cases are not as performant then I thought
I was looking into performance gains from StaticArray but I don’t think I totally understand the tradeoffs and the intuition of when they would not perform as well compared to Arrays.
Do you have an automated solution for the graphing, or did you do it manually?
Here is some expanded benchmarking.
I do not have an automated solution for graphing. I am just putting it into apple numbers.
Okay, so I did get different results when using the benchmarks to modify a global variable (in an attempt to avoid things being optimized away).
Here’s the new code: gist
And here are my timings:
That looks much more in line with what I expect from StaticArray.
that is way more what I expect. It is interesting that an optimization could cause different results.
mselig
July 22, 2021, 6:51am
7
I tried increasing the array sizes by a factor of 10, and got results that seemed much more as-expected:
#bsearch(&block : T -> Bool)
std_sorted_array_sm 95.37M ( 10.49ns) (± 1.64%) 0.0B/op 1.17× slower
std_sorted_array_md 81.29M ( 12.30ns) (± 1.65%) 0.0B/op 1.38× slower
std_sorted_array_lg 70.81M ( 14.12ns) (± 0.90%) 0.0B/op 1.58× slower
std_sorted_array_xlg 63.07M ( 15.86ns) (± 0.94%) 0.0B/op 1.77× slower
std_sorted_array_xxlg 56.40M ( 17.73ns) (± 1.42%) 0.0B/op 1.98× slower
static_sorted_array_sm 111.88M ( 8.94ns) (± 1.39%) 0.0B/op fastest
static_sorted_array_md 94.27M ( 10.61ns) (± 0.86%) 0.0B/op 1.19× slower
static_sorted_array_lg 81.24M ( 12.31ns) (± 0.90%) 0.0B/op 1.38× slower
static_sorted_array_xlg 71.04M ( 14.08ns) (± 1.39%) 0.0B/op 1.57× slower
static_sorted_array_xxlg 63.47M ( 15.76ns) (± 0.47%) 0.0B/op 1.76× slower
#all?(&)
std_array_sm 19.07M ( 52.45ns) (± 1.18%) 0.0B/op 1.06× slower
std_array_md 10.66M ( 93.80ns) (± 0.63%) 0.0B/op 1.89× slower
std_array_lg 5.66M (176.68ns) (± 0.79%) 0.0B/op 3.57× slower
std_array_xlg 2.91M (343.13ns) (± 1.36%) 0.0B/op 6.93× slower
std_array_xxlg 1.49M (673.03ns) (± 0.28%) 0.0B/op 13.59× slower
static_array_sm 20.20M ( 49.51ns) (± 2.16%) 0.0B/op fastest
static_array_md 11.70M ( 85.48ns) (± 0.73%) 0.0B/op 1.73× slower
static_array_lg 6.38M (156.75ns) (± 2.35%) 0.0B/op 3.17× slower
static_array_xlg 3.35M (298.45ns) (± 0.71%) 0.0B/op 6.03× slower
static_array_xxlg 1.71M (584.16ns) (± 1.49%) 0.0B/op 11.80× slower
#count(&)
std_array_sm 15.91M ( 62.86ns) (± 1.34%) 0.0B/op 1.07× slower
std_array_md 8.87M (112.76ns) (± 0.83%) 0.0B/op 1.92× slower
std_array_lg 4.54M (220.25ns) (± 1.48%) 0.0B/op 3.75× slower
std_array_xlg 2.31M (432.84ns) (± 1.20%) 0.0B/op 7.37× slower
std_array_xxlg 1.16M (858.90ns) (± 0.91%) 0.0B/op 14.63× slower
static_array_sm 17.03M ( 58.71ns) (± 1.38%) 0.0B/op fastest
static_array_md 8.89M (112.54ns) (± 1.50%) 0.0B/op 1.92× slower
static_array_lg 4.58M (218.41ns) (± 0.75%) 0.0B/op 3.72× slower
static_array_xlg 2.33M (429.74ns) (± 2.39%) 0.0B/op 7.32× slower
static_array_xxlg 1.16M (859.57ns) (± 1.37%) 0.0B/op 14.64× slower
#first
std_array_sm 563.27M ( 1.78ns) (± 1.00%) 0.0B/op 1.00× slower
std_array_md 561.96M ( 1.78ns) (± 1.60%) 0.0B/op 1.00× slower
std_array_lg 563.98M ( 1.77ns) (± 0.56%) 0.0B/op fastest
std_array_xlg 562.01M ( 1.78ns) (± 1.57%) 0.0B/op 1.00× slower
std_array_xxlg 563.63M ( 1.77ns) (± 0.86%) 0.0B/op 1.00× slower
static_array_sm 563.28M ( 1.78ns) (± 0.98%) 0.0B/op 1.00× slower
static_array_md 562.72M ( 1.78ns) (± 1.54%) 0.0B/op 1.00× slower
static_array_lg 563.13M ( 1.78ns) (± 1.03%) 0.0B/op 1.00× slower
static_array_xlg 562.42M ( 1.78ns) (± 1.52%) 0.0B/op 1.00× slower
static_array_xxlg 563.33M ( 1.78ns) (± 1.07%) 0.0B/op 1.00× slower
#minmax
std_array_sm 28.62M ( 34.94ns) (± 1.42%) 0.0B/op 19.68× slower
std_array_md 16.49M ( 60.64ns) (± 1.07%) 0.0B/op 34.17× slower
std_array_lg 9.44M (105.93ns) (± 1.47%) 0.0B/op 59.69× slower
std_array_xlg 4.92M (203.10ns) (± 0.75%) 0.0B/op 114.44× slower
std_array_xxlg 2.56M (390.23ns) (± 1.29%) 0.0B/op 219.88× slower
static_array_sm 563.46M ( 1.77ns) (± 1.00%) 0.0B/op fastest
static_array_md 16.60M ( 60.26ns) (± 0.93%) 0.0B/op 33.95× slower
static_array_lg 9.29M (107.64ns) (± 1.06%) 0.0B/op 60.65× slower
static_array_xlg 4.99M (200.57ns) (± 0.45%) 0.0B/op 113.01× slower
static_array_xxlg 2.55M (392.54ns) (± 1.41%) 0.0B/op 221.18× slower
#reduce(memo, &)
std_array_sm 18.63M ( 53.66ns) (± 1.18%) 0.0B/op 1.06× slower
std_array_md 10.15M ( 98.55ns) (± 0.68%) 0.0B/op 1.95× slower
std_array_lg 5.35M (187.00ns) (± 0.48%) 0.0B/op 3.70× slower
std_array_xlg 2.75M (363.59ns) (± 1.24%) 0.0B/op 7.20× slower
std_array_xxlg 1.40M (716.62ns) (± 0.84%) 0.0B/op 14.19× slower
static_array_sm 19.80M ( 50.50ns) (± 1.24%) 0.0B/op fastest
static_array_md 11.48M ( 87.13ns) (± 2.10%) 0.0B/op 1.73× slower
static_array_lg 6.21M (161.01ns) (± 1.13%) 0.0B/op 3.19× slower
static_array_xlg 3.34M (299.51ns) (± 1.46%) 0.0B/op 5.93× slower
static_array_xxlg 1.66M (602.03ns) (± 0.81%) 0.0B/op 11.92× slower
Sorry, I didn’t graph them, but it seems to show that StaticArray is perhaps slightly faster than Array, but there isn’t really much in it. This is to be expected - stack allocation should be much faster, and less pressure on garbage collection, but these benchmarks don’t measure that.