StaticArray vs Array

I have been watching and reading some articles about performance. In summary using array like data structures on the stack instead of the heap.

Here is a link to a blog with some more in depth explanation with a link to the video that first got me interested in this. Small vector optimization | Prog stuff

So I wanted to check what the performance is like in Crystal and got some interesting results. Here is a link to some benchmarking code and some graphing.

There are some definite performance gains in some cases but I have some questions.

  • Some cases are worse
  • Some scaling is worse then array
  • Some cases are not as performant then I thought

I was looking into performance gains from StaticArray but I don’t think I totally understand the tradeoffs and the intuition of when they would not perform as well compared to Arrays.

Do you have an automated solution for the graphing, or did you do it manually?

Here is some expanded benchmarking.

I do not have an automated solution for graphing. I am just putting it into apple numbers.

Okay, so I did get different results when using the benchmarks to modify a global variable (in an attempt to avoid things being optimized away).

Here’s the new code: gist

And here are my timings:

That looks much more in line with what I expect from StaticArray.

that is way more what I expect. It is interesting that an optimization could cause different results.

I tried increasing the array sizes by a factor of 10, and got results that seemed much more as-expected:

#bsearch(&block : T -> Bool)
     std_sorted_array_sm  95.37M ( 10.49ns) (± 1.64%)  0.0B/op   1.17× slower
     std_sorted_array_md  81.29M ( 12.30ns) (± 1.65%)  0.0B/op   1.38× slower
     std_sorted_array_lg  70.81M ( 14.12ns) (± 0.90%)  0.0B/op   1.58× slower
    std_sorted_array_xlg  63.07M ( 15.86ns) (± 0.94%)  0.0B/op   1.77× slower
   std_sorted_array_xxlg  56.40M ( 17.73ns) (± 1.42%)  0.0B/op   1.98× slower
  static_sorted_array_sm 111.88M (  8.94ns) (± 1.39%)  0.0B/op        fastest
  static_sorted_array_md  94.27M ( 10.61ns) (± 0.86%)  0.0B/op   1.19× slower
  static_sorted_array_lg  81.24M ( 12.31ns) (± 0.90%)  0.0B/op   1.38× slower
 static_sorted_array_xlg  71.04M ( 14.08ns) (± 1.39%)  0.0B/op   1.57× slower
static_sorted_array_xxlg  63.47M ( 15.76ns) (± 0.47%)  0.0B/op   1.76× slower
#all?(&)
     std_array_sm  19.07M ( 52.45ns) (± 1.18%)  0.0B/op   1.06× slower
     std_array_md  10.66M ( 93.80ns) (± 0.63%)  0.0B/op   1.89× slower
     std_array_lg   5.66M (176.68ns) (± 0.79%)  0.0B/op   3.57× slower
    std_array_xlg   2.91M (343.13ns) (± 1.36%)  0.0B/op   6.93× slower
   std_array_xxlg   1.49M (673.03ns) (± 0.28%)  0.0B/op  13.59× slower
  static_array_sm  20.20M ( 49.51ns) (± 2.16%)  0.0B/op        fastest
  static_array_md  11.70M ( 85.48ns) (± 0.73%)  0.0B/op   1.73× slower
  static_array_lg   6.38M (156.75ns) (± 2.35%)  0.0B/op   3.17× slower
 static_array_xlg   3.35M (298.45ns) (± 0.71%)  0.0B/op   6.03× slower
static_array_xxlg   1.71M (584.16ns) (± 1.49%)  0.0B/op  11.80× slower
#count(&)
     std_array_sm  15.91M ( 62.86ns) (± 1.34%)  0.0B/op   1.07× slower
     std_array_md   8.87M (112.76ns) (± 0.83%)  0.0B/op   1.92× slower
     std_array_lg   4.54M (220.25ns) (± 1.48%)  0.0B/op   3.75× slower
    std_array_xlg   2.31M (432.84ns) (± 1.20%)  0.0B/op   7.37× slower
   std_array_xxlg   1.16M (858.90ns) (± 0.91%)  0.0B/op  14.63× slower
  static_array_sm  17.03M ( 58.71ns) (± 1.38%)  0.0B/op        fastest
  static_array_md   8.89M (112.54ns) (± 1.50%)  0.0B/op   1.92× slower
  static_array_lg   4.58M (218.41ns) (± 0.75%)  0.0B/op   3.72× slower
 static_array_xlg   2.33M (429.74ns) (± 2.39%)  0.0B/op   7.32× slower
static_array_xxlg   1.16M (859.57ns) (± 1.37%)  0.0B/op  14.64× slower
#first
     std_array_sm 563.27M (  1.78ns) (± 1.00%)  0.0B/op   1.00× slower
     std_array_md 561.96M (  1.78ns) (± 1.60%)  0.0B/op   1.00× slower
     std_array_lg 563.98M (  1.77ns) (± 0.56%)  0.0B/op        fastest
    std_array_xlg 562.01M (  1.78ns) (± 1.57%)  0.0B/op   1.00× slower
   std_array_xxlg 563.63M (  1.77ns) (± 0.86%)  0.0B/op   1.00× slower
  static_array_sm 563.28M (  1.78ns) (± 0.98%)  0.0B/op   1.00× slower
  static_array_md 562.72M (  1.78ns) (± 1.54%)  0.0B/op   1.00× slower
  static_array_lg 563.13M (  1.78ns) (± 1.03%)  0.0B/op   1.00× slower
 static_array_xlg 562.42M (  1.78ns) (± 1.52%)  0.0B/op   1.00× slower
static_array_xxlg 563.33M (  1.78ns) (± 1.07%)  0.0B/op   1.00× slower
#minmax
     std_array_sm  28.62M ( 34.94ns) (± 1.42%)  0.0B/op   19.68× slower
     std_array_md  16.49M ( 60.64ns) (± 1.07%)  0.0B/op   34.17× slower
     std_array_lg   9.44M (105.93ns) (± 1.47%)  0.0B/op   59.69× slower
    std_array_xlg   4.92M (203.10ns) (± 0.75%)  0.0B/op  114.44× slower
   std_array_xxlg   2.56M (390.23ns) (± 1.29%)  0.0B/op  219.88× slower
  static_array_sm 563.46M (  1.77ns) (± 1.00%)  0.0B/op         fastest
  static_array_md  16.60M ( 60.26ns) (± 0.93%)  0.0B/op   33.95× slower
  static_array_lg   9.29M (107.64ns) (± 1.06%)  0.0B/op   60.65× slower
 static_array_xlg   4.99M (200.57ns) (± 0.45%)  0.0B/op  113.01× slower
static_array_xxlg   2.55M (392.54ns) (± 1.41%)  0.0B/op  221.18× slower
#reduce(memo, &)
     std_array_sm  18.63M ( 53.66ns) (± 1.18%)  0.0B/op   1.06× slower
     std_array_md  10.15M ( 98.55ns) (± 0.68%)  0.0B/op   1.95× slower
     std_array_lg   5.35M (187.00ns) (± 0.48%)  0.0B/op   3.70× slower
    std_array_xlg   2.75M (363.59ns) (± 1.24%)  0.0B/op   7.20× slower
   std_array_xxlg   1.40M (716.62ns) (± 0.84%)  0.0B/op  14.19× slower
  static_array_sm  19.80M ( 50.50ns) (± 1.24%)  0.0B/op        fastest
  static_array_md  11.48M ( 87.13ns) (± 2.10%)  0.0B/op   1.73× slower
  static_array_lg   6.21M (161.01ns) (± 1.13%)  0.0B/op   3.19× slower
 static_array_xlg   3.34M (299.51ns) (± 1.46%)  0.0B/op   5.93× slower
static_array_xxlg   1.66M (602.03ns) (± 0.81%)  0.0B/op  11.92× slower

Sorry, I didn’t graph them, but it seems to show that StaticArray is perhaps slightly faster than Array, but there isn’t really much in it. This is to be expected - stack allocation should be much faster, and less pressure on garbage collection, but these benchmarks don’t measure that.