StaticArray vs Array

I have been watching and reading some articles about performance. In summary using array like data structures on the stack instead of the heap.

Here is a link to a blog with some more in depth explanation with a link to the video that first got me interested in this. Small vector optimization | Prog stuff

So I wanted to check what the performance is like in Crystal and got some interesting results. Here is a link to some benchmarking code and some graphing.

There are some definite performance gains in some cases but I have some questions.

  • Some cases are worse
  • Some scaling is worse then array
  • Some cases are not as performant then I thought

I was looking into performance gains from StaticArray but I don’t think I totally understand the tradeoffs and the intuition of when they would not perform as well compared to Arrays.

2 Likes

Do you have an automated solution for the graphing, or did you do it manually?

Here is some expanded benchmarking.

I do not have an automated solution for graphing. I am just putting it into apple numbers.

Okay, so I did get different results when using the benchmarks to modify a global variable (in an attempt to avoid things being optimized away).

Here’s the new code: gist

And here are my timings:

That looks much more in line with what I expect from StaticArray.

3 Likes

that is way more what I expect. It is interesting that an optimization could cause different results.

I tried increasing the array sizes by a factor of 10, and got results that seemed much more as-expected:

#bsearch(&block : T -> Bool)
     std_sorted_array_sm  95.37M ( 10.49ns) (± 1.64%)  0.0B/op   1.17× slower
     std_sorted_array_md  81.29M ( 12.30ns) (± 1.65%)  0.0B/op   1.38× slower
     std_sorted_array_lg  70.81M ( 14.12ns) (± 0.90%)  0.0B/op   1.58× slower
    std_sorted_array_xlg  63.07M ( 15.86ns) (± 0.94%)  0.0B/op   1.77× slower
   std_sorted_array_xxlg  56.40M ( 17.73ns) (± 1.42%)  0.0B/op   1.98× slower
  static_sorted_array_sm 111.88M (  8.94ns) (± 1.39%)  0.0B/op        fastest
  static_sorted_array_md  94.27M ( 10.61ns) (± 0.86%)  0.0B/op   1.19× slower
  static_sorted_array_lg  81.24M ( 12.31ns) (± 0.90%)  0.0B/op   1.38× slower
 static_sorted_array_xlg  71.04M ( 14.08ns) (± 1.39%)  0.0B/op   1.57× slower
static_sorted_array_xxlg  63.47M ( 15.76ns) (± 0.47%)  0.0B/op   1.76× slower
#all?(&)
     std_array_sm  19.07M ( 52.45ns) (± 1.18%)  0.0B/op   1.06× slower
     std_array_md  10.66M ( 93.80ns) (± 0.63%)  0.0B/op   1.89× slower
     std_array_lg   5.66M (176.68ns) (± 0.79%)  0.0B/op   3.57× slower
    std_array_xlg   2.91M (343.13ns) (± 1.36%)  0.0B/op   6.93× slower
   std_array_xxlg   1.49M (673.03ns) (± 0.28%)  0.0B/op  13.59× slower
  static_array_sm  20.20M ( 49.51ns) (± 2.16%)  0.0B/op        fastest
  static_array_md  11.70M ( 85.48ns) (± 0.73%)  0.0B/op   1.73× slower
  static_array_lg   6.38M (156.75ns) (± 2.35%)  0.0B/op   3.17× slower
 static_array_xlg   3.35M (298.45ns) (± 0.71%)  0.0B/op   6.03× slower
static_array_xxlg   1.71M (584.16ns) (± 1.49%)  0.0B/op  11.80× slower
#count(&)
     std_array_sm  15.91M ( 62.86ns) (± 1.34%)  0.0B/op   1.07× slower
     std_array_md   8.87M (112.76ns) (± 0.83%)  0.0B/op   1.92× slower
     std_array_lg   4.54M (220.25ns) (± 1.48%)  0.0B/op   3.75× slower
    std_array_xlg   2.31M (432.84ns) (± 1.20%)  0.0B/op   7.37× slower
   std_array_xxlg   1.16M (858.90ns) (± 0.91%)  0.0B/op  14.63× slower
  static_array_sm  17.03M ( 58.71ns) (± 1.38%)  0.0B/op        fastest
  static_array_md   8.89M (112.54ns) (± 1.50%)  0.0B/op   1.92× slower
  static_array_lg   4.58M (218.41ns) (± 0.75%)  0.0B/op   3.72× slower
 static_array_xlg   2.33M (429.74ns) (± 2.39%)  0.0B/op   7.32× slower
static_array_xxlg   1.16M (859.57ns) (± 1.37%)  0.0B/op  14.64× slower
#first
     std_array_sm 563.27M (  1.78ns) (± 1.00%)  0.0B/op   1.00× slower
     std_array_md 561.96M (  1.78ns) (± 1.60%)  0.0B/op   1.00× slower
     std_array_lg 563.98M (  1.77ns) (± 0.56%)  0.0B/op        fastest
    std_array_xlg 562.01M (  1.78ns) (± 1.57%)  0.0B/op   1.00× slower
   std_array_xxlg 563.63M (  1.77ns) (± 0.86%)  0.0B/op   1.00× slower
  static_array_sm 563.28M (  1.78ns) (± 0.98%)  0.0B/op   1.00× slower
  static_array_md 562.72M (  1.78ns) (± 1.54%)  0.0B/op   1.00× slower
  static_array_lg 563.13M (  1.78ns) (± 1.03%)  0.0B/op   1.00× slower
 static_array_xlg 562.42M (  1.78ns) (± 1.52%)  0.0B/op   1.00× slower
static_array_xxlg 563.33M (  1.78ns) (± 1.07%)  0.0B/op   1.00× slower
#minmax
     std_array_sm  28.62M ( 34.94ns) (± 1.42%)  0.0B/op   19.68× slower
     std_array_md  16.49M ( 60.64ns) (± 1.07%)  0.0B/op   34.17× slower
     std_array_lg   9.44M (105.93ns) (± 1.47%)  0.0B/op   59.69× slower
    std_array_xlg   4.92M (203.10ns) (± 0.75%)  0.0B/op  114.44× slower
   std_array_xxlg   2.56M (390.23ns) (± 1.29%)  0.0B/op  219.88× slower
  static_array_sm 563.46M (  1.77ns) (± 1.00%)  0.0B/op         fastest
  static_array_md  16.60M ( 60.26ns) (± 0.93%)  0.0B/op   33.95× slower
  static_array_lg   9.29M (107.64ns) (± 1.06%)  0.0B/op   60.65× slower
 static_array_xlg   4.99M (200.57ns) (± 0.45%)  0.0B/op  113.01× slower
static_array_xxlg   2.55M (392.54ns) (± 1.41%)  0.0B/op  221.18× slower
#reduce(memo, &)
     std_array_sm  18.63M ( 53.66ns) (± 1.18%)  0.0B/op   1.06× slower
     std_array_md  10.15M ( 98.55ns) (± 0.68%)  0.0B/op   1.95× slower
     std_array_lg   5.35M (187.00ns) (± 0.48%)  0.0B/op   3.70× slower
    std_array_xlg   2.75M (363.59ns) (± 1.24%)  0.0B/op   7.20× slower
   std_array_xxlg   1.40M (716.62ns) (± 0.84%)  0.0B/op  14.19× slower
  static_array_sm  19.80M ( 50.50ns) (± 1.24%)  0.0B/op        fastest
  static_array_md  11.48M ( 87.13ns) (± 2.10%)  0.0B/op   1.73× slower
  static_array_lg   6.21M (161.01ns) (± 1.13%)  0.0B/op   3.19× slower
 static_array_xlg   3.34M (299.51ns) (± 1.46%)  0.0B/op   5.93× slower
static_array_xxlg   1.66M (602.03ns) (± 0.81%)  0.0B/op  11.92× slower

Sorry, I didn’t graph them, but it seems to show that StaticArray is perhaps slightly faster than Array, but there isn’t really much in it. This is to be expected - stack allocation should be much faster, and less pressure on garbage collection, but these benchmarks don’t measure that.