Thursday, June 21, 2012

Improving performance of std::vector

This article tells about a simple yet efficient way to optimize many cases of using std::vector by replacing it with the class that has very similar interface. That makes the transition trivial.
Cases that benefit from this optimization are:
  1. Working with very small vectors, e.g. when vector grows up to few items in majority of cases, but should not be limited in size anyway (e.g. by using std::array);
  2. Heavily-used stack-based vectors.

In-depths look at std::vector

Most vectors implementation (I said 'most', but I don't know a single one that differs significantly) have a memory layout that is similar to the following:
std::vector memory layout

There are two separate memory blocks:
  • First one contains pointer to the first item, pointer that points past the last item and pointer that points past the end of allocated buffer. It might be allocated on stack if std::vector is declared as local variable;
  • Second block contains all items and its size equals to total size of all items that the vector reserved memory for.
Note that actual contents of first block are implementation-specific, e.g. some implementations prefer to store vector size as an additional field. Also Alignment padding is optional variable-sized block and its size depends on compiler, run-time and target platform.
This is very good and robust implementation.
The problem with that implementation is that it requires dynamic memory allocation when the first item is added to the vector and also when reserved space is not enough expand a vector with new items.
This is usually not a problem because most of all std::vector implementations are clever enough not to grow linearly (e.g. by only 1 item when new item is added), and they tend to grow exponentially.
Also there is a well-known function reserve() method that allows to reserve certain amount of memory at the convenient time to make sure adding new items is not slowed down by memory allocations.
However there are few problems that still arise even when using reserve() properly:
  • There are still two memory blocks allocated instead of one. And each allocation brings memory overhead. Look into this post to see overhead estimations for few compilers and environments;
  • As reservation size grows memory overhead for the whole system grows too (especially when majority of vectors are empty in run-time). 

Small vector optimization

While I was looking at the performance-critical piece of code that dealt with syntactic analysis of natural language texts I noticed that most of std::vectors aggregated within syntactic node class are very small (e.g. in roughly 50% of cases they were empty, and in 49% of cases they grew just to contain 1 item).
I though that it would be great to get this item pre-allocated at the same memory block as the main vector.
This will improve both performance (because of reduced number of allocations and better data locality) and memory consumption of the system (as syntactic trees gets very big when dealing with larger texts).
After quick search I found that llvm project has exactly what I was looking for.
It has the class called llvm::SmallVector that implemented following concept:
  1. During template specialization number of items that should be pre-allocated is specified;
  2. llvm::SmallVector can grow up to the pre-allocated size without additional dynamic memory allocations;
  3. If items are added above the pre-allocated size new memory block is allocated on heap and all items are copied there.
That was exactly what I was looking for. Putting this class in place of std::vector was trivial, I just had to replace original declaration
    typedef std:vector<CAnnotation*> AnnotationPtrVector;
with the new one
    typedef llvm::SmallVector<CAnnotation*, 1> AnnotationPtrVector;
Please note that I was dealing with the vector of pointers, so the one pre-allocated pointer occupied exactly the same space as was previously occupied with Alignment padding memory block in original vector, so sizeof(AnnotationPtrVector) did not change (for some additional details on this please read on).
Obviously the performance increase was huge (the part that filled such vectors got twice faster, and overall system performance increase was about 10%).
Here is the memory layout of llvm::SmallVector class in two cases - when one item is added (and one is pre-allocated), and when more than 1 items are added.

Note that the latter case is very similar to original std::vector's layout.

Stack-based vector optimization

Another possible application of llvm::SmallVector class is heavily-updated vectors that are allocated on stack, e.g. to store some temporary data for further processing.
In this case we are (mostly) unrestricted with number of pre-allocated items (as long as the function is non-recursive), and can completely get rid of dynamic memory allocations even for hundreds of items.
This, of course, yields great performance results (and also reduces heap fragmentation levels).

Further improvements to llvm::SmallVector

When I looked at llvm::SmallVector I noticed few problems that I had to fix to make it applicable for my projects:
  1. The minimum number of pre-allocated items was two, not one, so all my vectors grew substantially when replaced with llvm::SmallVector;
  2. My application was a long-running server app so it used pools to store unused objects to reduce number of dynamic memory allocations. The problem that usual way to free all memory allocated for vec which is std::vector:
    std::vector<…>().swap(vec)
    did not work for llvm::SmallVector.
So I had to make a few tweaks to overcome these issues. Updated source code could be downloaded here. It has the same license as original code from LLVM, so you can freely use it in any open- or closed-source applications.
The only drawback of my modifications is that when we try to pre-allocate only 1 item zero-sized array is created within a llvm::SmallVector instance, and it produces a warning on the most compilers.
Also it might make sense to get rid of Boost dependency, but I believe that Boost is used in many modern C++ projects, so did not waste time on it.

Other implementations

Recently I have found a class that implemented exactly the same concept in Folly C++ library released for public use by Facebook.
I saw few issues with it, but it might be still usable in certain projects (I bet it is, as Facebook uses this library internally):
  1. It's dynamically allocated memory block can't be freed with usual swap() trick I mentioned above
  2. It makes a heavy use of Boost MPL which might not be available in all projects
  3. Many of the methods are surrounded by try / catch(…) blocks which might hide memory errors and also reduce performance (for Visual C++-compiler projects at least).
Please let me know if you come by similar classes in other libraries, I will add more to the article.

Happy optimizing!