Sunday, June 16, 2013

On pooling of objects that contain vectors

Recently I had an interesting exercise in de-leaking of our rather complex linguistic analysis system. Actually the system was rather solid and almost didn't leak, but under heavy load in production its memory usage grew slowly but steadily.

First approach was to use some free de-leaking tools, like Visual Leak Detector, but they didn't show any leaks - after application termination all objects were freed.

My next attempt was to analyze growth pattern of memory allocations. I created a small script to dump memory allocations every five minutes with UMDH (that could be found in Debugging Tools for Windows). After running this thing overnight I parsed outputs with another script and dumped it into huge Excel spreadsheet for further analysis.

I was surprised to see that there were only few allocations that had growing trend, and they all belong to the memory allocator within std::vector.

After few days I finally realized that it is not the size of the vector was growing, but their capacity. The problem was with rather complex, heavy-to-construct object that had few vectors in it. As the object was really complex and algorithm had to run very fast, objects were not constructed but rather taken from the pool.

Of course, before putting object back to pool it was cleared (e.g. the vector was empty after calling clear() method), but the underlying memory buffer that was allocated was not cleared. It would not be a problem if the vector in all cases would have similar size, however in our case average size was about 1.5, but the maximum size sometimes went up to 100.

This means that eventually after running long enough each pooled object will contain an empty vector that has a buffer to hold 100 elements.

Of course the fix was trivial, and such technique is mentioned in Effective STL by Scott Meyers. Instead of calling std::vector::clear() I did following:
std::vector<MyType>().swap(m_vMyTypes);
which basically swaps newly constructed (and thus empty) vector with existing one, freeing all memory that was held. Also the similar effect could be observed with std::deque collection - it also does not free allocated memory after call to clear().

This all means that everyone should be very careful when putting complex objects that contain STL collections into object pools.

Tuesday, September 25, 2012

Premium C++ error detection is now available for Visual Studio 2012

Just two weeks ahead of general public release of Visual Studio 2012 small Russia-based company OOO "Program Verification Systems" (Co Ltd) announced availability of new version of their flagship product PVS-Studio that completely integrates into VS2012. PVS-Studio is a product that performs early diagnosis of programming errors in C/C++/C++11 code.

This is the first static analysis tools for C++ that is available on the new platform to complement the new built-in static analysis that is now found in all editions of Visual Studio.

Besides VS2012 support this new version adds a large number of new diagnostic rule to the vast set of heuristics already available. Andrey Karpov, CTO of the company, mentioned that "our rapid development cycle (we usually release new version each month) allows a lightning-fast delivery of heuristics for detection of new error patterns that our researchers discover. As our updates are free for all exiting customers this greatly increases value of the product".

Considering no-obligations trial download (you won't even be asked to enter your e-mail!) this is product is worth trying for projects of any size and complexity.

Contributing back to the ecosystem

Team that created PVS-Studio are well-known within C++ communities for their invaluable contributions to various open-source projects. They use these projects as a playground when developing new rule sets, and always submit bugs they discover helping to make these products better.

While working on the newest version they validated source code of all C++ libraries (including MFC) that were included with Visual Studio 2012 and to the team's sheer surprise few issues popped up despite overall high quality of the code. These issues are rather typical and were easily caught by PVS-Studio. Detailed description of the issues could be found in their recent blog post. Also bug was filed on the Microsoft Connect site. Hopefully these issues will be fixed right in the next update for Visual Studio as some of them might actually pose a stability or security threat to the applications built with these libraries.

Meanwhile the team that created PVS-Studio continues its fight for improving quality of the native code.

Thursday, September 20, 2012

The cost of freeing memory

There is a common topic of discussion for all C/C++ developers - avoiding extra memory allocations to optimize application's performance. Really, unthoughtful memory allocations might kill performance of the best algorithm out there.

During one of the performance exercises I did on my project I noticed a strange hotspot in the place when memory is freed. Later on I did a brief investigation on this matter, and found no further information except this short discussion on stackoverflow: http://stackoverflow.com/questions/5625842/cost-of-memory-deallocation-and-potential-compiler-optimizations-c. It mentioned that although there are no guarantees about the cost of freeing memory, developers might expect it to be no higher than the cost of memory allocation.

As there were no numbers our there I decided to do simple benchmark and results were quite surprising. Although when the number of allocations was small the cost of deallocation was below the cost of the allocation, but as the numbers come close to millions of allocations the cost of freeing memory increases significantly (this of course depends on the compiled used).

Without further ado here are the numbers:

# of allocations
Visual C++ 10
Intel C++ 12.1
1K
51.44%
46.95%
10K
97.61%
96.55%
100K
100.13%
100.95%
1M
106.33%
105.95%
2M
134.35%
102.48%
3M
157.52%
109.76%
4M
176.04%
113.73%
5M
193.23%
119.14%

Numbers here are the relative costs of freeing compared to the costs of allocations. For example, 50% means that it takes twice less time to deallocate memory than to allocate it.

As you see, for Visual C++ compiler deallocation might take almost twice more than allocation! Intel's compiler also has a growing deallocation cost, but growth is much slower.

What's the practical application of this knowledge? In rare cases when memory deallocation becomes a bottleneck it might be useful to emulate a manual garbage collection - e.g. remember all blocks that are to be freed, and then, when application is idle, do the clean-up. Not sure if any real world application would benefit from such optimization though :)

Another possible optimization is just avoiding the cost of freeing memory when application is terminating - when process exists its memory will be reclaimed by OS in any case. This might speed up shutdown in some cases.

For those who want to know here is the information about benchmark application:

1) It allocates needed number of memory blocks via malloc() calls
2) Block sizes are random, something between 1 and 1024 bytes
3) Deallocation order is also random
4) All random numbers were taken out of the loop (e.g. vectors with the block sizes and deallocation order are prepared before the memory allocation stats)

Thursday, June 21, 2012

Improving performance of std::vector

This article tells about a simple yet efficient way to optimize many cases of using std::vector by replacing it with the class that has very similar interface. That makes the transition trivial.
Cases that benefit from this optimization are:
  1. Working with very small vectors, e.g. when vector grows up to few items in majority of cases, but should not be limited in size anyway (e.g. by using std::array);
  2. Heavily-used stack-based vectors.

In-depths look at std::vector

Most vectors implementation (I said 'most', but I don't know a single one that differs significantly) have a memory layout that is similar to the following:
std::vector memory layout

There are two separate memory blocks:
  • First one contains pointer to the first item, pointer that points past the last item and pointer that points past the end of allocated buffer. It might be allocated on stack if std::vector is declared as local variable;
  • Second block contains all items and its size equals to total size of all items that the vector reserved memory for.
Note that actual contents of first block are implementation-specific, e.g. some implementations prefer to store vector size as an additional field. Also Alignment padding is optional variable-sized block and its size depends on compiler, run-time and target platform.
This is very good and robust implementation.
The problem with that implementation is that it requires dynamic memory allocation when the first item is added to the vector and also when reserved space is not enough expand a vector with new items.
This is usually not a problem because most of all std::vector implementations are clever enough not to grow linearly (e.g. by only 1 item when new item is added), and they tend to grow exponentially.
Also there is a well-known function reserve() method that allows to reserve certain amount of memory at the convenient time to make sure adding new items is not slowed down by memory allocations.
However there are few problems that still arise even when using reserve() properly:
  • There are still two memory blocks allocated instead of one. And each allocation brings memory overhead. Look into this post to see overhead estimations for few compilers and environments;
  • As reservation size grows memory overhead for the whole system grows too (especially when majority of vectors are empty in run-time). 

Small vector optimization

While I was looking at the performance-critical piece of code that dealt with syntactic analysis of natural language texts I noticed that most of std::vectors aggregated within syntactic node class are very small (e.g. in roughly 50% of cases they were empty, and in 49% of cases they grew just to contain 1 item).
I though that it would be great to get this item pre-allocated at the same memory block as the main vector.
This will improve both performance (because of reduced number of allocations and better data locality) and memory consumption of the system (as syntactic trees gets very big when dealing with larger texts).
After quick search I found that llvm project has exactly what I was looking for.
It has the class called llvm::SmallVector that implemented following concept:
  1. During template specialization number of items that should be pre-allocated is specified;
  2. llvm::SmallVector can grow up to the pre-allocated size without additional dynamic memory allocations;
  3. If items are added above the pre-allocated size new memory block is allocated on heap and all items are copied there.
That was exactly what I was looking for. Putting this class in place of std::vector was trivial, I just had to replace original declaration
    typedef std:vector<CAnnotation*> AnnotationPtrVector;
with the new one
    typedef llvm::SmallVector<CAnnotation*, 1> AnnotationPtrVector;
Please note that I was dealing with the vector of pointers, so the one pre-allocated pointer occupied exactly the same space as was previously occupied with Alignment padding memory block in original vector, so sizeof(AnnotationPtrVector) did not change (for some additional details on this please read on).
Obviously the performance increase was huge (the part that filled such vectors got twice faster, and overall system performance increase was about 10%).
Here is the memory layout of llvm::SmallVector class in two cases - when one item is added (and one is pre-allocated), and when more than 1 items are added.

Note that the latter case is very similar to original std::vector's layout.

Stack-based vector optimization

Another possible application of llvm::SmallVector class is heavily-updated vectors that are allocated on stack, e.g. to store some temporary data for further processing.
In this case we are (mostly) unrestricted with number of pre-allocated items (as long as the function is non-recursive), and can completely get rid of dynamic memory allocations even for hundreds of items.
This, of course, yields great performance results (and also reduces heap fragmentation levels).

Further improvements to llvm::SmallVector

When I looked at llvm::SmallVector I noticed few problems that I had to fix to make it applicable for my projects:
  1. The minimum number of pre-allocated items was two, not one, so all my vectors grew substantially when replaced with llvm::SmallVector;
  2. My application was a long-running server app so it used pools to store unused objects to reduce number of dynamic memory allocations. The problem that usual way to free all memory allocated for vec which is std::vector:
    std::vector<…>().swap(vec)
    did not work for llvm::SmallVector.
So I had to make a few tweaks to overcome these issues. Updated source code could be downloaded here. It has the same license as original code from LLVM, so you can freely use it in any open- or closed-source applications.
The only drawback of my modifications is that when we try to pre-allocate only 1 item zero-sized array is created within a llvm::SmallVector instance, and it produces a warning on the most compilers.
Also it might make sense to get rid of Boost dependency, but I believe that Boost is used in many modern C++ projects, so did not waste time on it.

Other implementations

Recently I have found a class that implemented exactly the same concept in Folly C++ library released for public use by Facebook.
I saw few issues with it, but it might be still usable in certain projects (I bet it is, as Facebook uses this library internally):
  1. It's dynamically allocated memory block can't be freed with usual swap() trick I mentioned above
  2. It makes a heavy use of Boost MPL which might not be available in all projects
  3. Many of the methods are surrounded by try / catch(…) blocks which might hide memory errors and also reduce performance (for Visual C++-compiler projects at least).
Please let me know if you come by similar classes in other libraries, I will add more to the article.

Happy optimizing!

Monday, May 28, 2012

Memory allocation overhead in C++ - field study

This weekend I had a spare hour or so and made some measurements that deal with default allocation strategies on Visual C++ 10 and GCC 4.5 compiler.
Update: added results for Intel C++ compiler 12.1. They are very similar to GCC, but two things are interesting about them:

  1. Seems like Intel's Large Object Heap starts with smaller objects
  2. Sometimes 64-bit allocations are smaller then 32-bit ones. Not sure how it happens.

The results were enlightening to say as least. I always knew about memory overhead of small allocations but never though that it is that large.
Just look at the table below. To create them I allocated 100Mb in different chunks starting from 4 bytes (int on x86) up to 40000 bytes.
Overhead column presents average overhead in bytes per each allocation.

Visual C++, 32-bit compilation


Block, bytes
Total size allocated (all chunks + overhead), KB
Overhead, bytes per chunk
4
782788
28
40
156568
24
400
103708
25
4000
98444
32
40000
98052
162

Visual C++, 64-bit compilation


Block, bytes
Total size allocated (chunks + overhead), KB
Overhead, bytes per chunk
4
1565588
60
40
234824
56
400
109600
49
4000
99044
57
40000
98100
182

GCC, 32-bit compilation


Block, bytes
Total size allocated (chunks + overhead), KB
Overhead, bytes per chunk
4
390592
12
40
117952
8
400
99648
8
4000
97856
8
40000
98432
318

Intel C++, 32-bit compilation

Block, bytes
Total size allocated (chunks + overhead), KB
Overhead, bytes per chunk
4

392220
12
40
117652
8
400
100000
10
4000
102196
186
40000
98152
203

Intel C++, 64-bit compilation

Block, bytes
Total size allocated (chunks + overhead), KB
Overhead, bytes per chunk
4

392828
12
40
117620
8
400
102008
18
4000
102008
178
40000
98068
169
So here are few obvious and not so obvious thoughts. Of course they apply when dealing with large data structures that impact application's performance:

  1. (Very obvious one) Dynamic allocation of small values or not so large structures is extreme memory waste. Maybe pack them into std::vector and pass around iterators instead?
  2. All non-linear structures like std::list, std::map, std::set that are storing relatively small items are very memory-inefficient. In most cases they should be replaced by std::vector or something similar (like sorted vector that was mentioned by Scott Meyers)
  3. Even if your algorithm seem to require std::list for small items (e.g. because items should be swapped around, or added dynamically) - start with benchmarking! Even when both std::vector and std::list are being populated in one place (so list is packed well and is not very fragmented), processor cache might benefit from more localized memory footprint of std::vector.
  4. Always create your shared_ptr's with make_shared instead of new. As discussed in previous post it is not only a faster way but also reduces a number of small allocations.

Any other examples we should mention here? All comments are welcome!


PS: Here is the link to source I used to get this data.

Thursday, March 29, 2012

Improving on shared_ptr

In this post I will tell about one optimization I have recently discovered and successfully applied in my project. It is very simple to apply but yields very good results - in terms of both performance and memory footprint.

Let's start with the usual way shared_ptr's are created:
class Test
{
public:
    Test(int dummy) {}
}
...


std::shared_ptr<Test> pTest(new Test(1));


Now let's look step by step on what's going here:
  1. New object of class Test is allocated and constructed. Note that it is allocated on heap, and each heap allocation incurs memory overhead above sizeof(Test).
  2. Then pointer is passed to constructor of std::shared_ptr<Test> which takes ownership of the newly created object.
The mentioned constructor of shared_ptr basically does two things: it assigns its internal pointer to the object that it will hold, and it allocates (again on the heap) shared buffer that will be used to hold reference counters.
As you see, here's another memory allocation, another performance and memory footprint hit.

So the more efficient way of doing it is:
std::shared_ptr<Test> pTest = std::make_shared<Test>(1);


There are few benefits of doing this:
  • There is only one memory allocation - both reference counters and object itself are allocated in one buffer (however this is implementation-dependent, so this is the case for Visual C++ 10)
  • There is a constructor of std::shared_ptr<T> that takes R-value reference as an argument making this construct potentially more efficient
What are drawbacks here you might ask? There are only a few of these:
  • As Visual C++ 10 (and beta of VC11) does not support variadic templates number of arguments passed to object's constructor is limited to 10 (and in VC11 to 5 with default settings)
  • std::make_shared's from boost library does not support this optimization and even in the latest version available now (1.49.0) allocates two separate buffers - one for object and one for reference counters
So anyway, if have boost::shared_ptr's and you are using boost starting from 1.36.0 in your project (that's the version what make_shared was copied from tr1 to boost library)  it might still make sense to migrate to make_shared even if you won't get benefits right away.

Is it that simple?

No, there is one small pitfall I hit when I migrated my project to using std::make_shared.
When object is being created by some factory class the constructor is declared private and factory class is declared as a friend.
The problem lies in the fact that when factory creates new object with make_shared constructor is called from some unknown implementation-specific class, and this call fails to compile as we can't declare it a friend of our class.

The solution is not very elegant (at least for my taste), but it works, and helps when this optimization is applied to make a real difference in performance.
Here is a sample code:

class Obj;
typedef std::shared_ptr ObjPtr;

class Obj
{
...
private:
    struct HideMe {}
    friend class ObjFactory;
public:
    Obj(HideMe, int param)
    { ... }
}

class ObjFactory
{
public:
    static ObjPtr CreateObj()
    {
        return std::make_shared(Obj::HideMe(), param) ;
    }
}
So the actual constructor that is being called by ObjFactory is public, but no other object could invoke it due to hidden structure Obj::HideMe.

Happy optimizing!