There is a common topic of discussion for all C/C++ developers - avoiding extra memory allocations to optimize application's performance. Really, unthoughtful memory allocations might kill performance of the best algorithm out there.
During one of the performance exercises I did on my project I noticed a strange hotspot in the place when memory is freed. Later on I did a brief investigation on this matter, and found no further information except this short discussion on stackoverflow: http://stackoverflow.com/questions/5625842/cost-of-memory-deallocation-and-potential-compiler-optimizations-c. It mentioned that although there are no guarantees about the cost of freeing memory, developers might expect it to be no higher than the cost of memory allocation.
As there were no numbers our there I decided to do simple benchmark and results were quite surprising. Although when the number of allocations was small the cost of deallocation was below the cost of the allocation, but as the numbers come close to millions of allocations the cost of freeing memory increases significantly (this of course depends on the compiled used).
Without further ado here are the numbers:
| 
# of allocations | 
Visual C++ 10 | 
Intel C++ 12.1 | 
| 
1K | 
51.44% | 
46.95% | 
| 
10K | 
97.61% | 
96.55% | 
| 
100K | 
100.13% | 
100.95% | 
| 
1M | 
106.33% | 
105.95% | 
| 
2M | 
134.35% | 
102.48% | 
| 
3M | 
157.52% | 
109.76% | 
| 
4M | 
176.04% | 
113.73% | 
| 
5M | 
193.23% | 
119.14% | 
Numbers here are the relative costs of freeing compared to the costs of allocations. For example, 50% means that it takes twice less time to deallocate memory than to allocate it.
As you see, for Visual C++ compiler deallocation might take almost twice more than allocation! Intel's compiler also has a growing deallocation cost, but growth is much slower.
What's the practical application of this knowledge? In rare cases when memory deallocation becomes a bottleneck it might be useful to emulate a manual garbage collection - e.g. remember all blocks that are to be freed, and then, when application is idle, do the clean-up. Not sure if any real world application would benefit from such optimization though :)
Another possible optimization is just avoiding the cost of freeing memory when application is terminating - when process exists its memory will be reclaimed by OS in any case. This might speed up shutdown in some cases.
For those who want to know here is the information about benchmark application:
1) It allocates needed number of memory blocks via malloc() calls2) Block sizes are random, something between 1 and 1024 bytes
3) Deallocation order is also random
4) All random numbers were taken out of the loop (e.g. vectors with the block sizes and deallocation order are prepared before the memory allocation stats)
 
Seems like this was done on Windows - would be quite interesting to have the results for Linux, too.
ReplyDeleteI think the results will be quiet similar, baecause Windows and Linux are using similar heap management algorithms.
DeleteAgree, I don't expect any surprises, but will try to benchmark with GCC on Windows / Linux soon
DeleteI think the reason of slowing down in algorithm the libc is using. When freeing data it merge little consecutive chunks in bigger ones. Libc doesn't return memory to OS immediately, it is caching it.
ReplyDeleteAs long as you are interested in comparative benchmarks, would you
ReplyDelete(a) consider sharing the benchmark driver code?
(b) consider testing the difference when using libtcmalloc (from Google's perftools)?
(c) especially the behaviour under threading?
Given (a), I could do (b) and (c) perhaps :)