This weekend I had a spare hour or so and made some measurements that deal with default allocation strategies on Visual C++ 10 and GCC 4.5 compiler.
Update: added results for Intel C++ compiler 12.1. They are very similar to GCC, but two things are interesting about them:
The results were enlightening to say as least. I always knew about memory overhead of small allocations but never though that it is that large.
Just look at the table below. To create them I allocated 100Mb in different chunks starting from 4 bytes (int on x86) up to 40000 bytes.
Overhead column presents average overhead in bytes per each allocation.
So here are few obvious and not so obvious thoughts. Of course they apply when dealing with large data structures that impact application's performance:
Any other examples we should mention here? All comments are welcome!
PS: Here is the link to source I used to get this data.
Update: added results for Intel C++ compiler 12.1. They are very similar to GCC, but two things are interesting about them:
- Seems like Intel's Large Object Heap starts with smaller objects
- Sometimes 64-bit allocations are smaller then 32-bit ones. Not sure how it happens.
The results were enlightening to say as least. I always knew about memory overhead of small allocations but never though that it is that large.
Just look at the table below. To create them I allocated 100Mb in different chunks starting from 4 bytes (int on x86) up to 40000 bytes.
Overhead column presents average overhead in bytes per each allocation.
Visual C++, 32-bit compilation
Block, bytes
|
Total size allocated (all chunks + overhead), KB
|
Overhead, bytes per chunk
|
4
|
782788
|
28
|
40
|
156568
|
24
|
400
|
103708
|
25
|
4000
|
98444
|
32
|
40000
|
98052
|
162
|
Visual C++, 64-bit compilation
Block, bytes
|
Total size allocated (chunks + overhead), KB
|
Overhead, bytes per chunk
|
4
|
1565588
|
60
|
40
|
234824
|
56
|
400
|
109600
|
49
|
4000
|
99044
|
57
|
40000
|
98100
|
182
|
GCC, 32-bit compilation
Block, bytes
|
Total size allocated (chunks + overhead), KB
|
Overhead, bytes per chunk
|
4
|
390592
|
12
|
40
|
117952
|
8
|
400
|
99648
|
8
|
4000
|
97856
|
8
|
40000
|
98432
|
318
|
Intel C++, 32-bit compilation
Block, bytes
|
Total size allocated (chunks + overhead), KB
|
Overhead, bytes per chunk
|
4
| 392220 |
12
|
40
|
117652
|
8
|
400
|
100000
|
10
|
4000
|
102196
|
186
|
40000
|
98152
|
203
|
Intel C++, 64-bit compilation
Block, bytes
|
Total size allocated (chunks + overhead), KB
|
Overhead, bytes per chunk
|
4
| 392828 |
12
|
40
|
117620
|
8
|
400
|
102008
|
18
|
4000
|
102008
|
178
|
40000
|
98068
|
169
|
- (Very obvious one) Dynamic allocation of small values or not so large structures is extreme memory waste. Maybe pack them into std::vector and pass around iterators instead?
- All non-linear structures like std::list, std::map, std::set that are storing relatively small items are very memory-inefficient. In most cases they should be replaced by std::vector or something similar (like sorted vector that was mentioned by Scott Meyers)
- Even if your algorithm seem to require std::list for small items (e.g. because items should be swapped around, or added dynamically) - start with benchmarking! Even when both std::vector and std::list are being populated in one place (so list is packed well and is not very fragmented), processor cache might benefit from more localized memory footprint of std::vector.
- Always create your shared_ptr's with make_shared instead of new. As discussed in previous post it is not only a faster way but also reduces a number of small allocations.
Any other examples we should mention here? All comments are welcome!
PS: Here is the link to source I used to get this data.
VS2010 - was it a Debug heap or the release heap?
ReplyDeleteDebug heap adds very much overhead to make its debugging checks.
I was using Release heap
Deleteone more tip: use your own allocators for small data types. they could be a way faster than default allocators.
ReplyDeleteAgree. Sometimes this helps. I did these measurements to understand better when this 'sometimes' begins
DeleteCould you explain your method of overhead calculation?
ReplyDeleteAs I know for VC++ default heap user data = 8 bytes,
so for 8 bytes blocks you should get the same overhead of 28 bytes. Am I right?
Basically here is the very simple formula I used:
Deleteoverhead = (total allocations size) / number of allocations - chunk size
Total allocations size is working set size after allocations minus working set size before allocations.
For 8-byte blocks I would expect overhead of something like 24 bytes.
I wrote 'something like' because we should realize that this overhead comes not only from rounding up memory block size but also from additional structures required to hold information about allocated blocks
Could you please publish your code?
ReplyDeleteThe same request from me. I don't trust magic numbers :)
DeleteNo magic here, updated article with the link to source code. Let me know if you see any issues :)
Delete1. Is it the compiler's deal? Not a standard library? I think it's library's deal.
ReplyDelete2. If one particular implementation consumes more memory does it provide some kind of benefits rather than more compact implementations?
GCC and Visual C++ have significantly different stl implementations. I would not dare to try to switch STLs without switching compilers.
DeleteRegarding benefits - I totally agree. Designing implementation of such widely used thing as STL is always a balance of various trade-offs
What exactly benefit there is, don't you know? I think this decision had the reason...
DeleteI will yet have to discover it, stay tuned
Deletein case of glibc it is quiet simple - 8 bytes (on 32bit) is 2 pointers, if i don't mistake, one points to previous chunk and another one to next. Plus alignment.
DeleteYou are not actually measuring the compiler's nor the architecture's allocation overhead; you are trying to measure your allocator's overhead (which happens to be the default malloc implementation). Try running your same test with jmalloc (hint: LD_LIBRARY_PRELOAD should help) for some improvements. I think you can even get these stats from jmalloc itself.
ReplyDeleteAgree. The goal of this post is not to blame some compiler / architecture for inefficiency but rather to make more developers aware of these overheads. I'd like people to make justified trade-offs between memory footprint, speed and maintenance cost (which increase as you put in more 3rd party stuff such as custom allocators)
DeleteIt has a point. Similar problem I had was that it consumes much more heap memory than I expected by custom memory profiler. Wide usage of std::map with many small-sized value was main contributor.
ReplyDelete