Welcome to a collaborative enhancement of the C malloc API – a “version two” (v2) of the malloc(), realloc() and free() function specifications in the standard C runtime library which have been unchanged since the heady days of the Seventh Edition of Unix right back in 1979 (see http://cm.bell-labs.com/7thEdMan/v7vol1.pdf, page 297) and which has been a standard feature of ISO C and all languages which make use of the standard C library.
I intend to propose a new malloc API to the ISO C standards committee which I hope will be accepted into the next C1X draft standard, or failing that a TR library addendum shortly thereafter. Given the tight deadlines for that process, the deadline for these discussions is 11.59:59pm on the 15th October 2010.
I also have a C++ library standards change proposal based on the C malloc API changes which hopefully suggests what uses the C proposal could be put to.
Why this proposed change is important:
- The 1979 malloc API has no concept of greater than fundamental type alignment. This is becoming very important for hardware DMA transfers, vector units such as SSE/AVX on serial computing and stream/parallel computing, particularly for the resizing of aligned arrays which is currently impossible in the C library API.
- Lots and LOTS of languages other than C use the C malloc API, so its design has a disproportionately huge effect. For example, right now most major object orientated languages do not use realloc() to resize an array of objects because realloc() may unpredictably relocate the data in memory which would be catastrophic for internal bookkeeping. As a result, these languages waste CPU cycles, memory and memory bandwidth working around what is fundamentally a C library limitation. What we really need is the ability to speculatively extend in place, and if and only if that fails do we perform a copy/move construction of the array contents.
- The 1979 malloc API was designed for a time when virtual memory was an optional extra for many systems – as a result, it has no understanding of the hardware assisted virtual address space facilities which have been commonplace since the 1980s. This is important for two reasons:
- Most vector/array implementations in languages making use of the C malloc API substantially overallocate/overcommit memory in order to avoid the substantial penalties introduced by storage size changes. As a single example, in the two major C++ STL implementations in use today std::vector<> by default overallocates by 50%. Put another way, if your std::vector<> contains 10,000 items it has allocated memory for between 10,000 and 15,000 items. However, no major C++ STL std::vector<> implementation currently ever shrinks its allocation except when going to empty, so adding 20,000 items and removing 19,999 items will still leave storage for up to 30,000 items committed! This should not be the case in modern code! What one ought to be doing is reserving address space and committing into that not using the overcommitting of memory as a poor man’s substitute for a proper reservation mechanism.
- The lack of virtual memory support was excusable on 32 bit systems with their limited address space, but it is now an active inhibitor of performance on 64 bit systems where in many algorithms you can exchange address space for additional performance. Seeing as address space is very nearly free on most architectures, allowing the program to give hints to the memory allocator opens up a whole new range of improved optimisation opportunities under the bonnet.
Why getting this standardised as soon as possible is important:
Let me be honest: right now in 2010 the fact that realloc() can’t resize aligned arrays or arrays containing objects isn’t that important outside the High Performance Computing world (which includes some PC games). The problem isn’t what’s happening today, but rather what is coming during the next ten years – a typical lifetime of an ISO C standard.
This graph shows a plot of the speed and capacity of the “value sweetspot” in consumer PC RAM sticks between 1997 and 2009 where 1.0 = the value in 1997. Very simply I took the sticks with the best Mb/US dollar around the middle of each year and plotted them on a scatter chart. The speed is depicted in red according to the left scale, whereas the capacity is depicted in green according to the right scale.
As you can see, in the last twelve years capacity has outgrown speed by a factor of ten. Much more importantly, growth in capacity is approximately exponential (actually early logistic) whereas growth in speed is approximately linear. I chose to display twelve years simply because that is when the difference of a factor of ten appears – if one goes back, one finds that these rates of growth have been stable since the late 1980s.
What this means is that in the next twelve years, assuming that the trends of the past twelve years continue and all other things being equal, memory speed in 2021 will be a further 25x faster than today and therefore 625x faster than in 1997. However, assuming that its exponential growth continues exactly as before, memory capacity in 2021 will be log[base 44.8](250) = 1.131, pow(2*1.131, 44.8) = 7,657,443,735,288,281x larger than in 1997, or 30,629,774,941,153x larger than today – hereafter written as 3E13 (3 x 1013). In case you think that infeasible, consider that cutting edge research can already pack 3E18 bytes per square inch using electron quantum holography when today’s data storage technologies can manage perhaps 3E10 bytes per square inch.
One cannot establish confidence intervals for the course of logistic growth until a good bit after the point of inflection has passed i.e. when the second derivative has gone negative, so neither I nor anyone else can tell you if these predictions will come to pass for sure. However, even if growth in capacity were just linear from now on, one would still see a tenfold larger growth in capacity than in its access speed. So what I will predict is this: during the next twelve years memory capacity will outgrow its access speed by between a lower bound of ten times and an upper bound of one trillion times.
And what that translates into for computer programmers and operating system architects is that memory bandwidth must be conserved at all possible costs. One big user of memory bandwidth is memory copying, and it is precisely the avoidance of memory copying at which this “malloc v2″ proposal is aimed.
In summary, this standards change proposal is all about making it possible
via the standard API to remove the known major sources of unnecessary memory
accesses i.e. a latency reducing API.
Considering the years it takes to implement a new international standard, and then to get programmers to apply the new features to existing code, I hope you can see why this proposal needs to be added now even though it won’t affect everyday computer operation for many years yet.
If you have any questions or comments about this section, please add them below.
How you can help:
- Read the C Proposal Text, my notes upon it (many are linked) and the comments made by others below. Add your own comment if you wish.
- Now read the C++ Proposal Text, my notes upon it and the comments made by others below. Add your own comment if you wish.
- Lastly, go to the Polls section. Each major design disagreement by experts is listed here. Read their arguments and the counter-arguments of the other experts, and vote accordingly.
Thank you in advance for your time! You will have helped play your part in keeping computing technology in peak condition during the coming decades!
- [BACK TO POST] Note that an aligned_alloc() function has been added to the C1X draft standard, however I think it a bad idea without alignment property stickiness as otherwise it cannot permit resizing of aligned blocks. Adding a sticky alignment property to memory allocation needs a whole discussions page of its own.
- [BACK TO POST] Some may take the view here “so what?”. After all the kernel will surely page unused memory pages out into the swap file and use the physical RAM for more useful purposes?
This is a difficult concept to explain to those not versed in memory allocator theory, however the first reason is that the kernel’s memory pager is configured on the basis of a series of statistical assumptions about how the “average” program behaves as first detailed in Denning (1970). If some programs overcommit memory which they never use while other ones don’t, the programs which overcommit will be on average punished as those appear “fatter” relative to the average. If however the program can supply more metadata about what memory is really being used for, one can hint far more effectively to the paging mechanism what ought and ought not to be prioritised e.g. “I’m not going to use this memory here right now but I might soon” versus “I definitely will never use this memory ever again”. The kernel can only know how often a page is accessed on average which rewards frequently accessed pages but punishes those which are accessed infrequently. This translates into pathological performance in all sorts of unexpected corner cases which equals lots of extra debugging time and user frustration.
The second reason is that committed pages take up far more system resources than reserved (empty) address space. Obviously the process’ memory allocator must keep track of it, but so must also the kernel which must notionally allocate space for all committed pages in the swap file which inevitably drags in the filing system implementation and all its complexities. The greater the number of committed pages, the longer the page fault handler must take which adds latency every time the CPU strays outside its caches. And the longer these sorts of lists, the more time must be spent in serialisation locks and other sorts of performance inhibiting structures. Academic research on these sorts of overheads varies, however in certain scenarios memory management code may occupy up to a third of CPU time and this proportion is growing as CPUs continue to become faster than memory.