Home


Welcome to a collaborative enhancement of the C malloc API – a “version two” (v2) of the malloc(), realloc() and free() function specifications in the standard C runtime library which have been unchanged since the heady days of the Seventh Edition of Unix right back in 1979 (see http://cm.bell-labs.com/7thEdMan/v7vol1.pdf, page 297) and which has been a standard feature of ISO C and all languages which make use of the standard C library.

I intend to propose a new malloc API to the ISO C standards committee which I hope will be accepted into the next C1X draft standard, or failing that a TR library addendum shortly thereafter. Given the tight deadlines for that process, the deadline for these discussions is 11.59:59pm on the 15th October 2010.

I also have a C++ library standards change proposal based on the C malloc API changes which hopefully suggests what uses the C proposal could be put to.

Why this proposed change is important:

  1. The 1979 malloc API has no concept of greater than fundamental type alignment. This is becoming very important for hardware DMA transfers, vector units such as SSE/AVX on serial computing and stream/parallel computing[1], particularly for the resizing of aligned arrays which is currently impossible in the C library API.
  2. Lots and LOTS of languages other than C use the C malloc API, so its design has a disproportionately huge effect. For example, right now most major object orientated languages do not use realloc() to resize an array of objects because realloc() may unpredictably relocate the data in memory which would be catastrophic for internal bookkeeping. As a result, these languages waste CPU cycles, memory and memory bandwidth working around what is fundamentally a C library limitation. What we really need is the ability to speculatively extend in place, and if and only if that fails do we perform a copy/move construction of the array contents.
  3. The 1979 malloc API was designed for a time when virtual memory was an optional extra for many systems – as a result, it has no understanding of the hardware assisted virtual address space facilities which have been commonplace since the 1980s. This is important for two reasons:
    1. Most vector/array implementations in languages making use of the C malloc API substantially overallocate/overcommit memory in order to avoid the substantial penalties introduced by storage size changes. As a single example, in the two major C++ STL implementations in use today std::vector<> by default overallocates by 50%. Put another way, if your std::vector<> contains 10,000 items it has allocated memory for between 10,000 and 15,000 items. However, no major C++ STL std::vector<> implementation currently ever shrinks its allocation except when going to empty, so adding 20,000 items and removing 19,999 items will still leave storage for up to 30,000 items committed! This should not be the case in modern code! What one ought to be doing is reserving address space and committing into that not using the overcommitting of memory as a poor man’s substitute for a proper reservation mechanism[2].
    2. The lack of virtual memory support was excusable on 32 bit systems with their limited address space, but it is now an active inhibitor of performance on 64 bit systems where in many algorithms you can exchange address space for additional performance. Seeing as address space is very nearly free on most architectures, allowing the program to give hints to the memory allocator opens up a whole new range of improved optimisation opportunities under the bonnet.

Why getting this standardised as soon as possible is important:

Let me be honest: right now in 2010 the fact that realloc() can’t resize aligned arrays or arrays containing objects isn’t that important outside the High Performance Computing world (which includes some PC games). The problem isn’t what’s happening today, but rather what is coming during the next ten years – a typical lifetime of an ISO C standard.

This graph shows a plot of the speed and capacity of the “value sweetspot” in consumer PC RAM sticks between 1997 and 2009 where 1.0 = the value in 1997. Very simply I took the sticks with the best Mb/US dollar around the middle of each year and plotted them on a scatter chart. The speed is depicted in red according to the left scale, whereas the capacity is depicted in green according to the right scale.

As you can see, in the last twelve years capacity has outgrown speed by a factor of ten. Much more importantly, growth in capacity is approximately exponential (actually early logistic) whereas growth in speed is approximately linear. I chose to display twelve years simply because that is when the difference of a factor of ten appears – if one goes back, one finds that these rates of growth have been stable since the late 1980s.

What this means is that in the next twelve years, assuming that the trends of the past twelve years continue and all other things being equal, memory speed in 2021 will be a further 25x faster than today and therefore 625x faster than in 1997. However, assuming that its exponential growth continues exactly as before, memory capacity in 2021 will be log[base 44.8](250) = 1.131, pow(2*1.131, 44.8) = 7,657,443,735,288,281x larger than in 1997, or 30,629,774,941,153x larger than today – hereafter written as 3E13 (3 x 1013). In case you think that infeasible, consider that cutting edge research can already pack 3E18 bytes per square inch using electron quantum holography when today’s data storage technologies can manage perhaps 3E10 bytes per square inch.

One cannot establish confidence intervals for the course of logistic growth until a good bit after the point of inflection has passed i.e. when the second derivative has gone negative, so neither I nor anyone else can tell you if these predictions will come to pass for sure. However, even if growth in capacity were just linear from now on, one would still see a tenfold larger growth in capacity than in its access speed. So what I will predict is this: during the next twelve years memory capacity will outgrow its access speed by between a lower bound of ten times and an upper bound of one trillion times.

And what that translates into for computer programmers and operating system architects is that memory bandwidth must be conserved at all possible costs. One big user of memory bandwidth is memory copying, and it is precisely the avoidance of memory copying at which this “malloc v2″ proposal is aimed.

In summary, this standards change proposal is all about making it possible
via the standard API to remove the known major sources of unnecessary memory
accesses i.e. a latency reducing API.

Considering the years it takes to implement a new international standard, and then to get programmers to apply the new features to existing code, I hope you can see why this proposal needs to be added now even though it won’t affect everyday computer operation for many years yet.

If you have any questions or comments about this section, please add them below.

How you can help:

  1. Read  the C Proposal Text, my notes upon it (many are linked) and the comments made by others below. Add your own comment if you wish.
  2. Now read the C++ Proposal Text, my notes upon it and the comments made by others below. Add your own comment if you wish.
  3. Lastly, go to the Polls section. Each major design disagreement by experts is listed here. Read their arguments and the counter-arguments of the other experts, and vote accordingly.

Thank you in advance for your time! You will have helped play your part in keeping computing technology in peak condition during the coming decades!

Notes:

  1. [BACK TO POST] Note that an aligned_alloc() function has been added to the C1X draft standard, however I think it a bad idea without alignment property stickiness as otherwise it cannot permit resizing of aligned blocks. Adding a sticky alignment property to memory allocation needs a whole discussions page of its own.
  2. [BACK TO POST] Some may take the view here “so what?”. After all the kernel will surely page unused memory pages out into the swap file and use the physical RAM for more useful purposes?

    This is a difficult concept to explain to those not versed in memory allocator theory, however the first reason is that the kernel’s memory pager is configured on the basis of a series of statistical assumptions about how the “average” program behaves as first detailed in Denning (1970). If some programs overcommit memory which they never use while other ones don’t, the programs which overcommit will be on average punished as those appear “fatter” relative to the average. If however the program can supply more metadata about what memory is really being used for, one can hint far more effectively to the paging mechanism what ought and ought not to be prioritised e.g. “I’m not going to use this memory here right now but I might soon” versus “I definitely will never use this memory ever again”. The kernel can only know how often a page is accessed on average which rewards frequently accessed pages but punishes those which are accessed infrequently. This translates into pathological performance in all sorts of unexpected corner cases which equals lots of extra debugging time and user frustration.

    The second reason is that committed pages take up far more system resources than reserved (empty) address space. Obviously the process’ memory allocator must keep track of it, but so must also the kernel which must notionally allocate space for all committed pages in the swap file which inevitably drags in the filing system implementation and all its complexities. The greater the number of committed pages, the longer the page fault handler must take which adds latency every time the CPU strays outside its caches. And the longer these sorts of lists, the more time must be spent in serialisation locks and other sorts of performance inhibiting structures. Academic research on these sorts of overheads varies, however in certain scenarios memory management code may occupy up to a third of CPU time and this proportion is growing as CPUs continue to become faster than memory.

12 Comments

  1. Jeffrey Yasskin
    Posted September 22, 2010 at 10:48 pm | Permalink | Reply

    Are you sure that std::vector over*commits* by 50%, or does it just malloc() that much extra space? I believe most implementations don’t access that space until an object is actually stored there, so a good malloc() implementation should be able to reserve the address space instead of committing it.

    What is missing is a way for vector to communicate to the allocation system that it’s ok to _uncommit_ storage that’s become unused. I think your realloc2(ptr, smaller_size, M2_PREVENT_MOVE) would suffice for this. On systems that support it, this could forward to a call to madvise(ptr+smaller_size, old_size-smaller_size, MADV_DONTNEED).

    I suspect vector implementations will continue to over-malloc by 50-100% rather than using M2_RESERVE_MULT(2) to keep the number of calls into the allocator around log(size).

    Here I’m assuming the linux behavior of not actually allocating mmapped space until the process actually uses it. Are you saying that even this is expensive enough that it’s worth calling mmap extra times to explicitly manage the address space?

    • Posted September 23, 2010 at 5:19 pm | Permalink | Reply

      Thank you for your comment.

      The std::vector implementations in the MSVC (Dinkumware) and GCC (SGI) STLs both allocate (i.e. malloc) 50% more memory than they need, and neither ever reduce that memory allocation unless the vector is emptied.

      You would be correct that madvise() or VirtualAlloc(MEM_RESET) could be useful to tell the VM system that memory contents can be optionally thrown away. Academic research (Kimpe et al 2006) shows that these calls are so slow that except for very large ranges calling them doesn’t help … *except* in a memory constrained environment. In other words, their current design and performance means they ought only ever to be called when memory gets tight. This is doable, but most STLs try to avoid such architecture dependent code.

      Rather better would be to move the VM implementation out of the kernel into user space. Given the much improved availability of usage metadata in user space, these sorts of bottlenecks can be eliminated.

      Considering your point about vector implementations still overallocating anyway, I would suggest that address reservation doesn’t make much sense up to a few dozen kilobytes anyway as memcpy/move construction is fast enough. One might overallocate by say 100-200% at smaller block sizes, but decrease the overallocation in favour of reservation as the block size grows. By say 1Mb block sizes one might reserve 8Mb and overallocate by 256Kb.

      Future allocator technology will render much of overallocation redundant as memory can be made available at any address location in practically zero time. In such a world it no longer makes sense to micromanage allocations because calling the allocator will be faster than any logic trying to batch or avoid calls into the allocator.

      In other words, mmap today is slow because it involves pushing lots of data over a connection to the kernel in which manipulations are fundamentally serialised. If one moves the VM system into user space, mmap just becomes a compatibility wrapper around a much more advanced API with operates purely on a process (and CPU cluster) local level.

  2. Posted September 23, 2010 at 3:29 am | Permalink | Reply

    I agree that the C and C++ allocator “framework” is fairly primitive and needs improvement. I’ve been studying this myself the last few months, and have been primarily influenced by the following works:

    - Howard Hinnant’s 2006 proposal, N1953 [1]
    - Ion Gaztanaga’s 2006 proposal, N2045 [2]
    - Ion’s follow-up study [3]
    - Paul Pedriana’s EASTL proposal, N2271 [4]
    - Pablo Halpern’s 2005 Lakos allocator model, N1850 [5]

    [1], [2], and [3] form a natural progression of ideas, and I’m currently investigating some of the ideas introduced in [3]. [4] has an interesting perspective from the gaming community and makes some convincing points. I’m less convinced of [5], especially the need to propagate allocators to “contained objects” (there were some other issues I didn’t find palatable, I think, but it’s been a while since I last reviewed that paper), but it does show a relatively early interest in improving C++ allocators.

    These all come from the C++ community; perhaps you have additional references from the C community that give additional incites.

    I guess I’d like to comment at this time on what features I would like the C++ allocator interface to have, without considering any backwards compatibility. I don’t think this list is exhaustive, but covers some important points; your opinions on these are welcome. Of the above references, I tend to favor Ion’s work the most, and these points reflect that.

    - Separation of array allocation and node allocation.
    - Aligned allocation. I agree with you that this is important.
    - “Hinted” allocation to improve locality, as is provided by the std::allocator interface (but lacking in the underlying call to new).
    - What Ion refers to as “burst allocation”, and I call “multiallocation” and “multideallocation”, which allows one to make multiple allocation and deallocation requests in a single function call and attempt to improve locality. I think the way Ion has handled this, by intrusively linking the allocated chunks together, is “nifty”.
    - Array expansion and contraction. Your proposed realloc2 achieves this to some extent, but what I’d like to see is an interface that allows forward expansion, backward expansion, forward+backward expansion, forward contraction (moving the front of the block forward), and backward contraction (moving the end of the block backward).
    - No more rebind mess (as promoted by the EASTL proposal).

    I’m looking forward to your thoughts on any of this.

    - Jeff

    [1] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n1953.html
    [2] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2045.html
    [3] http://www.drivehq.com/web/igaztanaga/allocplus/
    [4] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2271.html
    [5] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1850.pdf

    • Posted September 27, 2010 at 4:38 pm | Permalink | Reply

      Thank you very much for your comment here and on the Boost devs mailing list. My apologies for the late reply – stuff came up, but also I needed to digest the reading you gave me.

      Given its importance, I have uploaded a new page on the topic at http://mallocv2.wordpress.com/references/ where I have made a number of general statements and thoughts. In short, I don’t think that any of the proposals you suggested offer anything better than the original Stepanov STL allocator design, and I think given the new language features of C++0x we might as well aim high if we’re going to. Moreover, I think you’d get heavyweight support for the Stepanov design whereas anything less will be sidelined (as the prior proposals have been).

      To be specific about the papers suggested:

      [1]: I think versioned allocators unnecessary simply because type_traits can be used to discover if some type has the v2 API or not.

      [2] and [3]: Lots in here, so each in turn:

      The problems #1 to #5: These proposals are designed to prevent memory wastage when this isn’t a problem – memory bandwidth wastage is far more important. Therefore much of what they propose – or more often, how they intend it – isn’t necessary.

      Node vs. array allocation: I’m afraid I don’t get the difference at all. Strictly speaking the STL allocator works in units of sizeof(T). But there is nothing at all stopping you feeding it some arbitrary N of T=char and performing a cast to the required type. In fact, it is still legal to call std::malloc from and work directly with the C API.

      Backwards expansion (in the sense of the front moving downwards): I would be strongly against backwards expansion, mainly because it’s a pain to implement quickly and moreover allowing it all makes the entire allocator implementation slower by introducing quadratic complexity where formerly there was linear. Also, backwards expansion is only really necessary when forward expansion isn’t available, and the address space reservation mechanism in this proposal obviates that problem in most cases. Lastly, programmers have a horrible (Western) cultural bias in favour of starts never moving while tails grow or shrink – I personally think that moveable starts would introduce a huge number of bugs by the average capability programmer, but I admit this could be elitism on my part.

      Returning how much memory is actually allocated: Good idea. When the allocator goes to mmap for allocating large blocks it can overallocate on some x64 platforms by up to 2Mb :)

      Expand + Allocate combined call: I am not sure that this is worth the additional API complexity … the in-place resize failing will usually in C++’s case result in a malloc for the new storage, but in practice implementations cannot reuse the work done in the resize attempt in order to speed the malloc. Therefore all one is saving at most is a spinlock traversal, and that assumes that the allocator isn’t running a per-thread lookaside cache (all big vendor system allocators today do just this).

      Burst allocation: I think that iterator based burst allocation is only safely possible in C++, not C where I could only see a macro based system offering any tangible benefits. One of the big advantages to a fixed size output buffer is that you can parallelise the call – a simple OpenMP declaration around the for loop suffices. I don’t get his concern that std::list is forced to allocate a temporary array when alloca() is your friend here. I agree that contiguousness ought to be more of an aspiration than a guarantee, and indeed the big reason why dlmalloc makes these calls so fast is that it actually finds a free space big enough for all the allocations and then proceeds to demarcate their internals all at once without having to search the free space lists for each allocation. Most allocators can also do this, but some cannot. In short, for C at least I can’t see an iterator based burst allocator being reasonable, and if C goes the independent_comalloc route then it automatically becomes available to C++. Then one has to ask, is the C API sufficient for C++ usage? I’d say yes it is when used with alloca. I might also add that Ion reports very similar performance between adaptive pools and independent_comalloc, but overhead is much higher (again, memory wastage I don’t view important).

      Use of virtual machine for benchmarking: Any of Ion’s results using a virtual machine need to be ignored. Anyone who has compared benchmarks obtained from a VM versus a real system knows that the VM results are extremely unreliable. Even when test A is three times faster than test B on a VM it can be that test A is ten times slower than test B on real hardware.

      Adaptive pools: I agree that segregated pools are useful for all sorts of reasons. Batch free() performance is not one of them – soon-to-be-released allocators will execute free asychronously, so calling free on a whole load of nodes isn’t going to be quite so slow anymore, and besides doing a load of free2(M2_CONSTANT_TIME) will be a lot better than without. The other reason I would delay decisions on segregated pools is that there are big changes coming here and soon – as core counts increase, a certain amount of NUMA where some cores and memory regions are “closer” than others is going to come in. Segregated pools are going to play an important part and I wouldn’t want to damage that through hasty API choices today.

      Total lack of VM awareness: I am fairly stunned that there is *still* zero awareness of virtual memory in ANY of these proposals. Hinting what memory will be used for to the VM system can be very beneficial – maybe not so much for existing kernel based VM, but for next generation user mode based VM certainly.

      [4]: There is lots of plenty good in the EASTL proposal, but I do think that developments since 2007 have likely done much to obviate much of it. Regarding its allocator proposal, they appear to have removed the type dependency of the allocator as well as adding state. Both are reasonable choices executed well, but I still yearn for the Stepanov original intent.

      [5]: From my brief reading, all of this proposal can be implemented, if somewhat hackily, with existing C++ features e.g. by using pointers to pointers to state in the template parameters of the allocator type. Again, I yearn for the superior Stepanov allocator design.

      I appreciate that this is a long reply, and it probably ought to go to email rather than as a comment. But I hope you find it useful, and my apologies once again for taking so long to reply.

      Niall

      • Posted September 29, 2010 at 4:25 pm | Permalink

        > Thank you very much for your comment here and on the Boost devs mailing list.
        > My apologies for the late reply – stuff came up, but also I needed to digest
        > the reading you gave me.

        Of course. My turn to digest a little, I suppose ;)

        > Given its importance, I have uploaded a new page on the topic at
        > http://mallocv2.wordpress.com/references/ where I have made a number of
        > general statements and thoughts. In short, I don’t think that any of the
        > proposals you suggested offer anything better than the original Stepanov STL
        > allocator design, and I think given the new language features of C++0x we
        > might as well aim high if we’re going to. Moreover, I think you’d get
        > heavyweight support for the Stepanov design whereas anything less will be
        > sidelined (as the prior proposals have been).

        I feel like I’m actually missing something from the “original Stepanov STL
        allocator design”, but I will post a comment on the References page where you
        can elaborate.

        > To be specific about the papers suggested:

        > [1]: I think versioned allocators unnecessary simply because type_traits can
        > be used to discover if some type has the v2 API or not.

        Sure, though I feel the importance of that paper was the proposed API
        extensions, not the specific mechanism to differentiate between the old and new
        API (the necessity of such a mechanism is a given, I think).

        > [2] and [3]: Lots in here, so each in turn:

        > The problems #1 to #5: These proposals are designed to prevent memory wastage
        > when this isn’t a problem – memory bandwidth wastage is far more important.
        > Therefore much of what they propose – or more often, how they intend it –
        > isn’t necessary.

        I think it would be helpful if you clarified what you mean by “memory bandwidth
        wastage”. I think I’ve been under the impression that “memory bandwidth
        wastage” is essentially an overcommittal of virtual memory pages (e.g., when
        realloc’ing a large array, you’d have roughly twice the number of pages
        committed in memory than you really need to store the data). I’m not as
        familiar with memory allocation as Mr. Nedmalloc is, but I’m trying ;)

        Superficially, at least, a lot of N2045/Ion’s followup seems to be aligned with
        your proposal. Your (2.) under “Why this proposed change is important” actually
        points to wasted CPU cycles and memory as a motivation for your proposal, which
        is what N2045′s #1-#4 address. So I was under the impression that you *did*
        think memory wastage was a problem. Please clarify.

        In any case, it would seem that N2045′s proposals are equally targeted toward
        minimizing fragmentation, reducing memory copying and memory usage, and reducing
        lock acquisitions. Your proposal seems to address the first two at least
        (although I like the idea of differentiating between limit_size, preferred_size,
        and received_size introduced in N1953 and expanded upon in N2045), and I gather
        the third one is not a major concern in your opinion (given thread-local caches,
        I presume). It seems the one major feature lacking in N2045 that you have in
        your proposal is address space reservation. Is this a fair assessment?

        > Node vs. array allocation: I’m afraid I don’t get the difference at all.
        > Strictly speaking the STL allocator works in units of sizeof(T). But there is
        > nothing at all stopping you feeding it some arbitrary N of T=char and
        > performing a cast to the required type. In fact, it is still legal to call
        > std::malloc from and work directly with the C API.

        I believe the primary difference is that blocks returned by node allocation
        cannot later be resized, which can reduce bookkeeping overhead and improve
        allocation/deallocation speed. Your experience may be different, but Ion’s
        followup study to N2045 seems to indicate advantages of separating node and
        array allocation. I would also expect a program that node-allocates a block of
        a given size typically allocates many same-sized blocks, where this would not
        (typically) be true for array allocation. Effectively, the program can
        communicate its usage patterns more clearly to the allocator. This is purely
        speculation, however.

        > Backwards expansion (in the sense of the front moving downwards): I would be
        > strongly against backwards expansion, mainly because it’s a pain to implement
        > quickly and moreover allowing it all makes the entire allocator implementation
        > slower by introducing quadratic complexity where formerly there was linear.
        > Also, backwards expansion is only really necessary when forward expansion
        > isn’t available, and the address space reservation mechanism in this proposal
        > obviates that problem in most cases. Lastly, programmers have a horrible
        > (Western) cultural bias in favour of starts never moving while tails grow or
        > shrink – I personally think that moveable starts would introduce a huge number
        > of bugs by the average capability programmer, but I admit this could be
        > elitism on my part.

        First of all, an implementation is free to return null upon a backwards
        expansion request ;)

        But more importantly, I don’t see how backwards expansion is inherently more
        difficult to implement. Perhaps you can elaborate? I also don’t see the
        quadratic complexity. It seems you’re assuming a particular allocator design
        (e.g., dlmalloc) here.

        I’m uncomfortable with assuming that address space reservation will always be
        available (and thus whether this really obviates the need for backwards
        expansion). Nested allocators? Fixed-size buffer allocators? I don’t know.
        It’s difficult for me to buy into arguments about a general allocator interface
        based on a specific implementation of that interface.

        Regarding the average capable programmer: The “average” (if there is such a
        thing) C++ programmer pretty much never has to worry about memory allocators, so
        I think this is moot as far as C++ is concerned. And C++ is well-known for
        providing the guns to shoot yourself in the foot anyway, so in that sense, one
        would be remisce to exclude a backwards expansion API.

        [...]
        > Expand + Allocate combined call: I am not sure that this is worth the
        > additional API complexity … the in-place resize failing will usually in C++’s
        > case result in a malloc for the new storage, but in practice implementations
        > cannot reuse the work done in the resize attempt in order to speed the malloc.
        > Therefore all one is saving at most is a spinlock traversal, and that assumes
        > that the allocator isn’t running a per-thread lookaside cache (all big vendor
        > system allocators today do just this).

        This is a valid argument. I haven’t decided myself if the additional API
        complexity is worth it or not.

        > Burst allocation: I think that iterator based burst allocation is only safely
        > possible in C++, not C where I could only see a macro based system offering
        > any tangible benefits. One of the big advantages to a fixed size output buffer
        > is that you can parallelise the call – a simple OpenMP declaration around the
        > for loop suffices. I don’t get his concern that std::list is forced to
        > allocate a temporary array when alloca() is your friend here. I agree that
        > contiguousness ought to be more of an aspiration than a guarantee, and indeed
        > the big reason why dlmalloc makes these calls so fast is that it actually
        > finds a free space big enough for all the allocations and then proceeds to
        > demarcate their internals all at once without having to search the free space
        > lists for each allocation. Most allocators can also do this, but some cannot.
        > In short, for C at least I can’t see an iterator based burst allocator being
        > reasonable, and if C goes the independent_comalloc route then it automatically
        > becomes available to C++. Then one has to ask, is the C API sufficient for C++
        > usage? I’d say yes it is when used with alloca. I might also add that Ion
        > reports very similar performance between adaptive pools and
        > independent_comalloc, but overhead is much higher (again, memory wastage I
        > don’t view important).

        The problem with alloca is portability, of course. What’s your feel on this
        when alloca is not an option?

        > Use of virtual machine for benchmarking: Any of Ion’s results using a virtual
        > machine need to be ignored. Anyone who has compared benchmarks obtained from a
        > VM versus a real system knows that the VM results are extremely unreliable.
        > Even when test A is three times faster than test B on a VM it can be that test
        > A is ten times slower than test B on real hardware.

        Agreed.

        [...]
        > Total lack of VM awareness: I am fairly stunned that there is *still* zero
        > awareness of virtual memory in ANY of these proposals. Hinting what memory
        > will be used for to the VM system can be very beneficial – maybe not so much
        > for existing kernel based VM, but for next generation user mode based VM
        > certainly.

        The address space reservation looks like a good idea, but I can understand how
        an abstract allocator interface would want to minimize the exposure of the
        details of the underlying memory system.

        > [4]: There is lots of plenty good in the EASTL proposal, but I do think that
        > developments since 2007 have likely done much to obviate much of it. Regarding
        > its allocator proposal, they appear to have removed the type dependency of the
        > allocator as well as adding state. Both are reasonable choices executed well,
        > but I still yearn for the Stepanov original intent.

        I find the argument to remove the type dependece convincing, as it will simplify
        container usage by removing the need to “rebind” your allocator. I think adding
        stateful allocators is a given in any modern C++ allocator proposal.

        > [5]: From my brief reading, all of this proposal can be implemented, if
        > somewhat hackily, with existing C++ features e.g. by using pointers to
        > pointers to state in the template parameters of the allocator type. Again, I
        > yearn for the superior Stepanov allocator design.

        There isn’t really much I agree with in N1850 other than a desire to address the
        current STL allocator interface’s shortcomings. But it does give an alternative
        viewpoint.

        > I appreciate that this is a long reply, and it probably ought to go to email
        > rather than as a comment. But I hope you find it useful, and my apologies once
        > again for taking so long to reply.

        Re: long reply, no problem. Also, I think public records of such discussions
        are useful for others, so I would consider this venue more appropriate than
        email.

        - Jeff

      • Posted October 1, 2010 at 11:03 pm | Permalink

        > > The problems #1 to #5: These proposals are designed to prevent memory
        > > wastage when this isn’t a problem – memory bandwidth wastage is far more
        > > important. Therefore much of what they propose – or more often, how they
        > > intend it – isn’t necessary.
        >
        > I think it would be helpful if you clarified what you mean by “memory
        > bandwidth wastage”. I think I’ve been under the impression that “memory
        > bandwidth wastage” is essentially an overcommittal of virtual memory pages
        > (e.g., when realloc’ing a large array, you’d have roughly twice the number
        > of pages committed in memory than you really need to store the data). I’m
        > not as familiar with memory allocation as Mr. Nedmalloc is, but I’m trying

        Capacity = how much you can store
        Bandwidth = how much of storage you can access per unit of time

        Overallocation – like std::vector – allocates 1.5Mb of storage despite only having 1Mb of data. This is capacity wastage.

        Memory bandwidth wastage is reading or writing data you do not need to. If you memset(0) a region and then – after it has left the L2 cache – you do another memset(0) on the same region, then you are wasting memory bandwidth.

        As the graph above shows, memory bandwidth is far more precious than memory capacity. And moreover, that preciousness is becoming exponentially more pronounced with time.

        > Superficially, at least, a lot of N2045/Ion’s followup seems to be aligned
        > with your proposal. Your (2.) under “Why this proposed change is important”
        > actually points to wasted CPU cycles and memory as a motivation for your
        > proposal, which is what N2045′s #1-#4 address. So I was under the impression
        > that you *did* think memory wastage was a problem. Please clarify.

        By far and away the most CPU cycle wastage is caused by the CPU being stalled waiting for main memory. I don’t have the figures to hand, but I remember an average of 40% of CPU time going this way for Intel Core 2. This is why they introduced Hyperthreading so the CPU can go execute a different thread while waiting for main memory.

        Obviously there is a correlation between increases in memory capacity usage and memory bandwidth usage for the simple reason that if you double your memory usage, one can see how one might need double the bandwidth to access it. However it is hardly a guaranteed correlation – if one uses an index or hash table one can 10x capacity usage with very little increase in bandwidth except when constructing or destructing the objects.

        Hence while at times myself and Ion’s proposal are aligned, at times we also differ. Fundamentally speaking in my opinion Ion’s proposal is deeply flawed because it assumes that memory capacity is the factor most needing conservation.

        > In any case, it would seem that N2045′s proposals are equally targeted
        > toward minimizing fragmentation, reducing memory copying and memory usage,
        > and reducing lock acquisitions. Your proposal seems to address the first two
        > at least (although I like the idea of differentiating between limit_size,
        > preferred_size, and received_size introduced in N1953 and expanded upon in
        > N2045), and I gather the third one is not a major concern in your opinion
        > (given thread-local caches, I presume). It seems the one major feature
        > lacking in N2045 that you have in your proposal is address space
        > reservation. Is this a fair assessment?

        I would rather say that I have made use of address space reservation as a mechanism which allows the removal of much of the complexity in N2045. I like things as simple as possible unless there is no other choice.

        > I believe the primary difference is that blocks returned by node allocation
        > cannot later be resized, which can reduce bookkeeping overhead and improve
        > allocation/deallocation speed. Your experience may be different, but Ion’s
        > followup study to N2045 seems to indicate advantages of separating node and
        > array allocation. I would also expect a program that node-allocates a block
        > of a given size typically allocates many same-sized blocks, where this would
        > not (typically) be true for array allocation. Effectively, the program can
        > communicate its usage patterns more clearly to the allocator. This is purely
        > speculation, however.

        Ah I see – he wants a specialised allocator which doesn’t keep the block metadata with the block. I can tell you right now that no OS vendor would permit such an allocator in an ISO standard – it doesn’t offer enough security against malicious memory corruption you see.

        That said, I see absolutely no reason why someone can’t write their own custom std::allocator with the existing API. If I understand what you meant that is.

        > But more importantly, I don’t see how backwards expansion is inherently more
        > difficult to implement. Perhaps you can elaborate? I also don’t see the
        > quadratic complexity. It seems you’re assuming a particular allocator design
        > (e.g., dlmalloc) here.

        Well I don’t want to go into too much detail – it’s a bit off topic, and this reply will be long enough anyway. However, one of the slowest parts of any allocator has to be working out what memory can be returned to the system. Generally speaking, when one frees a block one looks above and below it and if those regions are also free, one conjoins the lot into a contiguous free region. However, most allocators only store a forward only pointer per block to save on metadata overheads, so they can only coalesce with the block after rather than before. As a result, they need to run periodic free space coalescing routine where they traverse the free space list and coalesce anything which isn’t coalesced. Why forward only? One would store per block the block size and a pointer to its arena which is 8 bytes on 32 bit and 16 bytes on 64 bit. One can work out from the block size where the next block is, but NOT where the previous block is.

        The problem with backwards expansion is that you absolutely require a previous pointer stored per block. That not only increases the metadata needed by 50%, but it also greatly increases the cache working set required during the free space coalescing routine because lots of pointers not near the most recently accessed location need to be touched. All this translates into a much slower allocator due to the CPU stalling on main memory unless you employ a free space index rather than using the free spaces as a linked list.

        By the way, there are lots of exceptions to what I just said by using all sorts of clever tricks. Nevertheless, I would oppose backwards expansion on the principle that generally speaking it is usually slow to implement.

        > I’m uncomfortable with assuming that address space reservation will always
        > be available (and thus whether this really obviates the need for backwards
        > expansion). Nested allocators? Fixed-size buffer allocators? I don’t know.
        > It’s difficult for me to buy into arguments about a general allocator
        > interface based on a specific implementation of that interface.

        Firstly, address reservation is a hint precisely because on current hardware it isn’t worth it when the block is less than a few hundred Kb long. You’d only turn it on when blocks exceed maybe a quarter of the L2 cache size.

        Secondly all major OSs have supported some form of address reservation for decades (e.g. mmap() or VirtualAlloc()). I implemented full support for this proposal in two days. I’d doubt it would take any longer than a man week for the main OS vendors.

        > The problem with alloca is portability, of course. What’s your feel on this
        > when alloca is not an option?

        I have to say alloca is always an option. GCC supports it on almost every supported platform. So does MSVC. Just there you’ve covered 98% of use cases due to those two being the two most popular compilers by far.

        The only possible problem is lack of stack space, so alloca causes stack exhaustion on embedded systems. That’s easy enough to fix though – just call independent_comalloc in smaller chunks.

        > The address space reservation looks like a good idea, but I can understand
        > how an abstract allocator interface would want to minimize the exposure of
        > the details of the underlying memory system.

        Sure, that’s why it’s as barebones as it is: just two macros which *could* significantly improve performance. If they’re null ops then nothing is lost – performance is no worse than at present.

        > I find the argument to remove the type dependece convincing, as it will
        > simplify container usage by removing the need to “rebind” your allocator. I
        > think adding stateful allocators is a given in any modern C++ allocator
        > proposal.

        I wouldn’t agree, but I’ll explain on the References page why. What’s wrong with the STL allocator is how it’s been implemented at all. But I’ll explain on the other page!

        Niall

      • Posted October 4, 2010 at 9:24 am | Permalink

        [...]
        > > I think it would be helpful if you clarified what you mean by ?memory
        > > bandwidth wastage?.
        [...]
        > Bandwidth = how much of storage you can access per unit of time
        [...]
        > Memory bandwidth wastage is reading or writing data you do not need to. If you
        > memset(0) a region and then – after it has left the L2 cache – you do another
        > memset(0) on the same region, then you are wasting memory bandwidth.
        [...]

        That’s pretty much what I had in mind.

        > I would rather say that I have made use of address space reservation as a
        > mechanism which allows the removal of much of the complexity in N2045. I like
        > things as simple as possible unless there is no other choice.

        I still think there is some value in N2045 (and related) when address space
        reservation is unavailable or impractical (small allocations, fixed-size
        buffers), but at this point I agree it doesn’t address the same concerns that
        your (malloc) proposal does.

        > > I believe the primary difference is that blocks returned by node allocation
        > > cannot later be resized, which can reduce bookkeeping overhead and improve
        > > allocation/deallocation speed. Your experience may be different, but Ion?s
        > > followup study to N2045 seems to indicate advantages of separating node and
        > > array allocation. I would also expect a program that node-allocates a block
        > > of a given size typically allocates many same-sized blocks, where this would
        > > not (typically) be true for array allocation. Effectively, the program can
        > > communicate its usage patterns more clearly to the allocator. This is purely
        > > speculation, however.
        >
        > Ah I see – he wants a specialised allocator which doesn’t keep the block
        > metadata with the block. I can tell you right now that no OS vendor would
        > permit such an allocator in an ISO standard – it doesn’t offer enough security
        > against malicious memory corruption you see.

        I don’t see how this justifies your rejection of the *interface*…? The
        underlying allocator is free to use whatever redundant bookkeeping metadata it
        wants.

        > That said, I see absolutely no reason why someone can’t write their own custom
        > std::allocator with the existing API. If I understand what you meant that is.

        Hmmm…perhaps you can explain. The current std::allocator interface makes no
        distinction between array allocation and node allocation (although I understand
        some allocators attempt to make up for this fact by assuming that an array
        allocation of 1 object is really a node allocation, but, to me, that’s a hack
        necessitated by a deficient interface).

        > > But more importantly, I don?t see how backwards expansion is inherently more
        > > difficult to implement. Perhaps you can elaborate? I also don?t see the
        > > quadratic complexity. It seems you?re assuming a particular allocator design
        > > (e.g., dlmalloc) here.
        [...]
        > The problem with backwards expansion is that you absolutely require a previous
        > pointer stored per block.
        [...]
        > By the way, there are lots of exceptions to what I just said by using all
        > sorts of clever tricks. Nevertheless, I would oppose backwards expansion on
        > the principle that generally speaking it is usually slow to implement.

        “absolutely require” followed by “exceptions”…nice ;)

        For those following at home, I’ll only mention out that [1] points out the use
        of a footer tag with the block size and embedding in the free bits of the header
        tag the “freeness” of the previous adjacent block, allowing you to determine, in
        constant time, if the previous adjacent block is free, and where its header is.
        There’s some additional bookkeeping overhead, but I didn’t find it to be much.
        This technique suggested to me that coalescing with the previous adjacent block
        was not an uncommon capability, and hence backwards expansion would not be
        difficult. I suppose your experience has suggested otherwise.

        [1] “Dynamic Storage Allocation: A Survey and Critical Review” by Paul R.
        Wilson, Mark S. Johnstone, Michael Neeley, and David Boles

        - Jeff

      • Posted October 7, 2010 at 3:40 pm | Permalink

        > I still think there is some value in N2045 (and related) when address space
        > reservation is unavailable or impractical (small allocations, fixed-size
        > buffers), but at this point I agree it doesn’t address the same concerns that
        > your (malloc) proposal does.

        I am not saying that N2045 isn’t valuable – or indeed any of the other features you have mentioned here. My difficulty is (i) are these features necessary in an international standard and (ii) are they perceived as necessary by WG14?

        Backward expansion – yes it is useful in certain circumstances. But given it still requires move construction of the objects contained in the allocation, is it such a boon to efficiency that it needs codifying by ISO?

        Node vs. array allocation – is explicit C++ API support over and above something based on C’s independent_comalloc functions so necessary that it needs codifying by ISO?

        Given that N2045, and indeed all the others, drew little interest from the ISO committees up until now I would say it reasonable to conclude that they aren’t interested in such things because they don’t perceive them as a need. That would suggest to me that an approach of “the smaller the better” would be wise if you want the thing adopted.

        BTW, I’ve been allocated N1519 and it will need to be submitted next week. Would you like me to email it to you before it goes in?

        Niall

      • Posted October 10, 2010 at 6:00 am | Permalink

        > BTW, I’ve been allocated N1519 and it will need to be submitted next week.
        > Would you like me to email it to you before it goes in?

        Not necessary, but thank you. I think you’ve done a fine job rationalizing your
        design decisions in this discussion, something you may wish to add to your
        proposals. I’m sure the C and C++ committees will have much more insightful
        comments, which I’d be interested in reading. Keep me posted on any new
        developments.

      • Posted October 10, 2010 at 11:36 am | Permalink

        > Not necessary, but thank you. I think you’ve done a fine job rationalizing your
        > design decisions in this discussion, something you may wish to add to your
        > proposals. I’m sure the C and C++ committees will have much more insightful
        > comments, which I’d be interested in reading. Keep me posted on any new
        > developments.

        Thank you very much for saying so, though I feel like I just kept saying “No!” most of the time!

        N1519 is purely for C1X and most of our discussion was on C++ and its STL rather than much to do with C. Indeed actually most of anyone’s strong feelings on memory allocation seem to need C++. It’s interesting how C++ still arouses strong emotions even after all these years.

        However, I have incorporated some of your suggestions as well as those of others. There was a clear desire from many for keeping the API simple, so the basic plan is a two tier API, so one is the simple API which front ends the complex API. The complex API is a pure batch API, but I’ll probably have full, medium and light variants because of the size of the data structures.

        I already have three pages of rationalisations of what was included and excluded ;) so hopefully that will make them happy. Also, because the proposed API has changed so much I’ll have to spend a day doing more surgery to dlmalloc to get a working implementation. Ah, the work!

        Anyway thank you very much indeed for your help. I have singled you out for special thanks in the foreword to N1519. Best of luck with whatever you pursue I guess.

        Niall

  3. Gene B
    Posted September 25, 2010 at 11:56 am | Permalink | Reply

    I don’t see how std::vector is a good example of needing realloc. This container employs an exponential allocation algorithm (I think a typical exponent around 1.3-1.4) thus ammortising vector operations to achieve O(1) complexity. What it means the probability of reallocation (and possible copying) is O(log(n)). One can compare performance of vector that specifies capacity(n) versus the one that doesn’t. There will be little difference for large vectors.
    On the other note, there is an old (2002) research paper “Reconsidering Custom Memory Allocation” which demonstrated that in average case custom allocators don’t improve performance noticeably.

    • Posted September 27, 2010 at 4:59 pm | Permalink | Reply

      Thanks for the comment.

      Overcommitting memory is bad – one really ought to hint when memory is being used as a placeholder versus when it shall be shortly used. Address reservation lets application code supply the appropriate metadata to the memory allocation subsystem, though I agree that the POSIX method of mmap() + madvise() could be optimised as to be equivalent. Right now, it’s far too slow to be useful.

      Right now reallocation for reasonably sized vectors ALWAYS means a full move or copy construction. That’s a O(N) operation with lots of memory bandwidth usage, and when it happens is usually unpredictable to the code using std::vector. Seeing as it’s remarkably easy to fix such that one never sees a relocation at all it seems foolish that this is not already so.

      The Berger 2002 paper reflected mainly that general purpose allocator implementations at that time had become so much faster than they outperformed most of the specialised strategies of that time. Eight years on things have changed: general purpose allocators have become many times faster again, and even fewer specialised strategies make sense especially in the light of ongoing code maintenance.

      In my own experience, just about the only specialised strategy that still makes sense is the pointer incrementing allocator i.e. it just advances a pointer by X bytes, then you throw away the entire block when done. Even then you’ll find that the cache and TLB exhaustion introduced makes your full malloc implementation faster when the working set exceeds a few hundred Kb. It’s nuts really – holistically a few hundred cycle malloc can be faster than a pointer increment, but that’s modern caching for you!

      Niall

Post a Comment

Required fields are marked *

*
*

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: