This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] malloc: add random offset to mmapped memory


2015-03-02 8:33 GMT+01:00 Mike Frysinger <vapier@gentoo.org>:
> while i'm not against making programs work faster when possible, i'm not sure
> your example here is a good one.  it seems like you're purposefully writing
> (imo) bad code that ignores the realities of cpu caches.
This was just the most simple example that shows the slowdown due to
cache conflicts for page-aligned mallocs.
Here the loop interchange is trivial, resulting in much better code.
But consider for example code like:
    for (size_t i = 0; i < length; i++) {
        arr[0][i] = arr[1][i] - arr[2][i]
                  + arr[3][i] - arr[4][i]
                  + arr[5][i] - arr[6][i]
                  + ... ;
  }
Now it is not really clear how to change this loop in order to avoid
cache conflicts. It could be possible to split the loop as follows:
    for (size_t i = 0; i < length; i++) {
        arr[0][i] = arr[1][i] - arr[2][i] + arr[3][i] - arr[4][i];
    }
    for (size_t i = 0; i < length; i++) {
        arr[0][i] += arr[5][i] - arr[6][i] + arr[7][i] - arr[8][i];
    }
    ...
But now the application has to know what the associativity is of the
cache it is running on. And of course it will never be as efficient as
the original loop.

One could also argue that an application that cares about these kind
of performance details should add the offset to the malloc'ed memory
itself. But I think there's value in having glibc do it for all
applications.

> especially when you start talking about creating artificially bad scenarios by
> turning up the MALLOC_MMAP_THRESHOLD_ knob.  forcing lots of allocations to
> come from direct mmap's will put pressure on the system and can be even worse
> for performance than cache-hostile code like you've shown here.
The knob was just used to be able to measure the difference between a
malloc from an arena and a malloc from a mmap call while keeping all
the other variables the same. Of course the same performance
degradation comes into play when you make the arrays larger (say
hundreds of megabytes).

> it might help your case if you had a real world example that didn't specifically
> do both of those things ...
I do have some real world code, that's how I encountered this issue.
The curious thing was that for an application that needed several
temporary arrays for a certain calculation, it was faster to allocate
and free the arrays each time the calculation was performed, instead
of allocating the once before the first calulation. After some
investigation it turned out that these arrays (around 1-4 MB) were of
such a size that the first couple of calculations performed glibc
would return a mmap-backed pointer from malloc. So the first
iterations were slower because the pointers were page-aligned. But
after a couple of malloc-free cycles, the mmap-threshold within glibc
was adjusted and these mallocs became non page-aligned, increasing the
speed of the actual calculation.

The tiny sample code I provided was because this real world code is
not a nice self-contained example I would ask people on the mailing
list to read as an illustration of the kind of thing triggering this
behaviour. Furthermore, it is proprietary, so it can not readily be
shared. Of course, I could see whether I can trim the code down to
show the same effect of cache conflicts for a bit more interesting
calculation.

Maarten


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]