This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 3/3] sysdeps/arm/bits/atomic.h: Use relaxed atomics for catomic_*


On Mon, 2014-10-06 at 15:13 +0100, Will Newton wrote:
> On 6 October 2014 14:43, Torvald Riegel <triegel@redhat.com> wrote:
> > On Fri, 2014-10-03 at 16:11 +0100, Will Newton wrote:
> >> Using the relaxed memory model for atomics when single-threaded allows
> >> a reduction in the number of barriers (dmb) executed and an improvement in
> >> single thread performance on the malloc benchtest:
> >
> > I'm aware that we do have catomic* functions and they are being used.
> > However, I'm wondering whether they are the right tool for what we want
> > to achieve.
> 
> They are kind of ugly and rather undocumented as to their precise
> semantics, so I share your general concerns about these functions.
> 
> > They simply allow to avoid some of the overhead of atomics (but not
> > necessarily all).  Wouldn't it be better to change the calling code to
> > contain optimized paths for single-threaded execution?
> 
> How would you suggest implementing that? My first instinct is that the
> result would look a lot like what we have now, i.e. some kind of
> wrapper round atomic functions that special-cases the single-threaded
> case.

First, I'd differentiate between functions that can be reentrant and
those that aren't.  For the former, we'd use sequential code, which
would probably also avoid a few loops and branches because there's
nothing happening concurrently.  For the latter, using catomic* or
memory_order_relaxed atomics directly (after the transition to C11
atomics for this particular piece of code) would be the way to go I
guess, because you still need atomic ops even though you don't need to
synchronize with other threads (although the rules as to which atomics
can really be used in signal handlers is still being discussed, at least
in the C++ committee).

> malloc is the main performance critical subsystem using catomic, so it
> may be possible to do more of this work there and reduce the
> complexity of the generic atomic implementation (although I believe an
> earlier patch did do this to some extent but was rejected).

Having a single-threaded malloc implementation sounds like the right
thing to me.  We'd have some additional code, but the sequential code
would only be a subset of the program logic for the concurrent case
(IOW, you just would leave out stuff that cares about other
interleavings with other threads).

> > Also, calling code could either be reentrant or not.  For the former,
> > you could even avoid actual atomic accesses instead of just avoiding the
> > barriers.  Also, the compiler could perhaps generate more efficient code
> > if it doesn't have to deal with (relaxed) atomics.
> 
> Yes, that would be ideal if we had that option. It's not clear to me
> what catomic_ actually means, it seems from the previous discussions
> that it has to be atomic wrt. signal handlers which is why the atomic
> operations remain (but the barriers can be dropped).

That's my understanding too.  IIRC, it drops the "lock" prefix on x86,
for example.

> malloc is
> generally not re-entrant or AS-safe so optimizing away the atomic
> instructions would be a bg win here...
> 
> >> Before: 259.073
> >> After: 246.749
> >
> > What is the performance number for a program that does have several
> > threads but runs with your patch (i.e., has conditional execution but
> > can't avoid the barriers)?
> 
> I don't have them, but I will look at that.
> 
> > Do you have numbers for a hacked malloc that uses no atomics?
> 
> No, I'll see what I can do.
> 

Thanks.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]