This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Lock elision problems in glibc-2.18


On Tue, Sep 17, 2013 at 03:59:54PM +0200, Torvald Riegel wrote:
> On Tue, 2013-09-17 at 14:49 +0200, Dominik Vogt wrote:
> Just to recap, in this scenario the caller both misinterprets abort
> codes, and when it falls back to nontransactional execution eventually,
> it executes code that doesn't make use of glibc's lock elision.

Yes.

> > The initial design idea of the tuning was (at least as far as I
> > understood it) that it should automatically adapat and show
> > "sensible" behaviour after a while that at least makes sure that
> > there isn't a big performance loss.  Now we have a case where the
> > tuning code is never even executed.  This is a contradiction to the
> > design idea, so either the design or the implementation needs to be
> > fixed (or my idea of what the design is).
> 
> I think the scenario above isn't a problem.  In the first phase of it,
> when the caller executes transactionally but has to retry, no tuning of
> nested code could ever take place unless the caller is aware of *and*
> initiating the tuning of the nested code.  This is because we do abort
> (for whatever reason), and once we do that we roll back any tuning as
> well.

> It is unreasonable to assume that any combination of caller and
> callee would be integrated in terms of tuning.  It's also impractical to
> assume that even the caller knows which callee needs to be tuned --
> there's no way to track where the abort happened, and why (unless you
> use performance counters or such).

Agreed; it is not obvious how tuning could be enforced in all
cases (though, on z it is possible to do non-transactional stores
inside a transaction, i.e. these stores are committed even during
an abort).

Trying to guarantee a maximum performance loss in absence of
tuning is terribly difficult.

> In the second phase, the caller simply chooses to use other callees, and
> then it doesn't matter whether glibc's elision tuning is used or not.
> 
> The situation that we want to avoid in this case is that if we abort due
> to bad tuning (the only possible cause for this should be having to
> abort on trylock(); anything else shouldn't make things worse given that
> we're executing transactionally already).  But the only way out of this
> is for the caller to not use transactional execution, and let callees
> tune.  Thus, not misinterpreting abort codes, or being conservative on
> explicit aborts, would be the right thing to do.

Then maybe we should take a different approach to tuning.  At the
moment it's very liberal; elision is tried optimistically and
using locks is the "bad" fallback mode.  Perhaps using locks
should be the default, and a certain mutex has to prove that it's
worth trying transactions.  (With Tsx there still remains the
problem to communicate this from inside nested transactions.)

> > Perhaps we should start with writing down the requirements for
> > the tuning algorithm.  Just some bullets that come to my mind:
> > 
> >  * The ultimate goal is to use elision where it helps and to not
> >    use it where it harms.
> >  * A (possible recurring) training phase where the algorithm
> >    performs badly is acceptable.
> >  * After a training phase, a certain minimum or average
> >    performance must be guaranteed in all cases.  If not, what are
> >    the exceptions, and how do we deal with them (e.g. argue that
> >    they are irrelevant (reasons), easy to avoid etc.).
> 
> Those sound good.
> 
> >  * If lock elision is ever to be enabled by default, the tuning
> >    algorithm must make sure that no (relevant?) software shows
> >    a significant (unacceptable) performance loss.
> 
> Possibly.  Although this is certainly a trade-off between average
> performance gains and losses in particular cases.   We certainly want to
> avoid cases where performance really crashes.

If you're not comfortable with this proposed requirement, then
what is the requirement you have in mind?  I think we really need
to pin this down.  Otherwise we'll always have cases where one of
us thinks a certain scenario is unacceptable and someone else
thinks it is acceptable.

I've made a quick test on z.  There is only one thread that does

  pthread_mutex_lock(&mutex)
  if (in_transaction())
    TABORT(temporary)
  else
    /* do nothing */
  pthread_mutex_unlock(&mutex)
 
This is maybe the shortest possible transaction body that aborts
every transaction with a temporary abort code, thus triggering
the out of retries situation.

relative
performance  setup
-----------  ----------------------------------------------------
       100%  elision disabled through configure
        ~7%  elision enabled, without the out of retries patch
       ~81%  elision enabled, with the out out retries patch,
             skip_lock_out_of_retries set to very conservative
             value (32000)

Okay, that scenario _is_ artificial, but it shows roughly the
worst that can happen if a software has extreme data contention
and uses short lived locks.  The numbers may be different for
Tsx.

With the out of retries patch the worst case is a performance loss
of about 20%, without the patch performance can really drop to
abysmal values.  Of course all this says nothing about average
performance.

> > > It
> > > can also expect that callers with enclosing transactions will eventually
> > > fall back to nontransactional execution.
> > 
> > Let's rather say:  Expect that the caller will eventually take
> > measures to ensure forward progress.  This includes mechanisms like
> > the aforementioned constrained transactions.
> 
> I believe that the primary use of constrained transactions is really
> small transactions (eg, implementing things like a double-CAS, custom
> concurrent algorithms, etc.); calling arbitrary code that you don't
> control (eg, glibc) is unlikely to be a successful use.

That is probably the case.  Implementing constrained transactions
must have been a huge effort by the hardware people, and to make
it possible there are really strict limitations (no more than 32
instructions, no backwards jumps, code has to reside in max. two
cache lines, very limited memory footprint, maybe more that I
cannot remember from the top of my head.)

Ciao

Dominik ^_^  ^_^

-- 

Dominik Vogt
IBM Germany


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]