This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision



On 15/11/2016 14:12, Rajalakshmi Srinivasaraghavan wrote:
> 
> The initial problem reported was memory corruption in MongoDB only on Ubuntu 16.04 ppc64le(not on Ubuntu 15).
> The problem was not easily recreatable and debugging with many tools and creative ideas, the problem is narrowed down to lock elision in glibc.
> Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which made the testcase fail only on 16.04.
> 
> As stated in BZ 20822, short description is
> 
> "The update of *adapt_count after the release of the lock causes a race condition. Thread A unlocks, thread B continues and returns, destroying mutex on stack,
>  then gets into another function, thread A writes to *adapt_count and corrupts stack.The window is very narrow and requires that the
> machine be in smt mode, so likely the two threads have to be on the same core in order to hit this"
> 
> Looking back the file changes in sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit
>  https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve
> the performance.
> 
> Thinking of the following fixes.
> 
> 1) Update of adapt_count before the unlock. But I have still not identified if its going to affect the performance.
> 
> 2) In elision-lock.c/elision-trylock.c update adapt_count right after the lock acquire at the end of the function.
> In either case we have to be careful not to introduce another race condition.
> 
> Any suggestions?
> 

The fadd2ad9c commit indeed should have raised us a flags, since
it tries to update a variable ('adapt_count' or
pthread_mutex_t.__data.__elision) in a transaction, where it should
be semantically handled as atomic access, and later in a 
non-atomic way.  This is wrong and I think we should either revert
the patch or move the adapt count update to 'inside' the transaction
(your first option).

However, I am not sure if this is the culprit of the issue.  Based
on bug report, it states the faulty scenario is:

"[...]  Thread A unlocks, thread B continues and returns, destroying
mutex on stack, then gets into another function, thread A writes to
*adapt_count and corrupts stack. [...]"

It seems *after* thread B destroy the mutex, it was not clear that
the program just calls 'pthread_mutex_destroy' or if the thread 
deallocates the mutex on the stack (by returning or something else).
First option should not trigger an invalid access, but second is 
definitively an program error.

Either way, I am also not sure if this current behaviour would
lead to a correctness issue, but rather a performance one: the
race 'adapt_count' usage would lead to TLE being avoided, not
really invalid memory access.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]