This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH][AArch64] Single thread lowlevellock optimization

From: Szabolcs Nagy <szabolcs dot nagy at arm dot com>
To: Torvald Riegel <triegel at redhat dot com>
Cc: nd at arm dot com, GNU C Library <libc-alpha at sourceware dot org>
Date: Tue, 20 Jun 2017 16:05:45 +0100
Subject: Re: [PATCH][AArch64] Single thread lowlevellock optimization
Authentication-results: sourceware.org; auth=none
Authentication-results: sourceware.org; dkim=none (message not signed) header.d=none;sourceware.org; dmarc=none action=none header.from=arm.com;
Nodisclaimer: True
References: <59440699.6080900@arm.com> <1497966465.18410.45.camel@redhat.com>
Spamdiagnosticmetadata: NSPM
Spamdiagnosticoutput: 1:99

On 20/06/17 14:47, Torvald Riegel wrote:
> On Fri, 2017-06-16 at 17:26 +0100, Szabolcs Nagy wrote:
>> Do single thread lock optimization in aarch64 libc. Atomic operations
>> hurt the performance of some single-threaded programs using stdio
>> (usually getc/putc in a loop).
>>
>> Ideally such optimization should be done at a higher level and in a
>> target independent way as in
>> https://sourceware.org/ml/libc-alpha/2017-05/msg00479.html
>> but that approach will need more discussion so do it in lowlevellocks,
>> similarly to x86, until there is consensus.
> 
> I disagree that this is sufficient reason to do the right thing here
> (ie, optimize in the high-level algorithm).  What further discussion is
> needed re the high-level use case?
> 

one open issue is to detect malloc interposition at startup
time to disable the optimization (this should be easy i was
just not sure what's the right place to do it).

the current _IO_USER_LOCK flag could use the same mechanism:
instead of doing a check at every flockfile/funlockfile
just check once at entry into getc and jump to getc_unlocked.
but the stdio code may need some refactoring to make this
possible.

i allocated a new flag2 bit, i don't know if there are unwanted
implications (are the flags public abi? then getc_unlocked path
could even be inlined)

stdio can be compiled in non-thread-safe mode, i'm not sure
what that does, i certainly did not test that configuration.

there were a number of _IO* abi symbols in libc but they
didnt quite do what i wanted so i introduced a new symbol
that can be called from libpthread to update FILE objects
when a new thread is created. (i think this should be ok
but again it's not clear to me what might be the downsides
of a new abi symbol).

>> Differences compared to the current x86_64 behaviour:
>> - The optimization is not silently applied to shared locks, in that
>> case the build fails.
>> - Unlock assumes the futex value is 0 or 1, there are no waiters to
>> wake (that would not work in single thread and libc does not use
>> such locks, to be sure lll_cond* is undefed).
>>
>> This speeds up a getchar loop about 2-4x depending on the cpu,
>> while only cause around 5-10% regression for the multi-threaded case
> 
> What measurement of what benchmark resulted in that number (the latter
> one)?  Without details of what you are measuring this isn't meaningful.
> 

these are all about getchar in a loop

for (i=0; i<N; i++) getchar();

and then time ./a.out </dev/zero

it is i think idiomatic input processing code for a number
of cmdline tools and those tools tend to be single threaded.

the multi-threaded case is just creating a dummy thread to
disable the optimization.

>> (other libc internal locks are not expected to be performance
>> critical or significantly affected by this change).
> 
> Why do you think this is the case?
> 

there is only an extra branch in the lock and unlock
code, i don't see locks in libc that can be hot enough
to make that matter, except for stdio and malloc locks.
(it does add some code bloat to libc though)

in stdio only getc/putc/getchar/putchar +w variants are
short enough to make the optimization practically relevant.

the effect on malloc is already much smaller since it has
more surrounding code beyond the lock/unlock (instead of
2-4x speed up you get 10% or so with a naive free(malloc(1))
in a loop, with more complex workloads i'd expect smaller
effect as that would probably go through more branches in
malloc/free)

Follow-Ups:
- Re: [PATCH][AArch64] Single thread lowlevellock optimization
  - From: Torvald Riegel

References:
- [PATCH][AArch64] Single thread lowlevellock optimization
  - From: Szabolcs Nagy
- Re: [PATCH][AArch64] Single thread lowlevellock optimization
  - From: Torvald Riegel

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]