This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH][AArch64] Single thread lowlevellock optimization

From: Szabolcs Nagy <szabolcs dot nagy at arm dot com>
To: Torvald Riegel <triegel at redhat dot com>
Cc: nd at arm dot com, GNU C Library <libc-alpha at sourceware dot org>
Date: Wed, 21 Jun 2017 10:22:15 +0100
Subject: Re: [PATCH][AArch64] Single thread lowlevellock optimization
Authentication-results: sourceware.org; auth=none
Authentication-results: sourceware.org; dkim=none (message not signed) header.d=none;sourceware.org; dmarc=none action=none header.from=arm.com;
Nodisclaimer: True
References: <59440699.6080900@arm.com> <1497966465.18410.45.camel@redhat.com> <594939C9.6040308@arm.com> <1497982232.18410.88.camel@redhat.com>
Spamdiagnosticmetadata: NSPM
Spamdiagnosticoutput: 1:99

On 20/06/17 19:10, Torvald Riegel wrote:
> On Tue, 2017-06-20 at 16:05 +0100, Szabolcs Nagy wrote:
>> On 20/06/17 14:47, Torvald Riegel wrote:
>>> On Fri, 2017-06-16 at 17:26 +0100, Szabolcs Nagy wrote:
>>>> Differences compared to the current x86_64 behaviour:
>>>> - The optimization is not silently applied to shared locks, in that
>>>> case the build fails.
>>>> - Unlock assumes the futex value is 0 or 1, there are no waiters to
>>>> wake (that would not work in single thread and libc does not use
>>>> such locks, to be sure lll_cond* is undefed).
>>>>
>>>> This speeds up a getchar loop about 2-4x depending on the cpu,
>>>> while only cause around 5-10% regression for the multi-threaded case
>>>
>>> What measurement of what benchmark resulted in that number (the latter
>>> one)?  Without details of what you are measuring this isn't meaningful.
>>>
>>
>> these are all about getchar in a loop
>>
>> for (i=0; i<N; i++) getchar();
>>
>> and then time ./a.out </dev/zero
>>
>> it is i think idiomatic input processing code for a number
>> of cmdline tools and those tools tend to be single threaded.
> 
> Can you measure any CPU time difference for these tools?
> 

gnu dc with some generated input:

$ time taskset -c 1 $NOLOCK/lib64/ld-linux-aarch64.so.1 --library-path $NOLOCK/lib64 ./dc <dcinput
0

real	0m21.083s
user	0m21.040s
sys	0m0.040s
$ time taskset -c 1 $ORIG/lib64/ld-linux-aarch64.so.1 --library-path $ORIG/lib64 ./dc <dcinput
0

real	0m29.734s
user	0m29.716s
sys	0m0.016s

this also affects $customer tool

(most gnu tools have their own silly buffering
exactly to avoid the slow libc stdio, some tools
use _unlocked interfaces directly which are less
portable, so there are plenty of maintenance issues
caused by leaving this unfixed)

>> the multi-threaded case is just creating a dummy thread to
>> disable the optimization.
> 
> Note that half of the overhead will be in the unlock code, and so is
> executed during the critical section.  That means that you make the
> sequential parts of a program longer, and that will limit the maximum
> amount of parallelism you can have.
> 
> Also, more and more programs will be multi-threaded (though maybe they
> don't do tight getchar() loops like the one above), so it's not quite
> clear whether the 5-10% are less important overall or not.
> 

if this optimization is so bad, then remove it
from x86_64, it affects a lot of users.

>>>> (other libc internal locks are not expected to be performance
>>>> critical or significantly affected by this change).
>>>
>>> Why do you think this is the case?
>>>
>>
>> there is only an extra branch in the lock and unlock
>> code, i don't see locks in libc that can be hot enough
>> to make that matter, except for stdio and malloc locks.
> 
> If it's just a few of the higher-level clients that you think would
> benefit, this is another reason to optimize there and leave the
> low-level lock unchanged.
> 

i can simplify the stdio patch a bit so it is only
applied to getc/putc/.. then malloc interposition
is not an issue, nor printf hooks.

that should be a safe first step.

>> (it does add some code bloat to libc though)
>>
>> in stdio only getc/putc/getchar/putchar +w variants are
>> short enough to make the optimization practically relevant.
>>
>> the effect on malloc is already much smaller since it has
>> more surrounding code beyond the lock/unlock (instead of
>> 2-4x speed up you get 10% or so with a naive free(malloc(1))
>> in a loop, with more complex workloads i'd expect smaller
>> effect as that would probably go through more branches in
>> malloc/free)
> 
> What about multi-threaded malloc?
> 

<= 5% (value depends on cpu)

Follow-Ups:
- Re: [PATCH][AArch64] Single thread lowlevellock optimization
  - From: Torvald Riegel

References:
- [PATCH][AArch64] Single thread lowlevellock optimization
  - From: Szabolcs Nagy
- Re: [PATCH][AArch64] Single thread lowlevellock optimization
  - From: Torvald Riegel
- Re: [PATCH][AArch64] Single thread lowlevellock optimization
  - From: Szabolcs Nagy
- Re: [PATCH][AArch64] Single thread lowlevellock optimization
  - From: Torvald Riegel

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]