This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH][AArch64] Single thread lowlevellock optimization
- From: Torvald Riegel <triegel at redhat dot com>
- To: Szabolcs Nagy <szabolcs dot nagy at arm dot com>
- Cc: nd at arm dot com, GNU C Library <libc-alpha at sourceware dot org>
- Date: Wed, 21 Jun 2017 12:46:52 +0200
- Subject: Re: [PATCH][AArch64] Single thread lowlevellock optimization
- Authentication-results: sourceware.org; auth=none
- Authentication-results: ext-mx05.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
- Authentication-results: ext-mx05.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=triegel at redhat dot com
- Dkim-filter: OpenDKIM Filter v2.11.0 mx1.redhat.com CD3BE30AF51
- Dmarc-filter: OpenDMARC Filter v1.3.2 mx1.redhat.com CD3BE30AF51
- References: <59440699.6080900@arm.com> <1497966465.18410.45.camel@redhat.com> <594939C9.6040308@arm.com> <1497982232.18410.88.camel@redhat.com> <594A3AC7.1000404@arm.com>
On Wed, 2017-06-21 at 10:22 +0100, Szabolcs Nagy wrote:
> On 20/06/17 19:10, Torvald Riegel wrote:
> > On Tue, 2017-06-20 at 16:05 +0100, Szabolcs Nagy wrote:
> >> On 20/06/17 14:47, Torvald Riegel wrote:
> >>> On Fri, 2017-06-16 at 17:26 +0100, Szabolcs Nagy wrote:
> >>>> Differences compared to the current x86_64 behaviour:
> >>>> - The optimization is not silently applied to shared locks, in that
> >>>> case the build fails.
> >>>> - Unlock assumes the futex value is 0 or 1, there are no waiters to
> >>>> wake (that would not work in single thread and libc does not use
> >>>> such locks, to be sure lll_cond* is undefed).
> >>>>
> >>>> This speeds up a getchar loop about 2-4x depending on the cpu,
> >>>> while only cause around 5-10% regression for the multi-threaded case
> >>>
> >>> What measurement of what benchmark resulted in that number (the latter
> >>> one)? Without details of what you are measuring this isn't meaningful.
> >>>
> >>
> >> these are all about getchar in a loop
> >>
> >> for (i=0; i<N; i++) getchar();
> >>
> >> and then time ./a.out </dev/zero
> >>
> >> it is i think idiomatic input processing code for a number
> >> of cmdline tools and those tools tend to be single threaded.
> >
> > Can you measure any CPU time difference for these tools?
> >
>
> gnu dc with some generated input:
>
> $ time taskset -c 1 $NOLOCK/lib64/ld-linux-aarch64.so.1 --library-path $NOLOCK/lib64 ./dc <dcinput
> 0
>
> real 0m21.083s
> user 0m21.040s
> sys 0m0.040s
> $ time taskset -c 1 $ORIG/lib64/ld-linux-aarch64.so.1 --library-path $ORIG/lib64 ./dc <dcinput
> 0
>
> real 0m29.734s
> user 0m29.716s
> sys 0m0.016s
Is dcinput realistic? Is dc that important? Anything else?
And I think we still have no real data on how it affects the non-getchar
case, in particular the multi-threaded case. Those questions go away
once this is an optimization that just affects getchar() ...
> this also affects $customer tool
Is that the real motivation behind your work? Can you roughly describe
what that tool does?
> (most gnu tools have their own silly buffering
> exactly to avoid the slow libc stdio, some tools
> use _unlocked interfaces directly which are less
> portable, so there are plenty of maintenance issues
> caused by leaving this unfixed)
Well, reading one character at a time with something that will likely
remain a function call isn't a great approach in any case, right?
> >> the multi-threaded case is just creating a dummy thread to
> >> disable the optimization.
> >
> > Note that half of the overhead will be in the unlock code, and so is
> > executed during the critical section. That means that you make the
> > sequential parts of a program longer, and that will limit the maximum
> > amount of parallelism you can have.
> >
> > Also, more and more programs will be multi-threaded (though maybe they
> > don't do tight getchar() loops like the one above), so it's not quite
> > clear whether the 5-10% are less important overall or not.
> >
>
> if this optimization is so bad, then remove it
> from x86_64, it affects a lot of users.
And I have indeed thought about doing that, wondering whether it's still
useful. In particular on x86, atomic operations on data in the cache
where much more costly in the past than they are now. But I haven't
looked at this closely with benchmarks and so forth so far.
Anyway, the fact that we might have an issue on x86 shouldn't be a
reason to potentially introduce the issue on aarch64 too.
> >>>> (other libc internal locks are not expected to be performance
> >>>> critical or significantly affected by this change).
> >>>
> >>> Why do you think this is the case?
> >>>
> >>
> >> there is only an extra branch in the lock and unlock
> >> code, i don't see locks in libc that can be hot enough
> >> to make that matter, except for stdio and malloc locks.
> >
> > If it's just a few of the higher-level clients that you think would
> > benefit, this is another reason to optimize there and leave the
> > low-level lock unchanged.
> >
>
> i can simplify the stdio patch a bit so it is only
> applied to getc/putc/.. then malloc interposition
> is not an issue, nor printf hooks.
>
> that should be a safe first step.
>
> >> (it does add some code bloat to libc though)
> >>
> >> in stdio only getc/putc/getchar/putchar +w variants are
> >> short enough to make the optimization practically relevant.
> >>
> >> the effect on malloc is already much smaller since it has
> >> more surrounding code beyond the lock/unlock (instead of
> >> 2-4x speed up you get 10% or so with a naive free(malloc(1))
> >> in a loop, with more complex workloads i'd expect smaller
> >> effect as that would probably go through more branches in
> >> malloc/free)
> >
> > What about multi-threaded malloc?
> >
>
> <= 5% (value depends on cpu)
>
And definitely not just that (eg, allocation pattern and frequency,
number of threads, etc.). Thus, without that as a context, your
statement of less than 5% doesn't tell me anything, sorry.