This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: RFC: x86-64: Use fxsave/xsave/xsavec in _dl_runtime_resolve [BZ #21265]
On 2017.10.20 at 04:11 -0700, H.J. Lu wrote:
> On Fri, Oct 20, 2017 at 12:24 AM, Markus Trippelsdorf
> <markus@trippelsdorf.de> wrote:
> > On 2017.10.19 at 15:36 -0700, H.J. Lu wrote:
> >> On Thu, Oct 19, 2017 at 2:55 PM, Carlos O'Donell <carlos@redhat.com> wrote:
> >> > On 10/19/2017 10:41 AM, H.J. Lu wrote:
> >> >> In _dl_runtime_resolve, use fxsave/xsave/xsavec to preserve all vector,
> >> >> mask and bound registers. It simplifies _dl_runtime_resolve and supports
> >> >> different calling conventions. ld.so code size is reduced by more than
> >> >> 1 KB. However, use fxsave/xsave/xsavec takes a little bit more cycles
> >> >> than saving and restoring vector and bound registers individually.
> >> >>
> >> >> Latency for _dl_runtime_resolve to lookup the function, foo, from one
> >> >> shared library plus libc.so:
> >> >>
> >> >> Before After Change
> >> >>
> >> >> Westmere (SSE)/fxsave 345 866 151%
> >> >> IvyBridge (AVX)/xsave 420 643 53%
> >> >> Haswell (AVX)/xsave 713 1252 75%
> >> >> Skylake (AVX+MPX)/xsavec 559 719 28%
> >> >> Skylake (AVX512+MPX)/xsavec 145 272 87%
> >> >
> >> > This is a good baseline, but as you note, the change may not be observable
> >> > in any real world programs.
> >> >
> >> > The case I made to David Kreitzer here:
> >> > https://sourceware.org/ml/libc-alpha/2017-03/msg00430.html
> >> > ~~~
> >> > ... Alternatively a more detailed performance analysis of
> >> > the impact on applications that don't use __regcall is required before adding
> >> > instructions to the hot path of the average application (or removing their use
> >> > in _dl_runtime_resolve since that penalizes the dynamic loader for all applications
> >> > on hardware that supports those vector registers).
> >> > ~~~
> >> >
> >> >> This is the worst case where portion of time spent for saving and
> >> >> restoring registers is bigger than majority of cases. With smaller
> >> >> _dl_runtime_resolve code size, overall performance impact is negligible.
> >> >>
> >> >> On IvyBridge, differences in build and test time of binutils with lazy
> >> >> binding GCC and binutils are noises. On Westmere, differences in
> >> >> bootstrap and "makc check" time of GCC 7 with lazy binding GCC and
> >> >> binutils are also noises.
> >> > Do you have any statistics on the timing for large applications that
> >> > use a lot of libraries? I don't see gcc, binutils, or glibc as indicative
> >> > of the complexity of shared libraries in terms of loaded shared libraries.
> >>
> >> _dl_runtime_resolve is only called once when an external function is
> >> called the first time. Many shared libraries isn't a problem unless
> >> all execution
> >> time is spent in _dl_runtime_resolve. I don't believe this is a
> >> typical behavior.
> >>
> >> > Something like libreoffice's soffice.bin has 142 DSOs, or chrome's
> >> > 103 DSOs. It might be hard to measure if the lazy resolution is impacting
> >> > the performance or if you are hitting some other performance boundary, but
> >> > a black-box test showing performance didn't get *worse* for startup and
> >> > exit, would mean it isn't the bottlneck (but might be some day). To test
> >> > this you should be able to use libreoffice's CLI arguments to batch process
> >> > some files and time that (or the --cat files option).
> >
> > I did some testing on my old SSE only machine and everything is in the
> > noise. For example:
> >
> > ~ % ldd /usr/lib64/libreoffice/program/soffice.bin | wc -l
> > 105
> > ~ % hardening-check /usr/lib64/libreoffice/program/soffice.bin
> > /usr/lib64/libreoffice/program/soffice.bin:
> > Position Independent Executable: no, normal executable!
> > Stack protected: no, not found!
> > Fortify Source functions: no, not found!
> > Read-only relocations: yes
> > Immediate binding: no, not found!
>
> I have
>
> [hjl@gnu-6 tmp]$ readelf -d /usr/lib64/libreoffice/program/soffice.bin
>
> Dynamic section at offset 0xdb8 contains 27 entries:
> Tag Type Name/Value
> 0x0000000000000001 (NEEDED) Shared library: [libuno_sal.so.3]
> 0x0000000000000001 (NEEDED) Shared library: [libsofficeapp.so]
> 0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
> 0x000000000000000f (RPATH) Library rpath: [$ORIGIN]
> 0x000000000000000c (INIT) 0x710
> 0x000000000000000d (FINI) 0x904
> 0x0000000000000019 (INIT_ARRAY) 0x200da0
> 0x000000000000001b (INIT_ARRAYSZ) 8 (bytes)
> 0x000000000000001a (FINI_ARRAY) 0x200da8
> 0x000000000000001c (FINI_ARRAYSZ) 8 (bytes)
> 0x000000006ffffef5 (GNU_HASH) 0x298
> 0x0000000000000005 (STRTAB) 0x478
> 0x0000000000000006 (SYMTAB) 0x2e0
> 0x000000000000000a (STRSZ) 301 (bytes)
> 0x000000000000000b (SYMENT) 24 (bytes)
> 0x0000000000000015 (DEBUG) 0x0
> 0x0000000000000003 (PLTGOT) 0x200fa8
> 0x0000000000000007 (RELA) 0x608
> 0x0000000000000008 (RELASZ) 264 (bytes)
> 0x0000000000000009 (RELAENT) 24 (bytes)
> 0x0000000000000018 (BIND_NOW)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ _dl_runtime_resolve isn't
> used at all.
Yes. That is why I posted the hardening-check output:
"Immediate binding: no, not found!" means that "-z lazy" was used in my
case.
--
Markus