This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: x86-64: Use fxsave/xsave/xsavec in _dl_runtime_resolve [BZ #21265]


On Fri, Oct 20, 2017 at 4:16 AM, Markus Trippelsdorf
<markus@trippelsdorf.de> wrote:
> On 2017.10.20 at 04:11 -0700, H.J. Lu wrote:
>> On Fri, Oct 20, 2017 at 12:24 AM, Markus Trippelsdorf
>> <markus@trippelsdorf.de> wrote:
>> > On 2017.10.19 at 15:36 -0700, H.J. Lu wrote:
>> >> On Thu, Oct 19, 2017 at 2:55 PM, Carlos O'Donell <carlos@redhat.com> wrote:
>> >> > On 10/19/2017 10:41 AM, H.J. Lu wrote:
>> >> >> In _dl_runtime_resolve, use fxsave/xsave/xsavec to preserve all vector,
>> >> >> mask and bound registers.  It simplifies _dl_runtime_resolve and supports
>> >> >> different calling conventions.  ld.so code size is reduced by more than
>> >> >> 1 KB.  However, use fxsave/xsave/xsavec takes a little bit more cycles
>> >> >> than saving and restoring vector and bound registers individually.
>> >> >>
>> >> >> Latency for _dl_runtime_resolve to lookup the function, foo, from one
>> >> >> shared library plus libc.so:
>> >> >>
>> >> >>                              Before    After     Change
>> >> >>
>> >> >> Westmere (SSE)/fxsave         345      866       151%
>> >> >> IvyBridge (AVX)/xsave         420      643       53%
>> >> >> Haswell (AVX)/xsave           713      1252      75%
>> >> >> Skylake (AVX+MPX)/xsavec      559      719       28%
>> >> >> Skylake (AVX512+MPX)/xsavec   145      272       87%
>> >> >
>> >> > This is a good baseline, but as you note, the change may not be observable
>> >> > in any real world programs.
>> >> >
>> >> > The case I made to David Kreitzer here:
>> >> > https://sourceware.org/ml/libc-alpha/2017-03/msg00430.html
>> >> > ~~~
>> >> >   ... Alternatively a more detailed performance analysis of
>> >> >   the impact on applications that don't use __regcall is required before adding
>> >> >   instructions to the hot path of the average application (or removing their use
>> >> >   in _dl_runtime_resolve since that penalizes the dynamic loader for all applications
>> >> >   on hardware that supports those vector registers).
>> >> > ~~~
>> >> >
>> >> >> This is the worst case where portion of time spent for saving and
>> >> >> restoring registers is bigger than majority of cases.  With smaller
>> >> >> _dl_runtime_resolve code size, overall performance impact is negligible.
>> >> >>
>> >> >> On IvyBridge, differences in build and test time of binutils with lazy
>> >> >> binding GCC and binutils are noises.  On Westmere, differences in
>> >> >> bootstrap and "makc check" time of GCC 7 with lazy binding GCC and
>> >> >> binutils are also noises.
>> >> > Do you have any statistics on the timing for large applications that
>> >> > use a lot of libraries? I don't see gcc, binutils, or glibc as indicative
>> >> > of the complexity of shared libraries in terms of loaded shared libraries.
>> >>
>> >> _dl_runtime_resolve is only called once when an external function is
>> >> called the first time.  Many shared libraries isn't a problem unless
>> >> all execution
>> >> time is spent in _dl_runtime_resolve.  I don't believe this is a
>> >> typical behavior.
>> >>
>> >> > Something like libreoffice's soffice.bin has 142 DSOs, or chrome's
>> >> > 103 DSOs. It might be hard to measure if the lazy resolution is impacting
>> >> > the performance or if you are hitting some other performance boundary, but
>> >> > a black-box test showing performance didn't get *worse* for startup and
>> >> > exit, would mean it isn't the bottlneck (but might be some day). To test
>> >> > this you should be able to use libreoffice's CLI arguments to batch process
>> >> > some files and time that (or the --cat files option).
>> >
>> > I did some testing on my old SSE only machine and everything is in the
>> > noise. For example:
>> >
>> >  ~ % ldd /usr/lib64/libreoffice/program/soffice.bin | wc -l
>> > 105
>> >  ~ % hardening-check /usr/lib64/libreoffice/program/soffice.bin
>> > /usr/lib64/libreoffice/program/soffice.bin:
>> >  Position Independent Executable: no, normal executable!
>> >  Stack protected: no, not found!
>> >  Fortify Source functions: no, not found!
>> >  Read-only relocations: yes
>> >  Immediate binding: no, not found!
>>
>> I have
>>
>> [hjl@gnu-6 tmp]$ readelf -d  /usr/lib64/libreoffice/program/soffice.bin
>>
>> Dynamic section at offset 0xdb8 contains 27 entries:
>>   Tag        Type                         Name/Value
>>  0x0000000000000001 (NEEDED)             Shared library: [libuno_sal.so.3]
>>  0x0000000000000001 (NEEDED)             Shared library: [libsofficeapp.so]
>>  0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
>>  0x000000000000000f (RPATH)              Library rpath: [$ORIGIN]
>>  0x000000000000000c (INIT)               0x710
>>  0x000000000000000d (FINI)               0x904
>>  0x0000000000000019 (INIT_ARRAY)         0x200da0
>>  0x000000000000001b (INIT_ARRAYSZ)       8 (bytes)
>>  0x000000000000001a (FINI_ARRAY)         0x200da8
>>  0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
>>  0x000000006ffffef5 (GNU_HASH)           0x298
>>  0x0000000000000005 (STRTAB)             0x478
>>  0x0000000000000006 (SYMTAB)             0x2e0
>>  0x000000000000000a (STRSZ)              301 (bytes)
>>  0x000000000000000b (SYMENT)             24 (bytes)
>>  0x0000000000000015 (DEBUG)              0x0
>>  0x0000000000000003 (PLTGOT)             0x200fa8
>>  0x0000000000000007 (RELA)               0x608
>>  0x0000000000000008 (RELASZ)             264 (bytes)
>>  0x0000000000000009 (RELAENT)            24 (bytes)
>>  0x0000000000000018 (BIND_NOW)
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   _dl_runtime_resolve isn't
>> used at all.
>
> Yes. That is why I posted the hardening-check output:
>  "Immediate binding: no, not found!" means that "-z lazy" was used in my
>  case.
>

Great.  Performance impact of my patch is just noise.  On Ivy Bridge, for
binutils build and check with lazy binding GCC, as and ld, I got

Before

191.83user 24.37system 0:51.83elapsed 417%CPU (0avgtext+0avgdata
145800maxresident)k
108.09user 37.37system 1:48.58elapsed 133%CPU (0avgtext+0avgdata
2098644maxresident)k

After

191.68user 24.06system 0:51.94elapsed 415%CPU (0avgtext+0avgdata
145852maxresident)k
107.52user 37.22system 1:45.87elapsed 136%CPU (0avgtext+0avgdata
2098712maxresident)k

-- 
H.J.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]