This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: x86-64: Use fxsave/xsave/xsavec in _dl_runtime_resolve [BZ #21265]


On 2017.10.19 at 15:36 -0700, H.J. Lu wrote:
> On Thu, Oct 19, 2017 at 2:55 PM, Carlos O'Donell <carlos@redhat.com> wrote:
> > On 10/19/2017 10:41 AM, H.J. Lu wrote:
> >> In _dl_runtime_resolve, use fxsave/xsave/xsavec to preserve all vector,
> >> mask and bound registers.  It simplifies _dl_runtime_resolve and supports
> >> different calling conventions.  ld.so code size is reduced by more than
> >> 1 KB.  However, use fxsave/xsave/xsavec takes a little bit more cycles
> >> than saving and restoring vector and bound registers individually.
> >>
> >> Latency for _dl_runtime_resolve to lookup the function, foo, from one
> >> shared library plus libc.so:
> >>
> >>                              Before    After     Change
> >>
> >> Westmere (SSE)/fxsave         345      866       151%
> >> IvyBridge (AVX)/xsave         420      643       53%
> >> Haswell (AVX)/xsave           713      1252      75%
> >> Skylake (AVX+MPX)/xsavec      559      719       28%
> >> Skylake (AVX512+MPX)/xsavec   145      272       87%
> >
> > This is a good baseline, but as you note, the change may not be observable
> > in any real world programs.
> >
> > The case I made to David Kreitzer here:
> > https://sourceware.org/ml/libc-alpha/2017-03/msg00430.html
> > ~~~
> >   ... Alternatively a more detailed performance analysis of
> >   the impact on applications that don't use __regcall is required before adding
> >   instructions to the hot path of the average application (or removing their use
> >   in _dl_runtime_resolve since that penalizes the dynamic loader for all applications
> >   on hardware that supports those vector registers).
> > ~~~
> >
> >> This is the worst case where portion of time spent for saving and
> >> restoring registers is bigger than majority of cases.  With smaller
> >> _dl_runtime_resolve code size, overall performance impact is negligible.
> >>
> >> On IvyBridge, differences in build and test time of binutils with lazy
> >> binding GCC and binutils are noises.  On Westmere, differences in
> >> bootstrap and "makc check" time of GCC 7 with lazy binding GCC and
> >> binutils are also noises.
> > Do you have any statistics on the timing for large applications that
> > use a lot of libraries? I don't see gcc, binutils, or glibc as indicative
> > of the complexity of shared libraries in terms of loaded shared libraries.
> 
> _dl_runtime_resolve is only called once when an external function is
> called the first time.  Many shared libraries isn't a problem unless
> all execution
> time is spent in _dl_runtime_resolve.  I don't believe this is a
> typical behavior.
> 
> > Something like libreoffice's soffice.bin has 142 DSOs, or chrome's
> > 103 DSOs. It might be hard to measure if the lazy resolution is impacting
> > the performance or if you are hitting some other performance boundary, but
> > a black-box test showing performance didn't get *worse* for startup and
> > exit, would mean it isn't the bottlneck (but might be some day). To test
> > this you should be able to use libreoffice's CLI arguments to batch process
> > some files and time that (or the --cat files option).

I did some testing on my old SSE only machine and everything is in the
noise. For example:

 ~ % ldd /usr/lib64/libreoffice/program/soffice.bin | wc -l
105                                         
 ~ % hardening-check /usr/lib64/libreoffice/program/soffice.bin
/usr/lib64/libreoffice/program/soffice.bin: 
 Position Independent Executable: no, normal executable!                                 
 Stack protected: no, not found!            
 Fortify Source functions: no, not found!   
 Read-only relocations: yes                 
 Immediate binding: no, not found!

(with H.J.'s patch)
 Performance counter stats for '/var/tmp/glibc-build/elf/ld.so /usr/lib64/libreoffice/program/soffice.bin --convert-to pdf kandide.odt' (4 runs):

       2463.681675      task-clock (msec)         #    1.040 CPUs utilized            ( +-  0.06% )
               414      context-switches          #    0.168 K/sec                    ( +-  8.88% )
                10      cpu-migrations            #    0.004 K/sec                    ( +- 11.98% )
            28,227      page-faults               #    0.011 M/sec                    ( +-  0.04% )
     7,823,762,346      cycles                    #    3.176 GHz                      ( +-  0.15% )  (67.30%)
     1,360,335,356      stalled-cycles-frontend   #   17.39% frontend cycles idle     ( +-  0.51% )  (66.78%)
     2,090,675,875      stalled-cycles-backend    #   26.72% backend cycles idle      ( +-  1.02% )  (66.70%)
     8,984,501,079      instructions              #    1.15  insn per cycle
                                                  #    0.23  stalled cycles per insn  ( +-  0.11% )  (66.96%)
     1,866,843,047      branches                  #  757.745 M/sec                    ( +-  0.28% )  (67.25%)
        73,973,482      branch-misses             #    3.96% of all branches          ( +-  0.15% )  (67.37%)

       2.368775642 seconds time elapsed                                          ( +-  0.21% )

(without)
 Performance counter stats for '/usr/lib64/libreoffice/program/soffice.bin --convert-to pdf kandide.odt' (4 runs):

       2467.698417      task-clock (msec)         #    1.040 CPUs utilized            ( +-  0.23% )
               540      context-switches          #    0.219 K/sec                    ( +- 17.02% )
                12      cpu-migrations            #    0.005 K/sec                    ( +- 14.85% )
            28,245      page-faults               #    0.011 M/sec                    ( +-  0.02% )
     7,806,607,838      cycles                    #    3.164 GHz                      ( +-  0.09% )  (67.06%)
     1,338,588,952      stalled-cycles-frontend   #   17.15% frontend cycles idle     ( +-  0.30% )  (66.99%)
     2,103,802,012      stalled-cycles-backend    #   26.95% backend cycles idle      ( +-  0.77% )  (66.92%)
     9,012,688,271      instructions              #    1.15  insn per cycle
                                                  #    0.23  stalled cycles per insn  ( +-  0.14% )  (67.02%)
     1,870,634,478      branches                  #  758.048 M/sec                    ( +-  0.31% )  (67.19%)
        73,921,605      branch-misses             #    3.95% of all branches          ( +-  0.13% )  (67.08%)

       2.373621006 seconds time elapsed                                          ( +-  0.27% )


Compile times using clang, that was built with shared libs, also don't
change at all.

-- 
Markus


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]