This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: RFC: x86-64: Use fxsave/xsave/xsavec in _dl_runtime_resolve [BZ #21265]
On 2017.10.19 at 15:36 -0700, H.J. Lu wrote:
> On Thu, Oct 19, 2017 at 2:55 PM, Carlos O'Donell <carlos@redhat.com> wrote:
> > On 10/19/2017 10:41 AM, H.J. Lu wrote:
> >> In _dl_runtime_resolve, use fxsave/xsave/xsavec to preserve all vector,
> >> mask and bound registers. It simplifies _dl_runtime_resolve and supports
> >> different calling conventions. ld.so code size is reduced by more than
> >> 1 KB. However, use fxsave/xsave/xsavec takes a little bit more cycles
> >> than saving and restoring vector and bound registers individually.
> >>
> >> Latency for _dl_runtime_resolve to lookup the function, foo, from one
> >> shared library plus libc.so:
> >>
> >> Before After Change
> >>
> >> Westmere (SSE)/fxsave 345 866 151%
> >> IvyBridge (AVX)/xsave 420 643 53%
> >> Haswell (AVX)/xsave 713 1252 75%
> >> Skylake (AVX+MPX)/xsavec 559 719 28%
> >> Skylake (AVX512+MPX)/xsavec 145 272 87%
> >
> > This is a good baseline, but as you note, the change may not be observable
> > in any real world programs.
> >
> > The case I made to David Kreitzer here:
> > https://sourceware.org/ml/libc-alpha/2017-03/msg00430.html
> > ~~~
> > ... Alternatively a more detailed performance analysis of
> > the impact on applications that don't use __regcall is required before adding
> > instructions to the hot path of the average application (or removing their use
> > in _dl_runtime_resolve since that penalizes the dynamic loader for all applications
> > on hardware that supports those vector registers).
> > ~~~
> >
> >> This is the worst case where portion of time spent for saving and
> >> restoring registers is bigger than majority of cases. With smaller
> >> _dl_runtime_resolve code size, overall performance impact is negligible.
> >>
> >> On IvyBridge, differences in build and test time of binutils with lazy
> >> binding GCC and binutils are noises. On Westmere, differences in
> >> bootstrap and "makc check" time of GCC 7 with lazy binding GCC and
> >> binutils are also noises.
> > Do you have any statistics on the timing for large applications that
> > use a lot of libraries? I don't see gcc, binutils, or glibc as indicative
> > of the complexity of shared libraries in terms of loaded shared libraries.
>
> _dl_runtime_resolve is only called once when an external function is
> called the first time. Many shared libraries isn't a problem unless
> all execution
> time is spent in _dl_runtime_resolve. I don't believe this is a
> typical behavior.
>
> > Something like libreoffice's soffice.bin has 142 DSOs, or chrome's
> > 103 DSOs. It might be hard to measure if the lazy resolution is impacting
> > the performance or if you are hitting some other performance boundary, but
> > a black-box test showing performance didn't get *worse* for startup and
> > exit, would mean it isn't the bottlneck (but might be some day). To test
> > this you should be able to use libreoffice's CLI arguments to batch process
> > some files and time that (or the --cat files option).
I did some testing on my old SSE only machine and everything is in the
noise. For example:
~ % ldd /usr/lib64/libreoffice/program/soffice.bin | wc -l
105
~ % hardening-check /usr/lib64/libreoffice/program/soffice.bin
/usr/lib64/libreoffice/program/soffice.bin:
Position Independent Executable: no, normal executable!
Stack protected: no, not found!
Fortify Source functions: no, not found!
Read-only relocations: yes
Immediate binding: no, not found!
(with H.J.'s patch)
Performance counter stats for '/var/tmp/glibc-build/elf/ld.so /usr/lib64/libreoffice/program/soffice.bin --convert-to pdf kandide.odt' (4 runs):
2463.681675 task-clock (msec) # 1.040 CPUs utilized ( +- 0.06% )
414 context-switches # 0.168 K/sec ( +- 8.88% )
10 cpu-migrations # 0.004 K/sec ( +- 11.98% )
28,227 page-faults # 0.011 M/sec ( +- 0.04% )
7,823,762,346 cycles # 3.176 GHz ( +- 0.15% ) (67.30%)
1,360,335,356 stalled-cycles-frontend # 17.39% frontend cycles idle ( +- 0.51% ) (66.78%)
2,090,675,875 stalled-cycles-backend # 26.72% backend cycles idle ( +- 1.02% ) (66.70%)
8,984,501,079 instructions # 1.15 insn per cycle
# 0.23 stalled cycles per insn ( +- 0.11% ) (66.96%)
1,866,843,047 branches # 757.745 M/sec ( +- 0.28% ) (67.25%)
73,973,482 branch-misses # 3.96% of all branches ( +- 0.15% ) (67.37%)
2.368775642 seconds time elapsed ( +- 0.21% )
(without)
Performance counter stats for '/usr/lib64/libreoffice/program/soffice.bin --convert-to pdf kandide.odt' (4 runs):
2467.698417 task-clock (msec) # 1.040 CPUs utilized ( +- 0.23% )
540 context-switches # 0.219 K/sec ( +- 17.02% )
12 cpu-migrations # 0.005 K/sec ( +- 14.85% )
28,245 page-faults # 0.011 M/sec ( +- 0.02% )
7,806,607,838 cycles # 3.164 GHz ( +- 0.09% ) (67.06%)
1,338,588,952 stalled-cycles-frontend # 17.15% frontend cycles idle ( +- 0.30% ) (66.99%)
2,103,802,012 stalled-cycles-backend # 26.95% backend cycles idle ( +- 0.77% ) (66.92%)
9,012,688,271 instructions # 1.15 insn per cycle
# 0.23 stalled cycles per insn ( +- 0.14% ) (67.02%)
1,870,634,478 branches # 758.048 M/sec ( +- 0.31% ) (67.19%)
73,921,605 branch-misses # 3.95% of all branches ( +- 0.13% ) (67.08%)
2.373621006 seconds time elapsed ( +- 0.27% )
Compile times using clang, that was built with shared libs, also don't
change at all.
--
Markus