This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: RFC: Should x86-64 support arbitrary calling conventions?


Carlos, thank you for taking the time to write up such a comprehensive
response. You have articulated the glibc position clearly, and it is quite
reasonable. I agree that it is up to Intel to make a more convincing
data-driven case to support __regcall and other custom conventions
"out-of-the-box" in the dynamic linker as you suggest here:

>>> If one argues that enabling ICC's __regcall does not slow down (4) in a
>>> statistically significant way, then I would like to see a contribution of
>>> a microbenchmark that tries to show that so we can have some objective
>>> measurable position on the topic.

In the meantime, I appreciate the suggestions you and Florian have made for
how to get __regcall working with the existing tools.

Thanks,
Dave Kreitzer

-----Original Message-----
From: Carlos O'Donell [mailto:carlos@redhat.com] 
Sent: Monday, March 20, 2017 2:30 PM
To: Kreitzer, David L <david.l.kreitzer@intel.com>; H.J. Lu <hjl.tools@gmail.com>; GNU C Library <libc-alpha@sourceware.org>; Joseph S. Myers <joseph@codesourcery.com>; Jeff Law <law@redhat.com>
Cc: Maslov, Sergey V <sergey.v.maslov@intel.com>
Subject: Re: RFC: Should x86-64 support arbitrary calling conventions?

On 03/17/2017 02:03 PM, Kreitzer, David L wrote:
> H.J. is correct. The __regcall calling convention may use up to 16 
> vector registers for passing arguments. And when not used for passing 
> arguments, registers xmm8-xmm15 are callee-save. The convention 
> doesn't pass arguments in mask registers nor treat them as 
> callee-save, but there still might be situations where it would be 
> useful to pass arguments in mask registers for performance reasons.
> 
> Ideally, _dl_runtime_resolve should preserve any registers that it 
> uses, similar to an interrupt handler.
> 
> https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=5ed3cc7b66af4758f7849ed6
> f65f4365be8223be
> 
> It is not strictly necessary to use xsave/xrstor for this purpose, 
> though that is a convenient way to do it. An alternative if 
> xsave/xrstor is deemed too costly is to avoid using vector registers at all within _dl_runtime_resolve.
> 
> Otherwise, we leave significant performance potential on the table in 
> situations where the "one size fits all" calling convention is inefficient.

David,

Thanks for your input and experience on the matter.

Performance spectrum:
---------------------

I absolutely agree that performance is left on the table and it depends on the choices being made by the developer and the choices being made by the runtime and developer tooling.

Trade-offs are made at all levels to provide performance versus debugging or special case versus general case.

I consider a spectrum of optimizations here that range from:

(1) Static linking.

    - No dynamic loader involved (unless using dlopen)
    - Developer can use any regparm or __regcall options they want.
    - There are some natural consequences to not using dynamic loading.

(2) Whole program optimization (in the abstract)

    - Could use special call sequences like those used with -fno-plt to
      make direct calls to functions and bypass the PLT.
    - Likely require the runtime to be exactly that which was used at build time.
    - Depending on the framework you could have inter-module ABI differences e.g.
      the caller might know a given implementation of a shared library
      routine doesn't clobber certain registers and optimize for that.

(3) Dynamic linking with special options.

    - Use -fno-plt or -Wl,-z,now
    - Degraded developer tooling features because of current lack of support for
      alternate function call ABIs.
    - Inability to use LD_AUDIT audit framework without PLT entries.
    - ELF interposition still preserved.

(4) Dynamic linking

    - Following a published ABI.
    - Intra-module function calls may use non-standard procedure call ABIs:
      - Kernel syscalls are an example of a special call ABI (intra-module)
      - Use of regparm and __regcall for certain (intra-module)
      Note: Observable only by a debugger. Not observable by an audit module (LD_AUDIT).

You are positioning ICC's __regcall as something which should fit into (4).

I argue it fits into (3) and will not be supported out of the box.

glibc's position:
-----------------

In https://sourceware.org/bugzilla/show_bug.cgi?id=21265#c7 I state the general principles that glibc should follow:

(a) Optimize for the special local case.

    - In the special local case glibc uses internal_function for all non-PLT internal
      function calls and that may include using regparm.

(b) Optimize for the global average case.

    - In the global average case glibc strives to make (3/4) as fast as
      possible while still following ELF. The dynamic loader is responsible for running
      a large number of applications, not all of which are compiled with
      __regcall or other arbitrary calling conventions (like stack alignment at function
      entry).

Again, you are positioning ICC's __regcall as something that fits into (b) without any impact on the global average case.

I argue ICC's __regcall is in (a) and does not warrant changes in the dynamic loader's runtime resolution trampoline.

Benchmarking:
-------------

If one argues that enabling ICC's __regcall does not slow down
(4) in a statistically significant way, then I would like to see a contribution of a microbenchmark that tries to show that so we can have some objective measurable position on the topic.

The use -fno-plt (as Florian Weimer is suggesting), non-lazy binding, or LTO (in the future) can make it possible to optimize more of the call ABI.

Florian Weimer noticed that we do use internal_function on __libc_pthread_init@GLIBC_PRIVATE, which means that glibc is inconsistent about (b) above. I could not justify adding more support for alternate calling conventions just to satisfy a GLIBC_PRIVATE requirement. In fact I think that the use of regparm on __libc_pthread_init is a mistake that should be fixed.

Lastly, the use of all of these alternate ABIs can impact the developers ability to use developer tooling such as systemtap. The developer tooling should expect external global symbols follow the published ABI for the architecture.

In summary
==========

- Optimizing for the general case in the dynamic loader means that we don't
  support __regcall functions in the PLT with lazy binding. Thus no out-of-the-box
  support for __regcall.

- Application developers have to make a choice to compile for (3) as above,
  choices like using -fno-plt or -Wl,-z,now to safely use these
  high performance features at the cost of debugging (no LD_AUDIT support,
  and problems with uprobe-using tooling like systemtap which expects a given
  ABI).

- The support for regparm on i386 in the dynamic loader trampoline is historical.
  And glibc should remove the one usage in __libc_pthread_init@GLIBC_PRIVATE that
  has external linkage.

- I suggest bug 21265 be RESOLVED as WONTFIX because of the impact on applications
  that don't use __regcall. Alternatively a more detailed performance analysis of
  the impact on applications that don't use __regcall is required before adding
  instructions to the hot path of the average application (or removing their use
  in _dl_runtime_resolve since that penalizes the dynamic loader for all applications
  on hardware that supports those vector registers).

Comments?

--
Cheers,
Carlos.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]