This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [RFC] ifunc suck, use ufunc.
- From: Rich Felker <dalias at libc dot org>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: Szabolcs Nagy <nsz at port70 dot net>, libc-alpha at sourceware dot org
- Date: Sun, 24 May 2015 23:15:10 -0400
- Subject: Re: [RFC] ifunc suck, use ufunc.
- Authentication-results: sourceware.org; auth=none
- References: <20150524213858 dot GA18221 at domone> <20150525014323 dot GC26188 at port70 dot net> <20150525023652 dot GB29445 at domone>
On Mon, May 25, 2015 at 04:36:52AM +0200, OndÅej BÃlka wrote:
> On Mon, May 25, 2015 at 03:43:23AM +0200, Szabolcs Nagy wrote:
> > * Ond??ej B?lka <neleai@seznam.cz> [2015-05-24 23:38:58 +0200]:
> > > A main benefit would be interlibrary constant folding. Why waste cycles
> > > on reinitializing constant, just save it to ufunc structure. Resolver
> > > then could precompute tables to improve speed.
> > >
> > > As interposing these you would need to interpose resolver.
> > >
> > > An gcc support is not needed but we could get something with alternate
> > > calling convention as passing resolver struct is common and could be
> > > preserved for loops with tail calls.
> > >
> > > A future direction could be replace plt and linker with ufunc, it would
> > > require adding function string pointer to structure and calling first
> > > generic resolver to select specific resolver.
> > >
> > > Comments?
> > >
> >
> > this makes memset non-async-signal-safe. (qoi issue)
> >
> Did I explicitly say that its architecture specific optimization or did
> I forgot?
AS-safety is broken regardless of arch. Only the barrier stuff if
arch-specific.
> > it is not thread-safe either and would need an acquire
> > load barrier on every invocation of memset to fix that
> > or the use of thread local storage. (conformance issue)
> >
> > (in the example only resolve->fn is modified and idempotently,
> > this would work in practice but as soon as ->data is accessed
> > too the memory ordering guarantees are required.. which can
> > be made efficient on some archs but only in asm)
> >
> > in the example memset is called through the wrong type
> > of function pointer: the resolver and resolvee are
> > incompatible so this is invalid c, only works in asm.
> >
> Thats why I intended it as architecture-specific. On x64 it will work
> along with memset prototype. Adding atomic/locking in resolver would be unnecessary
> overhead.
>
> Could make this generic by defining macros that expand to atomic read on
> archs that don't act as pram.
Do you realize the relative cost of an atomic read (barrier) versus a
small memset? This is like driving an extra mile to a cheaper gas
station to save $0.01 per gallon...
> > it is not clear to me how many such ufunc structs will be
> > in a program for a specific function and how their redundant
> > initialization is avoided.
> > (one for every call site? every tu? every dso?)
> >
> Main objective is neccessary on call-site basis. These aren't retundant
> as data will be different for different call sites.
> For example in sequence
>
> memset (x,0,n);
> memset (y,1,n);
>
> In first memset data would contain 16 zeros, second will have 16 ones to
> save cycles on repeated creating of mask.
>
> Then there is planned optimization x64-specific where I need to change prototype
> more to pass data in xmm0 register and end of string. Then you could
> call different places of unrolled moves like
>
> ....
> movdqu %xmm0, -64(%rdi)
> movdqu %xmm0, -48(%rdi)
> movdqu %xmm0, -32(%rdi)
> movdqu %xmm0, -16(%rdi)
> ret
>
> It fixes that gcc does similar unrolling but only with 8-byte moves and
> tends to be quite excessive with it and it would cause performance
> penalty on cold paths. Also gcc couldn't do this without spliting
> function to several per-cpu variants as on some arch this would be slow
> without aligning, on others a rep stosq would be faster etc and you must
> do resolution to determine what happened.
>
> A per-dso would be possible with more bookkeeping (as I don't know how
> convince compiler to do that), you would need to end
> compilation by adding file with hidden variables with protected
> attribute.
>
> A similar idea would make more sense as gcc optimization to first
> extract address from plt.
This is utterly hideous...
Rich