This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [RFC] ifunc suck, use ufunc.
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Szabolcs Nagy <nsz at port70 dot net>
- Cc: libc-alpha at sourceware dot org
- Date: Mon, 25 May 2015 04:36:52 +0200
- Subject: Re: [RFC] ifunc suck, use ufunc.
- Authentication-results: sourceware.org; auth=none
- References: <20150524213858 dot GA18221 at domone> <20150525014323 dot GC26188 at port70 dot net>
On Mon, May 25, 2015 at 03:43:23AM +0200, Szabolcs Nagy wrote:
> * Ond??ej B?lka <neleai@seznam.cz> [2015-05-24 23:38:58 +0200]:
> > A main benefit would be interlibrary constant folding. Why waste cycles
> > on reinitializing constant, just save it to ufunc structure. Resolver
> > then could precompute tables to improve speed.
> >
> > As interposing these you would need to interpose resolver.
> >
> > An gcc support is not needed but we could get something with alternate
> > calling convention as passing resolver struct is common and could be
> > preserved for loops with tail calls.
> >
> > A future direction could be replace plt and linker with ufunc, it would
> > require adding function string pointer to structure and calling first
> > generic resolver to select specific resolver.
> >
> > Comments?
> >
>
> this makes memset non-async-signal-safe. (qoi issue)
>
Did I explicitly say that its architecture specific optimization or did
I forgot?
> it is not thread-safe either and would need an acquire
> load barrier on every invocation of memset to fix that
> or the use of thread local storage. (conformance issue)
>
> (in the example only resolve->fn is modified and idempotently,
> this would work in practice but as soon as ->data is accessed
> too the memory ordering guarantees are required.. which can
> be made efficient on some archs but only in asm)
>
> in the example memset is called through the wrong type
> of function pointer: the resolver and resolvee are
> incompatible so this is invalid c, only works in asm.
>
Thats why I intended it as architecture-specific. On x64 it will work
along with memset prototype. Adding atomic/locking in resolver would be unnecessary
overhead.
Could make this generic by defining macros that expand to atomic read on
archs that don't act as pram.
> it is not clear to me how many such ufunc structs will be
> in a program for a specific function and how their redundant
> initialization is avoided.
> (one for every call site? every tu? every dso?)
>
Main objective is neccessary on call-site basis. These aren't retundant
as data will be different for different call sites.
For example in sequence
memset (x,0,n);
memset (y,1,n);
In first memset data would contain 16 zeros, second will have 16 ones to
save cycles on repeated creating of mask.
Then there is planned optimization x64-specific where I need to change prototype
more to pass data in xmm0 register and end of string. Then you could
call different places of unrolled moves like
...
movdqu %xmm0, -64(%rdi)
movdqu %xmm0, -48(%rdi)
movdqu %xmm0, -32(%rdi)
movdqu %xmm0, -16(%rdi)
ret
It fixes that gcc does similar unrolling but only with 8-byte moves and
tends to be quite excessive with it and it would cause performance
penalty on cold paths. Also gcc couldn't do this without spliting
function to several per-cpu variants as on some arch this would be slow
without aligning, on others a rep stosq would be faster etc and you must
do resolution to determine what happened.
A per-dso would be possible with more bookkeeping (as I don't know how
convince compiler to do that), you would need to end
compilation by adding file with hidden variables with protected
attribute.
A similar idea would make more sense as gcc optimization to first
extract address from plt.