This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] ifunc suck, use ufunc.


On Mon, May 25, 2015 at 03:43:23AM +0200, Szabolcs Nagy wrote:
> * Ond??ej B?lka <neleai@seznam.cz> [2015-05-24 23:38:58 +0200]:
> > A main benefit would be interlibrary constant folding. Why waste cycles
> > on reinitializing constant, just save it to ufunc structure. Resolver 
> > then could precompute tables to improve speed.
> > 
> > As interposing these you would need to interpose resolver.
> > 
> > An gcc support is not needed but we could get something with alternate
> > calling convention as passing resolver struct is common and could be
> > preserved for loops with tail calls.
> > 
> > A future direction could be replace plt and linker with ufunc, it would
> > require adding function string pointer to structure and calling first
> > generic resolver to select specific resolver.
> > 
> > Comments?
> > 
> 
> this makes memset non-async-signal-safe. (qoi issue)
>
Did I explicitly say that its architecture specific optimization or did
I forgot?
 
> it is not thread-safe either and would need an acquire
> load barrier on every invocation of memset to fix that
> or the use of thread local storage. (conformance issue)
> 
> (in the example only resolve->fn is modified and idempotently,
> this would work in practice but as soon as ->data is accessed
> too the memory ordering guarantees are required.. which can
> be made efficient on some archs but only in asm)
> 
> in the example memset is called through the wrong type
> of function pointer: the resolver and resolvee are
> incompatible so this is invalid c, only works in asm.
> 
Thats why I intended it as architecture-specific. On x64 it will work
along with memset prototype. Adding atomic/locking in resolver would be unnecessary
overhead.

Could make this generic by defining macros that expand to atomic read on
archs that don't act as pram.


> it is not clear to me how many such ufunc structs will be
> in a program for a specific function and how their redundant
> initialization is avoided.
> (one for every call site? every tu? every dso?)
> 
Main objective is neccessary on call-site basis. These aren't retundant
as data will be different for different call sites.
For example in sequence

memset (x,0,n);
memset (y,1,n);

In first memset data would contain 16 zeros, second will have 16 ones to
save cycles on repeated creating of mask.

Then there is planned optimization x64-specific where I need to change prototype
more to pass data in xmm0 register and end of string. Then you could
call different places of unrolled moves like

...
movdqu %xmm0, -64(%rdi)
movdqu %xmm0, -48(%rdi)
movdqu %xmm0, -32(%rdi)
movdqu %xmm0, -16(%rdi)
ret

It fixes that gcc does similar unrolling but only with 8-byte moves and
tends to be quite excessive with it and it would cause performance
penalty on cold paths. Also gcc couldn't do this without spliting
function to several per-cpu variants as on some arch this would be slow
without aligning, on others a rep stosq would be faster etc and you must
do resolution to determine what happened.

A per-dso would be possible with more bookkeeping (as I don't know how
convince compiler to do that), you would need to end
compilation by adding file with hidden variables with protected
attribute.

A similar idea would make more sense as gcc optimization to first
extract address from plt.




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]