This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH v3 00/18] Improve generic string routines



On 10/01/2018 20:30, Ondřej Bílka wrote:
> On Wed, Jan 10, 2018 at 10:47:44AM -0200, Adhemerval Zanella wrote:
>> It is an update of previous Richard's patchset [1] to provide generic 
>> string implementation for newer ports and make them only focus on 
>> just specific routines to get a better overall improvement.
>> It is done by:
>>
> This is basically poorly reinvented version of my patch to make generic
> string routines. By dusting it of one could get lot better performance
> than this.

Why not work together to get to get a better implementation instead of
bashing a patch proposal by just saying 'I know/did better'? I did take a 
look at your suggestion and my first impression what a more convoluted
with a lot of over-engineering. I used Richard's proposal because as a
start point mainly because I thought it is better decomposed and organized
over the internal string operations and a bit simpler for generic
implementations.

Not saying your previous proposal is out of value, and I would like if we
could improve this patch proposal with your input.

> 
> This contains lot of glaring issues, main ones are:
> strnlen - quite ineffective as testing vs zero is faster than generic c.
> Could be trivially improved by inlining memchr. 

The 08/08 is using memchr instead (just no inline it)...

> 
> strcmp - this
> 
> +  /* Handle the unaligned bytes of p1 first.  */
> +  n = -(uintptr_t)p1 % sizeof(op_t);
> +  for (i = 0; i < n; ++i)
> 
> is a the worst way how to write memcmp as you make loop with unpredictable
> branch about alignment. Also in strcmp trying to be clever usually
> causes performance regression because its likely that you get difference
> in starting bytes.

I am open to suggestions here, usually the other methods trade speed for
more complexity and hardware/abi idiosyncrasies. Also keep in mind the idea 
of implementation is to be used on a large sets of different hardware, 
where strategies like unaligned access within a page can not be guarantee
to work.


> 
> strcpy - this was previously changed to strlen+memcpy because it with
> platform specific strlen/memcpy it is faster. One should at least check
> if some platforms are affected. I wouldn't be surprised that one could
> get faster function just by adding some unrolling to memcpy. 
> 

It really depends, for instance aarch64 implementation sets a threshold
where strlen+memcpy is actually used where then it switches to load/store.
Also both strlen and memcpy is inlined, witch amortizes the function call
cost.

I though about adding such optimization, but it will need to rely if 
compiler can inline size bounded strlen/memcpy to actually make sense 
for small sizes (where it is not true for a lot of platforms).  And 
there is also the issue of which is the size to used as threshold. 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]