This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH v3 00/18] Improve generic string routines
- From: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
- To: Ondřej Bílka <neleai at seznam dot cz>
- Cc: libc-alpha at sourceware dot org
- Date: Thu, 11 Jan 2018 08:54:15 -0200
- Subject: Re: [PATCH v3 00/18] Improve generic string routines
- Authentication-results: sourceware.org; auth=none
- References: <1515588482-15744-1-git-send-email-adhemerval.zanella@linaro.org> <20180110223020.GB12399@domone>
On 10/01/2018 20:30, Ondřej Bílka wrote:
> On Wed, Jan 10, 2018 at 10:47:44AM -0200, Adhemerval Zanella wrote:
>> It is an update of previous Richard's patchset [1] to provide generic
>> string implementation for newer ports and make them only focus on
>> just specific routines to get a better overall improvement.
>> It is done by:
>>
> This is basically poorly reinvented version of my patch to make generic
> string routines. By dusting it of one could get lot better performance
> than this.
Why not work together to get to get a better implementation instead of
bashing a patch proposal by just saying 'I know/did better'? I did take a
look at your suggestion and my first impression what a more convoluted
with a lot of over-engineering. I used Richard's proposal because as a
start point mainly because I thought it is better decomposed and organized
over the internal string operations and a bit simpler for generic
implementations.
Not saying your previous proposal is out of value, and I would like if we
could improve this patch proposal with your input.
>
> This contains lot of glaring issues, main ones are:
> strnlen - quite ineffective as testing vs zero is faster than generic c.
> Could be trivially improved by inlining memchr.
The 08/08 is using memchr instead (just no inline it)...
>
> strcmp - this
>
> + /* Handle the unaligned bytes of p1 first. */
> + n = -(uintptr_t)p1 % sizeof(op_t);
> + for (i = 0; i < n; ++i)
>
> is a the worst way how to write memcmp as you make loop with unpredictable
> branch about alignment. Also in strcmp trying to be clever usually
> causes performance regression because its likely that you get difference
> in starting bytes.
I am open to suggestions here, usually the other methods trade speed for
more complexity and hardware/abi idiosyncrasies. Also keep in mind the idea
of implementation is to be used on a large sets of different hardware,
where strategies like unaligned access within a page can not be guarantee
to work.
>
> strcpy - this was previously changed to strlen+memcpy because it with
> platform specific strlen/memcpy it is faster. One should at least check
> if some platforms are affected. I wouldn't be surprised that one could
> get faster function just by adding some unrolling to memcpy.
>
It really depends, for instance aarch64 implementation sets a threshold
where strlen+memcpy is actually used where then it switches to load/store.
Also both strlen and memcpy is inlined, witch amortizes the function call
cost.
I though about adding such optimization, but it will need to rely if
compiler can inline size bounded strlen/memcpy to actually make sense
for small sizes (where it is not true for a lot of platforms). And
there is also the issue of which is the size to used as threshold.