This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 4/*] Generic string memchr and strnlen
- From: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: libc-alpha at sourceware dot org
- Date: Tue, 28 Jul 2015 11:05:17 -0300
- Subject: Re: [PATCH 4/*] Generic string memchr and strnlen
- Authentication-results: sourceware.org; auth=none
- References: <003b01d0c622$dd96f530$98c4df90$ at com> <20150724153758 dot GA855 at domone> <003c01d0c62f$3440aa00$9cc1fe00$ at com> <20150724170425 dot GA10041 at domone> <003d01d0c63b$0080b730$01822590$ at com> <20150724191248 dot GA2889 at domone> <004201d0c873$bd3b9fe0$37b2dfa0$ at com> <20150727165642 dot GA22842 at domone> <55B67A68 dot 5070804 at linaro dot org> <20150728010059 dot GB21851 at domone>
On 27-07-2015 22:00, OndÅej BÃlka wrote:
> On Mon, Jul 27, 2015 at 03:37:28PM -0300, Adhemerval Zanella wrote:
>>> Best example is strrchr where almost all architectures got it wrong
>>> first time until I tell them otherwise. Problem is that they calculate
>>> position of byte at each iteration instead just at end which is
>>> ineffective. Now tile and powerpc have this problem. My proposed
>>> alternative
>>>
>>> return memrchr (s, c, strlen (s) + 1);
>>>
>>> would beat these on suitable inputs. I don't know if in practice as
>>> there is extra strlen call.
>>
>> You do a lot of assumptions and I doubt very much that it will be faster
>> on every input. Assuming assembly implementations (no function call) I
>> do see that this could be faster for long string with the character being
>> found in the end of the string, but I see it could potentially slower
>> if it searches large strings with character in start/middle. It will
>> really depends in which will be the input char size and distribution.
>>
> No, I don't make any assumptions. I know that performance of these could
> be terrible. For normal benchtest you get following.
>
> simple_strrchr __strrchr_power7 __strrchr_ppc
> Length 2048, alignment in bytes 0: 1334.92 114.344 182.391
> Length 256, alignment in bytes 1: 173.266 22.5625 35.4375
>
> If I change it to call strrchr("aaaa...aaaa",'a') I get following:
>
> Length 2048, alignment in bytes 0: 1602 578.984 1555.7
> Length 2048, alignment in bytes 0: 1332.17 485.922 8393.11
> Length 256, alignment in bytes 1: 115.25 63.9531 1018.58
>
> And when I change benchtest generation to
>
> for (i = 0; i < len; ++i)
> buf[align + i] = 1 + ((unsigned)(random())) % 8;
>
> I get following:
>
> Length 2048, alignment in bytes 0: 1602 578.984 1555.7
> Length 2048, alignment in bytes 1: 1724.23 581.375 1789.81
> Length 256, alignment in bytes 1: 186.875 57.9375 183.5
>
>
> where strlen+memrchr are definitely faster. Similar inputs could happen
> if you search for last '/' at paths as their frequency is typically
> less than 1/8.
Regarding this I agree with Wilco [1] remarks and let's continue on that
sub-thread.
[1] https://sourceware.org/ml/libc-alpha/2015-07/msg00937.html
>
>
>>>
>>> Second would be strcpy and strncpy where assembly implementations of
>>> some rchitectures look dubious and you could beat these with memcpy
>>> call + strlen. I know definitely about powerpc power7 and powerppc
>>> implementations.
>>
>> >From previous discussion the idea of power7 strcpy/strncpy is to not use
>> unaligned read/writes due architecture constraints. And again you do
>> a lot of assumptions: I doubt that using power7 memcpy/strlen (which also
>> do only aligned accesses) would be faster than current strcpy/strncpy.
>> It could be the case for large strings, where memcpy will gain because the
>> use of VSX instructions; but even I would like some data before taking
>> conclusions.
>>
> No, I don't make assumptions, just see results. These clearly show that
> implementation is flawed. If I add following implementation and
> ifunc_impl_list and change simple_strcpy to one from string/strcpy.c
> then I get following which clearly shows that current implementation is
> bad, with inlining strlen you could probably decrease treshold to 32 bytes.
>
> +extern __typeof (memcpy) __memcpy_power7 attribute_hidden;
> +extern __typeof (strlen) __strlen_power7 attribute_hidden;
> +
> +
> +char *__strcpy_power7b (char *dest, const char *src)
> +{
> + return __memcpy_power7 (dest, src, __strlen_power7 (src) + 1);
> +}
> +
> +extern __typeof (memcpy) __memcpy_ppc attribute_hidden;
> +extern __typeof (strlen) __strlen_ppc attribute_hidden;
> +
> +
> +char *__strcpy_ppcb (char *dest, const char *src)
> +{
> + return __memcpy_ppc (dest, src, __strlen_ppc (src) + 1);
> +}
> +
> +
>
>
> simple_strcpy __strcpy_power7 __strcpy_power7b __strcpy_ppcb __strcpy_ppc
>
> Length 15, alignments in bytes 0/ 7: 15.8438 9 11.8281 15.3125 10.25
> Length 15, alignments in bytes 7/ 0: 9 8.25 8.09375 9.09375 8.1875
> Length 16, alignments in bytes 0/ 0: 9 5 8.25 8.6875 4.85938
> Length 16, alignments in bytes 7/ 2: 9.53125 9.75 8.26562 9.6875 7.85938
> Length 32, alignments in bytes 0/ 0: 11.4062 4.79688 9.32812 10.7656 5.03125
> Length 32, alignments in bytes 6/ 4: 11.3125 11.2344 10 12.625 16.9219
> Length 64, alignments in bytes 0/ 0: 12.25 6.75 11.2656 13 7.28125
> Length 64, alignments in bytes 5/ 6: 25.4375 17.4844 24.5 25.0625 30.3125
> Length 128, alignments in bytes 0/ 0: 15.5938 8.96875 14.5 17.1562 13
> Length 128, alignments in bytes 4/ 0: 16.5 22.4219 15 22.75 22.7188
> Length 256, alignments in bytes 0/ 0: 20.3438 20 19.0781 27.1875 21.9531
> Length 256, alignments in bytes 3/ 2: 23.0781 40.5 22.1875 41.5 97.5625
> Length 512, alignments in bytes 0/ 0: 35.3906 33.7031 34.8594 44.6094 37.5
> Length 512, alignments in bytes 2/ 4: 58.0625 64.2344 57.4531 78.5 1897.17
> Length 1024, alignments in bytes 0/ 0: 61.2812 61.5156 60.6406 75 69.75
> Length 1024, alignments in bytes 1/ 6: 82.5 118.266 81.4219 128.312 966.078
I checked and your suggestion and it does seems better in mostly of inputs.
>
>>>
>>> Third would be strcmp et al, there writing a correct loop is very tricky
>>> and only mine x64 implementation does it.
>>>
>>>>> And as I read assembly it isn't particulary well optimized for most
>>>>> architectures. In most cases code looks like you would take current
>>>>> generic implementation and ran gcc -S. For example most assemlbly
>>>>> implementations don't do loop unrolling.
>>>>
>>>> Loop unrolling doesn't always increase performance.
>>>>
>>> But often does by allowing other optimizations like having only one
>>> addition instead four, you could do four loads in parallel without
>>> worrying that speculative load would fault etc.
>>>
>>> While it doesn't have to improve performance in most cases it does.
>>
>> It should be address case by case, like point out specific assembly
>> implementations where you think can be benefits from some loop
>> unrolling.
>
> Its basically every implementation, question is if its worth code size
> increase. I want to test my four times unrolled strlen versus archs that
> must do search just using arithmetic without bytewise equality
> instruction.
>