This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 4/*] Generic string memchr and strnlen



On 27-07-2015 22:00, OndÅej BÃlka wrote:
> On Mon, Jul 27, 2015 at 03:37:28PM -0300, Adhemerval Zanella wrote:
>>> Best example is strrchr where almost all architectures got it wrong
>>> first time until I tell them otherwise. Problem is that they calculate
>>> position of byte at each iteration instead just at end which is
>>> ineffective. Now tile and powerpc have this problem. My proposed
>>> alternative
>>>
>>> return memrchr (s, c, strlen (s) + 1);
>>>
>>> would beat these on suitable inputs. I don't know if in practice as
>>> there is extra strlen call.
>>
>> You do a lot of assumptions and I doubt very much that it will be faster
>> on every input.  Assuming assembly implementations (no function call) I
>> do see that this could be faster for long string with the character being
>> found in the end of the string, but I see it could potentially slower
>> if it searches large strings with character in start/middle.  It will
>> really depends in which will be the input char size and distribution.
>>
> No, I don't make any assumptions. I know that performance of these could
> be terrible. For normal benchtest you get following.
> 
>         simple_strrchr  __strrchr_power7        __strrchr_ppc
> Length 2048, alignment in bytes  0:     1334.92 114.344 182.391
> Length  256, alignment in bytes  1:     173.266 22.5625 35.4375
> 
> If I change it to call strrchr("aaaa...aaaa",'a') I get following:
> 
> Length 2048, alignment in bytes  0:     1602    578.984 1555.7
> Length 2048, alignment in bytes  0:     1332.17 485.922 8393.11
> Length  256, alignment in bytes  1:     115.25  63.9531 1018.58
> 
> And when I change benchtest generation to 
> 
>  for (i = 0; i < len; ++i)
>    buf[align + i] = 1 + ((unsigned)(random())) % 8;
> 
> I get following:
> 
> Length 2048, alignment in bytes  0:     1602    578.984 1555.7
> Length 2048, alignment in bytes  1:     1724.23 581.375 1789.81
> Length  256, alignment in bytes  1:     186.875 57.9375 183.5
> 
> 
> where strlen+memrchr are definitely faster. Similar inputs could happen
> if you search for last '/' at paths as their frequency is typically
> less than 1/8.

Regarding this I agree with Wilco [1] remarks and let's continue on that
sub-thread.

[1] https://sourceware.org/ml/libc-alpha/2015-07/msg00937.html

> 
>  
>>>
>>> Second would be strcpy and strncpy where assembly implementations of
>>> some rchitectures look dubious and you could beat these with memcpy
>>> call + strlen. I know definitely about powerpc power7 and powerppc
>>> implementations.
>>
>> >From previous discussion the idea of power7 strcpy/strncpy is to not use
>> unaligned read/writes due architecture constraints.  And again you do
>> a lot of assumptions: I doubt that using power7 memcpy/strlen (which also
>> do only aligned accesses) would be faster than current strcpy/strncpy.
>> It could be the case for large strings, where memcpy will gain because the
>> use of VSX instructions; but even I would like some data before taking
>> conclusions.
>>
> No, I don't make assumptions, just see results. These clearly show that
> implementation is flawed. If I add following implementation and
> ifunc_impl_list and change simple_strcpy to one from string/strcpy.c
> then I get following which clearly shows that current implementation is
> bad, with inlining strlen you could probably decrease treshold to 32 bytes.
> 
> +extern __typeof (memcpy) __memcpy_power7 attribute_hidden;
> +extern __typeof (strlen) __strlen_power7 attribute_hidden;
> +
> +
> +char *__strcpy_power7b (char *dest, const char *src)
> +{
> +  return __memcpy_power7 (dest, src, __strlen_power7 (src) + 1);
> +}
> +
> +extern __typeof (memcpy) __memcpy_ppc attribute_hidden;
> +extern __typeof (strlen) __strlen_ppc attribute_hidden;
> +
> +
> +char *__strcpy_ppcb (char *dest, const char *src)
> +{
> +  return __memcpy_ppc (dest, src, __strlen_ppc (src) + 1);
> +}
> +
> +
> 
> 
>                simple_strcpy   __strcpy_power7 __strcpy_power7b __strcpy_ppcb   __strcpy_ppc
> 
> Length   15, alignments in bytes  0/ 7: 15.8438 9       11.8281 15.3125	10.25
> Length   15, alignments in bytes  7/ 0: 9       8.25    8.09375 9.09375	8.1875
> Length   16, alignments in bytes  0/ 0: 9       5       8.25    8.6875	4.85938
> Length   16, alignments in bytes  7/ 2: 9.53125 9.75    8.26562 9.6875	7.85938
> Length   32, alignments in bytes  0/ 0: 11.4062 4.79688 9.32812 10.7656 5.03125
> Length   32, alignments in bytes  6/ 4: 11.3125 11.2344 10      12.625	16.9219
> Length   64, alignments in bytes  0/ 0: 12.25   6.75    11.2656 13	7.28125
> Length   64, alignments in bytes  5/ 6: 25.4375 17.4844 24.5    25.0625 30.3125
> Length  128, alignments in bytes  0/ 0: 15.5938 8.96875 14.5    17.1562 13
> Length  128, alignments in bytes  4/ 0: 16.5    22.4219 15      22.75	22.7188
> Length  256, alignments in bytes  0/ 0: 20.3438 20      19.0781 27.1875 21.9531
> Length  256, alignments in bytes  3/ 2: 23.0781 40.5    22.1875 41.5	97.5625
> Length  512, alignments in bytes  0/ 0: 35.3906 33.7031 34.8594 44.6094 37.5
> Length  512, alignments in bytes  2/ 4: 58.0625 64.2344 57.4531 78.5	1897.17
> Length 1024, alignments in bytes  0/ 0: 61.2812 61.5156 60.6406 75	69.75
> Length 1024, alignments in bytes  1/ 6: 82.5    118.266 81.4219 128.312 966.078

I checked and your suggestion and it does seems better in mostly of inputs.

>  
>>>
>>> Third would be strcmp et al, there writing a correct loop is very tricky
>>> and only mine x64 implementation does it.
>>>
>>>>> And as I read assembly it isn't particulary well optimized for most
>>>>> architectures. In most cases code looks like you would take current
>>>>> generic implementation and ran gcc -S. For example most assemlbly
>>>>> implementations don't do loop unrolling.
>>>>
>>>> Loop unrolling doesn't always increase performance.
>>>>
>>> But often does by allowing other optimizations like having only one
>>> addition instead four, you could do four loads in parallel without
>>> worrying that speculative load would fault etc.
>>>
>>> While it doesn't have to improve performance in most cases it does.
>>
>> It should be address case by case, like point out specific assembly
>> implementations where you think can be benefits from some loop
>> unrolling.
> 
> Its basically every implementation, question is if its worth code size
> increase. I want to test my four times unrolled strlen versus archs that
> must do search just using arithmetic without bytewise equality
> instruction.
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]