This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 4/*] Generic string memchr and strnlen
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
- Cc: libc-alpha at sourceware dot org
- Date: Tue, 28 Jul 2015 03:00:59 +0200
- Subject: Re: [PATCH 4/*] Generic string memchr and strnlen
- Authentication-results: sourceware.org; auth=none
- References: <003b01d0c622$dd96f530$98c4df90$ at com> <20150724153758 dot GA855 at domone> <003c01d0c62f$3440aa00$9cc1fe00$ at com> <20150724170425 dot GA10041 at domone> <003d01d0c63b$0080b730$01822590$ at com> <20150724191248 dot GA2889 at domone> <004201d0c873$bd3b9fe0$37b2dfa0$ at com> <20150727165642 dot GA22842 at domone> <55B67A68 dot 5070804 at linaro dot org>
On Mon, Jul 27, 2015 at 03:37:28PM -0300, Adhemerval Zanella wrote:
> > Best example is strrchr where almost all architectures got it wrong
> > first time until I tell them otherwise. Problem is that they calculate
> > position of byte at each iteration instead just at end which is
> > ineffective. Now tile and powerpc have this problem. My proposed
> > alternative
> >
> > return memrchr (s, c, strlen (s) + 1);
> >
> > would beat these on suitable inputs. I don't know if in practice as
> > there is extra strlen call.
>
> You do a lot of assumptions and I doubt very much that it will be faster
> on every input. Assuming assembly implementations (no function call) I
> do see that this could be faster for long string with the character being
> found in the end of the string, but I see it could potentially slower
> if it searches large strings with character in start/middle. It will
> really depends in which will be the input char size and distribution.
>
No, I don't make any assumptions. I know that performance of these could
be terrible. For normal benchtest you get following.
simple_strrchr __strrchr_power7 __strrchr_ppc
Length 2048, alignment in bytes 0: 1334.92 114.344 182.391
Length 256, alignment in bytes 1: 173.266 22.5625 35.4375
If I change it to call strrchr("aaaa...aaaa",'a') I get following:
Length 2048, alignment in bytes 0: 1602 578.984 1555.7
Length 2048, alignment in bytes 0: 1332.17 485.922 8393.11
Length 256, alignment in bytes 1: 115.25 63.9531 1018.58
And when I change benchtest generation to
for (i = 0; i < len; ++i)
buf[align + i] = 1 + ((unsigned)(random())) % 8;
I get following:
Length 2048, alignment in bytes 0: 1602 578.984 1555.7
Length 2048, alignment in bytes 1: 1724.23 581.375 1789.81
Length 256, alignment in bytes 1: 186.875 57.9375 183.5
where strlen+memrchr are definitely faster. Similar inputs could happen
if you search for last '/' at paths as their frequency is typically
less than 1/8.
> >
> > Second would be strcpy and strncpy where assembly implementations of
> > some rchitectures look dubious and you could beat these with memcpy
> > call + strlen. I know definitely about powerpc power7 and powerppc
> > implementations.
>
> >From previous discussion the idea of power7 strcpy/strncpy is to not use
> unaligned read/writes due architecture constraints. And again you do
> a lot of assumptions: I doubt that using power7 memcpy/strlen (which also
> do only aligned accesses) would be faster than current strcpy/strncpy.
> It could be the case for large strings, where memcpy will gain because the
> use of VSX instructions; but even I would like some data before taking
> conclusions.
>
No, I don't make assumptions, just see results. These clearly show that
implementation is flawed. If I add following implementation and
ifunc_impl_list and change simple_strcpy to one from string/strcpy.c
then I get following which clearly shows that current implementation is
bad, with inlining strlen you could probably decrease treshold to 32 bytes.
+extern __typeof (memcpy) __memcpy_power7 attribute_hidden;
+extern __typeof (strlen) __strlen_power7 attribute_hidden;
+
+
+char *__strcpy_power7b (char *dest, const char *src)
+{
+ return __memcpy_power7 (dest, src, __strlen_power7 (src) + 1);
+}
+
+extern __typeof (memcpy) __memcpy_ppc attribute_hidden;
+extern __typeof (strlen) __strlen_ppc attribute_hidden;
+
+
+char *__strcpy_ppcb (char *dest, const char *src)
+{
+ return __memcpy_ppc (dest, src, __strlen_ppc (src) + 1);
+}
+
+
simple_strcpy __strcpy_power7 __strcpy_power7b __strcpy_ppcb __strcpy_ppc
Length 15, alignments in bytes 0/ 7: 15.8438 9 11.8281 15.3125 10.25
Length 15, alignments in bytes 7/ 0: 9 8.25 8.09375 9.09375 8.1875
Length 16, alignments in bytes 0/ 0: 9 5 8.25 8.6875 4.85938
Length 16, alignments in bytes 7/ 2: 9.53125 9.75 8.26562 9.6875 7.85938
Length 32, alignments in bytes 0/ 0: 11.4062 4.79688 9.32812 10.7656 5.03125
Length 32, alignments in bytes 6/ 4: 11.3125 11.2344 10 12.625 16.9219
Length 64, alignments in bytes 0/ 0: 12.25 6.75 11.2656 13 7.28125
Length 64, alignments in bytes 5/ 6: 25.4375 17.4844 24.5 25.0625 30.3125
Length 128, alignments in bytes 0/ 0: 15.5938 8.96875 14.5 17.1562 13
Length 128, alignments in bytes 4/ 0: 16.5 22.4219 15 22.75 22.7188
Length 256, alignments in bytes 0/ 0: 20.3438 20 19.0781 27.1875 21.9531
Length 256, alignments in bytes 3/ 2: 23.0781 40.5 22.1875 41.5 97.5625
Length 512, alignments in bytes 0/ 0: 35.3906 33.7031 34.8594 44.6094 37.5
Length 512, alignments in bytes 2/ 4: 58.0625 64.2344 57.4531 78.5 1897.17
Length 1024, alignments in bytes 0/ 0: 61.2812 61.5156 60.6406 75 69.75
Length 1024, alignments in bytes 1/ 6: 82.5 118.266 81.4219 128.312 966.078
> >
> > Third would be strcmp et al, there writing a correct loop is very tricky
> > and only mine x64 implementation does it.
> >
> >>> And as I read assembly it isn't particulary well optimized for most
> >>> architectures. In most cases code looks like you would take current
> >>> generic implementation and ran gcc -S. For example most assemlbly
> >>> implementations don't do loop unrolling.
> >>
> >> Loop unrolling doesn't always increase performance.
> >>
> > But often does by allowing other optimizations like having only one
> > addition instead four, you could do four loads in parallel without
> > worrying that speculative load would fault etc.
> >
> > While it doesn't have to improve performance in most cases it does.
>
> It should be address case by case, like point out specific assembly
> implementations where you think can be benefits from some loop
> unrolling.
Its basically every implementation, question is if its worth code size
increase. I want to test my four times unrolled strlen versus archs that
must do search just using arithmetic without bytewise equality
instruction.