This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 4/*] Generic string memchr and strnlen


On Mon, Jul 27, 2015 at 03:37:28PM -0300, Adhemerval Zanella wrote:
> > Best example is strrchr where almost all architectures got it wrong
> > first time until I tell them otherwise. Problem is that they calculate
> > position of byte at each iteration instead just at end which is
> > ineffective. Now tile and powerpc have this problem. My proposed
> > alternative
> > 
> > return memrchr (s, c, strlen (s) + 1);
> > 
> > would beat these on suitable inputs. I don't know if in practice as
> > there is extra strlen call.
> 
> You do a lot of assumptions and I doubt very much that it will be faster
> on every input.  Assuming assembly implementations (no function call) I
> do see that this could be faster for long string with the character being
> found in the end of the string, but I see it could potentially slower
> if it searches large strings with character in start/middle.  It will
> really depends in which will be the input char size and distribution.
>
No, I don't make any assumptions. I know that performance of these could
be terrible. For normal benchtest you get following.

        simple_strrchr  __strrchr_power7        __strrchr_ppc
Length 2048, alignment in bytes  0:     1334.92 114.344 182.391
Length  256, alignment in bytes  1:     173.266 22.5625 35.4375

If I change it to call strrchr("aaaa...aaaa",'a') I get following:

Length 2048, alignment in bytes  0:     1602    578.984 1555.7
Length 2048, alignment in bytes  0:     1332.17 485.922 8393.11
Length  256, alignment in bytes  1:     115.25  63.9531 1018.58

And when I change benchtest generation to 

 for (i = 0; i < len; ++i)
   buf[align + i] = 1 + ((unsigned)(random())) % 8;

I get following:

Length 2048, alignment in bytes  0:     1602    578.984 1555.7
Length 2048, alignment in bytes  1:     1724.23 581.375 1789.81
Length  256, alignment in bytes  1:     186.875 57.9375 183.5


where strlen+memrchr are definitely faster. Similar inputs could happen
if you search for last '/' at paths as their frequency is typically
less than 1/8.

 
> > 
> > Second would be strcpy and strncpy where assembly implementations of
> > some rchitectures look dubious and you could beat these with memcpy
> > call + strlen. I know definitely about powerpc power7 and powerppc
> > implementations.
> 
> >From previous discussion the idea of power7 strcpy/strncpy is to not use
> unaligned read/writes due architecture constraints.  And again you do
> a lot of assumptions: I doubt that using power7 memcpy/strlen (which also
> do only aligned accesses) would be faster than current strcpy/strncpy.
> It could be the case for large strings, where memcpy will gain because the
> use of VSX instructions; but even I would like some data before taking
> conclusions.
>
No, I don't make assumptions, just see results. These clearly show that
implementation is flawed. If I add following implementation and
ifunc_impl_list and change simple_strcpy to one from string/strcpy.c
then I get following which clearly shows that current implementation is
bad, with inlining strlen you could probably decrease treshold to 32 bytes.

+extern __typeof (memcpy) __memcpy_power7 attribute_hidden;
+extern __typeof (strlen) __strlen_power7 attribute_hidden;
+
+
+char *__strcpy_power7b (char *dest, const char *src)
+{
+  return __memcpy_power7 (dest, src, __strlen_power7 (src) + 1);
+}
+
+extern __typeof (memcpy) __memcpy_ppc attribute_hidden;
+extern __typeof (strlen) __strlen_ppc attribute_hidden;
+
+
+char *__strcpy_ppcb (char *dest, const char *src)
+{
+  return __memcpy_ppc (dest, src, __strlen_ppc (src) + 1);
+}
+
+


               simple_strcpy   __strcpy_power7 __strcpy_power7b __strcpy_ppcb   __strcpy_ppc

Length   15, alignments in bytes  0/ 7: 15.8438 9       11.8281 15.3125	10.25
Length   15, alignments in bytes  7/ 0: 9       8.25    8.09375 9.09375	8.1875
Length   16, alignments in bytes  0/ 0: 9       5       8.25    8.6875	4.85938
Length   16, alignments in bytes  7/ 2: 9.53125 9.75    8.26562 9.6875	7.85938
Length   32, alignments in bytes  0/ 0: 11.4062 4.79688 9.32812 10.7656 5.03125
Length   32, alignments in bytes  6/ 4: 11.3125 11.2344 10      12.625	16.9219
Length   64, alignments in bytes  0/ 0: 12.25   6.75    11.2656 13	7.28125
Length   64, alignments in bytes  5/ 6: 25.4375 17.4844 24.5    25.0625 30.3125
Length  128, alignments in bytes  0/ 0: 15.5938 8.96875 14.5    17.1562 13
Length  128, alignments in bytes  4/ 0: 16.5    22.4219 15      22.75	22.7188
Length  256, alignments in bytes  0/ 0: 20.3438 20      19.0781 27.1875 21.9531
Length  256, alignments in bytes  3/ 2: 23.0781 40.5    22.1875 41.5	97.5625
Length  512, alignments in bytes  0/ 0: 35.3906 33.7031 34.8594 44.6094 37.5
Length  512, alignments in bytes  2/ 4: 58.0625 64.2344 57.4531 78.5	1897.17
Length 1024, alignments in bytes  0/ 0: 61.2812 61.5156 60.6406 75	69.75
Length 1024, alignments in bytes  1/ 6: 82.5    118.266 81.4219 128.312 966.078
 
> > 
> > Third would be strcmp et al, there writing a correct loop is very tricky
> > and only mine x64 implementation does it.
> > 
> >>> And as I read assembly it isn't particulary well optimized for most
> >>> architectures. In most cases code looks like you would take current
> >>> generic implementation and ran gcc -S. For example most assemlbly
> >>> implementations don't do loop unrolling.
> >>
> >> Loop unrolling doesn't always increase performance.
> >>
> > But often does by allowing other optimizations like having only one
> > addition instead four, you could do four loads in parallel without
> > worrying that speculative load would fault etc.
> > 
> > While it doesn't have to improve performance in most cases it does.
> 
> It should be address case by case, like point out specific assembly
> implementations where you think can be benefits from some loop
> unrolling.

Its basically every implementation, question is if its worth code size
increase. I want to test my four times unrolled strlen versus archs that
must do search just using arithmetic without bytewise equality
instruction.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]