This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] powerpc: strcasestr optimization




On 06/10/2015 04:18 PM, OndÅej BÃlka wrote:

  The onus for making all this happen is on the person who wants to change the code from its status quo.

That would mean that its Rajalakshmi and responsibility to
write benchmark that shows its performance improvement. I stated my
objections explicitly described which inputs to use.

Problem is that Steven failed to do his duty as a powerpc maintainer
which is to read and review change, check benchmark inputs and that
they are adequate. I could supply really bad implementation of strstr
which first checks special case that needle and haystack are periodic
abcdef sequences and returns result expected from benchmark.

I did just simple request that every freshmen should know and Steven
should as powerpc maintainer check how its handled. Its that naive
algorithm and proposed algorithm are slow on inputs of form

strstr("aaaaa....aaaa","aaaa...aaab");

As from your previous mail its their responsibility to add that case for
benchmark. A 'solution' on v2 would be to switch into two-way algorithm
when size is bigger than 2048. That isn't solution as you still have
problem with size 2047. I seen somewhere comment that its security
problem when you have website with search function on user comments and
implementation internally uses strstr then attacker could just post
comment with aaaaa...a and search few times for aa...ab.

2048 number is chosen by checking various length inputs and comparing the performance. So for 2047 case, there wont be worst performance.

Roland as its really basic issue I think that its proposer
responsibility to check. Also for strstr thread Carlos said about strstr
quadratic case:

We should add gnulib's test to the microbenchmarks.


Then as I commented that algorithm is flawed its from my experience,
Raji's description was:

For case 1, I always compare 8 characters at a time using cmpb instruction.
instruction.
-> perform basic strlen, strnlen, checks. Move haystack based on strchr result.
strchr result.
-> Load first 8 bytes(taking care of aligned load)from haystack and needle and compare them.
needle and compare them.
If they are same proceed to next 8 bytes of haystack and needle.
Else load next doubleword from haystack.Form a doubleword with last
seven characters from first load and first character from
second load and compare again with needle. Proceed in a similar way.
This gives good improvement.

Lets not mix the review comments of strstr and strcasestr.(as the algorithm is different)
My comment was that it wastes time most of time as in common case these
already differ in first byte so all these shifts were completely
unnecessary, a simple optimization of naive algorithm using

while(s=strchr(s,n[0]))
If I use while(s=strchr(s,n[0]) as mentioned in
https://sourceware.org/ml/libc-alpha/2015-05/msg00060.html
the quadratic testcase gives worst peformance. Attached the testcase.

would be faster. Steven instead of doing his duty as powerpc maintainer
and checking result of benchtest that I send instead tried to bullshit
himself by telling that powerpc architecture is too special. Then he
asked me do do benchmark instead as stalling tactics.

If he did his duty of reading benchmark results there is already
evidence. First to notice is there are regressions in original
benchmark, like following:

Length   30/2, alignment  0/ 9, found:  20.7812 32.2656 13.125  19.7188
Length   30/2, alignment  0/ 9, fail:   32.4688 33.9219 36.75   31.4688

I agree there are few regression cases(79 out of 13442 cases) in 'proposed_benchresults.txt' attached in
https://sourceware.org/ml/libc-alpha/2015-05/msg00276.html.
This is minimal when compared to the overall performance gain.

That he should comment, as strstr on average end on 10.6 byte after
haystack start its quite relevant.

But if he read proposed benchmark where needle and haystack are randomly
generated from list of 128 characters then he would easily see it.
I have used your proposed benchtest, and it always gives improvement for strcasestr.Please check the file 'newresults' in https://sourceware.org/ml/libc-alpha/2015-06/msg00032.html

Here is trimmed version to see pattern. Note that first there are lot of
times where performance is less than 13. Thats case where strchr didn't
find first character and we just return(generic case now does
unnecessary work leading to worse numbers but thats addressed on patch that I send with strchr loop.)

Its natural to ask what would happen if you called strchr again. In 120
bytes it would probably didn't found second character and we would
finish in 26 cycles at most not 111.5

I expect that average loop performance would be 13.4375/135 = 0.1 cycles
per byte not cycle per byte that this algorithm exhibits after he loses
performance gains from first correct strchr call.

                         simple_strstr   stupid_strstr   __strstr_power7
__strstr_ppc

Length    4/2, alignment  0/ 3, found:  9.35938 8.09375 7.90625 8.5
Length   12/2, alignment  3/ 9, fail:   9.23438 16.5625 7.90625 9.20312
Length   24/2, alignment  3/ 9, fail:   9.54688 26.2812 8.32812 10.1875
Length   44/4, alignment  9/ 3, found:  13.4688 40.9219 9.59375 13.9688
Length   44/4, alignment  9/ 3, fail:   13.4688 40.9531 50.6875 13.9844
Length   52/4, alignment  3/ 9, found:  13.5312 47.2812 9.5625  14.7344
Length   52/4, alignment  3/ 9, fail:   45.9688 53.625  35.9688 47.5938
Length   75/5, alignment 15/ 9, fail:   73.75   67.9375 64.2188 75.9062
Length   75/5, alignment 15/15, found:  15.2188 65.9062 10.7344 17.8438
Length   75/5, alignment 15/15, fail:   61.5469 68.6562 50.375  63.0312
Length  120/8, alignment 15/ 3, fail:   113.281 111.5   103.5   114.406
Length  120/8, alignment 15/ 9, found:  21.25   99.1406 12.8438 28.1406
Length  135/9, alignment 15/15, found:  23.125  110.734 13.4375 30.7969
Length  135/9, alignment 15/15, fail:   147.484 113.781 110.516 165.984
Length  168/12, alignment  3/15, found: 31.2344 134.172 13.6719 39.9375
Length  168/12, alignment  3/15, fail:  170.766 137.609 131.859 179.859
Length  168/12, alignment  9/ 0, found: 100.562 137.922 76.6562 112.578
Length  168/12, alignment  9/ 0, fail:  168.703 144.078 144.828 181.203
Length  168/12, alignment  9/ 3, found: 129.938 137.703 109.828 136.266
Length  168/12, alignment  9/ 3, fail:  178.734 137.875 165.469 183.609
Length  465/31, alignment 15/ 0, fail:  73.0469 363.375 32.2969 87.0625
Length  465/31, alignment 15/ 3, found: 507.656 373.703 429.797 543.625
Length  465/31, alignment 15/ 3, fail:  512.922 373.188 437.047 556.844
Length  465/31, alignment 15/ 9, found: 443.969 379.359 392.828 464.75
Length  465/31, alignment 15/ 9, fail:  484.312 376.406 423.703 513.672
Length  465/31, alignment 15/15, found: 326.859 376     256.625 350.422
Length  465/31, alignment 15/15, fail:  512.641 376.203 439.234 545.906
Length 131071/16, alignment  0/ 0, found:       116583  108378  122125
127809
Length 131071/16, alignment  0/ 0, fail:        118298  108639  122208
128700



--
Thanks
Rajalakshmi S
#include <string.h>		/* for strstr */
#include <stdlib.h>		/* for malloc */
#include <unistd.h>		/* for alarm */
char *my_strstr (char *s, char *n);
int
main ()
{

  int result = 0;
  size_t m = 1000000;
  char *haystack = (char *) malloc (2 * m + 2);
  char *needle = (char *) malloc (m + 2);
  /* Check for quadratic performance.  */
  if (haystack && needle)
    {
      memset (haystack, 'A', 2 * m);
      haystack[2 * m] = 'B';
      haystack[2 * m + 1] = 0;
      memset (needle, 'A', m);
      needle[m] = 'B';
      needle[m + 1] = 0;
      if (!my_strstr (haystack, needle))
	result |= 1;
    }
  return result;

}

char *
my_strstr (char *s, char *n)
{
  if (n[0] == 0)
    return s;
  while (s = strchr (s, n[0]))
    {
      long i;
      for (i = 1; n[i] && n[i] == s[i]; i++);
      if (n[i] == 0)
	return s;
      s++;
    }
  return NULL;
}

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]