This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] powerpc: strcasestr optimization
- From: Rajalakshmi Srinivasaraghavan <raji at linux dot vnet dot ibm dot com>
- To: OndÅej BÃlka <neleai at seznam dot cz>, Roland McGrath <roland at hack dot frob dot com>
- Cc: munroesj at linux dot vnet dot ibm dot com, GNU C Library <libc-alpha at sourceware dot org>, Steve Munroe <sjmunroe at us dot ibm dot com>
- Date: Thu, 11 Jun 2015 00:04:08 +0530
- Subject: Re: [PATCH] powerpc: strcasestr optimization
- Authentication-results: sourceware.org; auth=none
- References: <556C36D8 dot 2070208 at linux dot vnet dot ibm dot com> <20150601122830 dot GA14649 at domone> <1433169407 dot 10235 dot 5 dot camel at sjmunroe-ThinkPad-W500> <20150601162215 dot GA8955 at domone> <1433270823 dot 14958 dot 15 dot camel at sjmunroe-ThinkPad-W500> <20150602210000 dot GB20004 at domone> <1433882432 dot 21101 dot 71 dot camel at sjmunroe-ThinkPad-W500> <20150609210555 dot 3A7CA2C3BDC at topped-with-meat dot com> <1433885184 dot 21101 dot 83 dot camel at sjmunroe-ThinkPad-W500> <20150609214202 dot 5D0612C3BE3 at topped-with-meat dot com> <20150610104842 dot GA4871 at domone>
On 06/10/2015 04:18 PM, OndÅej BÃlka wrote:
The onus for making all this happen is on the person who wants to change the code from its status quo.
That would mean that its Rajalakshmi and responsibility to
write benchmark that shows its performance improvement. I stated my
objections explicitly described which inputs to use.
Problem is that Steven failed to do his duty as a powerpc maintainer
which is to read and review change, check benchmark inputs and that
they are adequate. I could supply really bad implementation of strstr
which first checks special case that needle and haystack are periodic
abcdef sequences and returns result expected from benchmark.
I did just simple request that every freshmen should know and Steven
should as powerpc maintainer check how its handled. Its that naive
algorithm and proposed algorithm are slow on inputs of form
strstr("aaaaa....aaaa","aaaa...aaab");
As from your previous mail its their responsibility to add that case for
benchmark. A 'solution' on v2 would be to switch into two-way algorithm
when size is bigger than 2048. That isn't solution as you still have
problem with size 2047. I seen somewhere comment that its security
problem when you have website with search function on user comments and
implementation internally uses strstr then attacker could just post
comment with aaaaa...a and search few times for aa...ab.
2048 number is chosen by checking various length inputs and comparing
the performance. So for 2047 case, there wont be worst performance.
Roland as its really basic issue I think that its proposer
responsibility to check. Also for strstr thread Carlos said about strstr
quadratic case:
We should add gnulib's test to the microbenchmarks.
Then as I commented that algorithm is flawed its from my experience,
Raji's description was:
For case 1, I always compare 8 characters at a time using cmpb instruction.
instruction.
-> perform basic strlen, strnlen, checks. Move haystack based on strchr result.
strchr result.
-> Load first 8 bytes(taking care of aligned load)from haystack and needle and compare them.
needle and compare them.
If they are same proceed to next 8 bytes of haystack and needle.
Else load next doubleword from haystack.Form a doubleword with last
seven characters from first load and first character from
second load and compare again with needle. Proceed in a similar way.
This gives good improvement.
Lets not mix the review comments of strstr and strcasestr.(as the
algorithm is different)
My comment was that it wastes time most of time as in common case these
already differ in first byte so all these shifts were completely
unnecessary, a simple optimization of naive algorithm using
while(s=strchr(s,n[0]))
If I use while(s=strchr(s,n[0]) as mentioned in
https://sourceware.org/ml/libc-alpha/2015-05/msg00060.html
the quadratic testcase gives worst peformance. Attached the testcase.
would be faster. Steven instead of doing his duty as powerpc maintainer
and checking result of benchtest that I send instead tried to bullshit
himself by telling that powerpc architecture is too special. Then he
asked me do do benchmark instead as stalling tactics.
If he did his duty of reading benchmark results there is already
evidence. First to notice is there are regressions in original
benchmark, like following:
Length 30/2, alignment 0/ 9, found: 20.7812 32.2656 13.125 19.7188
Length 30/2, alignment 0/ 9, fail: 32.4688 33.9219 36.75 31.4688
I agree there are few regression cases(79 out of 13442 cases) in
'proposed_benchresults.txt' attached in
https://sourceware.org/ml/libc-alpha/2015-05/msg00276.html.
This is minimal when compared to the overall performance gain.
That he should comment, as strstr on average end on 10.6 byte after
haystack start its quite relevant.
But if he read proposed benchmark where needle and haystack are randomly
generated from list of 128 characters then he would easily see it.
I have used your proposed benchtest, and it always gives improvement for
strcasestr.Please check the file 'newresults' in
https://sourceware.org/ml/libc-alpha/2015-06/msg00032.html
Here is trimmed version to see pattern. Note that first there are lot of
times where performance is less than 13. Thats case where strchr didn't
find first character and we just return(generic case now does
unnecessary work leading to worse numbers but thats addressed on patch that I send with strchr loop.)
Its natural to ask what would happen if you called strchr again. In 120
bytes it would probably didn't found second character and we would
finish in 26 cycles at most not 111.5
I expect that average loop performance would be 13.4375/135 = 0.1 cycles
per byte not cycle per byte that this algorithm exhibits after he loses
performance gains from first correct strchr call.
simple_strstr stupid_strstr __strstr_power7
__strstr_ppc
Length 4/2, alignment 0/ 3, found: 9.35938 8.09375 7.90625 8.5
Length 12/2, alignment 3/ 9, fail: 9.23438 16.5625 7.90625 9.20312
Length 24/2, alignment 3/ 9, fail: 9.54688 26.2812 8.32812 10.1875
Length 44/4, alignment 9/ 3, found: 13.4688 40.9219 9.59375 13.9688
Length 44/4, alignment 9/ 3, fail: 13.4688 40.9531 50.6875 13.9844
Length 52/4, alignment 3/ 9, found: 13.5312 47.2812 9.5625 14.7344
Length 52/4, alignment 3/ 9, fail: 45.9688 53.625 35.9688 47.5938
Length 75/5, alignment 15/ 9, fail: 73.75 67.9375 64.2188 75.9062
Length 75/5, alignment 15/15, found: 15.2188 65.9062 10.7344 17.8438
Length 75/5, alignment 15/15, fail: 61.5469 68.6562 50.375 63.0312
Length 120/8, alignment 15/ 3, fail: 113.281 111.5 103.5 114.406
Length 120/8, alignment 15/ 9, found: 21.25 99.1406 12.8438 28.1406
Length 135/9, alignment 15/15, found: 23.125 110.734 13.4375 30.7969
Length 135/9, alignment 15/15, fail: 147.484 113.781 110.516 165.984
Length 168/12, alignment 3/15, found: 31.2344 134.172 13.6719 39.9375
Length 168/12, alignment 3/15, fail: 170.766 137.609 131.859 179.859
Length 168/12, alignment 9/ 0, found: 100.562 137.922 76.6562 112.578
Length 168/12, alignment 9/ 0, fail: 168.703 144.078 144.828 181.203
Length 168/12, alignment 9/ 3, found: 129.938 137.703 109.828 136.266
Length 168/12, alignment 9/ 3, fail: 178.734 137.875 165.469 183.609
Length 465/31, alignment 15/ 0, fail: 73.0469 363.375 32.2969 87.0625
Length 465/31, alignment 15/ 3, found: 507.656 373.703 429.797 543.625
Length 465/31, alignment 15/ 3, fail: 512.922 373.188 437.047 556.844
Length 465/31, alignment 15/ 9, found: 443.969 379.359 392.828 464.75
Length 465/31, alignment 15/ 9, fail: 484.312 376.406 423.703 513.672
Length 465/31, alignment 15/15, found: 326.859 376 256.625 350.422
Length 465/31, alignment 15/15, fail: 512.641 376.203 439.234 545.906
Length 131071/16, alignment 0/ 0, found: 116583 108378 122125
127809
Length 131071/16, alignment 0/ 0, fail: 118298 108639 122208
128700
--
Thanks
Rajalakshmi S
#include <string.h> /* for strstr */
#include <stdlib.h> /* for malloc */
#include <unistd.h> /* for alarm */
char *my_strstr (char *s, char *n);
int
main ()
{
int result = 0;
size_t m = 1000000;
char *haystack = (char *) malloc (2 * m + 2);
char *needle = (char *) malloc (m + 2);
/* Check for quadratic performance. */
if (haystack && needle)
{
memset (haystack, 'A', 2 * m);
haystack[2 * m] = 'B';
haystack[2 * m + 1] = 0;
memset (needle, 'A', m);
needle[m] = 'B';
needle[m + 1] = 0;
if (!my_strstr (haystack, needle))
result |= 1;
}
return result;
}
char *
my_strstr (char *s, char *n)
{
if (n[0] == 0)
return s;
while (s = strchr (s, n[0]))
{
long i;
for (i = 1; n[i] && n[i] == s[i]; i++);
if (n[i] == 0)
return s;
s++;
}
return NULL;
}