This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] powerpc: strcasestr optimization
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Rajalakshmi Srinivasaraghavan <raji at linux dot vnet dot ibm dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>, Steve Munroe <sjmunroe at us dot ibm dot com>
- Date: Thu, 18 Jun 2015 14:58:38 +0200
- Subject: Re: [PATCH] powerpc: strcasestr optimization
- Authentication-results: sourceware.org; auth=none
- References: <1433270823 dot 14958 dot 15 dot camel at sjmunroe-ThinkPad-W500> <20150602210000 dot GB20004 at domone> <1433882432 dot 21101 dot 71 dot camel at sjmunroe-ThinkPad-W500> <20150609210555 dot 3A7CA2C3BDC at topped-with-meat dot com> <1433885184 dot 21101 dot 83 dot camel at sjmunroe-ThinkPad-W500> <20150609214202 dot 5D0612C3BE3 at topped-with-meat dot com> <20150610104842 dot GA4871 at domone> <55788320 dot 5090405 at linux dot vnet dot ibm dot com> <20150616072122 dot GA15725 at domone> <55829DA3 dot 1010108 at linux dot vnet dot ibm dot com>
On Thu, Jun 18, 2015 at 03:59:55PM +0530, Rajalakshmi Srinivasaraghavan wrote:
> >>2048 number is chosen by checking various length inputs and
> >>comparing the performance. So for 2047 case, there wont be worst
> >>performance.
> >>
> >What are you talking about? I am saying that aaaa haystack with aaa...ab
> > needle of size 2047 is worst case for your algorithm performance.
> >
> I ran the attached test.c in a loop 10000 times.
> proposed patch took 19s and existing code took 20s on power machine.
>
How could you ran it? You have two problems with it
1. It doesn't compile, its missing final } to end main
2. It doesn't measure strstr at all. Its pure function so gcc optimized
that away.
As when I fixed these two issues just 100 iterations take 1.002s with my
x64 vector implementation and 44.538s on generic implemenation your test
couldn't take just 20s for existing code.
> The reason I mentioned them as different is, in strstr() patch the
> input is loaded as double word ie. 8 bytes at a time by byte and in
> strcasestr its byte by byte load.The input is handled in different
> way.
>
I see now that you wrote basically
long i=0;
n0 = tolower(n[0])
while (s[i])
{
int j=1;
if (tolower(s[i])==n0)
while (n[j] && tolower(n[j]) == tolower(s[i+j]))
j++;
if (n[j]=='\0')
return s+i;
i++;
}
return NULL;
For some reason I skimmed that patch and though that you did obvious
optimization once you have vector strstr. You use vectorized tolower to
convert 8 bytes at once. Example of conversion is in
sysdeps/x86_64/multiarch/strcmp-sse42.S
> >
> >Second is about average case perforance. There worst case is irrelevant
> >as you use buy-or-rent techniques to handle it. I sumbitted following
> >patch with easy way to handle that.
> >
> >[PATCH] Improve generic strstr performance.
> >
> I applied your proposed patches and the benchtests and I could see
> proposed patch(__strstr_power7) is better than existing code
> (__strstr_ppc).
> 13363 out of 13442 combinations shows improvement.
> Attached benchtest results.
>
I said use that patch. Next line of my previous mail explains that.
> >its currently superseeded by generic vector bigraph search but it should
> >beat your strstr.
So you should do also comparison with that.
Your benchmark is invalid for simple reason, I guard vector patch with:
if (__BYTE_ORDER == __LITTLE_ENDIAN)
{
So as powerpc is big endian my patch does nothing at all and you are
just measuring generic strstr again. If you want test bigraph search to
that on power64le.
> >What overall performance gain? Did you try to find application and
> >measure performance difference before and after? Big sizes are not very
> >relevant in practice. You look behind 32'th byte of haystack only in
> >1.5% of calls so 10% regression at these sizes is much worse that 30%
> >gain on 256 byte range. In my profiler a simple collection shows
> >following
> No I don't check on any specific application and I use the glibc
> benchtests for performance measurement which is considered to be the
> standard way of performance measurement before proposing a patch.
Then you have problem that you don't know which data are important and
which are not.
Proof by numbers is quite suspect. Input ranges are completely
arbitrary. So I could decide that I want to check for sizes 1..64 all
possible 64byte needle and haystack alignments. That generates 262144
test cases which should be enough.