This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] powerpc: strcasestr optimization




On 06/18/2015 06:28 PM, OndÅej BÃlka wrote:
On Thu, Jun 18, 2015 at 03:59:55PM +0530, Rajalakshmi Srinivasaraghavan wrote:
2048 number is chosen by checking various length inputs and
comparing the performance. So for 2047 case, there wont be worst
performance.

What are you talking about? I am saying that aaaa haystack with aaa...ab
  needle of size 2047 is worst case for your algorithm performance.

I ran the attached test.c in a loop 10000 times.
proposed patch took 19s and existing code took 20s on power machine.

How could you ran it? You have two problems with it
1. It doesn't compile, its missing final } to end main
2. It doesn't measure strstr at all. Its pure function so gcc optimized
that away.

As when I fixed these two issues just 100 iterations take 1.002s with my
x64 vector implementation and 44.538s on generic implemenation your test
couldn't take just 20s for existing code.


The reason I mentioned them as different is, in strstr() patch the
input is loaded as double word ie. 8 bytes at a time by byte and in
strcasestr its byte by byte load.The input is handled in different
way.


I see now that you wrote basically

long i=0;
n0 = tolower(n[0])
while (s[i])
   {
     int j=1;
     if (tolower(s[i])==n0)
       while (n[j] && tolower(n[j]) == tolower(s[i+j]))
         j++;
     if (n[j]=='\0')
       return s+i;
     i++;
   }
return NULL;

For some reason I skimmed that patch and though that you did obvious
optimization once you have vector strstr. You use vectorized tolower to
convert 8 bytes at once. Example of conversion is in
  sysdeps/x86_64/multiarch/strcmp-sse42.S

For strcasestr() optimization I am trying to use tolower to convert 8 bytes and will send the modified patch.
For strstr() I have sent the new version
https://sourceware.org/ml/libc-alpha/2015-06/msg00788.html

Second is about average case perforance. There worst case is irrelevant
as you use buy-or-rent techniques to handle it. I sumbitted following
patch with easy way to handle that.

[PATCH] Improve generic strstr performance.

I applied your proposed patches and the benchtests and I could see
proposed patch(__strstr_power7) is better than existing code
(__strstr_ppc).
13363 out of 13442 combinations shows improvement.
Attached benchtest results.

I said use that patch. Next line of my previous mail explains that.

its currently superseeded by generic vector bigraph search but it should
beat your strstr.

So you should do also comparison with that.


Your benchmark is invalid for simple reason, I guard vector patch with:

if (__BYTE_ORDER == __LITTLE_ENDIAN)
   {

Modified new version works better for both LE and BE. Results attached in https://sourceware.org/ml/libc-alpha/2015-06/msg00788.html

So as powerpc is big endian my patch does nothing at all and you are
just measuring generic strstr again. If you want test bigraph search to
that on power64le.

Used your proposed code on powerpc64le and observed that quadratic worst case still takes more time with your proposed code. It took more than 2 minutes when the needle size is 100000 and haystack is of size 1000000.

What overall performance gain? Did you try to find application and
measure performance difference before and after? Big sizes are not very
relevant in practice. You look behind 32'th byte of haystack only in
1.5% of calls so 10% regression at these sizes is much worse that 30%
gain on 256 byte range. In my profiler a simple collection shows
following
No I don't check on any specific application and I use the glibc
benchtests for performance measurement which is considered to be the
standard way of performance measurement before proposing a patch.

Then you have problem that you don't know which data are important and
which are not.

Proof by numbers is quite suspect. Input ranges are completely
arbitrary. So I could decide that I want to check for sizes 1..64 all
possible 64byte needle and haystack alignments. That generates 262144
test cases which should be enough.



--
Thanks
Rajalakshmi S


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]