This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 4/*] Generic string memchr and strnlen


On 07/27/2015 07:22 PM, OndÅej BÃlka wrote:
On Mon, Jul 27, 2015 at 02:42:47PM -0400, Chris Metcalf wrote:
On 07/27/2015 12:56 PM, OndÅej BÃlka wrote:
Best example is strrchr where almost all architectures got it wrong
first time until I tell them otherwise. Problem is that they calculate
position of byte at each iteration instead just at end which is
ineffective. Now tile and powerpc have this problem. My proposed
alternative

return memrchr (s, c, strlen (s) + 1);

would beat these on suitable inputs.
I didn't say that a generic is optimal only that it avoids problem with
tile implementation on some inputs.

A problem is that your analysis is incomplete. It assumes that character
occurs rarely which doesn't have to be case. A forward scan could be
very expensive if character has probability 1/16 which causes branch
misprediction each second iteration.

True that more frequent occurrence leads to a high likelihood of
the string being in the final 1/3rd of the string, which as I said
earlier does mean strlen + memrchr is faster.

Note that tilegx does not do branch prediction dynamically;
the compiler just provides static branch prediction.

A problem that I described in english is that this isn't inner loop as
you jump from it each second iteration when c probability is 1/16.

True, and you're right that by restructuring the loop I can
get it to run with conditional moves and only a single loop exit:

string/../sysdeps/tile/tilegx/strrchr.c:58
  38:	{ cmovnez r0, r11, r10 ; addi r10, r10, 8 }
  40:	{ cmovnez r13, r11, r11 ; ld r12, r10 }
string/../sysdeps/tile/tilegx/strrchr.c:46
  48:	{ v1cmpeqi r14, r12, 0 ; v1cmpeq r11, r12, r15 }
string/../sysdeps/tile/tilegx/strrchr.c:48
  50:	{ beqzt r14, 38 <strrchr+0x38> }


With the branch predicted true we will definitely run at
0.625 cycles/byte all the way to the end of the string, so that's
probably a good micro-optimization.

For example if application uses strrchr to find last / in path name then
its likely that it would be among last 8 characters and occurs
frequently previously.

I think this use case is pretty interesting.  As Wilco says, if we can
update the benchtests to show more such than it is probably worth
machine maintainers' time to improve their algorithms.  Or switch
to a framework where we can do word-at-a-time loads with the
machine maintainers just providing the primitives for SIMD ops
and the like (like Linux's <asm/word-at-a-time.h>).

This mainly depends on size of string, if you assume that each character
has fixed probability it happens for sufficiently large strings.

A tricky part of writing these is that inputs are short so you need to
write separate header to get better performance.

What do you mean by "separate header" here?

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]