This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug string/21331] New: strcpy/strncpy suffering performance downgrade with short string copy from glibc2.15


https://sourceware.org/bugzilla/show_bug.cgi?id=21331

            Bug ID: 21331
           Summary: strcpy/strncpy suffering performance downgrade with
                    short string copy from glibc2.15
           Product: glibc
           Version: unspecified
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P2
         Component: string
          Assignee: unassigned at sourceware dot org
          Reporter: mousuanming at huawei dot com
  Target Milestone: ---

Created attachment 9954
  --> https://sourceware.org/bugzilla/attachment.cgi?id=9954&action=edit
Patch and testcase

Hi,

Once upgrading the glibc from 2.11 to 2.17, we observed about 30% performance
downgrade in the new glibc2.17 version in the Libmicro benchmark strcpy tool
with copying short strings e.g. 10 bytes which is really a common usage of
strcpy.
We use the same x86_64 machine with cpu Intel(R) Xeon(R) CPU E5-2620 v4 @
2.10GHz and not any performance related configuration differences with the two
glibc building. And the cpus were also adjusted to performance mode when we
executed the Libmicro strcpy test case on the same cpus via taskset. However,
still the negative result.

Digging into the code, we found quite big changes in the x86_64 strcpy
implementation. And later we picked the code causes the performance downgrade
with short string which length less than 32 bytes. In the 2.17 code, strcpy
will step in the STRCPY_SSE2_UNALIGNED disassemble code. Different with the
2.11 code , the cod in 2.17 do ‘add’ and ‘cmp’ to make the strcpy with source
string address 6 LSB less than 0x20 to jmp to ‘SourceStringAlignmentLess32’
which most cases will be matched, so is the Libmicro test case. In the
‘SourceStringAlignmentLess32’ code, it uses a ‘movdqu’ to copy 16 bytes source
string to xmm1 which causes the performance downgrade.
We understand that since the source address maybe not 16 bits aligned, it’ OK
to copy the string to xmm1 first then to search the ‘\0’. However, in the
glibc2.11 strcpy code, it does the alignment first, then does the ‘\0’ search
direct with the ‘rsi’ register recorded memory. Not any copies to xmm0/1
register are involved. And we can also find the glibc2.11 legacy code dealing
with the source address unaligned with 16 bits cases next to the
‘SourceStringAlignmentLess32’ jump label. The legacy code works better always
but now it becomes just serving the source string with 6 LSB range from 0x21 to
0x3f.

The ‘new’ ‘SourceStringAlignmentLess32’ code also make us confused since it
seems strcpy can also work without it based on our analysis. We created a test
case with different alignments and lengths to verify if strycpy would work
correctly without it. So it does. And the short 10 bytes string copy also get
20+% performance up with the Libmicro testcase. As the code don’t make any
sense and short byte string copy always be used high frequency, maybe it’s
better to comment out the code for strcpy, and so is for strncpy. We checked
the latest code still using the same code as 2.17, so we created a patch based
on the latest version to make Virgo happy.

You can get Libmicro benchmark tool below:
https://java.net/projects/libmicro/sources

And the benchmark command we are using below:
taskset -c x-y ./bin-x86_64/strcpy -E -C 200 -L -S -W -N strcpy_10 -s 10 -I 5

The patch and test code for verify the patch as attachment.

Please help to confirm in case any incorrect.
Thank you.

BR,
Mou

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]