This is the mail archive of the
libc-ports@sources.redhat.com
mailing list for the libc-ports project.
Re: [PATCH] Optimize MIPS memcpy
- From: Maxim Kuvyrkov <maxim at codesourcery dot com>
- To: Steve Ellcey <sellcey at mips dot com>
- Cc: Andrew T Pinski <pinskia at gmail dot com>, "Joseph S. Myers" <joseph at codesourcery dot com>, <libc-ports at sourceware dot org>
- Date: Tue, 9 Oct 2012 11:30:47 +1300
- Subject: Re: [PATCH] Optimize MIPS memcpy
- References: <5044746c.23eb440a.75e2.618f@mx.google.com> <1346771341.14333.20.camel@ubuntu-sellcey> <596797ED-6575-456D-98FD-C13A209DBC49@mentor.com> <1346948701.14333.152.camel@ubuntu-sellcey> <F265181B-6B5C-401B-B7CC-DC206B601795@codesourcery.com> <1347376645.14333.319.camel@ubuntu-sellcey> <F0856A02-839F-45C7-BF93-AA59E030A015@codesourcery.com> <1348166309.6170.55.camel@ubuntu-sellcey> <25105334-8813-4532-AC0E-B3A44BE69A19@codesourcery.com> <5B30D440-A918-4352-8DED-A7D681DF0338@codesourcery.com> <1349715796.30194.131.camel@ubuntu-sellcey>
On 9/10/2012, at 6:03 AM, Steve Ellcey wrote:
> On Sat, 2012-10-06 at 17:43 +1300, Maxim Kuvyrkov wrote:
>
>> Steve and I have debugged these failures and they now seem to be resolved. I'll let Steve to followup with analysis and a new patch.
>>
>> Meanwhile, I've benchmarked Steve's patch against mine. On the benchmark that I use both implementations provide equal performance for N64 ABI, but on N32 ABI Steve's patch is only half as fast. This is, probably, due to using 4-byte operations instead of 8-byte operations for N32 ABI:
>>
>> #if _MIPS_SIM == _ABI64
>> #define USE_DOUBLE
>> #endif
>>
>> It should be easy to improve Steve's patch for N32 ABI. Steve, will you look into that?
>>
>> I would also appreciate if you look into making your version of memcpy memmove-safe, if it is not already.
>>
>> Thank you,
>>
>> --
>> Maxim Kuvyrkov
>> CodeSourcery / Mentor Graphics
>
> Maxim, do you know if your test is doing a memcpy on overlapping memory?
> While our analysis showed that the problem was due to the use of the
> 'prepare to store' prefetch hint, the code I sent earlier should have
> worked fine for any code that was not doing an overlapping memcpy.
The test does not use overlapping memcpy.
>
> For anyone who may be interested, the 'prepare for store' prefetch hint
> is different then other 'safe' prefetches which can be executed or
> ignored without affecting the results of the code being executed.
>
> Instead of bringing a chunk of memory into the cache, it simply
> allocates a line of cache for use and zeros it out. If you write to
> every byte of that line of cache, you are OK. But if you use the
> 'prepare to store' cache hint and do not write to the entire cache line
> then the bytes you don't write to get written back to memory as zeros,
> overwriting whatever was there before. The code in my memcpy routine
> accounts for this, by checking the length of the buffer before doing the
> 'prepare to store' prefetches and only using them when it knows that it
> is going to write to the entire cache line.
Can there be a bug in logic that decides that a prepare-for-store prefetch is safe?
I've checked documentation for XLP (which is the target I'm using for testing) and it specifies 32-byte prefetch.
>
> The other issue though is if the source and destination of the memcpy
> overlap and if you use the prepare to store prefetch on a memory address
> that is also part of the source of the memcpy you will get incorrect
> results. That means that if we want to have memcpy be 'memmove-safe'
> we cannot use the 'prepare to store' hint.
I don't think this is a concern. Memmove will use memcpy only if the memory locations don't overlap. And for the record's sake, I'm testing without the memcpy-in-memmove patch.
>
> I will fix the code to use double loads and stores with the N32 ABI
> and add comments about the 'prepare to store' hint. I hate to give up
> on using the 'prepare for store' prefetch hint, since it does result in
> the best peformance, but given the various issues maybe it is not the
> best idea for glibc.
I too want to keep prepare-for-store prefetches is possible. For debugging purposes you could amend prepare-for-store prefetch macros to trigger a loop that would unconditionally clobber memory locations that prepare-for-store is expected to zero-out. Or add some other assertions to help out with debugging.
Thanks,
--
Maxim Kuvyrkov
CodeSourcery / Mentor Graphics