This is the mail archive of the libc-ports@sources.redhat.com mailing list for the libc-ports project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Optimize MIPS memcpy

From: Carlos O'Donell <carlos_odonell at mentor dot com>
To: Steve Ellcey <sellcey at mips dot com>
Cc: Andrew T Pinski <pinskia at gmail dot com>, Maxim Kuvyrkov <maxim_kuvyrkov at mentor dot com>, "Joseph S. Myers" <joseph at codesourcery dot com>, <libc-ports at sourceware dot org>
Date: Tue, 4 Sep 2012 11:14:27 -0400
Subject: Re: [PATCH] Optimize MIPS memcpy
References: <5044746c.23eb440a.75e2.618f@mx.google.com> <1346771341.14333.20.camel@ubuntu-sellcey>

On 9/4/2012 11:09 AM, Steve Ellcey wrote:
> On Mon, 2012-09-03 at 02:12 -0700, Andrew T Pinski wrote:
>> Forgot to CC libc-ports@ .
>> On Sat, 2012-09-01 at 18:15 +1200, Maxim Kuvyrkov wrote:
>>> This patch improves MIPS assembly implementations of memcpy.  Two optimizations are added:
>> prefetching of data for subsequent iterations of memcpy loop and pipelined expansion of unaligned
>> memcpy.  These optimizations speed up MIPS memcpy by about 10%.
>>>
>>> The prefetching part is straightforward: it adds prefetching of a cache line (32 bytes) for +1
>> iteration for unaligned case and +2 iteration for aligned case.  The rationale here is that it will
>> take prefetch to acquire data about same time as 1 iteration of unaligned loop or 2 iterations of aligned loop.  Values for these parameters were tuned on a modern MIPS processor.
>>>
>>
>> This might hurt Octeon as the cache line size there is 128 bytes.  Can
>> you say which modern MIPS processor which this has been tuned with?  And
>> is there a way to not hard code 32 in the assembly but in a macro
>> instead.
>>
>> Thanks,
>> Andrew Pinski
> 
> I've been looking at the MIPS memcpy and was planning on submitting a
> new version based on the one that MIPS submitted to Android.  It has
> prefetching like Maxim's though I found that using the load and 'prepare
> for store' hints instead of 'load streaming' and 'store streaming' hints
> gave me better results on the 74k and 24k that I did performance testing
> on.
> 
> This version has more unrolling too and between that and the hints
> difference I got a small performance improvement over Maxim's version
> when doing small memcpy's and a fairly substantial improvement on large
> memcpy's.
> 
> I also merged the 32 and 64 bit versions together so we would only have
> one copy to maintain.  I haven't tried building it as part of glibc yet,
> I have been testing it standalone first and was going to try and
> integrate it into glibc and submit it this week or next.  I'll attach it
> to this email so folks can look at it and I will see if I can
> parameterize the cache line size.  This one also assumes a 32 byte cache
> prefetch.

Exactly what benchmarks did you run to verify the performance gains?

The one thing I'd like to continue seeing is strong rationalization for
performance patches such that we have reproducible data in the event that
someone else comes along and wants to make a change.

For example see:
http://sourceware.org/glibc/wiki/benchmarking/results_2_17

and:
http://sourceware.org/glibc/wiki/benchmarking/benchmarks

Cheers,
Carlos.
-- 
Carlos O'Donell
Mentor Graphics / CodeSourcery
carlos_odonell@mentor.com
carlos@codesourcery.com
+1 (613) 963 1026

Follow-Ups:
- Re: [PATCH] Optimize MIPS memcpy
  - From: Steve Ellcey

References:
- Re: [PATCH] Optimize MIPS memcpy
  - From: Andrew T Pinski
- Re: [PATCH] Optimize MIPS memcpy
  - From: Steve Ellcey

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]