This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] more problems with newlib/libc/machine/m68k/memcpy.S


On Wed, Feb 10, 2010 at 05:47:04PM +0300, Maxim Kuvyrkov wrote:
> On 2/9/10 5:15 PM, Josef Wolf wrote:
>> Hello folks,
>>
>> I have done a closer look at newlib/libc/machine/m68k/memcpy.S and I think
>> there are a couple of problems with this code:
>>
>> 1. When the destination pointer is mis-aligned, up to three bytes are copied
>>     before getting into the optimized main loop. But d1, which contains the
>>     number of bytes to copy, is never adjusted to account for this fact. As a
>>     consequence, up to 3 byte too much can be copied by memcpy().
>
> When destination is mis-aligned, D1 is properly adjusted here:
>
> 	and.l	#3,d0		| look for the lower two only
> 	beq	2f		| is aligned?
> 	sub.l	d0,d1
>
> Or am I missing something?

Ah, I've overseen this, sorry.

>> 2. When MISALIGNED_OK==0, no attempt is made to get the pointers to aligned
>>     positions. This results in much slower code than the old code for such
>>     CPUs. IMHO, the cause of the problem reported in the thread starting with
>>     http://sourceware.org/ml/newlib/2009/msg01079.html was mis-interpreted.
>>     The problem was _not_ the attempt to align dest. The problem was that
>>     even after dest was aligned, src was still not aligned. Thus, disabling
>>     alignment completely is the wrong response to that problem.
>
> The problem was not misinterpreted; and it is true that memcpy can be 
> optimized for CPUs without alignment module by using 2-byte copies.

Uh, it _was_ optimized before, But with your patch, even src==xxx2,dst==yyy2
will fall back to bytewise loop, although lword access to even addresses is
perfectly legal on all CPUs (AFAIK). Even if it would not be legal, the (old)
code aligns it to src=xxx4,dst==yyy4, which is the fastest possible method
and always legal on _all_ CPUs. My patch brings this behavior back again.

With my patch, dest is always aligned to 4byte addresses (that was already
the case in the code before your patch), and the optimization is only omitted
when src is still mis-aligned (which the old code did not do and your patch
was supposed to introduce).

BTW: current code has one more issue: the Lresidue loop can't handle sizes
of more than 64 kbytes. This was not a problem in earlier releases, since
the Lresidue loop was never executed for more than 8 bytes in old code. But
your patch directs even things like src==xxx2,dst==yyy2 to Lresidue when
MISALIGNED_OK==0, so _now_ this _is_ a problem.

Either way, current code is inefficient and broken for any CPU with
MISALIGNED_OK==0.

> It is 
> not straightforward to implement a single memcpy routine that will be 
> optimal for both CPUs that support unaligned accesses and those which don't 

Well, we can try anyway ;-)

> It seems that your patch lengthens the critical path for CPUs which support 
> unaligned accesses.

Can you give a more specific example which path for which CPU is lengthened?

Here is a second version of the patch. This version fixes the Lresidue length
bug and further improves performance for CPUs that can do DBxx loops faster
than the unrolled loop (like cpu32, but probably other CPUs, too).

diff -ruw newlib-1.18.0.orig/newlib/libc/machine/m68k/memcpy.S newlib-1.18.0/newlib/libc/machine/m68k/memcpy.S
--- newlib-1.18.0.orig/newlib/libc/machine/m68k/memcpy.S	2009-12-14 21:50:53.000000000 +0100
+++ newlib-1.18.0/newlib/libc/machine/m68k/memcpy.S	2010-02-10 15:49:04.798570648 +0100
@@ -15,10 +15,21 @@
 
 #include "m68kasm.h"
 
-#if defined (__mcoldfire__) || defined (__mcpu32__) || defined (__mc68010__) || defined (__mc68020__) || defined (__mc68030__) || defined (__mc68040__) || defined (__mc68060__)
-# define MISALIGNED_OK 1
+#define WORD_ALIGNMENT 1
+#define LONG_ALIGNMENT 3
+
+#if defined (__mcpu32__)
+# define UNROLL_LOOP 0
+#else
+# define UNROLL_LOOP 1
+#endif
+
+#if defined (__mcpu32__)
+# define ALIGN_BITS WORD_ALIGNMENT
+#elif defined (__mcoldfire__) || defined (__mc68010__) || defined (__mc68020__) || defined (__mc68030__) || defined (__mc68040__) || defined (__mc68060__)
+ /* don't need alignment */
 #else
-# define MISALIGNED_OK 0
+# define ALIGN_BITS LONG_ALIGNMENT
 #endif
 	
 	.text
@@ -35,6 +46,9 @@
  *       - make sure the destination pointer (the write pointer) is long word
  *         aligned. This is the best you can do, because writing to unaligned
  *         addresses can be the most costfull thing you could do.
+ *       - after destination pointer is aligned, we still have to check the
+ *         source pointer. If it is not aligned, we have to fall back to
+ *         bytewise copy.
  *       - Once you have figured that out, we do a little loop unrolling
  *         to further improve speed.
  */
@@ -43,37 +57,43 @@
 	move.l	4(sp),a0	| dest ptr
 	move.l	8(sp),a1	| src ptr
 	move.l	12(sp),d1	| len
+
 	cmp.l	#8,d1		| if fewer than 8 bytes to transfer,
 	blo	.Lresidue	| do not optimise
 
-#if !MISALIGNED_OK
-	/* Goto .Lresidue if either dest or src is not 4-byte aligned */
-	move.l	a0,d0
-	and.l	#3,d0
-	bne	.Lresidue
-	move.l	a1,d0
-	and.l	#3,d0
-	bne	.Lresidue
-#else /* MISALIGNED_OK */
-	/* align dest */
-	move.l	a0,d0		| copy of dest
-	neg.l	d0
-	and.l	#3,d0		| look for the lower two only
-	beq	2f		| is aligned?
-	sub.l	d0,d1
-	lsr.l	#1,d0		| word align needed?
-	bcc	1f
+	move.l	a0,d0		| copy dest
+
+        /* make sure dest is word aligned */
+	btst	#0,d0		| need word align?
+	beq	1f		| no
+
 	move.b	(a1)+,(a0)+
+	subq.l	#1,d1
+	move.l	a0,d0		| dest has changed, copy dest again
+
 1:
-	lsr.l	#1,d0		| long align needed?
-	bcc	2f
-	move.w	(a1)+,(a0)+
+        /* make sure dest is long aligned */
+	btst	#1,d0
+	beq	2f		| OK
+
+        move.b	(a1)+,(a0)+
+        move.b	(a1)+,(a0)+
+	subq.l	#2,d1
 2:
-#endif /* !MISALIGNED_OK */
+
+#if defined (ALIGN_BITS)
+	/* dest is aligned now. But src might still be mis-aligned, so check
+           for that */
+	move.l	a1,d0
+	and.l	#ALIGN_BITS,d0
+	bne	.Lresidue
+#endif
 
 	/* long word transfers */
 	move.l	d1,d0
 	and.l	#3,d1		| byte residue
+
+#if UNROLL_LOOP
 	lsr.l	#3,d0
 	bcc	1f		| carry set for 4-byte residue
 	move.l	(a1)+,(a0)+
@@ -89,6 +109,15 @@
 	move.l	(a1)+,(a0)+
 	move.l	(a1)+,(a0)+
 .Lcopy:
+
+#else
+
+	lsr.l	#2,d0
+        subq.l	#1,d0
+1:
+        move.l	(a1)+,(a0)+
+
+#endif
 #if !defined (__mcoldfire__)
 	dbra	d0,1b
 	sub.l	#0x10000,d0
@@ -96,6 +125,7 @@
 	subq.l	#1,d0
 #endif
 	bpl	1b
+
 	bra	.Lresidue
 
 1:
@@ -104,9 +134,11 @@
 .Lresidue:
 #if !defined (__mcoldfire__)
 	dbra	d1,1b		| loop until done
+	sub.l	#0x10000,d1
 #else
 	subq.l	#1,d1
-	bpl	1b
 #endif
+	bpl	1b
+
 	move.l	4(sp),d0	| return value
 	rts


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]