This is the mail archive of the binutils@sourceware.org mailing list for the binutils project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] enable fdpic targets/emulations for sh--linux*

From: Oleg Endo <oleg dot endo at t-online dot de>
To: Rich Felker <dalias at libc dot org>
Cc: binutils at sourceware dot org
Date: Sat, 03 Oct 2015 18:04:19 +0900
Subject: Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*
Authentication-results: sourceware.org; auth=none
References: <20150929235801 dot GA8408 at brightrain dot aerifal dot cx> <1443612038 dot 2509 dot 140 dot camel at t-online dot de> <20150930142533 dot GC8645 at brightrain dot aerifal dot cx> <20150930143555 dot GD8645 at brightrain dot aerifal dot cx> <1443627005 dot 2509 dot 189 dot camel at t-online dot de> <20150930183810 dot GE8645 at brightrain dot aerifal dot cx> <1443715139 dot 2031 dot 134 dot camel at t-online dot de> <20151001164630 dot GI8645 at brightrain dot aerifal dot cx> <1443804962 dot 2031 dot 290 dot camel at t-online dot de> <20151002175223 dot GU8645 at brightrain dot aerifal dot cx>

On Fri, 2015-10-02 at 13:52 -0400, Rich Felker wrote:
> > 
> > We get around 5 cycles on SH4 (the SH4A LLCS version is 4 cycles).  So
> > it's not much slower than a non-atomic read-modify-write sequence.
> > 
> > If you pipe it through function calls:
> > 	mov.l	.L3,r0
> > 	mov	#5,r6
> > 	jmp	@r0
> > 	mov	#1,r5
> > ..L4:
> > 	.align 2
> > ..L3:
> > 	.long	___atomic_fetch_add_4
> > 
> > That's about 4 cycles just for the function call alone.  Without any reg
> > saves/restores.  Plus increased probability of a icache miss yada yada.
> > At best, this is twice as slow as inlined.
> 
> That's not PIC-compatible, and it also requires additional branch
> logic in the called function. So I think it's a lot worse in practice
> if you do it that way.

It was just an example of a minimal function call to demonstrate that
the smallest possible overhead of atomics-via-calls is 2x.

>  I would aim to expose a function pointer for
> the runtime-selected version and inline loading that function pointer.

Sure, that can be done, too.  Actually, you can have the function
pointer table in the TLS, which makes it reachable via GBR:
	mov.l	@(disp, gbr), r0
	jsr	@r0
	nop

Because the run-time selection has to be done only once during loading,
it'd always point to the right function / set of functions.
There are around 15 * 3 = 45 different __atomic / __sync functions.  If
having a 45*4 = 180 bytes atomics function table in the TLS is not good,
it could be just one pointer to the set of selected functions.  The
function itself will then have to be selected by adding a known constant
before the jump/call.

> I would also aim to make the calling convention avoid needing a GOT
> pointer in the callee and avoid clobbering pr; this can be done by
> using a sequence like:
> 
> 	mova .L1,r0
> 	mov.l @r0+,rn
> 	...
> 	braf r0
> 
> .L1:	.long whatever
> 	...
> 	[next code]
> 
> and the callee returns by jumping to the value it received as r0 or
> similar.

If there are other function calls around in that calling function it
won't be a win because PR will be clobbered anyway.

> IFUNC is rather a mess and non-solution (this has been discussed a lot
> in the musl community) and it's not clear how to make it work with
> static linking at all.

Any refs to those discussions?

> OK. Do you have an opinion on it, whether we should just drop the
> legacy variant of the struct missing the space for floating point
> registers, or introduce a personality framework to support two
> different ABIs for the structure?

Sorry, no, I don't have any opinions w.r.t. linux at the moment.

> Negative offsets would at least make it compatible with the TLS ABI,
> where the "TCB" is below the thread pointer rather than above.

The resulting sequence would look something like this:
	mova	1f,r0
	mov	r0,r1			// exit point during sequence in r1
	mov.l	.Loffset,r0		// or something else to get the constant
	or.b	#(0f-1f),@(r0,gbr)	// set sequence length and enter sequence
0:	mov.l	@r4,r1
	add	#1,r1
	mov.l	r1,@r4
1:	and.b	#0,@(r0,gbr)		// exit and clear sequence length

This would allow negative offsets.  However, because of the GBR logical
insns it'll be slower.  We can also lift the offset restriction of the
current implementation by not using @(disp,GBR) type insns if the
specified offset is not in the range as required by the insns.  Please
open a new GCC PR for this, if you're interested in that.

> Of course the TLS ABI design was bad to begin with. There's no
> advantage to using the "Type I" form where TCB is below TP(GBR) and
> application TLS is above. In theory you would have the advantage of
> being able to use small immediate GBR offsets to access some
> application TLS, but this can't be done because the compiler can't
> know the offset the linker will assign to a particular object and
> whether it will be "in range".

Yes, it would require some sort of link time optimization.

> Multilibs solve a completely different problem than forward-compatible
> binaries;

I'm still not sure I understand your definition of "forward-compatible"
binaries here.  According to my understanding a binary can't really be
"forward-compatible", unless somebody can precisely predict the future.
A system can be backward-compatible by being able to run older binaries
in some way.  Can you please clarify the meaning of
"forward-compatible"?

> I realize binary deployability may not seem the most interesting or
> agreeable goal on the FSF side, but I think it is worthwhile. I would
> much rather have works-anywhere busybox, etc. binaries I can drop onto
> an embedded device when exploring/extending it than have to first
> figure out the exact ISA-level/ABI and build the right binaries for
> it.

I think what you describe is more the situation/convenience we have with
desktop systems.  This compatibility has some price/inefficiency tag
attached to it.  In embedded systems the whole system is often modified,
tuned and rebuilt from source/scratch (e.g. buildroot).  Of course it's
possible to define what "compatibility" means for desktop-SH.  But I
guess for a niche system/market it's easier to say: "user, for your
system, please use the binaries/toolchain from this subdirectory".  If
building those different variants is a problem, then this condition can
be improved in other ways.

> > The -fpu vs -nofpu problem can be solved as it's done on ARM with a
> > soft-float calling convention (passing all args in GP regs).  Maybe it'd
> > make sense to define a new ABI for compatibility.
> 
> No new ABI is needed for something analogous to ARM's "softfp"; the
> whole point is that the ABI is the same, but use of fpu is allowed
> "under the hood".

Right.  You can define the SH2-nofpu ISA/ABI as the base level for your
system.  Then anything higher than that has to be made backwards
compatible.  This is what is currently not fully supported by the tools.
An sh4-linux should be already able to run a fully self contained
statically all-in linked sh2-linux program.  Mixing of pre-compiled
libraries won't work.
Of course the end result be an overall less efficient system and would
be a step backwards in some cases (copying GP regs <-> FP regs,
resetting  FPSCR.SZ/.PR, etc).  That's the price of backwards
compatibility.

> In general "new ABI" is something I don't like. My view is that "ABI
> combinatorics" (or config combinatorics in general) is a huge part of
> what lead to uClibc's demise. When there are too many combinations to
> test (and often more combinations than actual regular ongoing users!)
> it's impractical to keep code well-maintained.

My point of view the tools (compiler,assembler,linker etc) should
provide the options, and toolchain providers should configure it with
reasonable defaults for their systems.  Yes, testing becomes a bit more
difficult, but is not impossible.  Some combinations don't get tested
and occasionally break.  That's life, I guess.

Going back to your original idea w.r.t. libatomic... a "clean" way of
achieving what you want might be:

- add explicit -matomic-model=call
  (which would also define the corresponding __SH_ATOMIC_MODEL_CALL__
   and maybe implement some special ABI as above)

- add support to (somehow) allow different ABIs to be mixed within one
ELF, e.g. --isa=sh2,sh3,sh4,sh4a,...

- maybe put the function table etc into libgcc

With that, there's no need for the libatomic dependency and the
__atomic* primitives would "just work" (which in turn can be used by
libatomic).  Then you can configure the toolchain for your system to use
-matomic-model=call by default.  I wouldn't make it the default for
sh4-linux or sh4a-linux though.  Those are not fully backwards
compatible software systems.  Maybe it's better to leave them as they
are and add a new sub-target definition which is "SH2 or better" and
create variants of sh4/sh4a-linux which are backwards compatible with
that minimum baseline ABI.  They'd also need to include SW FP libs (and
maybe some other support functions from/for libgcc, not sure).  These
libs in turn can utilize SH4* HW functions (e.g. shld, FPU, ..).  But
currently there is no easy way to make the compiler generate that kind
of code because the function call ABI is tied to the ISA caps.

Cheers,
Oleg

Follow-Ups:
- Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*
  - From: Rich Felker

References:
- Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*
  - From: Oleg Endo
- Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*
  - From: Rich Felker
- Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*
  - From: Oleg Endo
- Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*
  - From: Rich Felker

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*

Re: [PATCH] enable fdpic targets/emulations for sh--linux*