This is the mail archive of the binutils@sourceware.org mailing list for the binutils project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*


On Thu, Oct 01, 2015 at 12:30:05AM +0900, Oleg Endo wrote:
> On Wed, 2015-09-30 at 10:35 -0400, Rich Felker wrote:
> > On Wed, Sep 30, 2015 at 10:25:33AM -0400, Rich Felker wrote:
> > > On Wed, Sep 30, 2015 at 08:20:38PM +0900, Oleg Endo wrote:
> > > > On Tue, 2015-09-29 at 19:58 -0400, Rich Felker wrote:
> > > > > Currently sh/fdpic support in binutils is only enabled for
> > > > > sh{1,2,}-*-uclinux*. This patch adds it to the sh*-*-linux* targets
> > > > > which are what I'm using for musl's j2/sh2 support. sh2eb-*-linux-musl
> > > > > toolchains treat the target as regular Linux (modulo no fork and
> > > > > resticted mmap), produce binaries which are forward-compatible with
> > > > > sh3/4 Linux,
> > > > 
> > > > Do you already have a suggestion how to encode the atomic model that is
> > > > being used by the SH1*/SH2* ELF?  Without at least that, true "forward
> > > > compatibility" is difficult to achieve, I guess.
> > > 
> > > On the musl side, we have all atomics go through a function that
> > > chooses which atomic to use based on runtime detection. LLSC (sh4a),
> > > GUSA (sh3/4), and imask (sh2 single-core) are supported now and I'm
> > > going to add j2 cas.l. For sh4a+ targets, this is optimized out and
> > > the inline LLSC atomics are used.
> 
> How is this "optimized out" done?

#ifdef __SH4A__

If __SH4A__ is defined, this implies an ISA that includes the LLSC
instrutions (movli.l and movco.l) so we can use them unconditionally.

We could do the same for the J2 cas.l instruction and a J2-specific
macro, but I want to be able to consider J4 (future), SH4 and SH4A as
ISA-supersets of J2. The J4 will have cas.l but SH4 and SH4A obviously
don't.

> > > On the GCC side, I think we should default to using libatomic except
> > > perhaps for sh4a and do the same runtime selection in libatomic.
> 
> That'd sort of defeat the purpose of compiler-provided inlined atomic
> builtins.  Every atomic op would have to go through a function call.  A
> safe but not a very attractive default, e.g. when one wants to build a
> SH4-only system.  We can try to lower the function call overhead with
> some special atomic library function ABI (as it's done for e.g. shifts
> and division libfuncs), if that helps.

Perhaps, but GUSA-style atomics are likely so much faster than actual
memory-synchronizing instructions that the total cost is still
relatively small.

I have considered doing a custom calling convention for this in musl's
atomics (and "calling" the "functions" from inline asm) but before
spending effort on that I'd want to see if there's actually a
practical performance issue. I'm not a fan of premature optimization.

> > Also note that "encoding the atomic model" (for a fixed model) is not
> > useful for forward-compatibility. 
> 
> It is useful.  It gives the system a chance of checking the required
> capabilities of the binary that is to be executed.  If the system
> doesn't implement those requirements, the binary can't run.  I think
> it's better to not run a program than running a buggy/misbehaving
> program.  At least that way, everybody knows what's wrong.

Oh, yes. But I don't see a good way to do this automatically without
adversely affecting programs that use runtime selection. It would be
really unfortunate for programs to get flagged as incompatible just
because they contain code that's not compatible, when they're
explicitly avoiding using that code.

> There are actually even more such things, like floating point ABI (pass
> args in GP vs pass in FP regs, default FP mode), or the presence of FP
> SW emulation.

There's a big problem here right now: the definition of
sigcontext/mcontext_t is dependent on whether the _kernel_/_cpu_ has
fpu, and thus programs built for the no-fpu layout are incompatible
with kernels for cpu-supporting cpus. (and sh4-nofpu ABI binaries
cannot run _anywhere_ correctly). I want to fix this in the kernel by
just always using the same structure layout with space for fpu
registers (and possibly having a personality bit for the old one if
people think it's necessary for backwards-compat) but the fact that
there's presently no maintainer for SH makes it really hard to advance
any changes that would be "policy"-like... :(

> E.g. an SH4 system has to provide SW FP library to run
> SH2 code, in case it uses FP.  Unless *everything* is linked statically,
> of course.

Generally we static-link libgcc.a. I think it's hard to get a net
advantage dynamic-linking it unless you have a lot more programs
running that would be typical on nommu systems. Going dynamic wastes a
good bit of ram data/got/function-descriptors and also requires
carrying around all of libgcc.so, even the parts you don't need.

> > A program compiled to use imask will
> > crash on cpus/kernels with privilege enforcement (although perhaps it
> > could trap and emulate).
> 
> Yes, that's one option.
> 
> >  A program compiled to use imask or GUSA will
> > fail to be atomic on multi-core machines.
> 
> Not necessarily.  It will have to be restricted to run exclusively on
> one core (via processor affinity).

Atomics are not just within a process. They can be used on share
memory too. You would have to restrict all processes that have access
to the same shared memory to a single core.

> Probably that will require to use
> only one particular core for running processes that use imask/gUSA
> atomics, because cross process atomics on shared memory will not work
> either.  It would run.  The usefulness might be debatable though.

Right. I don't think this is useful.

> Another option (maybe only viable with an MMU) is patching the binary
> code during loading or when downloading code into the flash or
> something.  Detecting the compiler generated atomic sequences shouldn't
> be that difficult, although it'd be easier and faster if the loader new
> where they are (some meta info in the ELF).

This is impossible without tagging it with relocations. You can't
assume data is code just because it's in .text. The most obvious
exception is constant pools.

> >  The failure of GUSA to be
> > atomic also affects qemu linux-user; for this reason I got them to
> > change the default cpu model for qemu-sh4 to sh4a (and report it via
> > hwcap), so that real atomics get used and qemu has a chance to emulate
> > them (but I'm still not entirely convinced they work right; emulating
> > llsc on a cas-based architecture seems hard).
> 
> I'm still puzzled why somebody wanted to run a program/library compiled
> for SH2 on e.g. an SH4A multi-core system.  It's easier to just
> recompile and redistribute it.  Performance and efficiency is better,
> too...

The most important usage case I can think of right off is being able
to build software whose build systems are not cross-friendly by
running on the higher-performance MMU-ful hardware at build time. This
only works if stuff built for the target also runs on the build
system. (As a partial analogy, I can quickly and easily build
cross-unfriendly software for i386 by running on a recent high-end
x86, but only because the target binaries are forward-compatible.)

In general, this is just part of the value of having an ISA as a
platform, which is what makes revision N+1 of a ISA X a more
attractive target when you're already using revision N of ISA X, as
opposed to just using revision M of ISA Y. It's a much nicer developer
and user experience to be able to throw your existing binaries on a
new device and be able to use them right away than having to adjust
your builds for the new device, rebuild, test for regressions (e.g.
moving from soft float to hard float), etc., and even if you do want
to transition, there's a lot of value in being able to do so
incrementally (and possibly only for the software where performance
actually matters).

I don't know if J2/SH2 -> SH4A is likely to be a real-world transition
path anyone would actually care about, mainly because I don't have any
experience with SH4A hardware. But I also think it's a much bigger
mess to be trying to figure out which transition paths make sense and
building a forward-compatibility model around the resulting
assumptions than to just use simple runtime detection.

Programs that actually need high-performance atomics on multiple
objects with low latency between them will likely ignore
forward-compatibility concerns and simply hard-code whatever option
makes sense for the hardware they're targeted to. But this is likely
to be a very small minority of programs. For the needs of most
programs, which are using high-level sync primitives like
pthread_mutex_lock, an extra level of conditional or indirection
around the actual atomic is relatively very small compared to the code
paths already present for implementing the high-level sync primitive
correctly.

> Anyway, I still think there should be more flags or SH attributes in ELF
> for encoding all the various ABIs and options.

I'm not opposed to this as long as they don't break usage cases that
otherwise would/should work. One approach I'd be happy to see is first
adding support for runtime switching in gcc and making that the
default, then setting the ABI flags via new gas directives if the
default is overridden and the choice requires a specific cpu model
that's not forwards-compatible.

Rich


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]