This is the mail archive of the
mailing list for the binutils project.
Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*
- From: Oleg Endo <oleg dot endo at t-online dot de>
- To: Rich Felker <dalias at libc dot org>
- Cc: binutils at sourceware dot org
- Date: Sun, 04 Oct 2015 12:32:54 +0900
- Subject: Re: [PATCH] enable fdpic targets/emulations for sh*-*-linux*
- Authentication-results: sourceware.org; auth=none
- References: <1443612038 dot 2509 dot 140 dot camel at t-online dot de> <20150930142533 dot GC8645 at brightrain dot aerifal dot cx> <20150930143555 dot GD8645 at brightrain dot aerifal dot cx> <1443627005 dot 2509 dot 189 dot camel at t-online dot de> <20150930183810 dot GE8645 at brightrain dot aerifal dot cx> <1443715139 dot 2031 dot 134 dot camel at t-online dot de> <20151001164630 dot GI8645 at brightrain dot aerifal dot cx> <1443804962 dot 2031 dot 290 dot camel at t-online dot de> <20151002175223 dot GU8645 at brightrain dot aerifal dot cx> <1443863059 dot 2031 dot 433 dot camel at t-online dot de> <20151003185947 dot GC8645 at brightrain dot aerifal dot cx>
On Sat, 2015-10-03 at 14:59 -0400, Rich Felker wrote:
> > Sure, that can be done, too. Actually, you can have the function
> > pointer table in the TLS, which makes it reachable via GBR:
> > mov.l @(disp, gbr), r0
> > jsr @r0
> > nop
> Again, that's unfortunately not possible because positive offsets from
> GBR belong to the application's initial-exec TLS. The TLS ABI really
> should have defined GBR to point 1024 bytes below the start of TLS
> rather that at the start of TLS, so that up to 1k of TCB space could
> be accessed via the short/fast GBR-based addressing. This would not
> require reserving that much actual space (which would be a horrible
> idea -- huge waste of memory per thread) but would just allow it it to
> be assigned from the end downwards as needed. This is what most other
> risc archs with limited-range immediates did.
So fix the TLS ABI? Anyway you're building a new system...
The same @(disp,gbr) loads/stores can be used to get/set errno. Not
that a lot of apps out there actually use errno, but the standard
> In the big scheme of TCB access it probably doesn't matter. You can
> just do:
> stc gbr,r0
> add #imm,r0
> mov.l @r0,...
Yeah, it's just that it's 2 or 3 cycles extra of "effectively doing
nothing" for the mem access...
> Indeed. But ideally functions which perform locking are either leaf
> functions or have a shrink-wrappable code path that should avoid
> setting up a call frame and saving the return address. I doubt the
> current sh backend makes any such optimizations, so before we even
> think about ugly micro-optimization hacks that require complex
> cooperation between different parts of the toolchain and runtime code,
It's true, things usually have to go hand in hand.
> I think we should focus on the big performance problem that would make
> a much much bigger difference: very bad codegen by gcc. Aside from
> lack of shrink-wrapping, poor hanling of the PIC register (like the
> way x86 used to handle %ebx, as permanently-reserved and unusable)
> stands out at something that needs to be fixed.
Shrink wrapping is being done for all backends. In fact, it has been
improved recently. Of course there could be some SH specific cases
which aren't optimized well. Please open PRs in GCC's bugzilla for
issues with the compiler.
> I do like your above trick for using negative offsets efficiently. BTW
> for small negative offsets (which are the only reasonable ones) you
> can avoid the .Loffset and just use an immediate.
> For most archs it's very simple -- you have a linear progression of
> ISA levels/models, and forwards-compatible just means anything built
> with the -march for ISA level A runs on a host with ISA level B, for
> any A<B.
I think your definition is reversed. E.g. the 386 is not forwards
compatible, but the 486 is backwards compatible... But anyway.
> Note that the main big real-world obstacle to forward-compatibility
> through an ISA progression is lack of proper atomics/barriers on old
> versions of the ISA. Whereas most code for an older ISA runs fine on a
> superset of that ISA, if the old ISA lacked real atomics/barriers and
> the newer model supports SMP, you're pretty much completely out of
Yep. And at this point it might be actually better to make a cut and
start afresh with SMP and HW atomics in place. But that's not possible
either because with J2's cas.l now we have/are getting yet another
atomics impl. on SH.
> The only hope for the code running without knowledge and
> conditional use of the newer ISA extensions is that the OS can
> reliably notice and trap whatever old simulated atomics were used and
> convert them to something that synchronizes memory. I advised the
> OpenRISC developers on this issue early in their porting of musl to
> or1k and quickly got real atomics added to the ISA so that they
> wouldn't run into a nasty issue like this in the future. OTOH
> Linux/MIPS handled the issue just by pretending all MIPS ISA levels
> have the ll/sc instructions and requiring the kernel to trap and
> emulate them on ancient hardware. That would have worked for J2 as
> well but would have given really really bad performance.
I'm severly confused. First you say performance of atomics doesn't
matter (you're OK to add a 2x..3x overhead for the runtime switched
version compared to compiler inlined). But now you are concerned about
atomics performance. So which?
> Things have always been done that was with uClibc, but that doesn't
> mean it's the right way; I'm trying to do something better with musl.
> Trying to micro-optimize out every single code path you possibly can
> with highly target-specific knowledge is simply not an efficient path
> to small size and performance; it takes too much human maintenance
> effort and distracts from the real opportunities for big gains from
> higher-level optimizations.
That's why it's better to let the compiler do it. Of course it won't do
every imaginable trick, but it can be improved. This way all the SW
gets the improvement, not just a particular library.
> Modulo the sigcontext ABI issue and the gratuitously different syscall
> trap numbers (the latter of which I have a pending kernel patch to
> fix, but it's not getting any attention because there's no maintainer
> for SH and without a maintainer nobody can really touch design/policy
> type issues like this...).
Maybe for now it's more productive to create an SH linux branch (which
is updated from mainline periodically of course) and send a pull request
to some global maintainer after things have settled and have been
working for a while.
> Of course if you're running on actual sh2 hardware all the libs need
> to refrain from using instructions from sh3/sh4/sh4a. But the same
> dynamic binaries (built for sh2 ISA) can run just fine on sh4 (modulo
> the sigcontext issue) with sh4-nofpu versions of the libraries
> installed for better performance (and even with hard-float used
> internally, like on ARM softfp).
Yes, that's what I was saying. Except that using hard-float
"internally" and soft-float "externally" is not supported by the
compiler at the moment.
AFAIK, there is also no mechanism for the dynamic linker to pick the
right libraries. E.g. when loading an SH2-nofpu ELF on an SH4-fpu
system, it should pick the SH2-nofpu compatible libraries.
> > I wouldn't make it the default for
> > sh4-linux or sh4a-linux though. Those are not fully backwards
> > compatible software systems.
> That's actually one of the biggest areas it's needed -- right now,
> binaries built for sh4 are not safe to run on sh4a. Their atomics are
> non-atomic on sh4a if it's SMP or if they're sharing memory with
> programs using the real atomic instructions. This is the original
> reason musl implemented the runtime selection of atomics, way before I
> even thought about sh2 and nommu support.
If running old SH4/gUSA binaries on SH4A/LLCS multi-core is needed that
badly (gUSA and LLCS are compatible on single-core), then probably the
easiest thing to do is adding some backwards compatibility to
sh4a-linux. As we've figured out earlier, there are several options
which could work. The only severe problem is that the atomic model used
by the ELF is not mentioned in the ELF (which would be an issue for
binutils). So the system has to either assume the worst, or try to
figure it out at runtime whether gUSA or LLCS is being used.
It's running in circles a bit. For your proposed fix (to do all atomics
via function calls), you'll get SH4A multi-core compatible SH4 SW only
after rebuilding all of that SW from source. At that point you might as
well just rebuild it for SH4A/LLCS. In fact, for your new system you'll
probably be rebuilding everything for FDPIC, too ...