Bug 12403

Summary: futex issues with --enable-kernel=2.6.22 to 2.6.28
Product: glibc Reporter: Allan McRae <allan>
Component: nptlAssignee: Ulrich Drepper <drepper.fsp>
Status: RESOLVED FIXED    
Severity: normal CC: bz-glibc, hjl.tools, j, toolchain
Priority: P2 Flags: fweimer: security-
Version: unspecified   
Target Milestone: ---   
Host: x86_64-unknown-linux-gnu Target:
Build: Last reconfirmed:
Attachments: Fix stack imbalance under --assume-kernel=2.6.{22..29} in rwlock code

Description Allan McRae 2011-01-16 09:39:14 UTC
When building glibc on x86_64 linux with --enable-kernel set for 2.6.22 to 2.6.28 inclusive, the following tests fail:

make[2]: *** [/build/glibc-build/nptl/tst-rwlock6.out] Error 1
make[2]: *** [/build/glibc-build/nptl/tst-rwlock7.out] Error 1
make[2]: *** [/build/glibc-build/nptl/tst-rwlock9.out] Error 1
make[2]: *** [/build/glibc-build/nptl/tst-rwlock11.out] Error 1
make[2]: *** [/build/glibc-build/nptl/tst-rwlock12.out] Error 11
make[2]: *** [/build/glibc-build/nptl/tst-rwlock14.out] Error 1
make[2]: *** [/build/glibc-build/nptl/tst-abstime.out] Error 1

This issue also results in crashes in various real-world applications.

Looking at what is enabled at the failure boundaries indicates a futex issue:
Support for private futexes was added in 2.6.22
Support for the FUTEX_CLOCK_REALTIME flag was added in 2.6.29

Confirming this is an issue with futex support, glibc built with one of the bad values for --enable-kernel (2.6.27) and manually adjusting the following defines:

1 - default - 2.6.22 - 2.6.28:
  # define __ASSUME_PRIVATE_FUTEX    1
  # undef __ASSUME_FUTEX_CLOCK_REALTIME
Glibc tests fail.

2 - default pre 2.6.22:
  # undef __ASSUME_PRIVATE_FUTEX
  # undef __ASSUME_FUTEX_CLOCK_REALTIME
Glibc tests pass.

3 - default 2.6.29 and later:
  # define __ASSUME_PRIVATE_FUTEX    1
  # define __ASSUME_FUTEX_CLOCK_REALTIME    1
Glibc tests pass.

This issues does not occur on i686-pc-linux-gnu.
Comment 1 Allan McRae 2011-01-16 09:53:22 UTC
Naively trying to locate the source of this bug...  Generating a list of files that have #ifdef/#ifndef on the __ASSUME_PRIVATE_FUTEX and __ASSUME_FUTEX_CLOCK_REALTIME defines (assuming this is not some more complex interaction) and are x86_64 specific (as this does not occur on i686 builds) gives:

nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
nptl/sysdeps/unix/sysv/linux/x86_64/lowlevelrobustlock.S
nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S
nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_timedrdlock.S
nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_timedwrlock.S

If we further assume that the bug requires a nested #ifdef for these two values, that restricts the issue to small parts of the last three files and given the test suite failures we can exclude the first of those.
Comment 2 Ulrich Drepper 2011-01-16 16:03:22 UTC
This is no place to report such problems.

*** This bug has been marked as a duplicate of bug 333 ***
Comment 3 Allan McRae 2011-01-16 22:50:05 UTC
I have shown the issue occurs on a specific platform and with specific values for --enable-kernel and shown it is a specific combination of defines that results in the issue.

I would have thought that specific enough to be able to replicate the issue and for it not to be a #333 duplicate. What further information is needed to show this is a genuine glibc issue?
Comment 4 Bryan Kadzban 2011-01-23 22:37:14 UTC
Created attachment 5208 [details]
Fix stack imbalance under --assume-kernel=2.6.{22..29} in rwlock code
Comment 5 Bryan Kadzban 2011-01-23 22:38:12 UTC
I'm seeing this as well.  I've tracked it down to a bug in the cleanup code in pthread_rwlock_timedwrlock.S (causing a stack imbalance just before "retq") -- it uses __ASSUME_PRIVATE_FUTEX when deciding whether or not to clean up after the local variables (and saved register) created for __ASSUME_FUTEX_CLOCK_REALTIME.  When these two are set differently, "retq" jumps off into never-never-land.

There's a related bug in pthread_rwlock_timedrdlock.S, which emits the wrong CFI directives, but I don't think this will affect runtime.  (Could be wrong though; I don't know a lot about CFI.)

Attached is a patch that fixes both issues; with this, all crashing in the testsuite is gone.
Comment 6 Allan McRae 2011-01-24 05:17:46 UTC
Thanks. I can confirm that patch fixes the issues I was observing.
Comment 7 Jürg Billeter 2011-07-25 09:43:57 UTC
I've just spent some time debugging a crash in asterisk which I tracked down to the issue described here; my patch looks identical. Is there any reason why this is not yet in master (and at least 2.13 and 2.14).
Comment 8 Andreas Schwab 2011-08-18 06:51:58 UTC
*** Bug 13106 has been marked as a duplicate of this bug. ***
Comment 9 Ulrich Drepper 2011-09-09 03:55:40 UTC
I checked in a patch.
Comment 10 Mike Frysinger 2012-01-09 19:39:22 UTC
guess only one site needed updating:

http://sourceware.org/git/?p=glibc.git;a=commitdiff;h=1e4bd093e664f2889c48e63714583ef06b90d5b9
Comment 11 Bryan Kadzban 2012-01-10 07:11:17 UTC
> guess only one site needed updating:

Sort of.  Only one site affects the generated machine code (and that site was fixed in the git change that you linked to), but git head still has broken CFI data at the other site.

Which may or may not be an actual problem, depending on what happens.  If the kernel tries to trace back into userspace from one of the other syscalls, the info still might be totally broken.  But at least the code works now, which is a step up from before.  I'm not very hopeful about the CFI data *ever* getting fixed, unfortunately.  :-/
Comment 12 Mike Frysinger 2012-01-10 21:03:52 UTC
that sounds like a diff (if semi-related) bug.  could you file a new one for us to track it ?
Comment 13 Jackie Rosen 2014-02-16 17:45:05 UTC Comment hidden (spam)