This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Remove CPU mask size detection from setaffinity


On 05/23/2015 04:21 AM, Carlos O'Donell wrote:

> Is this the problem?
> 
> "The glibc sched_setaffinity routine incorrectly attempts to compute the
>  kernel cpumask_t, and therefore does more harm than good imposing a 
>  incorrectly computed cpumask_t size as a limit." With "harm" exaplained
>  in detail.

The main harm I see is unnecessary complexity, and perhaps misleading
application developers who read the glibc implementation but not the
kernel code.

I think it's also unreasonable to assume that the current state of CPU
count affinity handling is the final development in this area.  Things
like CPU hotplug and process migration will likely get more important
over time, and making glibc rely on the present implementation details
could cause problems in the future.

(An example of such a change is the route cache removal.  There is now a
sysctl for backwards compatibility which says the kernel maximum table
size is INT_MAX.  If you have code that reads this sysctl, it will
likely fail in some way.)

> The glibc API description for sched_setaffinity says EINVAL
> will be returned when the affinity bit mask contains no processes
> that are currently physically on the system, or are permitted to
> run. It does not fail if the bit mask is larger than the kernel
> cpumask_t and that extra space contains zeroes. In fact the glibc
> manual states that this information can even be used in the *future*
> by the scheduler, one presumes, when more cpus come online.
> 
> The present code in sysdeps/unix/sysv/linux/sched_setaffinity.c
> uses sched_getaffinity to compute what it believes is the maximum
> size of the kernel cpu_mask_t. The loop there will succeed at the
> first call since most systems default to NR_CPUS[1] to 1023.

The value here is not NR_CPUS, but its dynamically reduced counterpart,
nr_cpu_ids, which you call possible CPUs below.  NR_CPUS varies wildly
between distributions (I've seen values ranging from 512 to 5120), but
nr_cpu_ids should be fairly consistent for a single piece of hardware.

> If NR_CPUS were very high, the limit the kernel uses is nr_cpu_ids,
> which is still the correct limit and reflects maximum possible cpus
> as detected by hardware (as opposed to the compile time max).
> The kernel code and glibc code as far as I can tell will not correctly
> compute nr_cpu_ids nor NR_CPUS, but will compute the largest power of
> two size, starting with 128 bytes, that the kernel will accept as a
> cpumask_t.

This does not match my testing, both with the Fedora 21 kernel and the
Red Hat Enterprise Linux 7 kernel (x86_64).  I only tested full 64-bit
words, but the kernel implementation of sched_getaffinity returned
successful if passed 3 words of affinity bits (24 bytes, 192 bits) on a
144 processor system.  A power of two does not seem to be required.

> This value need not be nr_cpu_ids, and if nr_cpu_ids is
> larger than 1024, then it certainly won't be since the kernel is happy
> to copy back a partial set of bits e.g. len < nr_cpu_ids.

Are you sure about that?  That's a clear kernel bug.  The sources I
looked at have this right at the top of the sched_getaffinity
implementation:

	if ((len * BITS_PER_BYTE) < nr_cpu_ids)
		return -EINVAL;

So I think the current glibc is not downright buggy on 1025+ systems.
It just does unnecessary work.

> If the kernel ignores non-allowed cpus, that would seem like a bug
> in the interface. Either the call succeeds according to your request
> or it doesn't.

Based on the kernel code for sched_setaffinity, this seems the intent.

> However, I have not worked on these interfaces on the kernel side,
> and I would appreciate Motohiro-san's experience in this matter.

Same here, we really need to know what's the intent on the kernel side.

-- 
Florian Weimer / Red Hat Product Security


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]