This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: What *is* the API for sched_getaffinity? Should sched_getaffinity always succeed when using cpu_set_t?


On 07/17/2013 04:06 PM, KOSAKI Motohiro wrote:
> (7/17/13 3:43 PM), Carlos O'Donell wrote:
>> On 07/15/2013 09:25 PM, KOSAKI Motohiro wrote:
>>> I meant, _SC_NPROCESSORS_ONLN is racy and not recommended if cpu hotplug is happen.
>>
>> (1) Race-free fix.
>>
>> Given that it's racy to use _SC_NPROCESSORS_ONLN, and it's also racy
>> to rely on sched_getaffinity returning EINVAL, the best we can do is
>> add a _SC_NPROCESSORS_MAX_NP and use that to query the maximum number
>> of processors the OS supports and use that.
> 
> No. sched_getaffinity is not racy because it uses possible cpus instead
> of online cpus. Because Ulrich Drepper requested me so when I implement current
> kernel logic.

Thank you for the clarification. 

I think I see what you mean. 

There is no race.

(1) No race in sched_getaffinity internally.

Just to clarify:

kernel/sched/core.c:
...
3714 SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,
3715                 unsigned long __user *, user_mask_ptr)
...
3720         if ((len * BITS_PER_BYTE) < nr_cpu_ids)
3721                 return -EINVAL;

This code causes EINVAL to be returned when the length of the userspace
buffer is smaller than the size of the mask for ONLINE cpus.

...
3729         if (ret == 0) {
3730                 size_t retlen = min_t(size_t, len, cpumask_size());
3731 
3732                 if (copy_to_user(user_mask_ptr, mask, retlen))
3733                         ret = -EFAULT;
3734                 else
3735                         ret = retlen;
3736         }

This code causes at most the mask of POSSIBLE cpus (cpumask_size())
to be copied to the userspace buffer even if the userspace buffer is
bigger. No error is returned if the userspace buffer is bigger than
the POSSIBLE cpu mask size.

Thus the worst case behaviour is as follows:

sched_getaffinity (glibc)
  Allocate space for glibc buffer based on /sys values.
  Kernel hot plug adds a CPU and ONLINE cpus is incremented by 1 or more.
  Call kernel sched_getaffinity, result is EINVAL
  Kernel hot plug adds a CPU and ONLINE cpus is incremented by 1 or more.
  Call kernel sched_getaffinity, result is EINVAL
  ...  Repeat until you reach POSSIBLE cpu limit.
  Call kernel sched_getaffinity, result is size of POSSIBLE cpu mask.
  Copy all or part of glibc buffer to user buffer.

So there is no race, but possibly a long wait if the kernel
is in the process of adding a lot of cpus quickly.

(2) No race between sched_getaffinity and sched_setaffinity.

The call to sched_getaffinity always succeeds, and glibc knows the
size of the ONLINE number of cpus (since kernel sched_getaffinity
succeeds if you have at least enough space for ONLINE cpus).

You manipulate the mask and call sched_setaffinity which succeeds
because either the length of the user buffer is less than the
POSSIBLE cpu mask size and that is OK (the kernel copies what it
can from userspace), or the buffer is larger than the POSSIBLE cpu 
mask size and the kernel ignores the higher bytes.

e.g.

kernel/sched/core.c:
3646 static int get_user_cpu_mask(unsigned long __user *user_mask_ptr, unsigned len,
3647                              struct cpumask *new_mask)
3648 {
3649         if (len < cpumask_size())
3650                 cpumask_clear(new_mask);
3651         else if (len > cpumask_size())
3652                 len = cpumask_size();
3653 
3654         return copy_from_user(new_mask, user_mask_ptr, len) ? -EFAULT : 0;
3655 }

Thus there is no race between the get/set interface and it always
works without failure.
 
>> (2) Accept the race exist and live with it.
>>
>> Accept that sched_getaffinity may take an infinite amount of time as
>> it attempts to query the kernel's cpu affinity mask size (via sys or
>> some other means) and then get the affinity mask. It insulates the user
>> from writing such code, and if you aren't plugging in and out cpus
>> continually it should succeed at some point.
>>
>> This solution does not requires a new non-portable sysconf value.
>> The use of sched_setaffinity may still fail if cpus were taken offline
>> and the user at that point has to call sched_getaffinity again and
>> try again. It is documented that sched_setaffinity might fail with
>> EINVAL in this case.
> 
> Hmmm...
> Sorry, I haven't caught the point. If user uses _SC_NPROCESSORS_ONLN
> sched_getaffinity() may always fail because it uses possible cpus
> internally.
> 
> Example:
> 
> online cpus: 0-1023
> possible cpus: 0-4095
> 
> and, if the user allocate and uses 1024bit width bitmap, sched_getaffinity() always
> fail even though he doesn't use hot plugging. So, this is not only race matter.
 
I agree there is no race.

However, I expected the current kernel code to work for your example
since you changed sched_getaffinity in the kernel to work with ONLINE
cpus?

The current glibc sched_getaffinity will return EINVAL when we have:

user buffer: 1024 bits.
online cpus: 0-4095.
possible cpus: 0-65535.

This is because of:

kernel/sched/core.c:
3720         if ((len * BITS_PER_BYTE) < nr_cpu_ids)
3721                 return -EINVAL;

In this case the *new* sched_getaffinity in glibc would loop or
use _SC_NPROCESSORS_MAX_NP (new constant) to determine the possible
number of cpus before making the syscall.
 
>> (3) Accept the interface as it is today.
>>
>> Both sched_getaffinity and sched_setaffinity can fail with EINVAL
>> at any time and you need to adjust the call to succeed. Code that
>> uses a fixed cpu_set_t might never work correctly.
>>
>> Perhaps the most important question is:
>>
>> Should sched_getaffinity be able to fail with EVINAL and what does
>> that mean?
>>
>> My opinion is that is should not, and that such internal kernel
>> details should be kept out of the API. It should just copy whatever
>> size the user has.
>>
>> It should be OK to use _SC_NPROCESSORS_ONLN here, but we can document
>> that it's racy in the presence of hot plug.
>>
>> We can document the new _SC_NPROCESSORS_MAX_NP as a non-portable way
>> to get the OS configured maximum which you can use if you want to get
>> all possible CPUs.
> 
> I hope to clarify the meanings of _SC_NPROCESSORS_CONF at first. I'm still
> unclear we need to make new _SC_NPROCESSORS_MAX_NP.

glibc uses /sys/devices/system/cpu or /proc/cpuinfo for
computing the value of _SC_NPROCESSORS_CONF.

We can't use _SC_NPROCESSORS_CONF here because this constant
is only for configured CPUs which may be brought online by
the OS.

You might have an empty socket with no configured CPU and
that would contribute towards possible cpus, but not configured
or online cpus.

It really seems to me that we need _SC_NPROCESSORS_MAX_NP to
indicate the physical maximum number of CPUs the OS can connect
given a particular topology, and it can return -1 to indicate
it is completely a dynamic topology in which case all we can
do is call sched_getaffinity with higher and higher buffers
to find the current maximum (and we can't cache it).

Does that clarify why I think we may need _SC_NPROCESSORS_MAX_NP?

Cheers,
Carlos.




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]