This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Linux: Implement interfaces for memory protection keys



On 06/11/2017 18:01, Florian Weimer wrote:
> On 11/06/2017 05:49 PM, Adhemerval Zanella wrote:
>>>
>>> What's the general feeling about these interfaces?  Shall we add them, despite their limitations (especially with regard to key reuse)?
>>>
>>> Then I'll write documentation.
>>
>> This is currently only supported on x86 and I am not sure if it will be
>> ever used by other architectures, but Linux thinks its interface is
>> sufficient generic since it decided to extend it be part of generic
>> ABI.  I also do not foresee this interface to be deprecated by any other
>> soon, so I think it should be a valuable adition.
>>
>> Do you know any project that is already using it?
> 
> I'm not aware of any users.  To be honest, I just needed a distraction over the weekend.
> 
>> On 04/11/2017 15:49, Florian Weimer wrote:
>>> This adds system call wrappers for pkey_alloc, pkey_free, pkey_mprotect,
>>> and x86-64 implementations of pkey_get and pkey_set, which abstract over
>>> the PKRU CPU register and hide the actual number of memory protection
>>> keys supported by the CPU.
>>>
>>> The system call wrapers use unsigned int instead of unsigned long for
>>> parameters, so that no special treatment for x32 is needed.  The flags
>>> argument is currently unused, and the access rights bit mask is limited
>>> to two bits by the current PKRU register layout anyway.
>>
>> AFAIK there is no need to special handling of unsigned long, which is
>> 32 bits on x32.  There were issues of using {INLINE,INTERNAL}_SYSCALL
>> along with 64 bits arguments (off_t for instance) and it should be fixed
>> by 78ca091cdd2c.  The only special handling is required for syscalls that
>> return 64 bits arguments, for instance 'times', which can't be handled
>> by {INLINE,INTERNAL}_SYSCALLS (x32 times.c get around by redefining
>> internal_syscall1 to return a 64 bits value).
> 
> Oh, I was worried that these arguments were expected to be true 64-bit values eventually.  Not today, but perhaps in the future.   Then using unsigned long would be no good on x32.  But this does not seem to be the case, so we can just use unsigned int, to make clear that these arguments are actually 32 bits only.

Even they were 64 bits argument, current {INLINE,INTERNAL}_SYSCALL on
x32 pass them correctly in syscall arguments.

> 
>> So I think there is no reason to use a different symbol signature than
>> kernel.
> 
> I still think using unsigned long should be avoided.

Is your objection because it will have different argument types for 32
and 64 bits? Currently for x86 indeed it does not make difference since
for 'wrpkru' instruction on x86_64 the high-order 32-bits of RCX, RDX 
and RAX are ignored.

But if we aim to make it a generic API I think we should stick to what
kernel provides for main reason we can not foresee what kind of usage
future architecture might do with this value.

> 
>>> +/* Compare the two numbers LEFT and RIGHT and report failure if they
>>> +   are different.  */
>>> +#define TEST_COMPARE(left, right)                                       \
>>> +  ({                                                                    \
>>> +    __typeof__ (left) __left_value = (left);                            \
>>> +    __typeof__ (right) __right_value = (right);                         \
>>> +    _Static_assert (sizeof (__left_value) <= sizeof (long long),        \
>>> +                    "left value fits into long long");                  \
>>> +    _Static_assert (sizeof (__right_value) <= sizeof (long long),       \
>>> +                    "right value fits into long long");                 \
>>> +    if (__left_value != __right_value                                   \
>>> +        || ((__left_value > 0) != (__right_value > 0)))                 \
>>> +      support_test_compare_failure                                      \
>>> +        (__FILE__, __LINE__,                                            \
>>> +         #left, __left_value, __left_value > 0,                         \
>>> +         #right, __right_value, __right_value > 0);                     \
>>> +  })
>>> +
>>
>> I think the macro name should be more explicit since it is not only
>> comparing two number but also their size.
> 
> Huh?  No, it checks that conversion to long long plus the sign bit is not lossy and that the signs match.  So I think the name is accurate.

Yeah, but it ties some type (long long) and also some sign (since it
is comparing if both sides are positive) to a quite generic name.
Maybe TEST_COMPARE_LL_INT_POSITIVE?

> 
>>
>>> +/* Internal implementation of TEST_COMPARE.  LEFT_POSITIVE and
>>> +   RIGHT_POSITIVE are used to fit both unsigned long long and long
>>> +   long arguments into LEFT_VALUE and RIGHT_VALUE.  */
>>> +void support_test_compare_failure (const char *file, int line,
>>> +                                   const char *left_expr,
>>> +                                   long long left_value,
>>> +                                   int left_positive,
>>> +                                   const char *right_expr,
>>> +                                   long long right_value,
>>> +                                   int right_positive);
>>> +
>>>   /* Internal function called by the test driver.  */
>>>   int support_report_failure (int status)
>>>     __attribute__ ((weak, warn_unused_result));
>>> diff --git a/support/support_test_compare_failure.c b/support/support_test_compare_failure.c
>>> new file mode 100644
>>> index 0000000000..38fec1ca89
>>> --- /dev/null
>>> +++ b/support/support_test_compare_failure.c
>>> @@ -0,0 +1,46 @@
>>> +/* Reporting mumeric comparison failure.
>>
>> Typo.
> 
> Thanks, fixed.
> 
>>> diff --git a/sysdeps/unix/sysv/linux/bits/mman-linux.h b/sysdeps/unix/sysv/linux/bits/mman-linux.h
>>> index b091181960..da5ec79334 100644
>>> --- a/sysdeps/unix/sysv/linux/bits/mman-linux.h
>>> +++ b/sysdeps/unix/sysv/linux/bits/mman-linux.h
>>> @@ -109,3 +109,38 @@
>>>   # define MCL_ONFAULT    4        /* Lock all pages that are
>>>                          faulted in.  */
>>>   #endif
>>> +
>>> +/* Memory protection key support.  */
>>> +#ifdef __USE_GNU
>>> +
>>> +/* FLags for pkey_alloc.  */
>>> +# define PKEY_DISABLE_ACCESS 0x1
>>> +# define PKEY_DISABLE_WRITE 0x2
>>
>> Typo on 'FLags'.
> 
> Fixed.
> 
>>> diff --git a/sysdeps/unix/sysv/linux/syscalls.list b/sysdeps/unix/sysv/linux/syscalls.list
>>> index 40c4fbb9ea..6f657eea2e 100644
>>> --- a/sysdeps/unix/sysv/linux/syscalls.list
>>> +++ b/sysdeps/unix/sysv/linux/syscalls.list
>>> @@ -110,3 +110,6 @@ setns        EXTRA    setns        i:ii    setns
>>>   process_vm_readv EXTRA    process_vm_readv i:ipipii process_vm_readv
>>>   process_vm_writev EXTRA    process_vm_writev i:ipipii process_vm_writev
>>>   memfd_create    EXTRA    memfd_create    i:si    memfd_create
>>> +pkey_alloc    EXTRA    pkey_alloc    i:ii    pkey_alloc
>>> +pkey_free    EXTRA    pkey_free    i:i    pkey_free
>>> +pkey_mprotect    EXTRA    pkey_mprotect    i:aiii  pkey_mprotect
>>
>> I do not think we can them as default since build against kernel headers
>> older than v4.9 won't have __NR_pkey_{alloc,free,mprotect).  We need
>> default implementation with #ifdef __NR_....
> 
> Hmm.  I checked that we'd fall back to an ENOSYS stub, but this was for memfd_create, which had the stub in the generic API.  That won't apply here, so I will have to rework this a bit.
> 
> I will sent an updated patch.
> 
>>> +/* Return the value of the PKRU register.  */
>>> +static inline unsigned int
>>> +pkey_read (void)
>>> +{
>>> +  unsigned int result;
>>> +  __asm__ volatile (".byte 0x0f, 0x01, 0xee"
>>> +                    : "=a" (result) : "c" (0) : "rdx");
>>> +  return result;
>>> +}
>>
>> I think we should also constrain 'edx' since it is explict cleared on
>> 'rdpkru' instruction:
>>
>>    static inline unsigned int
>>    pkey_read (void)
>>    {
>>      unsigned int result, edx;
>>      __asm__ volatile (".byte 0x0f, 0x01, 0xee"
>>                        : "=a" (result), "=d" (edx)
>>               : "c" (0));
>>      return result;
>>    }
> 
> I think the clobber for rdx takes care of that.
> 
>> I think the kernel example to implement pkey_get is more parametrized and
>> may lead to a more portable code if the idea is to extend this syscalls
>> to other architectures than x86.  It adds a arch-specific PKRU_BITS_PER_PKEY
>> and sets the mask based on current supported flags (which I think won't
>> be extended at least for x86 since there is no more available unused bits):
> 
> Sorry, I disagree with that.  Premature generalization usually does not work out because you don't know which parts to generalize.
> 
> If we get key revocation on pkey_alloc, pkey_set would have to turn into a vsyscall anyway (where the code sequence is known to the kernel, and the revocation code can override the PKRU value the pkey_set implementation has read, thus closing the race).

Alright, that is fair point. I am thinking more about the sense of
what should be generic such a maximum number of bits for the mask,
the possible mask value and to set them on the internal flag.  But
I think if it were the case in the future we can work a more generic
implementation.

> 
>>> +int
>>> +pkey_set (int key, unsigned int rights)
>>> +{
>>> +  if (key < 0 || key > 15 || rights > 3)
>>> +    {
>>> +      __set_errno (EINVAL);
>>> +      return -1;
>>> +    }
>>> +  unsigned int mask = 3 << (2 * key);
>>> +  unsigned int pkru = pkey_read ();
>>> +  pkru = (pkru & ~mask) | (rights << (2 * key));
>>> +  pkey_write (pkru);
>>> +  return 0;
>>> +}
>>
>> According to this LWN article [2] and man pages [3], the pkey 0 is set
>> aside for the default one and this key will never be allocated with
>> pkey_alloc(), and its restrictions cannot be changed.  Assuming this
>> is still true (since the article is based on non upstream code), what
>> happen if try issue a pkey_set with a key 0? Does the kernel silent
>> ignore it or trap with an invalid instruction?
> 
> I think userspace can write whatever it wants to the PKRU register.  The register update is always valid.  It's just that some values will not be very useful.
> 
> I added
> 
>   TEST_COMPARE (pkey_set (0, PKEY_DISABLE_WRITE), 0);
> 
> to the test case, and it crashes here:
> 
>    0x00000000004023c3 <+2083>:  xor    %edi,%edi
>    0x00000000004023c5 <+2085>:  mov    $0x2,%esi
>    0x00000000004023ca <+2090>:  callq  0x4013d0 <pkey_set@plt>
>    0x00000000004023cf <+2095>:  cmp    $0x0,%eax
>    0x00000000004023d2 <+2098>:  jne    0x4026d0 <do_test+2864>
>    0x00000000004023d8 <+2104>:  mov    $0x6062e0,%ebx
>    0x00000000004023dd <+2109>:  mov    (%rbx),%rdi
>    0x00000000004023e0 <+2112>:  mov    %r12,%rsi
>    0x00000000004023e3 <+2115>:  add    $0x8,%rbx
> => 0x00000000004023e7 <+2119>:  callq  0x403730 <xmunmap>
> 
> This is the first memory write after the PKRU update.  It is exactly what I expected because the stack has key 0, and we revoked write access.  With PKEY_DISABLE_ACCESS, things get a bit weird because the coredump writer can't read the memory.  But with single-stepping, I was able to confirm that the SIGSEGV happens on the retq instruction in pkey_set, which is again as expected.
> 
> So I don't think we need to document key 0 as special in any way.  It is just a key that will never be returned from pkey_alloc, so unexpected things may happen.  I think key 1 is currently special as well because it is PKEY_DISABLE_ACCESS, and execute-only mappings will use that.
> 
> Thanks,
> Florian

What about new pkey_mprotect (..., 0) after pkey_set (0, ...)? Will it follow
the new key setting for key 0 or kernel will use a default value?

My question is of you think we need to filter and/or document key 0 (and 1 as
you indicate) as being special?


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]