This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86)


On Tue, Jul 21, 2015 at 06:18:03PM +0000, Mathieu Desnoyers wrote:
> ----- On Jul 21, 2015, at 2:00 PM, OndÅej BÃlka neleai@seznam.cz wrote:
> 
> > On Tue, Jul 21, 2015 at 05:45:26PM +0000, Mathieu Desnoyers wrote:
> >> ----- On Jul 21, 2015, at 11:16 AM, OndÅej BÃlka neleai@seznam.cz wrote:
> >> 
> >> > On Tue, Jul 21, 2015 at 12:58:13PM +0000, Mathieu Desnoyers wrote:
> >> >> ----- On Jul 21, 2015, at 3:30 AM, OndÅej BÃlka neleai@seznam.cz wrote:
> >> >> 
> >> >> > On Tue, Jul 21, 2015 at 12:25:00AM +0000, Mathieu Desnoyers wrote:
> >> >> >> >> Does it solve the Wine problem?  If Wine uses gs for something and
> >> >> >> >> calls a function that does this, Wine still goes boom, right?
> >> >> >> > 
> >> >> >> > So the advantage of just making a global segment descriptor available
> >> >> >> > is that it's not *that* expensive to just save/restore segments. So
> >> >> >> > either wine could do it, or any library users would do it.
> >> >> >> > 
> >> >> >> > But anyway, I'm not sure this is a good idea. The advantage of it is
> >> >> >> > that the kernel support really is _very_ minimal.
> >> >> >> 
> >> >> >> Considering that we'd at least also want this feature on ARM and
> >> >> >> PowerPC 32/64, and that the gs segment selector approach clashes with
> >> >> >> existing apps (wine), I'm not sure that implementing a gs segment
> >> >> >> selector based approach to cpu number caching would lead to an overall
> >> >> >> decrease in complexity if it leads to performance similar to those of
> >> >> >> portable approaches.
> >> >> >> 
> >> >> >> I'm perfectly fine with architecture-specific tweaks that lead to
> >> >> >> fast-path speedups, but if we have to bite the bullet and implement
> >> >> >> an approach based on TLS and registering a memory area at thread start
> >> >> >> through a system call on other architectures anyway, it might end up
> >> >> >> being less complex to add a new system call on x86 too, especially if
> >> >> >> fast path overhead is similar.
> >> >> >> 
> >> >> >> But I'm inclined to think that some aspect of the question eludes me,
> >> >> >> especially given the amount of interest generated by the gs-segment
> >> >> >> selector approach. What am I missing ?
> >> >> >> 
> >> >> > As I wrote before you don't have to bite bullet as I said before. It
> >> >> > suffices to create 128k element array with cpu for each tid, make that
> >> >> > mmapable file and userspace could get cpu with nearly same performance
> >> >> > without hacks.
> >> >> 
> >> >> I don't see how this would be acceptable on memory-constrained embedded
> >> >> systems. They have multiple cores, and performance requirements, so
> >> >> having a fast getcpu would be useful there (e.g. telecom industry),
> >> >> but they clearly cannot afford a 512kB table per process just for that.
> >> >> 
> >> > Which just means that you need more complicated api and implementation
> >> > for that but idea stays same. You would need syscalls
> >> > register/deregister_cpuid_idx that would give you index used instead
> >> > tid. A kernel would need to handle that many ids could be registered for
> >> > each thread and resize mmaped file in syscalls.
> >> 
> >> I feel we're talking past each other here. What I propose is to implement
> >> a system call that registers a TLS area. It can be invoked at thread start.
> >> The kernel can then keep the current CPU number within that registered
> >> area up-to-date. This system call does not care how the TLS is implemented
> >> underneath.
> >> 
> >> My understanding is that you are suggesting a way to speed up TLS accesses
> >> by creating a table indexed by TID. Although it might lead to interesting
> >> speed ups useful when reading the TLS, I don't see how you proposal is
> >> useful in addressing the problem of caching the current CPU number (other
> >> than possibly speeding up TLS accesses).
> >> 
> >> Or am I missing something fundamental to your proposal ?
> >>
> > No, I still talk about getting cpu number. My first proposal is that
> > kernel allocates table of current cpu numbers accessed by tid. That
> > could process mmap and get cpu with cpu_tid_table[tid]. As you said that
> > size is problem I replied that you need to be more careful. Instead tid
> > you will use different id that you get with say register_cpucache, store
> > in tls variable and get cpu with cpu_cid_table[cid]. That decreases
> > space used to only threads that use this.
> > 
> > A tls speedup was side remark when you would implement per-cpu page then
> > you could speedup tls. As tls access speed and getting tid these are
> > equivalent as you could easily implement one with other.
> 
> Thanks for the clarification. There is then a fundamental question
> I need to ask: what is the upside of going for a dedicated array of
> current cpu number values rather than using a TLS variable ?
> The main downside I see with the array of cpu number is false sharing
> caused by having many current cpu number variables sitting on the same
> cache line. It seems like an overall performance loss there.
>
Its considerably simpler to implement as you don't need to mark tls
pages to avoid page fault in context switch, security issues where
attacker could try to unmap tls for possible privilege escalation if it
would write to different process etc.

And as for sharing it simply doesn't matter. Its mostly read only that
is written only on context switch so they will be resident in cache.
Also when you switch cpu then you get same cache miss from tls variable
so its same.  


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]