This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH RFC] introduce dl_iterate_phdr_parallel

From: Gleb Natapov <gleb at scylladb dot com>
To: Torvald Riegel <triegel at redhat dot com>
Cc: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>, libc-alpha at sourceware dot org
Date: Wed, 3 Aug 2016 20:47:27 +0300
Subject: Re: [PATCH RFC] introduce dl_iterate_phdr_parallel
Authentication-results: sourceware.org; auth=none
References: <20160725142326.GM1018@scylladb.com> <579A6F54.2080709@linaro.org> <20160731091642.GF2502@scylladb.com> <579F8FA8.9060009@linaro.org> <20160801184946.GL17903@scylladb.com> <1470080795.19224.101.camel@localhost.localdomain> <20160801200738.GM17903@scylladb.com> <1470082795.19224.111.camel@localhost.localdomain> <20160801204153.GN17903@scylladb.com> <1470245186.25085.47.camel@localhost.localdomain>

On Wed, Aug 03, 2016 at 07:26:26PM +0200, Torvald Riegel wrote:
> On Mon, 2016-08-01 at 23:41 +0300, Gleb Natapov wrote:
> > On Mon, Aug 01, 2016 at 10:19:55PM +0200, Torvald Riegel wrote:
> > > On Mon, 2016-08-01 at 23:07 +0300, Gleb Natapov wrote:
> > > > On Mon, Aug 01, 2016 at 09:46:35PM +0200, Torvald Riegel wrote:
> > > > > If we built something custom for this and are willing to make the
> > > > > wrlock / exclusive-access case much more costly, we can decrease this
> > > > > overhead.  This could be roughly similar to one lock per thread or a set
> > > > > of rwlocks as you mentioned, but with less space overhead.
> > > > > 
> > > > IMO space overhead is negligible. More efficient rwlock is, of course,
> > > > better and can be useful in many more places. If you have something to
> > > > test I am willing to do so, but if custom rwlock will take time to
> > > > materialize we may start with lock array and change it later. The lock
> > > > array is not part of the interface, but implementation detail. What I
> > > > would like to avoid is stalling the afford while waiting for something
> > > > better. Exception scalability is very pressing issue for us.
> > > 
> > > I haven't looked at the workload in detail, and so can't say off-hand
> > > what would be best and how long it would take.  Is there a summary
> > > describing the synchronization that is necessary?  Once I know that, I
> > > can give a closer estimate.  It could be simple to do, so it might be
> > > worth having a look before trying the (rw?)lock-array approach.
> > > 
> > rwlock semantics is pretty much what is needed. There is a list of
> > loaded objects that changes rarely (only during dlopen/dlclose), but it
> > is accessed for read on each exception.
> 
> Does "for read" mean that the only thing you are interested in is some
> consistent snapshot of the list, or something like a yes/no result for a
> query about whether some element is contained?
No. The list contains objects that have pointers into some other memory that
the locking of the list prevents from been unmapped.

> Is this list in memory that is never unmapped?  Can we construct it in
> such a way that we'll follow only a few pointers for which we don't
> quite know whether they point to proper elements of the list?
> 
> If we can do that, than we should be able to run an atomic snapshot
> speculatively.  IOW, do what many STMs do, or seqlocks, etc.
I do not see how.

> 
> > Exceptions on different threads
> > are independent, so they should not interfere with each other. Having
> > only one lock (even rw one) means that all threads will access same
> > atomic variable for synchronisation which has a lot of overhead on
> > modern cpus. lock-array approach is a way to allow multiple threads to
> > process completely independently without any inter-core communication at
> > all. It is hard to beat.
> 
> You can try to never modify memory in the common case, which is better.
Which memory? No memory, besides lock, is modified in the common case.

> Additionally, the problem with the lock array is that you have to put
> the locks somewhere, preferably so that they are in a page that is local
> to where the thread runs. This makes a simple lock array suboptimal, at
> least on bigger systems.  TLS space would be most likely to be backed by
> local memory, but as Adhemerval has mentioned, we might have lots of
> threads, and if we do TLS, we need to sync between thread destruction
> and the heavyweight operation that wants to acquire all locks.
> 
The lock array is global, no TLS is involved. There is no problem of a
kind you describe.

> > New rwlock is substantially slower on only 4
> > cores (it is much better than current rwlock since it avoids entering
> > kernel if there are only readers).
> 
> Is this a microbenchmark just for acquisition/release of the rwlock, or
> does this significant difference also show up in the real scenario you
> are concerned about (the exception storm you mentioned)?
That's on a micro benchmark. But it simulates pretty closely what
sometimes happen on our real system when multiple threads encounter
burst of exceptional conditions (network connection dropped and all
queued requests, sometimes thousands, terminate with exception).

> 
> > > The POSIX rwlock's complexity is partly due to having to support the
> > > different modes (eg, reader preference), so a custom rwlock for a
> > > specific use case could be much simpler.  I would guess that it looks
> > > more like reference counting, or if the snapshots of the readers are
> > > suitable, it could also be seqlock based (and thus require no memory
> > > modification by readers at all).
> > > 
> > I evaluated seqlock approach, but snapshoting of the object list requires
> > pretty deep copy (for exception handling case)
> 
> Why do you need to copy? 
We need to copy because if We access this data without a lock it may
disappear.
 
>                          Can you point me to the list datastructure you
> need to snapshot?
> 
Look at __dl_iterate_phdr() where it iterates over GL(dl_ns)[ns]._ns_loaded
list. It calls provided callback with a structure that hold addresses
which are protected from been unmapped by the same lock that protects
the list from been modified. This is the list of all loaded dynamic
libraries.

> > and the data accessed
> > varies for each user of dl_iterate_phdr (elements of the list point into
> > mmaped parts of .so files).
> 
> That doesn't matter.  The only overhead is how often we need to validate
> so that we know that a subsequent pointer dereference is safe.  The
> number of these valuations compared to a lock acquisition+release is the
> deciding factor here.
Pointer dereferencing happens outside of glibc in a user provided
callback.

--
			Gleb.

Follow-Ups:
- Re: [PATCH RFC] introduce dl_iterate_phdr_parallel
  - From: Torvald Riegel

References:
- Re: [PATCH RFC] introduce dl_iterate_phdr_parallel
  - From: Adhemerval Zanella
- Re: [PATCH RFC] introduce dl_iterate_phdr_parallel
  - From: Gleb Natapov
- Re: [PATCH RFC] introduce dl_iterate_phdr_parallel
  - From: Torvald Riegel
- Re: [PATCH RFC] introduce dl_iterate_phdr_parallel
  - From: Gleb Natapov
- Re: [PATCH RFC] introduce dl_iterate_phdr_parallel
  - From: Torvald Riegel
- Re: [PATCH RFC] introduce dl_iterate_phdr_parallel
  - From: Gleb Natapov
- Re: [PATCH RFC] introduce dl_iterate_phdr_parallel
  - From: Torvald Riegel

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]