This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH RFC] introduce dl_iterate_phdr_parallel


On Thu, Aug 04, 2016 at 07:58:19PM +0200, Torvald Riegel wrote:
> On Thu, 2016-08-04 at 20:37 +0300, Gleb Natapov wrote:
> > On Thu, Aug 04, 2016 at 07:16:55PM +0200, Torvald Riegel wrote:
> > > On Wed, 2016-08-03 at 20:47 +0300, Gleb Natapov wrote:
> > > > On Wed, Aug 03, 2016 at 07:26:26PM +0200, Torvald Riegel wrote:
> > > > > On Mon, 2016-08-01 at 23:41 +0300, Gleb Natapov wrote:
> > > > > > On Mon, Aug 01, 2016 at 10:19:55PM +0200, Torvald Riegel wrote:
> > > > > > > On Mon, 2016-08-01 at 23:07 +0300, Gleb Natapov wrote:
> > > > > > > > On Mon, Aug 01, 2016 at 09:46:35PM +0200, Torvald Riegel wrote:
> > > > > > > > > If we built something custom for this and are willing to make the
> > > > > > > > > wrlock / exclusive-access case much more costly, we can decrease this
> > > > > > > > > overhead.  This could be roughly similar to one lock per thread or a set
> > > > > > > > > of rwlocks as you mentioned, but with less space overhead.
> > > > > > > > > 
> > > > > > > > IMO space overhead is negligible. More efficient rwlock is, of course,
> > > > > > > > better and can be useful in many more places. If you have something to
> > > > > > > > test I am willing to do so, but if custom rwlock will take time to
> > > > > > > > materialize we may start with lock array and change it later. The lock
> > > > > > > > array is not part of the interface, but implementation detail. What I
> > > > > > > > would like to avoid is stalling the afford while waiting for something
> > > > > > > > better. Exception scalability is very pressing issue for us.
> > > > > > > 
> > > > > > > I haven't looked at the workload in detail, and so can't say off-hand
> > > > > > > what would be best and how long it would take.  Is there a summary
> > > > > > > describing the synchronization that is necessary?  Once I know that, I
> > > > > > > can give a closer estimate.  It could be simple to do, so it might be
> > > > > > > worth having a look before trying the (rw?)lock-array approach.
> > > > > > > 
> > > > > > rwlock semantics is pretty much what is needed. There is a list of
> > > > > > loaded objects that changes rarely (only during dlopen/dlclose), but it
> > > > > > is accessed for read on each exception.
> > > > > 
> > > > > Does "for read" mean that the only thing you are interested in is some
> > > > > consistent snapshot of the list, or something like a yes/no result for a
> > > > > query about whether some element is contained?
> > > > No. The list contains objects that have pointers into some other memory that
> > > > the locking of the list prevents from been unmapped.
> > > 
> > > Then maybe something more specific to this case like hazard pointers
> > > would be better.  This could be used for other things in glibc too, and
> > > you wouldn't need atomic RMW as for the lock (which still have more
> > > overhead then stores, even though the RMWs are not as costly anymore on,
> > > for example, x86 as in the past).
> > > 
> > Access is done outside of glibc control. User code does regular
> > dereference.
> 
> https://en.wikipedia.org/wiki/Hazard_pointer
> 
Can you elaborate on how this can be used to solve the problem of user
accessing an object memory that can be unmapped?

> It can be used for reference-counting like things too, like this
> example.  The locks you suggest have the same goal, but a different
> implementation.
> 
The locks I suggest prevents list from been modified while any object on
the list is accessed, it does not try to make link manipulation lockless.
There is no point in doing so since list modification is rare.

> > > > > Additionally, the problem with the lock array is that you have to put
> > > > > the locks somewhere, preferably so that they are in a page that is local
> > > > > to where the thread runs. This makes a simple lock array suboptimal, at
> > > > > least on bigger systems.  TLS space would be most likely to be backed by
> > > > > local memory, but as Adhemerval has mentioned, we might have lots of
> > > > > threads, and if we do TLS, we need to sync between thread destruction
> > > > > and the heavyweight operation that wants to acquire all locks.
> > > > > 
> > > > The lock array is global, no TLS is involved. There is no problem of a
> > > > kind you describe.
> > > 
> > > If the lock array is global, it will be backed by memory to a single
> > > NUMA node.  This is inefficient if you access it from a CPU on a
> > > different NUMA node in a larger system.  You could split the array
> > > between NUMA nodes, but we don't have a mechanism to access CPU-local
> > > memory efficiently right now (it can be approximated, or one can try to
> > > amortize over several calls, but that's it).
> > > So the problem I am describing is having to make a choice between this
> > > problem and the problems that come with putting the lock into TLS.
> > > 
> > The NUMA problem is not different from having just one lock as we do
> > now. Can we do something better, yes. Having per thread lock for instance,
> > but I do not see a point in more complex solution without evidence of
> > real performance issues in proposed one.
> 
> Have you measured it?
I did not, but as I said it is an improvement over current situation
which has exactly same problem. "better is enemy of good". Do we want
to search for perfect solution forever or have good one now and improve
it afterwards if need arise?

> 
> You are trying to make a performance improvement, yet we have no
> evidence that what you picked is actually useful beyond being better
> than a single lock. 
I showed that what I picked is much better then a single lock. The
usefulness of it in been better. What else should have been shown?

>                      Why 64 locks, for example.  Why haven't you put
> them at least on separate cache lines?  Is this just an arbitrary choice
> (which might be fine, but then we need to make this clear in the code)?
Putting them on separate cache lines is definitely good idea. And no, I
do not have scientific explanation for 64. 64 is reasonable number of
cores in modern medium sized server.

> 
> > > > > > New rwlock is substantially slower on only 4
> > > > > > cores (it is much better than current rwlock since it avoids entering
> > > > > > kernel if there are only readers).
> > > > > 
> > > > > Is this a microbenchmark just for acquisition/release of the rwlock, or
> > > > > does this significant difference also show up in the real scenario you
> > > > > are concerned about (the exception storm you mentioned)?
> > > > That's on a micro benchmark. But it simulates pretty closely what
> > > > sometimes happen on our real system when multiple threads encounter
> > > > burst of exceptional conditions (network connection dropped and all
> > > > queued requests, sometimes thousands, terminate with exception).
> > > 
> > > So the micro benchmark does the exception stuff in the rwlock critical
> > > sections (eg, you throw exceptions from multiple threads concurrently
> > > and measure overall throughput)?
> > What do you mean by "does the exception stuff in the rwlock critical
> > sections"? Stack unwinding happens in rwlock critical sections, the
> > benchmark just throws exception concurrently.
> 
> That was my question (see the example in there).
> 
Where should I see the example?

> > > 
> > > > > > and the data accessed
> > > > > > varies for each user of dl_iterate_phdr (elements of the list point into
> > > > > > mmaped parts of .so files).
> > > > > 
> > > > > That doesn't matter.  The only overhead is how often we need to validate
> > > > > so that we know that a subsequent pointer dereference is safe.  The
> > > > > number of these valuations compared to a lock acquisition+release is the
> > > > > deciding factor here.
> > > > Pointer dereferencing happens outside of glibc in a user provided
> > > > callback.
> > > 
> > > Ah, right.  I must have skipped the last part of the last sentence you
> > > wrote previously.
> > > 
> > > Is there a way to make callbacks only work on memory that is not going
> > > to be unmapped (e.g., under the assumption that the user won't ever
> > > dlclose stuff that's currently needed for any exception processing)?
> > > 
> > Callback may work on any memory of a loaded object. That's the point of
> > the interface - to run some code on each loaded object.
> 
> You asked for comments on a different interface.  Though we currently
> seem to favor not adding a different interface, my question is clearly
> revelant in that context.
> 
Interface I proposed is the fix of the current one, since the current
one failed to spell out that callback may be called in parallel. It
should be easy to move from the old one to the new one for users that
never made such assumptions in the first place. The only reason I
proposed new interface instead of fixing the old one it to be on a safe
side.

--
			Gleb.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]