This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH RFC] introduce dl_iterate_phdr_parallel

From: Gleb Natapov <gleb at scylladb dot com>
To: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
Cc: libc-alpha at sourceware dot org
Date: Mon, 1 Aug 2016 21:49:46 +0300
Subject: Re: [PATCH RFC] introduce dl_iterate_phdr_parallel
Authentication-results: sourceware.org; auth=none
References: <20160725142326.GM1018@scylladb.com> <579A6F54.2080709@linaro.org> <20160731091642.GF2502@scylladb.com> <579F8FA8.9060009@linaro.org>

On Mon, Aug 01, 2016 at 03:06:32PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 31/07/2016 06:16, Gleb Natapov wrote:
> > Sorry, I missed your answer since I am not subscribed and the reply was
> > sent to the mailing list only.
> > 
> > On Thu, Jul 28, 2016 at 05:47:16PM -0300, Adhemerval Zanella wrote:
> >>
> >>
> >> On 25/07/2016 11:23, Gleb Natapov wrote:
> >>> Problem: exception handling is not scalable. Attached program mt.cc
> >>> demonstrates this easily. It runs around 1 seconds with 1 thread, but
> >>> with only 4 threads it takes 4.5 seconds to complete. The reason is
> >>> locks that are taken on unwind path.
> >>>
> >>> There are two locks right now.
> >>>
> >>> Fist is in libgcc's _Unwind_Find_registered_FDE and it is eliminated by
> >>> https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01629.html (for dynamic
> >>> executables only, but this is common case).
> >>>
> >>> Second one is dl_load_write_lock in __dl_iterate_phdr. It serves dual
> >>> purpose: it stops the list of loaded objects from been modified while
> >>> iterating over it and it makes sure that more than one callback will
> >>> not run in parallel. This means that even if we will find a cleaver way
> >>> to make __dl_iterate_phdr iterate over the objects list locklessly the
> >>> lock will still have to be taken around the call to callback to preserve
> >>> second guaranty.
> >>>
> >>> This patch here propose to introduce another API: dl_iterate_phdr_parallel
> >>> which does exactly same thing as dl_iterate_phdr but may run more than
> >>> one provided callback in parallel. And to make it more scalable it
> >>> breaks single dl_load_write_lock into arrays of locks. Reader takes only
> >>> one lock, but writer has to lock all of them to proceed with modifying
> >>> the list.
> >>
> >> Wouldn't a read-write lock solve the performance problem at dl_iterate_phdr?
> > 
> >> It should scale better than current code, since dl_iterate_phdr should act
> >> just a reader on link_map structures and multiple readers should only incur
> >> in a lll_lock/lll_unlock on rwlock->__data.__lock.  I am assuming that
> >> we set it to prefer readers and use a read-lock on dl_iterate_phdr.
> >>
> > I did try read-write lock first and it did not improved the test case
> > at all. The reason is (as I was very surprised to discover) that rwlock
> > takes a lock even if there are only readers. Though the lock is taken
> > for a short period of time when it becomes congested scalability drops
> > to the floor since all new waiters go to sleep in the kernel.
> > 
> 
> Yes and I assumed this short lock contention wouldn't yield worse performance,
> but that's not the case.
> 
> > Attached is a, rather hacky, patch that I used for testing. 
> > 
> >> I see this slight better than trading memory and a limited scalability 
> >> (the _DL_LOCKS value) and also avoid adds another exported API.
> > The memory tradeoff is rather small. Using rwlock will not avoid
> > API addition though since two callbacks will be able to run in
> > parallel and this is user visible difference in behaviour from current
> > dl_iterate_phdr. C++ exception handling code, for instance, relies on the
> > fact that callbacks do not run in parallel. They use global cache which
> > I made to be per thread in the libgcc patch.
> 
> Now I understand better the issue your proposed ABI is trying to fix after
> check on libstdc++/libgcc unwind code.  Looks like the callback mutual
> exclusion execution came from libstdc++ usage, since it is not documented
> in 'dl_iterate_phdr' documentation.
> 
Yes, precisely. I looked hard to find anything relevant in docs, but did
not see it documented.

> Now, I see we have two options here:
> 
> 1. Work on your proposal for dl_iterate_phdr_parallel and *document* that
>    current dl_iterate_phdr serializes callback execution and programs that
>    would like to scale its execution should move to dl_iterate_phdr_parallel.
>    We would need to further document that scalability uses an internal 64
>    locks and using depending of thread and number of processor this limit
>    can be hit.
64 locks is implementation detail. It could be change to lock per thread
where a writer has to lock thread list + all thread locks.

> 
> 2. Another option is to push to cleanup dl_iterate_phdr interface to *not*
>    require callback serialization and use a rdlock while accessing the
>    maps list.  With current rwlock implementation performance won't change
>    as you noticed, however since we are reworking and changing to a more
>    scalable one [1] the read only pass should *much* better: the 
>    __pthread_rwlock_rdlock_full should issue just a atomic_fetch_add_acquire
>    for uncontented read case.
I saw new rwlock implementation and tested it. It is much better that
current rwlock, but still almost twice slower than multiple locks.
Profiling shows that exchange in unlock takes a lot of cpu. No wonder
since congested locking operation is very expensive. IMO ideal solution
is array of rwlocks.

> 
>    The downside is current libgcc/libstdc++ requires serialized callback
>    execution, so we will have to version dl_iterate_phdr.  Also I am not
>    sure of other programs usage of dl_iterate_phdr, so it might an issue.
Yes, I was worried about users we do not know about. The one in libgcc can be
fixed. I am not sure how to do symbol versioning magic though, will need
help with that.

> 
> Also I noticed other libcs (freebsd, uclibc, and musl) do not serialize
> the callback execution and adding a glibc specific symbol (the parallel
> version) would add more portabilities issues.  So I would prefer to work
> toward 2.
How does libgcc works on those systems? Or do they provide their own
unwinding?

> 
> [1] https://sourceware.org/ml/libc-alpha/2016-07/msg00636.html

--
			Gleb.

Follow-Ups:
- Re: [PATCH RFC] introduce dl_iterate_phdr_parallel
  - From: Torvald Riegel
- Re: [PATCH RFC] introduce dl_iterate_phdr_parallel
  - From: Adhemerval Zanella

References:
- Re: [PATCH RFC] introduce dl_iterate_phdr_parallel
  - From: Adhemerval Zanella

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]