This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH RFC] introduce dl_iterate_phdr_parallel
- From: Gleb Natapov <gleb at scylladb dot com>
- To: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
- Cc: libc-alpha at sourceware dot org
- Date: Mon, 1 Aug 2016 21:49:46 +0300
- Subject: Re: [PATCH RFC] introduce dl_iterate_phdr_parallel
- Authentication-results: sourceware.org; auth=none
- References: <20160725142326.GM1018@scylladb.com> <579A6F54.2080709@linaro.org> <20160731091642.GF2502@scylladb.com> <579F8FA8.9060009@linaro.org>
On Mon, Aug 01, 2016 at 03:06:32PM -0300, Adhemerval Zanella wrote:
>
>
> On 31/07/2016 06:16, Gleb Natapov wrote:
> > Sorry, I missed your answer since I am not subscribed and the reply was
> > sent to the mailing list only.
> >
> > On Thu, Jul 28, 2016 at 05:47:16PM -0300, Adhemerval Zanella wrote:
> >>
> >>
> >> On 25/07/2016 11:23, Gleb Natapov wrote:
> >>> Problem: exception handling is not scalable. Attached program mt.cc
> >>> demonstrates this easily. It runs around 1 seconds with 1 thread, but
> >>> with only 4 threads it takes 4.5 seconds to complete. The reason is
> >>> locks that are taken on unwind path.
> >>>
> >>> There are two locks right now.
> >>>
> >>> Fist is in libgcc's _Unwind_Find_registered_FDE and it is eliminated by
> >>> https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01629.html (for dynamic
> >>> executables only, but this is common case).
> >>>
> >>> Second one is dl_load_write_lock in __dl_iterate_phdr. It serves dual
> >>> purpose: it stops the list of loaded objects from been modified while
> >>> iterating over it and it makes sure that more than one callback will
> >>> not run in parallel. This means that even if we will find a cleaver way
> >>> to make __dl_iterate_phdr iterate over the objects list locklessly the
> >>> lock will still have to be taken around the call to callback to preserve
> >>> second guaranty.
> >>>
> >>> This patch here propose to introduce another API: dl_iterate_phdr_parallel
> >>> which does exactly same thing as dl_iterate_phdr but may run more than
> >>> one provided callback in parallel. And to make it more scalable it
> >>> breaks single dl_load_write_lock into arrays of locks. Reader takes only
> >>> one lock, but writer has to lock all of them to proceed with modifying
> >>> the list.
> >>
> >> Wouldn't a read-write lock solve the performance problem at dl_iterate_phdr?
> >
> >> It should scale better than current code, since dl_iterate_phdr should act
> >> just a reader on link_map structures and multiple readers should only incur
> >> in a lll_lock/lll_unlock on rwlock->__data.__lock. I am assuming that
> >> we set it to prefer readers and use a read-lock on dl_iterate_phdr.
> >>
> > I did try read-write lock first and it did not improved the test case
> > at all. The reason is (as I was very surprised to discover) that rwlock
> > takes a lock even if there are only readers. Though the lock is taken
> > for a short period of time when it becomes congested scalability drops
> > to the floor since all new waiters go to sleep in the kernel.
> >
>
> Yes and I assumed this short lock contention wouldn't yield worse performance,
> but that's not the case.
>
> > Attached is a, rather hacky, patch that I used for testing.
> >
> >> I see this slight better than trading memory and a limited scalability
> >> (the _DL_LOCKS value) and also avoid adds another exported API.
> > The memory tradeoff is rather small. Using rwlock will not avoid
> > API addition though since two callbacks will be able to run in
> > parallel and this is user visible difference in behaviour from current
> > dl_iterate_phdr. C++ exception handling code, for instance, relies on the
> > fact that callbacks do not run in parallel. They use global cache which
> > I made to be per thread in the libgcc patch.
>
> Now I understand better the issue your proposed ABI is trying to fix after
> check on libstdc++/libgcc unwind code. Looks like the callback mutual
> exclusion execution came from libstdc++ usage, since it is not documented
> in 'dl_iterate_phdr' documentation.
>
Yes, precisely. I looked hard to find anything relevant in docs, but did
not see it documented.
> Now, I see we have two options here:
>
> 1. Work on your proposal for dl_iterate_phdr_parallel and *document* that
> current dl_iterate_phdr serializes callback execution and programs that
> would like to scale its execution should move to dl_iterate_phdr_parallel.
> We would need to further document that scalability uses an internal 64
> locks and using depending of thread and number of processor this limit
> can be hit.
64 locks is implementation detail. It could be change to lock per thread
where a writer has to lock thread list + all thread locks.
>
> 2. Another option is to push to cleanup dl_iterate_phdr interface to *not*
> require callback serialization and use a rdlock while accessing the
> maps list. With current rwlock implementation performance won't change
> as you noticed, however since we are reworking and changing to a more
> scalable one [1] the read only pass should *much* better: the
> __pthread_rwlock_rdlock_full should issue just a atomic_fetch_add_acquire
> for uncontented read case.
I saw new rwlock implementation and tested it. It is much better that
current rwlock, but still almost twice slower than multiple locks.
Profiling shows that exchange in unlock takes a lot of cpu. No wonder
since congested locking operation is very expensive. IMO ideal solution
is array of rwlocks.
>
> The downside is current libgcc/libstdc++ requires serialized callback
> execution, so we will have to version dl_iterate_phdr. Also I am not
> sure of other programs usage of dl_iterate_phdr, so it might an issue.
Yes, I was worried about users we do not know about. The one in libgcc can be
fixed. I am not sure how to do symbol versioning magic though, will need
help with that.
>
> Also I noticed other libcs (freebsd, uclibc, and musl) do not serialize
> the callback execution and adding a glibc specific symbol (the parallel
> version) would add more portabilities issues. So I would prefer to work
> toward 2.
How does libgcc works on those systems? Or do they provide their own
unwinding?
>
> [1] https://sourceware.org/ml/libc-alpha/2016-07/msg00636.html
--
Gleb.