This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH RFC] introduce dl_iterate_phdr_parallel

From: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
To: Gleb Natapov <gleb at scylladb dot com>
Cc: libc-alpha at sourceware dot org
Date: Mon, 1 Aug 2016 15:06:32 -0300
Subject: Re: [PATCH RFC] introduce dl_iterate_phdr_parallel
Authentication-results: sourceware.org; auth=none
References: <20160725142326.GM1018@scylladb.com> <579A6F54.2080709@linaro.org> <20160731091642.GF2502@scylladb.com>

On 31/07/2016 06:16, Gleb Natapov wrote:
> Sorry, I missed your answer since I am not subscribed and the reply was
> sent to the mailing list only.
> 
> On Thu, Jul 28, 2016 at 05:47:16PM -0300, Adhemerval Zanella wrote:
>>
>>
>> On 25/07/2016 11:23, Gleb Natapov wrote:
>>> Problem: exception handling is not scalable. Attached program mt.cc
>>> demonstrates this easily. It runs around 1 seconds with 1 thread, but
>>> with only 4 threads it takes 4.5 seconds to complete. The reason is
>>> locks that are taken on unwind path.
>>>
>>> There are two locks right now.
>>>
>>> Fist is in libgcc's _Unwind_Find_registered_FDE and it is eliminated by
>>> https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01629.html (for dynamic
>>> executables only, but this is common case).
>>>
>>> Second one is dl_load_write_lock in __dl_iterate_phdr. It serves dual
>>> purpose: it stops the list of loaded objects from been modified while
>>> iterating over it and it makes sure that more than one callback will
>>> not run in parallel. This means that even if we will find a cleaver way
>>> to make __dl_iterate_phdr iterate over the objects list locklessly the
>>> lock will still have to be taken around the call to callback to preserve
>>> second guaranty.
>>>
>>> This patch here propose to introduce another API: dl_iterate_phdr_parallel
>>> which does exactly same thing as dl_iterate_phdr but may run more than
>>> one provided callback in parallel. And to make it more scalable it
>>> breaks single dl_load_write_lock into arrays of locks. Reader takes only
>>> one lock, but writer has to lock all of them to proceed with modifying
>>> the list.
>>
>> Wouldn't a read-write lock solve the performance problem at dl_iterate_phdr?
> 
>> It should scale better than current code, since dl_iterate_phdr should act
>> just a reader on link_map structures and multiple readers should only incur
>> in a lll_lock/lll_unlock on rwlock->__data.__lock.  I am assuming that
>> we set it to prefer readers and use a read-lock on dl_iterate_phdr.
>>
> I did try read-write lock first and it did not improved the test case
> at all. The reason is (as I was very surprised to discover) that rwlock
> takes a lock even if there are only readers. Though the lock is taken
> for a short period of time when it becomes congested scalability drops
> to the floor since all new waiters go to sleep in the kernel.
> 

Yes and I assumed this short lock contention wouldn't yield worse performance,
but that's not the case.

> Attached is a, rather hacky, patch that I used for testing. 
> 
>> I see this slight better than trading memory and a limited scalability 
>> (the _DL_LOCKS value) and also avoid adds another exported API.
> The memory tradeoff is rather small. Using rwlock will not avoid
> API addition though since two callbacks will be able to run in
> parallel and this is user visible difference in behaviour from current
> dl_iterate_phdr. C++ exception handling code, for instance, relies on the
> fact that callbacks do not run in parallel. They use global cache which
> I made to be per thread in the libgcc patch.

Now I understand better the issue your proposed ABI is trying to fix after
check on libstdc++/libgcc unwind code.  Looks like the callback mutual
exclusion execution came from libstdc++ usage, since it is not documented
in 'dl_iterate_phdr' documentation.

Now, I see we have two options here:

1. Work on your proposal for dl_iterate_phdr_parallel and *document* that
   current dl_iterate_phdr serializes callback execution and programs that
   would like to scale its execution should move to dl_iterate_phdr_parallel.
   We would need to further document that scalability uses an internal 64
   locks and using depending of thread and number of processor this limit
   can be hit.

2. Another option is to push to cleanup dl_iterate_phdr interface to *not*
   require callback serialization and use a rdlock while accessing the
   maps list.  With current rwlock implementation performance won't change
   as you noticed, however since we are reworking and changing to a more
   scalable one [1] the read only pass should *much* better: the 
   __pthread_rwlock_rdlock_full should issue just a atomic_fetch_add_acquire
   for uncontented read case.

   The downside is current libgcc/libstdc++ requires serialized callback
   execution, so we will have to version dl_iterate_phdr.  Also I am not
   sure of other programs usage of dl_iterate_phdr, so it might an issue.

Also I noticed other libcs (freebsd, uclibc, and musl) do not serialize
the callback execution and adding a glibc specific symbol (the parallel
version) would add more portabilities issues.  So I would prefer to work
toward 2.

[1] https://sourceware.org/ml/libc-alpha/2016-07/msg00636.html

Follow-Ups:
- Re: [PATCH RFC] introduce dl_iterate_phdr_parallel
  - From: Gleb Natapov

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]