This is the mail archive of the
systemtap@sourceware.org
mailing list for the systemtap project.
Re: Hashtable
- From: Arkady <arkady dot miasnikov at gmail dot com>
- To: David Smith <dsmith at redhat dot com>
- Cc: systemtap at sourceware dot org
- Date: Thu, 6 Jul 2017 20:50:44 +0300
- Subject: Re: Hashtable
- Authentication-results: sourceware.org; auth=none
- References: <CANA-60ovD=B6BC-d4cCB6P+LEtwq=CYS7YU+4LG-MV1676Du9g@mail.gmail.com> <CAKFOr-bwstqmAYgBYuBQyuH+baY-fMF=XjAaO7PMDnLqwGK7Xw@mail.gmail.com> <CANA-60rDXaNof4v2vZYHbb6KXJoNMY58YXdEGcM8Cz7FV0Shtg@mail.gmail.com> <CANA-60oU8omKZ51-KGqRdYipf8=bQV59rJyPHaR14jSqpqvBYA@mail.gmail.com>
P.S.2
Convenient links for copy&paste
https://gist.githubusercontent.com/larytet/10ceddea609d2da17aa09558ed0e04bc/raw/05037d536e5edf0e2f5a45282c41b8fa46d1fd55/SystemTap_tests.sh
https://gist.githubusercontent.com/larytet/fc147587e9dfecfe99ab6bac2ba4aaa0/raw/670e385cb76798b526de9f4265046cf576c42f4e/SystemTap_tests
On Thu, Jul 6, 2017 at 8:45 PM, Arkady <arkady.miasnikov@gmail.com> wrote:
> P.S. The performance is very sensitive to TRYLOCKDELAY which is expected.
>
>
> On Thu, Jul 6, 2017 at 8:25 PM, Arkady <arkady.miasnikov@gmail.com> wrote:
>> On Thu, Jul 6, 2017 at 7:36 PM, David Smith <dsmith@redhat.com> wrote:
>>> On Wed, Jul 5, 2017 at 11:46 AM, Arkady <arkady.miasnikov@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> I have a CPU bottleneck in some situations on heavy loaded servers.
>>>>
>>>> From the tests it appears that associative maps contribute significant
>>>> part of the overhead.
>>>
>>> ... stuff deleted ...
>>>
>>> Can you show us your script (or the associate map portion) that
>>> illustrates the performance problem? Perhaps we can make some
>>> suggestions.
>>>
>> My test is a tight loop
>>
>> file=echo_file_`date +%s%N`; echo $file; echo > $file; counter=1;
>> end=$((SECONDS+10)); while [ $SECONDS -lt $end ]; do echo $counter >>
>> $file; counter=$((counter+1)); done; tail -n 1 $file;rm -f $file
>>
>> I run a number of these - one per core.
>>
>>
>> My stap script is something like this (8 probes)
>>
>> stap -g -e '%{long long counter;u8 shm[256];static void*
>> w_shm(void);static void* w_shm() {memset(shm, 0, sizeof(shm));return
>> shm;} %} probe syscall.close{%{ {counter++;w_shm();} %}} probe
>> syscall.close.return {%{ {counter++;w_shm();} %}} probe
>> syscall.open{%{ {counter++;w_shm();} %}} probe syscall.open.return{%{
>> {counter++;w_shm();} %}} probe syscall.dup2.return{%{
>> {counter++;w_shm();} %}} probe syscall.dup2.return{%{
>> {counter++;w_shm();} %}} probe syscall.read.return{%{
>> {counter++;w_shm();} %}} probe syscall.read{%{ {counter++;w_shm();}
>> %}} probe end { %{ {printk("\n%lli\n", counter);} %}}'
>>
>>
>> w_shm() simulates writes to the shared memory.
>> The performance impact is ~15% for 4 cores
>>
>> I am adding a map (global ar%):
>>
>> stap -D MAXSKIPPED=0 -D MAXTRYLOCK=1000000 -D TRYLOCKDELAY=10 -g -e
>> 'global ar%; function w_ar() {ar[tid()]=tid();} %{long long counter;u8
>> shm[256];static void* w_shm(void);static void* w_shm() {memset(shm, 0,
>> sizeof(shm));return shm;} %} probe syscall.close{w_ar();%{
>> {counter++;w_shm();} %}} probe syscall.close.return {w_ar();%{
>> {counter++;w_shm();} %}} probe syscall.open{w_ar();%{
>> {counter++;w_shm();} %}} probe syscall.open.return{w_ar();%{
>> {counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{
>> {counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{
>> {counter++;w_shm();} %}} probe syscall.read.return{w_ar();%{
>> {counter++;w_shm();} %}} probe syscall.read{w_ar();%{
>> {counter++;w_shm();} %}} probe end { %{ {printk("\n%lli\n",
>> counter);} %}}'
>>
>> I am getting 35% hit. The overhead grows with the number of cores.
>>
>> The scripts roughly reflect what I am doing in the actual code. I have
>> 1-3 associative arrays per syscall type. For example I keep separate
>> arrays for probe syscall.read and probe syscall.write
>>
>> I have ~30 probes - I/O, networking, thread life cycle.
>>
>>> (Also note that I've started a background personal task to reduce the
>>> use of locks in systemtap. I don't have much to show for it yet.)
>>>
>>
>> It looks like the performance of the probes does not scale well with
>> the number of cores. The overhead increases with the number of cores
>> growing. I suspect that the spin locks at the beginning of every probe
>> are to blame.
>>
>>> --
>>> David Smith
>>> Principal Software Engineer
>>> Red Hat
>>
>> Thank you, Arkady.