This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Hashtable


On Thu, Jul 6, 2017 at 7:36 PM, David Smith <dsmith@redhat.com> wrote:
> On Wed, Jul 5, 2017 at 11:46 AM, Arkady <arkady.miasnikov@gmail.com> wrote:
>> Hi,
>>
>> I have a CPU bottleneck in some situations on heavy loaded servers.
>>
>> From the tests it appears that associative maps contribute significant
>> part of the overhead.
>
> ... stuff deleted ...
>
> Can you show us your script (or the associate map portion) that
> illustrates the performance problem? Perhaps we can make some
> suggestions.
>
My test is a tight loop

file=echo_file_`date +%s%N`; echo $file; echo > $file; counter=1;
end=$((SECONDS+10)); while [ $SECONDS -lt $end ]; do echo $counter >>
$file; counter=$((counter+1)); done; tail -n 1 $file;rm -f $file

I run a number of these - one per core.


My stap script is something like this (8 probes)

stap  -g   -e '%{long long counter;u8 shm[256];static void*
w_shm(void);static void* w_shm() {memset(shm, 0, sizeof(shm));return
shm;} %} probe syscall.close{%{ {counter++;w_shm();} %}} probe
syscall.close.return {%{ {counter++;w_shm();} %}} probe
syscall.open{%{ {counter++;w_shm();} %}} probe syscall.open.return{%{
{counter++;w_shm();} %}} probe syscall.dup2.return{%{
{counter++;w_shm();} %}} probe syscall.dup2.return{%{
{counter++;w_shm();} %}} probe syscall.read.return{%{
{counter++;w_shm();} %}} probe syscall.read{%{ {counter++;w_shm();}
%}} probe end {  %{ {printk("\n%lli\n", counter);} %}}'


w_shm() simulates writes to the shared memory.
The performance impact is ~15% for 4 cores

I am adding a map (global ar%):

stap -D MAXSKIPPED=0 -D MAXTRYLOCK=1000000 -D TRYLOCKDELAY=10 -g   -e
'global ar%; function w_ar() {ar[tid()]=tid();} %{long long counter;u8
shm[256];static void* w_shm(void);static void* w_shm() {memset(shm, 0,
sizeof(shm));return shm;} %} probe syscall.close{w_ar();%{
{counter++;w_shm();} %}} probe syscall.close.return {w_ar();%{
{counter++;w_shm();} %}} probe syscall.open{w_ar();%{
{counter++;w_shm();} %}} probe syscall.open.return{w_ar();%{
{counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{
{counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{
{counter++;w_shm();} %}} probe syscall.read.return{w_ar();%{
{counter++;w_shm();} %}} probe syscall.read{w_ar();%{
{counter++;w_shm();} %}} probe end {  %{ {printk("\n%lli\n",
counter);} %}}'

I am getting 35% hit. The overhead grows with the number of cores.

The scripts roughly reflect what I am doing in the actual code. I have
1-3 associative arrays per syscall type. For example I keep separate
arrays for probe syscall.read and probe syscall.write

I have ~30 probes - I/O, networking, thread life cycle.

> (Also note that I've started a background personal task to reduce the
> use of locks in systemtap. I don't have much to show for it yet.)
>

It looks like the performance of the probes does not scale well with
the number of cores. The overhead increases with the number of cores
growing. I suspect that the spin locks at the beginning of every probe
are to blame.

> --
> David Smith
> Principal Software Engineer
> Red Hat

Thank you, Arkady.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]