This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Poor scaling

From: David Smith <dsmith at redhat dot com>
To: Avi Kivity <avi at scylladb dot com>, systemtap at sourceware dot org
Date: Tue, 9 Aug 2016 10:06:00 -0500
Subject: Re: Poor scaling
Authentication-results: sourceware.org; auth=none
References: <77356e45-d67b-6836-510c-a10aa23347b4@scylladb.com> <0019fa71-4afd-abfd-a36d-216026fbfdff@redhat.com> <b468b1e6-0155-fb68-f04d-e9151543a791@scylladb.com>

On 08/09/2016 01:38 AM, Avi Kivity wrote:
> 
> 
> On 08/08/2016 06:00 PM, David Smith wrote:
>> On 08/07/2016 07:47 AM, Avi Kivity wrote:
>>> Running a process with 12 active threads, each bound to its own core,
>>> and touching a systemtap array fairly frequently, I see:
>>>
>>>
>>>    72.91%  scylla           [kernel.kallsyms]     [k] _raw_spin_lock
>>>     3.49%  scylla           [kernel.kallsyms]     [k]
>>> start_callback.isra.62
>>>     1.77%  scylla           scylla                [.]
>>> _ZSt13__adjust_heapIN9__gnu_cxx17__normal_iteratorIPPN8logalloc11region_implESt6vectorIS4_SaIS4_EEEElS4_NS0_5__ops15_Iter_comp_iterIZNS2_7tracker4impl15compact_on_idleERKSt8functionIFbvEEEUlS4_S4_E_EEEvT_T0_SM_T1_T2_.isra.437.constprop.582
>>>
>>>
>>>     1.54%  scylla           [kernel.kallsyms]     [k] _raw_spin_unlock
>>>     1.39%  scylla           scylla                [.]
>>> _ZN7reactor10smp_pollfn4pollEv
>>>     1.29%  scylla           libc-2.19.so          [.] epoll_pwait
>>>     1.21%  swapper          [kernel.kallsyms]     [k] poll_idle
>>>     1.06%  scylla           [kernel.kallsyms]     [k]
>>> _raw_read_lock_irqsave
>>>     1.01%  scylla           scylla                [.]
>>> _ZSt11__push_heapIN9__gnu_cxx17__normal_iteratorIPPN8logalloc11region_implESt6vectorIS4_SaIS4_EEEElS4_NS0_5__ops14_Iter_comp_valIZNS2_7tracker4impl15compact_on_idleERKSt8functionIFbvEEEUlS4_S4_E_EEEvT_T0_SM_T1_T2_.isra.436.constprop.583
>>>
>>>
>>>     0.80%  scylla           [kernel.kallsyms]     [k]
>>> _raw_read_unlock_irqrestore
>>>     0.78%  scylla           scylla                [.]
>>> _ZN17smp_message_queue20flush_response_batchEv
>>>     0.75%  scylla           [kernel.kallsyms]     [k]
>>> utrace_report_syscall_entry
>>>     0.63%  scylla           [kernel.kallsyms]     [k] task_utrace_struct
>>>     0.50%  scylla           [kernel.kallsyms]     [k] native_read_tsc
>>>     0.37%  scylla           [kernel.kallsyms]     [k]
>>> utrace_report_syscall_exit
>>>     0.32%  scylla           [kernel.kallsyms]     [k] delay_tsc
>>>     0.28%  scylla           scylla                [.]
>>> _ZN17smp_message_queue13process_queueILm2EZNS_16process_incomingEvEUlPNS_9work_itemEE_EEmRNS_8lf_queueET0_.isra.2371
>>>
>>>
>>>     0.26%  scylla           scylla                [.]
>>> _ZN17smp_message_queue13process_queueILm4EZNS_19process_completionsEvEUlPNS_9work_itemEE_EEmRNS_8lf_queueET0_.isra.2278
>>>
>>>
>>>     0.20%  scylla           [kernel.kallsyms]     [k] _raw_spin_lock_irq
>>>     0.18%  scylla           scylla                [.]
>>> _ZN6memory9cpu_pages4freeEPv
>>>     0.18%  scylla           liblz4.so.1.0.0       [.] LZ4_compress
>>>     0.17%  scylla           scylla                [.]
>>> _ZN7reactor9run_tasksER15circular_bufferISt10unique_ptrI4taskSt14default_deleteIS2_EESaIS5_EE
>>>
>>>
>>>
>>>
>>> Now this is not something I can run to monitor a production load.
>>>
>>>
>>> Is there any way I can improve scaling?
>> It depends, we'd have to see your script to make suggestions.
>>
> 
> 
> Here is the latest version of my script (I'm not sure whether tid() is a
> process-local or global identifier, so I used a pid(), tid() pair).

You shouldn't need to use the pair, both are unique systemwide. pid()
returns the process group id (task->tgid) and tid() returns the thread
id (task->pid). (The naming is an artifact of how process group ids
where shoehorned into the linux kernel.)

> #!/usr/bin/stap
> 
> # usage: task_latency.stap process_name latency_threshold_ms
> 
> global start_time
> 
> probe process(@1).mark("reactor_run_tasks_single_start") {
>     start_time[pid(), tid()] = gettimeofday_us()
> }
> 
> probe process(@1).mark("reactor_run_tasks_single_end") {
>     delete start_time[pid(), tid()]
> }
> 
> probe timer.profile {
>     if ([pid(), tid()] in start_time) {
>         now = gettimeofday_us()
>         start = start_time[pid(), tid()]
>         if ((now - start) > $2 * 1000) {
>             printf("detected tasks running for >%sms\n", @2)
>             print_usyms(ubacktrace())
>         }
>     }
> }
> 
> 
> Are entire probes serialized, or just individual variables?  If the
> latter, perhaps you can detect that an array's index is always tid(),
> remove that index, and convert the global variable to a thread-local one
> (with an exception for begin and end probes).

The same probe handler can be running simultaneously in different
threads. Probe handlers (and probe local variables) aren't really
serialized. However, global variable access is serialized.

-- 
David Smith
dsmith@redhat.com
Red Hat
http://www.redhat.com
256.217.0141 (direct)
256.837.0057 (fax)

Follow-Ups:
- Re: Poor scaling
  - From: Avi Kivity

References:
- Poor scaling
  - From: Avi Kivity
- Re: Poor scaling
  - From: David Smith
- Re: Poor scaling
  - From: Avi Kivity

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]