This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Task watchers v2

From: Matt Helsley <matthltc at us dot ibm dot com>
To: Paul Jackson <pj at sgi dot com>
Cc: yanmin_zhang at linux dot intel dot com, akpm at osdl dot org, linux-kernel at vger dot kernel dot org, jes at sgi dot com, hch at lst dot de, viro at zeniv dot linux dot org dot uk, sgrubb at redhat dot com, linux-audit at redhat dot com, systemtap at sources dot redhat dot com
Date: Tue, 19 Dec 2006 04:05:55 -0800
Subject: Re: Task watchers v2
Organization: IBM Linux Technology Center
References: <20061215000754.764718000@us.ibm.com> <20061215000817.771088000@us.ibm.com> <1166420641.15989.117.camel@ymzhang> <1166447901.995.110.camel@localhost.localdomain> <20061218214159.2d571bf5.pj@sgi.com>

On Mon, 2006-12-18 at 21:41 -0800, Paul Jackson wrote:
> Matt wrote:
> > - Task watchers can actually improve kernel performance slightly (up to
> > 2% in extremely fork-heavy workloads for instance).
> 
> Nice.
> 
> Could you explain why?

After the last round of patches I set out to improve instruction and
data cache hits.

Previous iterations of task watchers would prevent the code in these
paths from being inlined. Furthermore, the code certainly wouldn't be
placed near the table of function pointers (which was in an entirely
different ELF section). By placing them adjacent to each other in the
same ELF section we can improve the likelihood of cache hits in
fork-heavy workloads (which were the ones that showed a performance
decrease in the previous iteration of these patches).

Suppose we have two functions to invoke during fork -- A and B. Here's
what the memory layout looked like in the previous iteration of task
watchers:

+--------------+<----+
|  insns of B  |     |
       .             |
       .             |
       .             |
|              |     |
+--------------+     |
       .             |
       .             |
       .             |
+--------------+<-+  |
|  insns of A  |  |  |
       .          |  |
       .          |  |
       .          |  |
|              |  |  |
+--------------+  |  |
       .          |  |
       .          |  |
       .          |  |  .text
==================|==|========= ELF Section Boundary
+--------------+  |  |  .task
| pointer to A----+  |
+--------------+     |
| pointer to B-------+
+--------------+

The notify_task_watchers() function would first load the pointer to A from the .task
section. Then it would immediately jump into the .text section and force the
instructions from A to be loaded. When A was finished, it would return to
notify_task_watchers() only to jump into B by the same steps.

As you can see things can be rather spread out. Unless the compiler inlined the
functions called from copy_process() things are very similar in a mainline
kernel -- copy_process() could be jumping to rather distant portions of the kernel
text and the pointer table would be rather distant from the instructions to be loaded.

Here's what the new patches look like:

===============================
+--------------+        .task
| pointer to A----+
+--------------+  |
| pointer to B-------+
+--------------+  |  |
       .          |  |
       .          |  |
+--------------+<-+  |
|  insns of A  |     |
       .             |
       .             |
       .             |
|              |     |
+--------------+<----+
|  insns of B  |
       .
       .
       .
|              |
+--------------+
===============================

Which is clearly more compact and also follows the order of calls (A
then B). The instructions are all in the same section. When A finishes
executing we soon jump into B which could be in the same instruction
cache line as the function we just left. Furthermore, since the sequence
always goes from A to B I expect some anticipatory loads could be done.

For fork-heavy workloads I'd expect this to explain the performance
difference. For workloads that aren't fork-heavy I suspect we're just as
likely to experience instruction cache misses -- whether the functions
are inlined, adjacent, or not -- since the fork happens relatively
infrequently and other instructions are likely to push fork-related
instructions out.

Cheers,
	-Matt Helsley

Follow-Ups:
- Re: Task watchers v2
  - From: Paul Jackson

References:
- Re: Task watchers v2
  - From: Zhang, Yanmin
- Re: Task watchers v2
  - From: Matt Helsley
- Re: Task watchers v2
  - From: Paul Jackson

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]