This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: health monitoring scripts

From: David Smith <dsmith at redhat dot com>
To: "Frank Ch. Eigler" <fche at redhat dot com>
Cc: Systemtap List <systemtap at sources dot redhat dot com>
Date: Fri, 04 Sep 2009 16:17:46 -0500
Subject: Re: health monitoring scripts
References: <20090820194543.GA14945@redhat.com> <4A9D44E9.8090804@redhat.com>

On 09/01/2009 10:59 AM, David Smith wrote:
> On 08/20/2009 02:45 PM, Frank Ch. Eigler wrote:
>> Hi -
>>
>> I ask asked to share some snippets of an old idea regarding a possible
>> compelling application for systemtap.  Here goes, from a few months
>> back:
>>
>> ------------------------------------------------------------------------
>>
>> The technical gist of the idea would have several parts: to create a
>> suite of systemtap script fragments (a new "health" tapset); and to
>> one or more front-ends for monitoring the ongoing probes graphically
>> and via other tools.
>>
>> Each tapset piece would represent a single subsystem or concern.  A
>> variety of probes could act to represent a view of its health
>> (tracking allocation counts, rates of activity, latency trends,
>> whatever makes sense).  The barest sketch ...
>>
> 
> ... sketch removed ...
> 
> I've been taking a stab at implementing this.  Here's what I've discovered.

... stuff deleted ...

> - number of context switches:  You can see the current number of context
> switches by looking in /proc/stat in the 'ctxt' line.  This information
> comes from calling the nr_context_switches() function in kernel/sched.c.
>  nr_context_switches() gets this information from a per-CPU runqueue
> structure (which contains lots of interesting information).
> Unfortunately, neither the nr_context_switches() function is exported
> nor the underlying runqueue data structure is exported.  The nr_switches
> field of the runqueue structure gets incremented in schedule(), but it
> is possible for for schedule() to increment nr_switches more than once
> (and we have no way to detect this).

One of the things I've discovered is I need to look at our existing
tapsets more - there is already a 'schedule.ctxswitch' probe point that
exists that is in the correct spot.

Here's a baby implementation of this idea.  It only reports context
switches.  After untar'ing, you'd run it like this:

# stap -I tapset/health resource_monitor.stp 'health.*'
1252098915,context_switches,2149
1252098925,context_switches,4352

Besides needing more information sources, we also need to think about
what makes a system "unhealthy".  For instance, in the case of
context_switches, the health monitoring code could check for too many
context switches within a certain time interval.  Of course the hard
part is knowing what is "too many" (or at least how to make it
configurable).

-- 
David Smith
dsmith@redhat.com
Red Hat
http://www.redhat.com
256.217.0141 (direct)
256.837.0057 (fax)

Attachment: health_monitor.tar.bz2
Description: application/bzip

Follow-Ups:
- Re: health monitoring scripts
  - From: Roland McGrath

References:
- health monitoring scripts
  - From: Frank Ch. Eigler
- Re: health monitoring scripts
  - From: David Smith

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]