This is the mail archive of the frysk@sourceware.org mailing list for the frysk project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: Optimizing watchpoints

From: Roland McGrath <roland at redhat dot com>
To: Phil Muldoon <pmuldoon at redhat dot com>
Cc: Frysk Hackers <frysk at sourceware dot org>
Date: Sun, 30 Sep 2007 18:25:29 -0700 (PDT)
Subject: Re: Optimizing watchpoints
References: <46FD7036.2010500@redhat.com>
You've brought up two issues, which I think each deserve their own separate
thread of discussion.  The second thread is about indirection, or generally
speaking, dynamic specification of watchpoint addresses.  That is a worthy
and interesting subject, but I don't think you need to worry about it now.
For considering the watchpoint implementation per se, we can just talk
about "a watchpoint" as being a request to watch a given fixed address.

There are two aspects of specification that will poke through to the
lowest-level implementation layers.  Those are how you can specify the
address range and whose actions you want to watch.

For the latter, that means an individual thread or a group of threads that
share a set of watchpoints.  Right now, the implementation can only be done
by setting each watchpoint individually on each thread.  But it is likely
that future facilities will be able to share some low-level resources and
interface overhead by treating uniformly an arbitrary subset of threads in
the same address space.  It also likely to matter whether the chosen subset
is in fact the whole set of all threads in the same address space, and
whether a thread has only the breakpoints shared with its siblings in a
chosen subset, or has those plus additional private breakpoints of its own.
So it's worthwhile to think about how the structure of keeping track of
watchpoints (and other kinds of low-level breakpoints) can reflect those
groupings of threads from the high-level semantic control plane down to the
lowest-level implementation, where the most important sharing can occur.

For the address range specification, the low-level implementation pokes up
to constrain the granularity at which you can usefully specify what you
want to watch.  On the most common machines, it's only naturally-aligned
chunks whose size is some machine-dependent subset of 1, 2, 4, 8 bytes.
(i.e. 1,2,4 for 32-bit x86 kernels, 1,2,4,8 for x86_64 kernels, and 8 for
powerpc and ia64).  Probably future facilities will add page size to that
set of sizes (using a software VM facility rather than hardware features).
So the first step has to be turning the semantic request of a given address
range into one or more aligned-address, size pairs (or if you prefer,
address, mask pairs) of sizes supported at low level.  Then you maintain
the active set in that form, and identify the redundancies as they go in.
Only when the duplicate-free active set changes do you poke the low level.

There is one final aspect of organization to consider.  At the lowest
level, there is a fixed-size hardware resource of watchpoint slots.  When
you set them with ptrace, the operating system just context-switches them
for each thread in the most straightforward way.  So the hardware resource
is yours to decide how to allocate.  However, this is not what we expect to
see in future facilities.  The coming model is that hardware watchpoints
are a shared resource managed and virtualized to a certain degree by the
operating system.  The debugger may be one among several noncooperating
users of this resource, for both per-thread and system-wide uses.  Rather
than having the hardware slots to allocate as you choose, you will specify
what you want in a slot, and a priority, and can get dynamic feedback about
the availability of a slot for your priority.  (For compatibility, ptrace
itself will use that facility to virtualize the demands made by
PTRACE_SET_DEBUGREG and the like.  ptrace uses a known priority number that
is fairly high, so that some system-wide or other background tracing would
have to knowingly intend to interfere with traditional user application use
by choosing an even higher priority.)

In the long run, the way to look at this is not as a set of resources to be
allocated, but as a bag of tricks each with different tradeoffs of
constraints, costs, and dynamic resource contention.  

At one extreme you have single-step, i.e. software watchpoints by storing
the old value, stepping an instruction, and checking if the value in memory
changed.  This has few constraints on specification (only that you can't
distinguish stored-same from no-store, and it's not a mechanism for data
read breakpoints).  It has no resource contention issues at all.  It is
inordinately expensive in CPU time (though a straightforward in-kernel
implementation could easily be orders of magnitude faster than the
traditional debugger experience of implementing this).

Hardware watchpoints have some precise constraints and they compete for a
very limited dynamic resource, but they are extremely cheap in CPU time.

In the future we might have the option of VM tricks.  Those have their own
constraints (page granularity on addresses), they consume an important
dynamic resource that is finite but often not too constrained in practice
(page tables), and they have a somewhat complex balance of cost.

When I talk about cost, I mean roughly the setup overhead, plus the
exception overhead of a hit, plus the overhead of taking and sorting out
any false positive hits (due to the granularity of the facility you choose
being coarser than what you're actually aiming for).  For cases like VM
tricks, this takes in the complex set of factors affecting scaling and
secondary overhead (page table locking, TLB, cache, etc).

To satisfy a set of requests percolated down from the higher levels, you
take those requirements, your bag of tricks, and each trick's particular
tradeoffs for each case, and figure out dynamically what to do.  For tricks
that depend on resources with priority allocations, i.e. hardware
watchpoints, you have to rate the cost-effectiveness of the next-best trick
you could use instead, somehow scaled by how badly you want to get this
done properly, to come up with the right priority number to use so as best
to express decent global tradeoffs among the competing needs in the system.

In a simple example, you can do a one-byte data write watchpoint on powerpc
but only as a two-tier operation.  You set the hardware to catch stores to
the aligned 8-byte word containing the target byte address.  You get a
fault before the store happens, but you only know it will store to one or
more of those 8 bytes.  So, you can save the old 8 bytes, disable the
hardware breakpoint, single-step, and compare the new 8 bytes to the old to
decide which smaller-granularity data-change watchpoints to say have hit.
This is a pretty good tradeoff, since while the total overhead of a hit is
at least twice that of an optimal case (aligned 8-byte watchpoint), the
likely rate of false positives is quite low.  (Compared to "sorry, just
can't do it", that's really quite good.)

In an oft-imagined future example, VM tricks give an effectively unlimited
resource in how many watchpoints (of that kind) you can have installed
simultaneously.  Even if you can only have one word-granularity watchpoint,
or four, or none, you can set lots of page-granularity watchpoints.  You
can dynamically track the rate of false-positives engendered by those (page
faults you have to examine and let go), potentially feeding a complex
estimate of the cost of maintaining particular watchpoints in the page
tables.  You can install many page-level watchpoints to start, then respond
to their hits and to the dynamic feedback of hardware watchpoint slot
pressure to choose the hottest watchpoints and give them the hardware
slots.  In the absence of hardware slots, this might always lead to
single-step over stores to watched pages, or to declaring the situation too
costly and alerting the user we had to give up.

The trivial case is that you request at low level exactly one watchpoint
for what the user needs, with normal ptrace-equivalent priority.  If the
slot is (or becomes) unavailable, you tell the user "no can do", end of story.
(In today's implementation based on ptrace, this is what the low level
always has to boil down to.)


This view of the whole bag of tricks should be unified across all sorts of
low-level breakpoint resources, i.e. instruction breakpoints as well as
data read/write breakpoints (aka watchpoints).  Normal breakpoint insertion
and related techniques are more of the tricks in the bag; like some of the
hardware features mentioned earlier, they are very constrained in their
kind of specification (an exact instruction address).  

The most common case is one with no watchpoints at all, but just a few
instruction breakpoints.  On x86 and ia64, the same hardware features used
for data breakpoints work for these and are orders of magnitude less costly
(and far less complicated) than even the sexiest hypothetical breakpoint
assistance techniques.  When the pressure for those resources allows, it's
always optimal to use hardware breakpoints in preference to memory insertion.


Thanks,
Roland
Follow-Ups:
- Re: Optimizing watchpoints
  - From: Mark Wielaard
- Re: Optimizing watchpoints
  - From: Phil Muldoon
- Re: Optimizing watchpoints
  - From: Phil Muldoon
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]