This is the mail archive of the systemtap@sources.redhat.com mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: x86_64 kprobes wart removal


On Fri, 2005-04-08 at 08:20, William Cohen wrote:
> Jim Keniston wrote:
...
> > I propose the following alternative:
> > - Allocate one executable page at the beginning of time. [See note 1.]
> > - Store the instruction copy in the kprobe object, as in other
> > arhcitectures.
> > - When it comes time to single-step an instruction, just copy the
> > instruction from the kprobe object to the executable page.
> > - In resume_execution, adjust copy_rip accordingly.
> 
> Copying the instruction just before the single step could be expensive, 
> looking more like self-modifying code.

If we allocate a whole L1 cache line for each single-step scratch area,
as you suggest below, is this still a performance concern?  We would
copy the instruction into the scratch area, then eventually iret, which
triggers the single-step.  A memory expert I talked to here said it
shouldn't be an issue, although he admitted that he's not 100% sure
about what the x86_64 CPUs do in such situations.

> 
> > Note 1: If we go to per-CPU locking, we may need to allocate enough
> > space for NR_CPUS instructions.  Also, we still want to use Roland's
> > trick of allocating the memory close to where the modules live.
> 
> Wouldn't the allocations need to be large enough fill a cache line to 
> avoid false sharing and cache lines getting bounced between processors?

Yes, good point.

>  
> Cache lines are significantly larger than the 15 bytes or so for the 
> largest x86-64 instruction.

64 bytes is the largest allowable L1 cache line for x86_64, right? 
(L1_CACHE_SHIFT_MAX = 6).  If NR_CPUS is 64 or less, we can fit all the
CPUs' scratch areas in one page (4096 / 64 = 64).

> 
> > I don't have a patch yet, but does that sound like the right approach? 
> > I wish I'd thought of this a year ago. :-}
> 
> It sounds like this approach might be slower and consume more memory.

It can't consume any more memory unless NR_CPUS > 64.  If the number of
currently installed probes exceeds NR_CPUS * 4, the new scheme could
even consume less memory.

> 
> -Will
> 

Jim


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]