This is the mail archive of the
systemtap@sources.redhat.com
mailing list for the systemtap project.
Re: x86_64 kprobes wart removal
On Fri, 2005-04-08 at 08:20, William Cohen wrote:
> Jim Keniston wrote:
...
> > I propose the following alternative:
> > - Allocate one executable page at the beginning of time. [See note 1.]
> > - Store the instruction copy in the kprobe object, as in other
> > arhcitectures.
> > - When it comes time to single-step an instruction, just copy the
> > instruction from the kprobe object to the executable page.
> > - In resume_execution, adjust copy_rip accordingly.
>
> Copying the instruction just before the single step could be expensive,
> looking more like self-modifying code.
If we allocate a whole L1 cache line for each single-step scratch area,
as you suggest below, is this still a performance concern? We would
copy the instruction into the scratch area, then eventually iret, which
triggers the single-step. A memory expert I talked to here said it
shouldn't be an issue, although he admitted that he's not 100% sure
about what the x86_64 CPUs do in such situations.
>
> > Note 1: If we go to per-CPU locking, we may need to allocate enough
> > space for NR_CPUS instructions. Also, we still want to use Roland's
> > trick of allocating the memory close to where the modules live.
>
> Wouldn't the allocations need to be large enough fill a cache line to
> avoid false sharing and cache lines getting bounced between processors?
Yes, good point.
>
> Cache lines are significantly larger than the 15 bytes or so for the
> largest x86-64 instruction.
64 bytes is the largest allowable L1 cache line for x86_64, right?
(L1_CACHE_SHIFT_MAX = 6). If NR_CPUS is 64 or less, we can fit all the
CPUs' scratch areas in one page (4096 / 64 = 64).
>
> > I don't have a patch yet, but does that sound like the right approach?
> > I wish I'd thought of this a year ago. :-}
>
> It sounds like this approach might be slower and consume more memory.
It can't consume any more memory unless NR_CPUS > 64. If the number of
currently installed probes exceeds NR_CPUS * 4, the new scheme could
even consume less memory.
>
> -Will
>
Jim