This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Hitachi djprobe mechanism


Hi,

I am a member of djprobe team.

Thank you very much for your information.

We didn't realize that ettarum when we designed djprobe.


And we believe that djprobe can safely modify the code like this;


step 1: making int3 bypass code using kprobe

step 2: safety check;
        make sure that all CPUs don't run on the code that will
        be replaced with jmp instruction (also check whether stack
        include EIP of the code which is subject to replace)

step 3: (after all CPU pass safety check) replace with jmp
        instruction without first byte. leave int 3 instruction
        unchanged at this time (new step).

step 4: i-cache flush or serializing:
        invoke i-cache flush instruction such as CLFLASH or serialize
        instruction such as CPUID on all CPUs (new step)

step 5: (after all CPU invoke i-cache flush or serializing instruction)
        replace int 3 instruction with first byte of jmp instruction

How do you think of this?


Richard J Moore wrote:



There is another issue to consider when looking into using probes other then int3:

Intel erratum 54 - Unsynchronized Cross-modifying code - refers to the
practice of modifying code on one processor where another has prefetched
the unmodified version of the code. Intel states that unpredictable general
protection faults may result if a synchronizing instruction (iret, int,
int3, cpuid, etc ) is not executed on the second processor before it
executes the pre-fetched out-of-date copy of the instruction.

When we became aware of this I had a long discussion with Intel's
microarchitecture guys. It turns out that the reason for this erratum
(which incidentally Intel does not intend to fix) is because the trace
cache - the stream of micorops resulting from instruction interpretation -
cannot guaranteed to be valid. Reading between the lines I assume this
issue arises because of optimization done in the trace cache, where it is
no longer possible to identify the original instruction boundaries. If the
CPU discoverers that the trace cache has been invalidated because of
unsynchronized cross-modification then instruction execution will be
aborted with a GPF. Further discussion with Intel revealed that replacing
the first opcode byte with an int3 would not be subject to this erratum.

So, is cmpxchg reliable? One has to guarantee more than mere atomicity.



- -
Richard J Moore
IBM Advanced Linux Response Team - Linux Technology Centre
MOBEX: 264807; Mobile (+44) (0)7739-875237
Office: (+44) (0)1962-817072


Andi Kleen <ak@suse.de> To 31/07/2005 Mathieu Desnoyers 23:03 <compudj@krystal.dyndns.org> cc Andi Kleen <ak@suse.de>, Karim Yaghmour <karim@opersys.com>, Masami Hiramatsu <masami.hiramatsu@gmail.com>, Masami Hiramatsu <hiramatu@sdl.hitachi.co.jp>, Roland McGrath <roland@redhat.com>, Richard J Moore/UK/IBM@IBMGB, systemtap@sources.redhat.com, sugita@sdl.hitachi.co.jp, Satoshi Oshima <soshima@redhat.com>, michel.dagenais@polymtl.ca bcc Subject Re: Hitachi djprobe mechanism





On Sat, Jul 30, 2005 at 12:47:47PM -0400, Mathieu Desnoyers wrote:


* Andi Kleen (ak@suse.de) wrote:

As I see it, the write in memory is atomic, but not the instruction

fetching. In


that case, the reader would see an inconsistent last jmp address

byte.


Yes, you're right. cmpxchg only helps when the replaced instruction
is >= the new instruction. For smaller instructions only a IPI to
stop all CPUs works.


It was not exactly the point of my comment. If we try to overwrite an

existing


instruction, without any marker, two cases may show up :

* the instruction to replace is >= the jmp instruction (5 bytes)

It has been suggested that using cmpxchg8 would solve this problem.

cmpxchg8


does indeed commit 8 bytes of data to memory atomically, even on 32 bits
architectures.

My question is related to the instruction we want to replace : how is it

read by


the CPU ? If it's 5 bytes in size, il has to be read in two chunks by the

cpu in


a 32 bits arch. Does the CPU lock the memory bus between those two read ?


32bit ISA has nothing to do how the CPU fetches instructions
("32bit" x86s usually have a much wider memory interface)

In general these things are done on cache lines between 32 and 128 bytes
depending on the CPU. Of course cache lines can be crossed by instructions,
but the
CPU should handle that atomically.

However is no guarantee afaik for that in the architecture though so you
cannot
really rely on it. If let's say the 386 had this behaviour then it is
probably
safe to assume later x86s implement it too for compatibility (modulo bugs)

In practice it's more complicated. The CPU fetches the instruction
some time before actually executing it into its pipeline, and then sniffs
the bus for any modifications of it and then cancels and reexecutes the
instruction if needed.

However when you look at CPU errata sheets you will find quite a lot
of bugs in this area, so I would not really rely on frequent patching for
production.

I think just using the IPI is much simpler and easier.



* the instruction to replace is < the jmp instruction (4 bytes or less)

If our goal is to overwrite code which has not been surrounded by a

marker, an


IPI wouldn't save us here. The marker is necessary in order to disable
interruptions and make the IPI meaningful.


You lost me here.




Actually there may be tricks possible to first int3 (or equivalent

single


byte replacement on other archs) the second instruction,
then the first, then wait for a RCU period of all CPUs to quiescence

and then


write the longer jump. But an IPI is probably easier because it doesn't

need


a full disassembler for this and setting probes should not be

performance


critical.


Well, in fact, there is still a problem. (on no, not again!) ;) The RCU

does


require the reader to disable preemption, otherwise there is no guarantee

that


they won't be scheduled out in the middle of the critical section, and

the RCU


does only guarantee that a non schedulable reader will have finished by

the time


the RCU period is over.

How do you plan to disable unvolountary preemption around the

instructions you


want to overwrite ?



One way would be to just search the task list for any tasks blocked with an IP inside the patched region. If yes rewait for another quiescent period.

-Andi




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]