Summary: | i386 RHEL4U2 problem with some variables | ||
---|---|---|---|
Product: | systemtap | Reporter: | Martin Hunt <hunt> |
Component: | translator | Assignee: | Unassigned <systemtap> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | dsmith, fche, kevinrs |
Priority: | P1 | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Host: | Target: | ||
Build: | Last reconfirmed: | ||
Bug Depends on: | |||
Bug Blocks: | 907, 1382 | ||
Attachments: |
probe script file from comment #12
test c program from comment #12 updated test script updated test program |
Description
Martin Hunt
2005-09-05 04:14:48 UTC
I've seen this too. I recall Roland mentioning on IRC that something was wrong with the %esp value being used for this target variable. Roland, could you elaborate? IIRC the location expression for that variable on x86 was encoded as a compuation from %esp. Prior to my 2005-08-30 fix to runtime/loc2c-runtime.h, incorrect values for %esp were being used and so any variable location computations based on it would yield bogus pointers. If that's the problem being seen here, that should be fixed using the current sources. The loc2c-runtime.h problem wouldn't explain why it works on FC4 and FC5 kernels. Also, it is still happening with the CVS sources as of 15 minutes ago. Which exact kernel rpm (uname -r)? This might be a different bug relating to particular debuginfo details that differ in the different kernel builds. 2.6.9-17.EL. Also happened with 2.6.9-16.EL. This is verified by 4 different people. I have the problem on two different machines. This sounds like something different from the problem Frank had in mind. I don't doubt that you can reproduce the problem. But I am not set up to do so easily (no RHEL4U2 install on hand). It takes a long time just to download the debuginfo rpm. Please attach some -v output that shows the exact PC it uses and so forth. With that and the debuginfo I can see what is going wrong, though I don't think I can get systemtap running on the same kernels myself very quickly. 2.6.9-17EL ----------------------------------------------------------------- Created temporary directory "/tmp/stappaSxOb" Searched '/usr/share/systemtap/tapset/2.6.9-17.EL/*.stp', match count 0 Searched '/usr/share/systemtap/tapset/2.6.9/*.stp', match count 0 Searched '/usr/share/systemtap/tapset/2.6/*.stp', match count 0 Searched '/usr/share/systemtap/tapset/*.stp', match count 4 Pass 1: parsed user script and 4 library script(s). parsed 'sys_open' -> func 'sys_open' pattern 'kernel' matches module 'kernel' focused on module 'kernel' = [c0100000-c0440dbc, bias 0] pattern 'sys_open' matches function 'sys_open' selected function sys_open querying prologue-end of function 'sys_open' finding location for local 'filename' near address c0167662, module bias 0 [...] I checked the generated code and the probe was set at 0xc0167662. ./loc2c-test -k 0xc0167662 [...] sys_open (0x2e): 0xc016765f (fs/open.c:947) .. 0xc01676dc (fs/open.c:984) filename [434a24] char const* DIsassembling sys_open, I see c016765f <sys_open>: sys_open(): fs/open.c:947 c016765f: 57 push %edi c0167660: 56 push %esi c0167661: 53 push %ebx c0167662: 8b 44 24 10 mov 0x10(%esp),%eax fs/open.c:954 c0167666: e8 05 f6 00 00 call c0176c70 <getname> ----------------------------------------------- Comparing the above with kernel 2.6.12-1.1447_FC4smp Created temporary directory "/tmp/stapv3MNGJ" Searched '/usr/share/systemtap/tapset/2.6.12-1.1447_FC4smp/*.stp', match count 0 Searched '/usr/share/systemtap/tapset/2.6.12/*.stp', match count 0 Searched '/usr/share/systemtap/tapset/2.6/*.stp', match count 0 Searched '/usr/share/systemtap/tapset/*.stp', match count 4 Pass 1: parsed user script and 4 library script(s). parsed 'sys_open' -> func 'sys_open' pattern 'kernel' matches module 'kernel' focused on module 'kernel' = [c0100000-c049dfb8, bias 0] pattern 'sys_open' matches function 'sys_open' selected function sys_open querying prologue-end of function 'sys_open' finding location for local 'filename' near address c0161f32, module bias 0 ./loc2c-test -k 0xc0161f32 sys_open (0x2e): 0xc0161f2b (fs/open.c:938) .. 0xc0161f86 (fs/open.c:974) filename [50a178] char const* Disassembling kernel at sys_open: c0161f2b <sys_open>: sys_open(): fs/open.c:938 c0161f2b: 57 push %edi c0161f2c: 56 push %esi c0161f2d: 53 push %ebx c0161f2e: 8b 44 24 10 mov 0x10(%esp),%eax fs/open.c:945 c0161f32: e8 89 c9 00 00 call c016e8c0 <getname> ---------------- Is that the problem? In the first example, the kprobe is placed before the " mov 0x10(%esp),%eax" and in the second (working) example, it is placed after. That is indeed the problem. The debug information says "filename" is in %eax throughout the function (which means after the prologue): [434a24] formal_parameter name "filename" decl_file 1 decl_line 946 type [42ab7d] location 1 byte block [ 0] reg0 But the ad hoc prologue marker in the debug line info says that 0xc0167662 is the end of the prologue: advance address by 90 to 0xc016765f special opcode 21: address+0 = 0xc016765f, line+6 = 947 special opcode 57: address+3 = 0xc0167662, line+0 = 947 special opcode 78: address+4 = 0xc0167666, line+7 = 954 I believe this should mean it's the first instruction after the prologue. However, that instruction is necessary to load the argument, so I would call it the final instruction of the prologue. The debug line info in the FC4 kernel you cited seems to bear this out. So we have a bug to file against the RHEL4U2 gcc. We might want to consider if we can work around this in the prologue detection code in systemtap, since getting a new kernel build with a fixed compiler at this point seems unlikely to happen for RHEL4U2. Many tapsets will depend on resolution of this bug. *** Bug 1356 has been marked as a duplicate of this bug. *** patch committed It corrects the specific scenario that Martin reported on (sys_open), but it seems there are still some quirks. I probed these 2 functions: long sys_time(time_t __user * tloc) long sys_open(const char __user * filename,...) Here is the script: probe kernel.function("sys_time") { if(execname()=="user") log("Addr of tloc = ".string($tloc)) } probe kernel.function("sys_open"){ if(execname()=="user") { log("Addr of filename = ".string($filename)) log("\tfilename = ".user_string($filename)) } } ...and here is a corresponding C test program: #include <time.h> #include <stdio.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> int main (void) { int fd; time_t t = 44; char *fn = "myfile"; time(&t); printf("Addr of tloc = %d\n",&t); fd = open(fn,O_RDONLY); printf("Addr of filename = %d\n",fn); } This is the result from the script: Addr of tloc = 4648948 Addr of filename = 134513960 filename = myfile This is the result of the C program: Addr of tloc = -1074533664 Addr of filename = 134513960 On sys_open, the __user * param gets resolved correctly, this does not seem to be the case on sys_time. Created attachment 674 [details] probe script file from comment #12 Created attachment 675 [details] test c program from comment #12 The sys_time case appears to be a different bug. We appear to find the end of the prologue close enough. The debugging data leads loc2c astray at the resulting PC. From my 19.EL i686 kernel, the sys_time disassembly begins like this: c0125d9c <sys_time>: c0125d9c: 56 push %esi c0125d9d: 53 push %ebx c0125d9e: 53 push %ebx c0125d9f: 53 push %ebx c0125da0: 8b 5c 24 14 mov 0x14(%esp),%ebx c0125da4: 89 e0 mov %esp,%eax c0125da6: e8 f5 6b fe ff call c010c9a0 <do_gettimeofday> Thew new proloogue heurstic nails ...da4 as the insn of interest; the old did ...d9e. The loc2c expression for $tval is (%esp+0x14+0x8) for the new case and (%esp+0x14) for the old. They both appear to be too high by 8 bytes - as if the frame_base value calculation gives wrong results in both cases. So the prologue detector change does not appear to relate to this bug. The question (roland?) is what's going on with the frame_base here (assuming my reading of the code/data is correct)? Roland, please advise. Yes, I think this is a gcc bug. I filed this report: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=169485 This problems is also seen on IA64. Following error is seen when "stap -v sysopen.stp" is run on IA64. -------------------------------------- ERROR: user string copy fault at 0xa0000001000 near embedded-code at /usr/local/share/systemtap/tapset/conversions.stp:32:4 ---------------------------------- The RHEL4 3.4.4-3 compiler revision mentioned in RH bug 159485 appears to fix the tloc frame-base problem in i686. (In reply to comment #19) > The RHEL4 3.4.4-3 compiler revision mentioned in RH bug 159485 appears to fix > the tloc frame-base problem in i686. Where can I get the fixed compiler to test the fix? Frank made a typo, btw; the correct reference to the RHEL4 compiler bug is: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=169485 The fix is in upstream gcc and backported to the RHEL4 gcc slated for RHEL4U3. That will be in beta fairly soon, available officially to IBM et al through Red Hat beta relationships. If you would do significant testing in the next few days and are eager to get it ASAP, we can put some unofficial unreleased RPMs up somewhere public sooner. [...]
> If you would do significant testing in the next few
> days and are eager to get it ASAP, we can put some unofficial unreleased RPMs up
> somewhere public sooner.
My final day is the 23 of Dec. If the U3 beta's are not available between now
and the 16th, I would like you to post the unofficial RPM's somewhere that I can
grab them and do some testing. Tx.
It looks like that beta may be later than I thought. There are some src.rpm's in http://people.redhat.com/roland/tmp/ (or ftp: same) for binutils and gcc that you can rpmbuild --rebuild and test with. Created attachment 5315 [details]
updated test script
Created attachment 5316 [details]
updated test program
I've tested this one with the latest RHEL4 rpms: kernel-2.6.9 100.EL.i686 gcc-3.4.6 11.el4_8.1.i386 Both with systemtap-1.3 5.el4.i386 and current git head systemtap, the behavior is now correct for both sys_time() and sys_open(). However, glibc has changed a bit making this harder to verify. When I originally ran the test program and systemtap script attached to this bug (after updating the script a bit), I was always getting '0' for the argument of sys_time(). After digging a bit deeper, I realized that strace also showed time() being called with NULL. So, the glibc time() wrapper is calling time() with NULL and then copying the output to the argument. After changing the test program to use 'syscall(SYS_TIME, &t)', I'm now seeing the proper argument to sys_time(). Here's the output I now get from the (updated) test program: Addr of tloc = 0xbffe0be0 Addr of filename = 0x80485a0 Here's the output from the (updated) script: Addr of tloc = 0x0 Addr of tloc = 0xbfe640a0 Addr of filename = 0x80485a0 filename = myfile So, the arguments to those two syscalls are now correct. |