Bug 5745

Summary: fstack cores
Product: frysk Reporter: Andrew Cagney <cagney>
Component: generalAssignee: Unassigned <frysk-bugzilla>
Status: ASSIGNED ---    
Severity: normal CC: mark
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:
Bug Depends on:    
Bug Blocks: 2244    

Description Andrew Cagney 2008-02-07 19:01:25 UTC
cagney@localhost$ ../frysk-core/frysk/bindir/fcatch
../frysk-core/frysk/bindir/fstack core.18692 ./TestRunner
Entered parseCommand
fcatch: from PID 18955 TID 18955:
SIGSEGV(11) detected - dumping stack trace for TID 18955
#0 0x082cf3b8 in read_proc_memory(void*, void*, unsigned long long, unsigned
int, unsigned int) () from
#1 0x0855348b in elf_from_remote_memory () from
#2 0x082cf35f in dwfl_frysk_proc_find_elf(Dwfl_Module*, void**, char const*,
unsigned long long, char**, Elf**) () from
#3 0x08551898 in find_file () from
#4 0x08551ae3 in find_symtab () from
#5 0x08551c6a in dwfl_module_getsymtab () from
#6 0x085540dc in dwfl_module_addrsym () from
#7 0x082cfd98 in hidden alias for void lib::dwfl::DwflModule::getSymbol(long
long, lib::dwfl::SymbolBuilder*) () from
#8 0x08218e7f in frysk::symtab::Symbol*
frysk::symtab::SymbolFactory::getSymbol(frysk::proc::Task*, long long) () from
#9 0x082091a2 in frysk::symtab::Symbol*
frysk::stack::LibunwindFrame::getSymbol() () from
#10 0x082072ea in void frysk::stack::Frame::toPrint(java::io::PrintWriter*,
bool, bool) () from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack
#11 0x0820b720 in void
frysk::proc::Task*, bool, bool, int) () from
#12 0x08230901 in void frysk::util::StacktraceAction::printTasks() () from
#13 0x08230a52 in void
frysk::util::StacktraceAction::allExistingTasksCompleted() () from
#14 0x081e8dc7 in
frysk::proc::ProcObserver$ProcAction*) () from
#15 0x080fa5d3 in void
frysk::bindir::fstack::stackCore(frysk::util::CoreExePair*) () from
#16 0x080fa98d in void
frysk::bindir::fstack::access$1(frysk::util::CoreExePair*) () from
#17 0x080f9f15 in void
frysk::bindir::fstack$2::parseCores(JArray<frysk::util::CoreExePair*>*) () from
#18 0x0822d158 in JArray<java::lang::String*>*
frysk::util::CommandlineParser::doParse(JArray<java::lang::String*>*) () from
#19 0x0822ccd5 in JArray<java::lang::String*>*
frysk::util::CommandlineParser::parse(JArray<java::lang::String*>*) () from
#20 0x080fa92d in void frysk::bindir::fstack::main(JArray<java::lang::String*>*)
() from /home/scratch/frysk/pid/native/frysk-core/frysk/bindir/fstack
#21 0x02ddd3d1 in void gnu::java::lang::MainThread::call_main() () from
#22 0x02e56987 in void gnu::java::lang::MainThread::run() () from
#23 0x02defb63 in _Jv_ThreadRun(java::lang::Thread*) () from
#24 0x02d9b8ea in _Jv_RunMain(_Jv_VMInitArgs*, java::lang::Class*, char const*,
int, char const**, bool) () from /usr/lib/libgcj.so.8rh.0.0
#25 0x02d9baa4 in _Jv_RunMain(java::lang::Class*, char const*, int, char
const**, bool) () from /usr/lib/libgcj.so.8rh.0.0
#26 0x02d9bb2b in JvRunMain () from /usr/lib/libgcj.so.8rh.0.0
#27 0x080f66c1 in main () from
#28 0x00808390 in __libc_start_main () from /lib/libc-2.7.so
#29 0x080f65c1 in void java::awt::Component::transferFocusUpCycle() () from
SIGSEGV(11) detected - dumping stack trace for TID 18956
#0 0x00110402 in __kernel_vsyscall () from
#1 0x000518e0 in [unknown] from /home/scratch/frysk/pid/native/frysk-sys/Unknown
#2 0xf13c8001 in [unknown] from /home/scratch/frysk/pid/native/frysk-sys/Unknown
Comment 1 Mark Wielaard 2008-02-07 20:25:10 UTC
That looks familiar, but I lost the core file that produced a similar issues.
Can you make the core file available somewhere?
Comment 2 Phil Muldoon 2008-02-08 10:53:24 UTC
Pretty certain that Java does not glob ./Testrunner as 'pwd'/TestRunner. It just
sees the filename as ./Testrunner. Can you rerun the test with physical location
of TestRunner provided?
Comment 3 Phil Muldoon 2008-02-08 11:56:00 UTC
Looking a little closer, it seems LinuxCoreHost was not using the Canonical File
name, so the ../ and the ./ were not being correctly resolved in the path:

2008-02-08  Phil Muldoon  <pmuldoon@redhat.com>

	* LinuxCoreHost.java: Use CanonicalFile() througout.
Comment 4 Andrew Cagney 2008-02-08 14:31:52 UTC
There's a more fundamental issue here; this should not have lead to an _abort_

Since we clearly don't know what to do with the IO exception, getAbsolutePath()
is also a better option.
Comment 5 Phil Muldoon 2008-02-08 14:40:32 UTC
Feel free to split it into two bugs. One where the the process run aground in
the dwfl call, and another where ./ and ../ are not being dealt with other than
in a canonical file case. They need to be addressed differently.

As for the IOException, the converstion of a IOException to a RuntimeException
is just conversion to an unchecked versus checked exception. I'm quite open to
throwing either, but in keeping with the rest of frysk's usage on unchecked

I think think the case of using a canonical file is correct in LinuxCoreHost as
it is correctly constructing the location of the file. getAbsolutePath wasn't.
It was using the file as passed by the user, with the expectation that ./ and
../ would be correctly parsed. I'll say usability trumps pedantic do as I mean,
now what I say (type) here.

I won't address the dwfl issue as I have no expertise there.
Comment 6 Phil Muldoon 2008-02-08 16:39:07 UTC
Attempting to replicate the issue with a simple process produces the following
sacktrace below. I added the debug statements for emphasis. The ideal solution
here would be for LinuxCoreHost not to deal with paths at all, and have Elf just
provide the File based constructor. Also, a change to how StatelessFile itself
constructs the filename would be required:

     public StatelessFile(File file)
	byte[] path = file.getAbsolutePath().getBytes();
	// NUL terminate it - an array start with NUL
	unixPath = new byte[path.length + 1];
	System.arraycopy (path, 0, unixPath, 0, path.length);

This also uses AbsolutePath, and does not account for encoding.  

There is another improvement in that the metadata instead of storing strings
extracted from the linkmap for library names, store File objects instead. This
once again shunts the native -> java path fixing off to one central location.

Anyway, the bug I tried to replicate showed as:

[pmuldoon@localhost frysk-core]$ ./frysk/bindir/fstack core.2157 ./sleep

Exe file backend is print as System.out.println(File): ./sleep

Exe file backend is print as System.out.println(File.getAbsolutePath())

Exe file backend is print as System.out.println(File.getCanoncialPath())

Exception in thread "main" java.lang.NullPointerException
   at lib.dwfl.DwflModule.getSymbol(fstack)
   at frysk.symtab.SymbolFactory.getSymbol(fstack)
   at frysk.stack.LibunwindFrame.getSymbol(fstack)
   at frysk.stack.Frame.toPrint(fstack)
   at frysk.stack.StackFactory.printTaskStackTrace(fstack)
   at frysk.util.StacktraceAction.printTasks(fstack)
   at frysk.util.StacktraceAction.allExistingTasksCompleted(fstack)
   at frysk.proc.ProcCoreAction.<init>(fstack)
   at frysk.bindir.fstack.stackCore(fstack)
   at frysk.bindir.fstack.access$1(fstack)
   at frysk.bindir.fstack$2.parseCores(fstack)
   at frysk.util.CommandlineParser.doParse(fstack)
   at frysk.util.CommandlineParser.parse(fstack)
   at frysk.bindir.fstack.main(fstack)
Comment 7 Phil Muldoon 2008-02-12 15:05:36 UTC
Some additional root cause analysis. I've spent the last few days working down
through the code to the dwlf native calls. Given this command reproducer:

[pmuldoon@localhost frysk-core]$ cp /bin/sleep .
[pmuldoon@localhost frysk-core]$ ./sleep 5000 &
[3] 8966
[pmuldoon@localhost frysk-core]$ ./frysk/bindir/fcore 8966
[pmuldoon@localhost frysk-core]$ ./frysk/bindir/fstack core.8966 ./sleep


Exception in thread "main" java.lang.NullPointerException
   at lib.dwfl.DwflModule.getSymbol(DwflModule.cxx:126)
   at frysk.symtab.SymbolFactory.getSymbol(SymbolFactory.java:86)
   at frysk.stack.LibunwindFrame.getSymbol(LibunwindFrame.java:185)
   at frysk.stack.Frame.toPrint(Frame.java:163)
   at frysk.stack.StackFactory.printTaskStackTrace(StackFactory.java:105)
   at frysk.util.StacktraceAction.printTasks(StacktraceAction.java:156)
   at frysk.proc.ProcCoreAction.<init>(ProcCoreAction.java:60)
   at frysk.bindir.fstack.stackCore(fstack.java:145)
   at frysk.bindir.fstack.access$1(fstack.java:140)
   at frysk.bindir.fstack$2.parseCores(fstack.java:166)
   at frysk.util.CommandlineParser.doParse(CommandlineParser.java:169)
   at frysk.util.CommandlineParser.parse(CommandlineParser.java:109)
   at frysk.bindir.fstack.main(fstack.java:278)

If we add in the following debug statements to: frysk-sys/lib/dwfl/cni/Dwfl.cxx
in function: lib::dwfl::Dwfl::dwfl_report_module

lib::dwfl::Dwfl::dwfl_report_module(jstring moduleName, jlong low, jlong high)
	jsize len = JvGetStringUTFLength(moduleName);
	char modName[len+1]; 
	JvGetStringUTFRegion(moduleName, 0, len, modName);
	modName[len] = '\0';
	printf("Reporting module: %s\n at address 0x%lx-0x%lx\n",modName,low,high);
	::dwfl_report_module(DWFL_POINTER, modName, (::Dwarf_Addr) low, 
	                     (::Dwarf_Addr) high);  

The output is produced:

Reporting module: ./sleep
 at address 0x400000-0x626000
Reporting module: /lib64/librt.so.1
 at address 0x3235400000-0x3235609000
Reporting module: /lib64/ld-linux-x86-64.so.2
 at address 0x38ca800000-0x38caa1c000
Reporting module: /lib64/libc.so.6
 at address 0x38cba00000-0x38cbd57000
Reporting module: /lib64/libpthread.so.0
 at address 0x38cc600000-0x38cc81b000
Reporting module: [vdso]
 at address 0x7fff997fe000-0x7fff99800000

Note the the module is reported as provided by the user: ./sleep

The backtrace eventually grounds out at GetSymbol in DwflModule.cxx. Adding some
debug statements here:

lib::dwfl::DwflModule::getSymbol(jlong address, lib::dwfl::SymbolBuilder*
	Dwarf_Addr addr = (Dwarf_Addr) address;
	GElf_Sym closest_sym;

	printf("Testing error no with dwfl pointer 0x%lx address 0x%lx\n",(long)
DWFL_MODULE_POINTER, (long) address);
	const char* methName = dwfl_module_addrsym(DWFL_MODULE_POINTER, addr, 
                                                   &closest_sym, NULL);
	if (errno != 0)
	  printf("Erro no is %d.  Message: %s\n", errno, strerror(errno));
	  printf("Method name is: %s\n", methName);

	jstring jMethodName;
	if (methName == NULL)	
		jMethodName = NULL;
		jMethodName = JvNewStringUTF(methName);	



Produces this output:

Testing error no with dwfl pointer 0x107b880 address 0x38cba9ac30
Method name is: __nanosleep_nocancel

Testing error no with dwfl pointer 0x107b500 address 0x402f1a
Exception in thread "main" java.lang.NullPointerException
   at lib.dwfl.DwflModule.getSymbol(DwflModule.cxx:126)

Note that the first symbol lookup is fine. When a second symbol lookup occurs at
the memory address represented by the ./bash named module, the function:
dwfl_module_addrsym never returns, or sets an errno. It seems to die in the
elfutils lib. Subsequently, if you follow the exception path back up to
Frame.java toPrint() method, the symbol is null:

   public void toPrint (PrintWriter writer, boolean printSource, boolean fullpath) {
	// the address, padded with 0s based on the task's word size, ...
	String addr = Long.toHexString(getAddress());
	int padding = 2 * getTask().getISA().wordSize() - addr.length();
	for (int i = 0; i < padding; ++i)
	// the symbol, if known append (), ..
	Symbol symbol = getSymbol();
	writer.write(" in ");

and the  call to getDemangledName() is null. The backtrace is produced.

Comment 8 Phil Muldoon 2008-02-12 15:14:30 UTC
I wanted to be sure that the dwfl_report_module was not setting errno, so I
checked it after the call with the code below. For all modules including:

[pmuldoon@localhost frysk-core]$ ./frysk/bindir/fstack core.8966 ./sleep
Reporting module: ./sleep
 at address 0x400000-0x626000
Erro no is 0, message is Success

I get success. Code:

lib::dwfl::Dwfl::dwfl_report_module(jstring moduleName, jlong low, jlong high)
	jsize len = JvGetStringUTFLength(moduleName);
	char modName[len+1]; 
	JvGetStringUTFRegion(moduleName, 0, len, modName);
	modName[len] = '\0';
	printf("Reporting module: %s\n at address 0x%lx-0x%lx\n",modName,low,high);
	::dwfl_report_module(DWFL_POINTER, modName, (::Dwarf_Addr) low, 
	                     (::Dwarf_Addr) high);  
	printf("Erro no is %d, message is %s\n", errno, strerror(errno));
Comment 9 Phil Muldoon 2008-02-12 15:20:18 UTC
Upstream already has a LinuxCoreHost.java patch that converts ./sleep to the
canonical name `pwd`/sleep, so this error will not occur in Frysk at present.
This is neither the right place to do the converstion, or the right fix. IMO the
dwfl issue should be further investigated into the dwfl lib code as this code
masks the problem, rather than solving it.

I'm removing my assignment from this bug.