This is the mail archive of the elfutils-devel@sourceware.org mailing list for the elfutils project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

caching bug example


To play along at home, use the roland/dwarf-hacking branch.
This has the cache reenablement patch I included before, plus
some debugging hacks I'm using.

The test file is just elfcmp from my build.  But to keep us stably working
with the same data and so we can talk about file offsets and all, it's here:
	http://roland.fedorapeople.org/tmp/test-elfcmp

My build is by gcc-4.4.4-10.fc13.x86_64 and uses:

 CFLAGS='-fexceptions -g' CXXFLAGS='-D_GLIBCXX_DEBUG -g' --enable-maintainer-mode --enable-tests-rpath --prefix=/usr --libdir=/usr/lib64 --program-prefix=eu- 

You'll need to match that for the frame counts in my gdb fragments to be
right for you.  Otherwise, you can poke around with "up RET RET RET ..."
until you hit the right place and see how many frames it actually is for
each case in your build.

The test is the first failure in make check on this branch:

	src/dwarfcmp-test -T test-elfcmp{,}

It fails because the copied dwarf_output doesn't match the dwarf.
It comes out arcanely different because of bad reference matches
that influenced what subtrees got unified into one shared tree.

Firstly, we look at the trees for reference:

	tests/dwarf-print --sort-attrs test-elfcmp > a
	tests/dwarf-print --output test-elfcmp > b
	tests/dwarf-print --offsets test-elfcmp > c
	tests/dwarf-print --output --stats --silent test-elfcmp > s

The --sort-attrs output is the input file (dwarf class), but with
attributes sorted by (integer) name so that the conceptually unordered
attribute set is in a canonical order that prints identically regardless of
the order the input file uses.  The --output (dwarf_output class) data
always has sorted attributes, so that makes a and b directly comparable.

The --offsets output (you can add --sort-attrs after it if you want) is
handy for cross-referencing while you debug.  It prints a fake attribute
offset=[0x1234] in each DIE (whereas the other format prints a fake
attribute ref="ref123").  It's otherwise the same as a, but it's not
textually comparable to b.  Importantly, it has the line breaks in all the
same places.  So each line of c corresponds to the same-numbered line in a
(and should to b, if dwarf_output is not too buggy).

So when you are debugging, look at the DIE offset you are dealing with and
then you can look that up in c, go to the same line number in a and b, and
see how they each look.

The dump in s can also come in handy.  You can look up original input file
DIE (hex) offsets in there to see what did or didn't get shared, etc.  To
the extent that the output formats are not obvious, just ask.  (I actually
don't remember off hand exactly what all the numbers are.)

So, with this data in hand, now you can do:

	diff -u a b

and you'll see that all the differences seem to be in the ref numbers,
so you know it's harder than that.

I'm using these gdb macros:

	(gdb) define pp
	Type commands for definition of "pp".
	End with a line saying just "end".
	>p/x b.offset()
	>p b.to_string()
	>p/x *(a.info())
	>end
	(gdb) define refmatch
	Type commands for definition of "refmatch".
	End with a line saying just "end".
	>up 7
	>pp
	>end
	(gdb) define diematch
	Type commands for definition of "diematch".
	End with a line saying just "end".
	>up 8
	>pp
	>end
	(gdb)

Btw, I always use:

	(gdb) catch throw
	Catchpoint 1 (throw)
	(gdb) 

first thing.  

Set a breakpoint at dwarf_comparator:309.
That's the debugging hack line with asm("nop").
I always use C-x SPC, but that's the line number.

Now run it:

	(gdb) r -T ../test-elfcmp{,}
	Starting program: /home/roland/build/elfutils/src/dwarfcmp-test -T ../test-elfcmp{,}
	[Thread debugging using libthread_db enabled]

	Breakpoint 5, elfutils::dwarf_comparator<elfutils::dwarf_output, elfutils::dwarf, false, elfutils::dwarf_ref_tracker<elfutils::dwarf_output, elfutils::dwarf> >::match_rhs::operator() (this=0x7fffffff60f0, it1=..., it2=...)
	    at ../../../redhat/elfutils/src/../libdw/c++/dwarf_comparator:309

(If you followed these instructions exactly, yours will be "Breakpoint 2".)
That breakpoint hits only when an attribute value mismatch was just seen.
So we are down inside however many layers of recursion we might be when
the first wrong detail was noticed.

	(gdb) up 4
	#4  0x00000000004d386d in elfutils::dwarf_comparator<elfutils::dwarf_output, elfutils::dwarf, false, elfutils::dwarf_ref_tracker<elfutils::dwarf_output, elfutils::dwarf> >::reference_match (this=0x7fffffff7530, ref1=..., ref2=...)
	    at ../../../redhat/elfutils/src/../libdw/c++/dwarf_comparator:551

This gets you to reference_match.  The DIEs that had differing values for
some attribute were being compared to satisfy a reference comparison.
That's what the first macro is for:

	(gdb) pp
	$110 = 0x6c63
	$111 = "<typedef name=\"int64_t\"/>"
	$112 = {
	  first = {
	    _m_children = 0x8712f0, 
	    _m_attributes = 0x8a8310, 
	    _m_hash = 0xbb7652b333a4027, 
	    _m_tag = 0x16
	  }, 
	  second = {
	    _m_parent = 0xff65c0, 
	    _m_refs = std::queue wrapping: std::__debug::deque with 1 elements = {
	      0x893c30
	    }, 
	    _m_originals = std::__debug::set with 4 elements = {
	      [0] = 0x534,
	      [1] = 0x2b28,
	      [2] = 0x5097,
	      [3] = 0x5e4e
	    }, 
	    _m_original_cost = 0x2c, 
	    _m_with_sibling = std::__debug::bitset = {
	      [1] = 0x1
	    }, 
	    _m_uses = 0x4
	  }
	}
	(gdb) 

The output shows us we are comparing (references to) input DIE [6c63] and
what should be its corresponding dwarf_output::debug_info_entry
incarnation.  The debugging hack there gets the die_info_pair, of which
.first is the debug_info_entry itself.  The die_info part holds all the
interesting stuff we recorded in the dwarf_output building process.

In particular, notice some nice STL pretty-printing there and we have the
_m_originals set nicely shown.  That's where we record (just for this kind
of debugging) the offsets of all the original dwarf::debug_info_entry's
that were consolidated into this one dwarf_output::debug_info_entry.
If you find 0x534 in s, you'll see:

4	hash=0x281033975398eb41	(10) 44 (33)	<typedef name="int64_t"/>	 0x534 0x2b28 0x5097 0x5e4e

That's just printing the same _m_originals set.
You can also find 0x6c63 and see:

1	hash=0x28105a1a3bd43b54	(10) 11 (0)	<typedef name="int64_t"/>	 0x6c63

So 4 instances of int64_t were merged and one was not.
You can wonder why that might have been and do:

	$ src/dwarfcmp -l test-elfcmp{,} 0x534 0x6c63
	534 vs 6c63: decl_file="/usr/include/sys/types.h" vs decl_file="/usr/include/stdint.h"
	534 vs 6c63: decl_line=198 vs decl_line=41
	$

So, they differ because they really do differ.  Fine and dandy.
So, how did we get here, thinking they should be the same?

Remember, our current frame now is in reference_match.
We have a macro for that.

	(gdb) refmatch
	#11 0x0000000000489121 in elfutils::dwarf_comparator<elfutils::dwarf_output, elfutils::dwarf, false, elfutils::dwarf_ref_tracker<elfutils::dwarf_output, elfutils::dwarf> >::match (this=0x7fffffff7530, a=..., b=...)
	    at ../../../redhat/elfutils/src/../libdw/c++/dwarf_comparator:265
	$113 = 0x76cf
	$114 = "<formal_parameter/>"
	$115 = {
	  first = {
	    _m_children = 0x8712f0, 
	    _m_attributes = 0x13c2a60, 
	    _m_hash = 0xc237b1935c5, 
	    _m_tag = 0x5
	  }, 
	  second = {
	    _m_parent = 0x13c3ae0, 
	    _m_refs = std::queue wrapping: std::__debug::deque with 1 elements = {
	      0x13c0d10
	    }, 
	    _m_originals = std::__debug::set with 8 elements = {
	      [0] = 0x1170,
	      [1] = 0x1190,
	      [2] = 0x3f7a,
	      [3] = 0x3f9a,
	      [4] = 0x5ab0,
	      [5] = 0x5ad0,
	      [6] = 0x686d,
	      [7] = 0x688d
	    }, 
	    _m_original_cost = 0x2c, 
	    _m_with_sibling = std::__debug::bitset = {
	      [0] = 0x1,
	      [1] = 0x1
	    }, 
	    _m_uses = 0x2
	  }
	}

Now our current frame is in DIE matching, not reference_match (you just
have to look).  We have a macro for that.

	(gdb) diematch
	#19 0x00000000004d3946 in elfutils::dwarf_comparator<elfutils::dwarf_output, elfutils::dwarf, false, elfutils::dwarf_ref_tracker<elfutils::dwarf_output, elfutils::dwarf> >::reference_match (this=0x7fffffff7f00, ref1=..., ref2=...)
	    at ../../../redhat/elfutils/src/../libdw/c++/dwarf_comparator:557
	$116 = 0x76c5
	$117 = "<subroutine_type>"
	$118 = {
	  first = {
	    _m_children = 0x13c1020, 
	    _m_attributes = 0x13b69d0, 
	    _m_hash = 0x8e92e3c9b8cb158d, 
	    _m_tag = 0x15
	  }, 
	  second = {
	    _m_parent = 0xff65c0, 
	    _m_refs = std::queue wrapping: std::__debug::deque with 1 elements = {
	      0x13c1d60
	    }, 
	    _m_originals = std::__debug::set with 4 elements = {
	      [0] = 0x1166,
	      [1] = 0x3f70,
	      [2] = 0x5aa6,
	      [3] = 0x6863
	    }, 
	    _m_original_cost = 0x68, 
	    _m_with_sibling = std::__debug::bitset = {
	      [1] = 0x1
	    }, 
	    _m_uses = 0x4
	  }
	}

Now we're back in reference_match, it so happens.  This could be all
thoroughly automated with some gdb python, but I just did it the old
school way with the macros that are easy to write, and the eyeball test
at each iteration.  Doing that all the way back, I culled the sequence
of comparison steps that got us here (shown here innermost first).  (I
didn't record which were refs and which were DIEs, but you can repeat
the experiment for yourself.)

	0x6c63 "<typedef name=\"int64_t\"/>"
	      [0] = 0x534,
	      [1] = 0x2b28,
	      [2] = 0x5097,
	      [3] = 0x5e4e
	0x76cf "<formal_parameter/>"
	      [0] = 0x1170,
	      [1] = 0x1190,
	      [2] = 0x3f7a,
	      [3] = 0x3f9a,
	      [4] = 0x5ab0,
	      [5] = 0x5ad0,
	      [6] = 0x686d,
	      [7] = 0x688d
	0x76c5 "<subroutine_type>"
	      [0] = 0x1166,
	      [1] = 0x3f70,
	      [2] = 0x5aa6,
	      [3] = 0x6863
	0x76df "<pointer_type/>"
	      [0] = 0x1180,
	      [1] = 0x3f8a,
	      [2] = 0x5ac0,
	      [3] = 0x687d
	0x7337 "<member name=\"dynamic_tag_name\"/>"
	      [0] = 0x3be8,
	      [1] = 0x5718,
	      [2] = 0x64d5
	0x7222 "<structure_type name=\"ebl\">"
	      [0] = 0x5603,
	      [1] = 0x63c0
	0x7217 "<typedef name=\"Ebl\"/>"
	      [0] = 0x55f8,
	      [1] = 0x63b5
	0x7601 "<pointer_type/>"
	      [0] = 0x59e2,
	      [1] = 0x679f
	0x75f6 "<formal_parameter/>"
	      [0] = 0x59d7,
	      [1] = 0x5a80,
	      [2] = 0x5ba7,
	      [3] = 0x5c9a,
	      [4] = 0x5cd9,
	      [5] = 0x5d42,
	      [6] = 0x6794,
	      [7] = 0x683d,
	      [8] = 0x6964,
	      [9] = 0x6a57,
	      [10] = 0x6a96,
	      [11] = 0x6aff,
	      [12] = 0x75f6,
	      [13] = 0x769f
	0x75ec "<subroutine_type>"
	      [0] = 0x59cd,
	      [1] = 0x678a,
	      [2] = 0x75ec
	0x7607 "<pointer_type/>"
	      [0] = 0x59e8,
	      [1] = 0x67a5,
	      [2] = 0x7607
	0x729b "<member name=\"reloc_simple_type\"/>"
	      [0] = 0x567c,
	      [1] = 0x6439,
	      [2] = 0x729b
	0x7222 "<structure_type name=\"ebl\">"
	      [0] = 0x7222
	0x7217 "<typedef name=\"Ebl\"/>"
	      [0] = 0x7217
	0x6c2c "<compile_unit name=\"../../../redhat/elfutils/libebl/ebldebugscnp.c\">"

So, what's really going on here is that there are some different entries
for <structure_type name="ebl"> that have gotten mixed up weirdly.

M-x occur says:

4 matches for "<structure_type name="ebl">" in buffer: s
    363:2	hash=0x25ae7a82f071dcfe	(10) 1180 (590)	<structure_type name="ebl">	 0x5603 0x63c0
    433:1	hash=0x6570b7d2bd1d5b7d	(10) 590 (0)	<structure_type name="ebl">	 0x7222
    434:1	hash=0x2e42f8c64d6c4a2c	(10) 590 (0)	<structure_type name="ebl">	 0x3ad3
    435:1	hash=0x9248a5043861eca7	(10) 590 (0)	<structure_type name="ebl">	 0xcc9

So, there were 5 different DIEs in the input (in different CUs, we expect).
Two got consolidated together and the others didn't.  That's fine for now.
They might differ for good reasons or ill reasons, but that's not quite our
concern at the moment.

What's certainly a concern is how that comparison path somehow led from one
copy to another.  In that comparison path, we started comparing 0x7222 to
the output DIE that came from 0x7222.  But when the input side ("b" in gdb)
went circular at 0x7217->0x7222, the output side ("a" in gdb) instead went
to a different <typedef name="Ebl"/> instead.  If we look back, the first
comparison step that was looking at an output copy that did not include the
input peer in its originals set is:

	0x7601 "<pointer_type/>"
	      [0] = 0x59e2,
	      [1] = 0x679f

There is no 0x7601 in that set.  But the <formal_parameter/> copied from
0x75f6 points to it.  So, how did that come to be?

It means the dwarf_output tracker decided that some type=[0x7601] attribute
was equal to some type=[0x59e2] attribute so it could merge 0x75f6 and its
peers together.  But then when it came to 0x7601 vs 0x59e2 as DIEs, it
didn't think they were equal, so 0x7601 didn't merge with 0x59e2.

We can try to figure it out with:

	$ src/dwarfcmp test-elfcmp{,} 0x7601 0x59e2
	7601 vs 59e2: type=[0x7217] vs type=[0x55f8] (<typedef name="Ebl"/> with reference mismatches)
	reference 7217 vs 55f8: 7217 vs 55f8: type=[0x7222] vs type=[0x5603] (<structure_type name="ebl"> with reference mismatches)
	reference 7222 vs 5603: 7337 vs 5718: type=[0x76df] vs type=[0x5ac0] (<pointer_type/> with reference mismatches)
	reference 76df vs 5ac0: 76df vs 5ac0: type=[0x76c5] vs type=[0x5aa6] (<subroutine_type> with reference mismatches)
	reference 76c5 vs 5aa6: 76cf vs 5ab0: type=[0x6c63] vs type=[0x5097] (<typedef name="int64_t"/> != <typedef name="int64_t"/>)
	reference 6c63 vs 5097: 6c63 vs 5097: decl_file="/usr/include/stdint.h" vs decl_file="/usr/include/sys/types.h"
	$ 

But that just tells us that dwarfcmp is not as confused as dwarf_output--at
least when starting from those two DIEs, though perhaps not so further out
in the circularity path.  We can see good reason that these two were not
conflated together.  But we don't know why references to them *did* get
conflated.  So, get to investigating!  (I expect this to generate a lot
of questions about dwarf_output::copier.)


Thanks,
Roland

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]