This is the mail archive of the elfutils-devel@sourceware.org mailing list for the elfutils project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: dwarf_edit_output testcase sanity checking

From: Roland McGrath <roland at redhat dot com>
To: elfutils-devel at lists dot fedorahosted dot org
Date: Thu, 11 Nov 2010 12:07:04 -0800
Subject: Re: dwarf_edit_output testcase sanity checking
> - First we walk the DIE tree ignoring any reference attributes.
>   Since we ignore reference attributes, this is just walking the tree,
>   so no cycles and we are guaranteed to terminate.
>   - For basic attributes we calculate the hash as before, likewise for
>     attr_sets of basic attributes.
>   - The first hash of a DIE is calculated based on the first hash of its
>     children set (broods), again all reference attributes are ignored

Ok.  This is a fine way to describe the algorithm in the abstract.  In
the actual implementation, I think we'll want this first walk to do all
the work of copying the non-reference attribute values.  For one reason,
instantiating them in their dwarf_output/dwarf_data representation is
our main means of hashing the values.  It would just be extra hacking to
define a canonical hash function for each value-space that uses the
abstract dwarf:: template API to examine the values.  It would also be
extra CPU, though there is conversely some cost to be considered in the
copy-construct-and-discard work done when a value matches an existing
entry in its hash table.

A second reason is that we'll eventually be doing some value-rewriting
during the copying.  The primary case for that is rewriting the source
file names (to replace /usr/lib/rpm/debugedit).  We'll probably also do
some things like canonicalizing DW_AT_data_member_location attributes of
exprloc:DW_OP_plus_uconst(n) to be constant:n.  We don't want the code
doing that to have to consider each value more than once.  These will be
later refinements (with various knobs to control them), but we should
plan for them now.

A third reason is that the string_cache hack gave us big wins on CPU
time.  We can keep using that as it is for copying values.  It would be
easy enough to do an incarnation that just cached the hashes or whatnot,
but still some extra hacking.

Does the first hash ignore reference attributes completely as if they
weren't there?  Or does the hash function factor those attribute names
in too, and just omit any effect of the values (referents)?  It doesn't
matter either way to the correctness of the algorithm.  In practice, it
probably doesn't change the collision rate on real-world data to an
important degree.  For example, <pointer_type byte_size=8/> (which is
"void *") vs <pointer_type byte_size=8 type="#refN"/> ("something *")
would or wouldn't collide in the first hash.  But all the different ones
with different "#refN"s are going to collide anyway.

It might make sense to completely split the final storage of attribute
maps into the basic set and the reference set.  Then it becomes very
straightforward to completely finalize the basic set in the first walk.
Thus you'd completely uniquify the basic set and not need duplicative
storage for them at all, even temporarily.  This also yields shared
storage for the two basic sets above.  (I suspect that this one example
is more or less the only case that actually collides that way, so there
is little material storage reduction available from doing this as
opposed to not.)  Having the basic set and the reference set split into
two maps in the final form breaks the current hacks that rely on a
canonicalized name order to compare attribute maps.  But that really
only matters to some micro-optimization of dwarfcmp.  It also slightly
complicates the implementation of dwarf_output::attributes_type and its
iterators.  This is converse to the idea I floated yesterday about
storing a "basic plus reference stubs" map, which complicates the
niggling implementation details differently (and perhaps more strangely).

It may even make sense to completely finalize any whole DIEs that have
no references in the first pass.  I'm not sure off hand if that really
saves any storage or computation, as opposed to just moving the same
storage allocation and the same computation from the second pass to the
first pass.  I guess there's always a little: you don't need any
"pending_entry" object with either pointers to the attribute sets or
just storage of their hashes, you can just have the final pointer
directly to the collector die_info_pair.

> - Secondly we walk the DIE tree just creating an additional hash just
>   for the reference attributes. Here each DIE hash is extended with the
>   hash based on all just reference attributes. We do need the results of
>   the first hash calculated for each DIE, but we can just enumerate each
>   DIE, so again no cyclic walks.
>   - Each reference attribute's hash is based on its name and the
>     referent DIE first hash. We can create attr_sets of just these
>     reference attributes like we did for the basic attributes sets.

Ok.  That sounds like you intend the "completely split" plan.

> - In case the above isn't unique enough we could extend the algorithm
>   with a third (or even fourth) pass like the first or the second.

I concur with the suspicion that the inclusion of the children in the
first hash will tend to make the collisions rare enough.  We can see how
it goes in the actual data.  The only case that comes to mind for "long
simple reference chains" is the several type DIEs in "foo *****" and such.

> So in the above scheme the final hash for such singleton_tags with just
> one reference attribute is based on the (first) hash of the type DIE. So
> if they point at the same basic_type they would hash the same, just like
> when they point to a structure_type with the same children. But they
> will hash differently if the basic_types are not equal or the
> structure_type has different children.

I see.  This does look promising.  I'm not sure I understand the
ramifications of the algorithm clearly enough to be sure whether it's
sound and/or complete.  By "complete", I mean that it does guarantee
perfect detection of all duplicates.  By "sound" I mean that, if the
hash function produced infinite bits (i.e. not considering collisions
produced by the actual hash functions' bit swizzling techniques), it
would never hash two DIES the same that are not truly equivalent.  As
always, it's trying to think through the complex cyclic cases that makes
my brain fall out.  Do you think you can state a proof of its soundness?
Also, do you think you can calculate the formal algorithmic complexity?

Since the hash will not have infinite bits, there will always be the
theoretical possibility of collisions.  Collisions in the first hash are
straightforward to resolve, because the comparisons without regard to
references are easy to understand.  When two DIEs do match completely
without regard to any references, then it becomes possible to have a
collision in both the first and second hashes.  What's the plan then?

One approach is to use a cryptographic hash and just never even try to
do any comparisons at all.  Then the mathematicians will assure that
it's a million times less likely that we'd have a hash collision than
that we'll all simultaneously quantum teleport to the heart of the sun.
Once we've been vaporized, we'll be more than happy to leave it to the
living to worry about the poor bastards whose debugging we have
confounded by lousing up their DWARF files with wrongly-conflated type
information.  Personally, I speculate it's far more likely that the
mathematicians will tell us "oops, Bob just got a Nobel prize for
figuring out how wrong we were".  But when that happens, Bob will
already have sold out to mafiosi who use it to break all the encryption
in the world and clear out everybody's bank accounts, the high-tech
economy will fall, and the the roving bands of hoodlums will kill us
just for having been well-paid nerds in the prior epoch without even
asking about the soundness of our algorithms.  So that might be a
satisfactory plan despite its inelegance.  It might even be that a
standard cryptographic hash for which someone else supplies a well-tuned
implementation is sufficiently more efficient than a weaker hash and
ever doing any full comparisons at all that it's the better choice in
practice.  But it does have a fundamentally different character than
old-fashioned "hashes we think are good enough, and comparisons we
really know are perfect".

Aside from the "less likely than roving bands of hoodlums" criterion,
we do need the algorithm to be formally sound.  Unlike dwarfcmp, for
dwarf_output we don't really need it to be formally complete, just
"complete enough" that uncaught duplicates don't appear in practice
with real-world data, or are not large, or are exceptionally rare.

As an aside for later contemplation: it seems good that this algorithm
is not based on cycle-chasing comparisons, for the complexity and thus
the efficiency.  I also find it a good thing that it's substantially
different from a plan that's based more on pure comparison (as my
original plan was).  My thinking is that we still want an algorithm
for dwarfcmp that is more like pure comparison rather than just
duplicating this algorithm.  It's then a good thing that they differ
because it means they can serve as material soundness checks on each
other.  Since dwarfcmp is a validation and debugging tool for DWARF
hacking ideas, rather than a production tool, it doesn't really need
to be nearly as efficient as we want dwarf_output to be.  But it does
still need to be sufficiently fast that it's not completely
impractical to use on huge cases with lots of hairy cycles.  So we
should also consider the alternate plan for handling the cyclic
weirdness in a pure comparison algorithm.  I hope it's not intractable
to figure that out somehow that is not pathological in its performance
on large real-world cases but also is not just a copy of this
hashing-based algorithm.  But I really don't know.


Thanks,
Roland
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]