This is the mail archive of the elfutils-devel@sourceware.org mailing list for the elfutils project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: new dwarf_output algorithm

From: Roland McGrath <roland at redhat dot com>
To: elfutils-devel at lists dot fedorahosted dot org
Date: Thu, 06 Jan 2011 10:49:49 -0800
Subject: Re: new dwarf_output algorithm
> Just hashing all references to zero that are part of a cycle makes too
> many structures hash identical, and cycle detection in arbitrary graphs
> is not fun. So I was proposing a way to define categorizing/hashing the
> DIEs based on properties that don't involve arbitrary cycles.

Indeed.  But recall that this hash is imperfect.  Hence, we will always
need to fall back of "full comparison".  In fact, each detection of an
authentic duplicate will involve at least one full comparison to verify
that the hash match is a true equivalence and not just a collision.  So,
this scheme still relies on having a sound equivalence comparison algorithm
that does not have pathological performance.

> The algorithm depends on the fact that you can see the CU as two
> different trees/forests without cycles in themselves. First there is the
> DIE trees where we ignore the attribute references and only look at the
> children of a DIE. Secondly there is the attribute reference tree
> (attributes reference other DIEs to describe properties, but these
> properties itself aren't cyclic or self-referential (this seems
> semantically true for all current dwarf reference attribute types, but
> structurally dwarf seems to allow DIEs with reference attributes that
> would allow backreferences, so we would need to detect this as a sanity
> check).

Correct.

> The original proposal was worded as walking the DIE tree multiple times.
> https://fedorahosted.org/pipermail/elfutils-devel/2010-November/001699.html
> But it might be easier to describe/fit into the current copier design by
> describing it as when elements can be finalized into the collector
> and/or how hashes are calculated.

Yes, these are different ways of saying the same thing.  The "first walk"
is the "copy construction" proper, where the navigation is directly via the
input DIE tree structure.  This creates a 'struct entry' for each node
(that is, dwarf_output::copier::entry).  The second walk is the
finalization pass, triggered where we now have the call to 'final_unit'.
Here the navigation is via the 'struct entry' graph.

That graph exists in bifurcated ways.  When an entry is already finalized,
there is entry._m_final pointing to its collector object.  This will arise
in the finalization pass in two ways.  One is if we do some short-circuit
logic to do immediate finalization during the first walk when it's easy.
The second is when the input data already used DW_TAG_imported_unit, so
that the same input DIE identity appears more than once in the input tree.
(This latter possibility is the main reason why 'struct entry' lives for
the entire life of the copier, i.e. the whole walk of all CUs in an input
file, or later even multiple input files, rather than only living until
that entry is final.)

During the first walk, entry._m_pending holds is (that is,
dwarf_output::copier::pending_entry).  Right now it has a whole attributes
map and children vector of its own.  That allows quick enumeration at the
cost of more storage.  It would also be possible instead to just hold the
input DIE pointer (i.e. debug_info_entry::children_type::const_iterator of
the input dwarf-like class).  Then enumerating the children for the
finalization pass would entail walking the input DIE's children list again
and doing the _m_entries lookup (hash table on identity) again for each.
We can see later on how that CPU vs memory tradeoff looks in practice.

> In the below I assume we will keep two hashes per element. The
> "children-hash" and the "references-hash". An element can be
> finalized/put in the collector if both are fully defined. We can
> probably keep one hash, but it seemed simpler to describe using two.

I think I prefer the terms "copier hash" and "final hash" for these,
not that it really matters which terms we use.

> (These two are different from the use pointer as hash recently discussed
> on the list.)

Hmm, calling the "references-hash" the "final hash" implies/presumes that
this hash function is the same one that we use for the collector's hash
table.  So perhaps that's not a good term to use, at least not yet.

> - The children-hash of a DIE without children is calculated from its
> attributes but only using the values of basic attributes (ignoring the
> value of reference attributes).

Let's call this the "local hash" of a DIE.

> - The children-hash of a DIE with children is the same as one without,
> combined with all children-hashes of its children.

An equivalent way to state it would be to say that the copier hash of any
DIE is its local hash combined with the copier hashes of each child.  Note
that this is slightly different from the approach of the current DIE hash
function, where the local hash is combined with a single "children_type
hash".  The latter is itself the combination of the hashes of each child.
The difference is x vs hash_combine (x, 0) for a childless DIE.

> - The references-hash of a DIE without reference attributes is equal to
> its children-hash.
> 
> - The references-hash of a DIE with reference attributes is the
> combination of the children DIE of the DIE itself, combined with all
> reference-hashes of all DIEs that are referenced by the reference
> attributes.

Again, we could either define it this way, or that the final hash is the
local hash combined with the "references list hash".  Again, the difference
is whether for a DIE without references the final hash is exactly the
copier hash or is hash_combine (copier_hash, 0).

I'm not sure whether it really makes a difference to the bit distribution
or not.  It just makes a difference to how you organize the control flow of
the computation, in a subtle way that is easily overlooked.

> - For an attribute set to be finalized all DIEs pointed to by referece
> attributes should have a reference-hash set. The hash of an attribute
> set then is the combination of simple attribute values hashes combined
> with the reference-hashes of all those DIEs.
> 
> - For a vector of DIEs (children of another DIE or the CU) to be placed
> in the collector all DIEs need to have reference-hashes. The hash of a
> children vector is the combination of all DIE reference-hashes.
> 
> - For a DIE to be finalized, the attribute set and the children vector
> need to have been finalized and it has to have its reference-hash set.
> 
> So by calculating the children-hash of a DIE always before the
> reference-hash of the DIE, we make sure we are chasing down one "tree"
> at a time.

Right.

> I think the above should fit into the current copier code setup. But I
> am sure I forgot something. It will come up when actually trying to
> implement it this way. 

I haven't noticed anything you forgot.  The things I mentioned above are
some implementation details you didn't mention, but not material issues.

> Missing is the sanity check to make sure that a reference attribute never
> references a DIE for which its reference hash is currently being resolved
> (which shouldn't ever happen, but might be caused by funky dwarf).

Right.  The existing code has machinery that detects cycles akin to these.
So it should be moderately clear how to turn entry_finalizer into just
doing this.  For now, we can just make it throw an exception when a cycle
is detected.  If it ever comes up in real-world data, we'll have to
consider the subject more deeply.


Thanks,
Roland
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]