This is the mail archive of the dwarf2@corp.sgi.com mailing list for the dwarf2 project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

PROPOSAL: duplicate elimination. 991026.3



The following document is a detailed  proposal for 991026.3
Compression  -- Duplicate Dwarf data deletion

$Date: 2001/02/19 22:23:15 $
Dwarf Duplicate Elimination -- Space Compression.

The following is close to being a proposal, and after review
by the compression subcommittee will be submitted as such.

This describes a way to emit dwarf that is shared
by multiple compilation units, avoiding duplication.
It deals with #include duplications, function duplications,
and more.  Parts of this have been implemented in gcc
(as is described below) and parts are simply invention.


The following has major sections, preceded by lines like this:
===================================================<name>

Section numbers here are draft 5 section numbers.

The second section is  revised wording for section 3.1.1
and a new section numbered 3.2.1 (renumbering
later 3.1.x sections) <3.1.[12] wording>

The second section is revised wording for 3.3.8.3
with respect to abstract roots.

The third section is a new document section, which
we call 3.8 just to be specific here.
Some parts of this third section might be better
as appendices. <3.8 compression>

We present these in this order as that's the order a
reader of the dwarf document will encounter them.





===================================================<3.1.[12] wording>
Replace the first paragraph of Section 3.1,
with

    "An object file may be derived from one or more compilation units.
    Each such compilation unit is described by a debugging information
    unit with the tag DW_TAG_compile_unit or the tag
    DW_TAG_subunit or by a combination of both.
    In simple normal compilation, a single compilation unit with
    the tag DW_TAG_compile_unit is emitted per object file, and
    DW_TAG_subunit is not emitted.
    When DWARF space compression and duplication elimination and
    the like is being done, additional compilation-unit-tags
    may be emitted in an object file (and these additional
    compilation-unit-tags may be DW_TAG_compile_unit or
    DW_TAG_subunit as appropriate).

    

    "A DW_TAG_compile_unit entry owns debugging information entries that
    represent the declarations made in the corresponding compilation unit.
    A DW_TAG_subunit entry owns debugging information entries that represent
    some portion of the declarations made in a related compilation
    unit. The declarations of a subunit have no defined relationship of
    themselves to the scopes of the compilation unit to which they are
    related. 
 
    <i>A DW_TAG_subunit does not necessarily correspond to any 
    source language
    syntax; it is part of a mechanism by which a compiler may attempt
    to make the DWARF description of a program more 
    space efficient.</i>
 
    <i>The place where the declarations of a subunit logically
    occur is indicated by means of a DW_TAG_import_subunit 
    debugging information entry that refers to the subunit.</i>"

    See section 3.8, 
    "Space Compression", for the definition and use of
    DW_TAG_compile_unit and DW_TAG_subunit when the producer
    is attempting to save space in the debugging information.


Following bullet 10, remove the paragraph
  "A compilation unit entry owns debugging information entries
  that represent the declarations made in the corresponding
  compilation unit." 
as that information is now earlier in 3.1.



    

===================================================<3.3.8.3 wording>
Part of this proposal  is that 2. at the end of section 3.3.8.3
"Out-of-Line Instances of Inline Subroutines" 
have its wording changed to:

2. The root entry for a concrete out-of-line-instance tree
is normally owned by the same parent entry that also owns
the root entry of the associated abstract instance.
The parent may be different if the concrete out-of-line-instance
or the abstract root, or both, are in Section Groups,
as without this allowance some kinds of duplicate elimination
would not be possible. 


===================================================<3.8 compression>


Section 3.8  Data Compression

3.8.1 motivation

DWARF2 can use a lot of disk space, especially for C++.  

The incredible depth and complexity of headers for C++ means
many many (possibly thousands of) declarations are repeated in every
compilation unit.

C++ templates mean some functions and their dwarf get duplicated.  

For maximum flexibility, implementations want to be able to
move functions around (so that frequently called code can be
placed to avoid excessive-instruction-page-references or
icache-thrashing) and putting all the dwarf2 for all the
functions in a single compilation unit adds difficulty.  Consider, for
example, if a function is dead (never called).  How can the
unneeded dwarf information be removed?

Since discussing this seems inextricably tied to 
object-file aspects, various object-format-specific
terms are used.  Such terms are intended to
aid in explaining the concepts, 
not to prescribe use of one object format
or another.


3.8.2 Overview

The solution is to break up the debug information into
separate sections and separate compilation units in
the output from compiling a single source file.

<i>
We'll use some traditional section naming here but
aside from the dwarf sections, the names are just meant
to suggest traditional contents as a way
of explaining the approach, not to be limiting in any way
on an implementation.
</i>

Where a traditional relocatable-object output from a
single source file might contain sections named:

	.data
	.text
	.debug_info
	.debug_abbrev
	.debug_line
	.debug_aranges

A relocatable object from a compilation system attempting 
some duplicate-dwarf elimination might contain

	.data
	.text
	.debug_info
	.debug_abbrev
	.debug_line
	.debug_aranges

   followed (or preceded, order  is not significant) a series
   of 'section groups'
   section-group 1
	.debug_info
	.debug_abbrev
	.debug_line
   ...
   section-group N
	.debug_info
	.debug_abbrev
	.debug_line
	
Where section groups might contain executable code (.text sections)
or might not. 

The contents of a section group could be 
discarded as a group (if determined appropriate by a linker).
For example, if a linker determined that section-group 1
from A.o and section-group 3 from B.o were identical it
could discard one group and arrange  that all references  in A.o and
B.o were to  apply to the remaining one of the two identical
section groups.  This space compression
definition is intended to  
make that 'arranging' trivial and automatic because
the reference are simply to external names and the linker
already knows how to match up references and definitions.

What is minimally needed from the object file format
(outside of dwarf2 itself, and  normal object/linker facilities
such as simple relocations):

  A means of having multiple .debug_info etc sections from
  a single compilation.

  A means of identifying a section-group (giving it a name).

  A means of identifying which groups of sections go together
  (the elf Section Group , or COMDAT, notion) so that
  a group can be treated as a group (kept or discarded).
 
  A means of referencing from inside one .debug_info
  compilation-unit to another .debug_info compilation unit
  (DW_FORM_ref_addr provides this).


The remainder of this section uses current UNIX and Elf
terminology for specificity, though
nothing here is inherently Elf or UNIX specific.

3.8.3  terminology 

The following terms are not all used, but the sketchy
definitions may help communicate the meaning and use of Section
Groups.

Relocatable-object.  A simple object file, to be bound together
with others to make an executable or shared-library.  Also
known as a '.o'.  (it may contain SECTION GROUPs (defined
below));   Many UNIX static-linkers have a -r flag which
enables the creation of a new relocatable-object from several
relocatable-objects as input.  Static linker implementors (and
linker users) have to realize that -r may impact duplicate
handling and possibly even executable correctness, depending on
exactly what the static linker does with -r.

Shared-library.  Also known as Dynamic Shared Object.  All
static relocations are done.  It may reference other
shared-libraries and use such at run time.  Never contains
SECTION GROUPs.

Executable.  An application.  It may reference shared-libraries
and use such at run time.  All static relocations are done.
Never contains SECTION GROUPs.

3.8.4 Example 1,C++
A Simple Example ( a sketch of parts of the relocatable object
a compiler might output to an assembler -- showing
assembler-like output so we can show the labels):

Source file wa.h
struct A {
	int i;
};

Source file wa.C
#include "wa.h"
int 
f(A &a) 
{
  return a.i +2;
}


Base CU sections of the relocatable object:

== section .text
   [function f code]

== section .debug_info
   DW_TAG_compile_unit
.L1 (local):
     DW_TAG_reference_type
	DW_AT_type   ref to DW.cpp.wa.h.123456.3
     DW_TAG_subprogram
	DW_AT_name "f"
	DW_AT_type   ref to DW.cpp.wa.h.123456.2
	DW_TAG_variable
	  DW_AT_name "a"
	  DW_AT_type  ref to <.L1>
	...
== section .debug_abbrev
	...
== section .debug_aranges
	...
== section .debug_line
	...


SectionGroup sections (COMDAT sections) of the same relocatable
object:

group identifier my.compiler.company.cpp.wa.h.123456 (linker global symbol)

== section .debug_info
DW.cpp.wa.h.123456.1: (linker global symbol)
   DW_TAG_compile_unit
     DW_AT_language DW_LANG_C_plus_plus
     ...
DW.cpp.wa.h.123456.2: (linker global symbol)
     DW_TAG_base_type
	DW_AT_name "int"
DW.cpp.wa.h.123456.3: (linker global symbol)
     DW_TAG_structure_type
DW.cpp.wa.h.123456.4: (linker global symbol)
       DW_TAG_member
	DW_AT_name "i"
	DW_AT_type  DW_FORM_ref to DW.cpp.wa.h.123456.2,
	    (it is a local ref, so the more compact
	     DW_FORM_ref can be used)
== section .debug_abbrev
	...
== section .debug_line
	...

<i>
This example is C++-like in that  it uses
DW_TAG_compile_unit for the Section Group, implying
that the contents of the compilation unit are
globally visible (following the language rules).
</i>


3.8.5 Example 2, C
A Simple Example ( a sketch of parts of the relocatable object
a compiler might output to an assembler -- showing
assembler-like output so we can show the labels):

Source file wa.h
struct A {
	int i;
};

Source file wa.c
#include "wa.h"
int 
f(A *a) 
{
  return a.i +2;
}


Base CU sections of the relocatable object:

== section .text
   [function f code]

== section .debug_info
   DW_TAG_compile_unit
.L1 (local):
     DW_TAG_import_subunit
        DW_AT_import  ref to DW.c.wa.h.123456.1
     DW_TAG_pointer_type
	DW_AT_type   ref to DW.c.wa.h.123456.3
     DW_TAG_subprogram
	DW_AT_name "f"
	DW_AT_type   ref to DW.c.wa.h.123456.2
	DW_TAG_variable
	  DW_AT_name "a"
	  DW_AT_type  ref to <.L1>
	...
== section .debug_abbrev
	...
== section .debug_aranges
	...
== section .debug_line
	...


SectionGroup sections (COMDAT sections) of the same relocatable
object:

group identifier my.compiler.company.c.wa.h.123456 
(linker global symbol)

== section .debug_info
DW.c.wa.h.123456.1: (linker global symbol)
   DW_TAG_subunit
     DW_AT_language DW_LANG_C89
     ...
DW.c.wa.h.123456.2: (linker global symbol)
     DW_TAG_base_type
	DW_AT_name "int"
DW.c.wa.h.123456.3: (linker global symbol)
     DW_TAG_structure_type
DW.c.wa.h.123456.4: (linker global symbol)
       DW_TAG_member
	DW_AT_name "i"
	DW_AT_type  DW_FORM_ref to DW.c.wa.h.123456.2,
	    (it is a local ref, so the more compact
	     DW_FORM_ref can be used)
== section .debug_abbrev
	...
== section .debug_line
	...

<i>
This example is C-like in that  it uses
DW_TAG_subunit for the Section Group, implying
that the contents of the compilation unit are
globally invisible. The only way DW.c.wa.h.123456.1
as a whole
is made visible in some context
is by a DW_TAG_import_subunit with the attribute
DW_AT_import referring to DW.c.wa.h.123456.1.
</i>


3.8.6 Naming

A precise description of the means of deriving names
usable by the linker to access dwarf entities
is not part of the
dwarf2 specification, 
it is a quality-of-implementation issue.

Nonetheless, an outline of a usable approach is given here
to make this more understandable and to guide implementors.

Section Groups (Elf) must have a section group name.
For the above example a name like
   <producer-prefix>.<file-designator>.<gidnumber>
would suffice, where 
  <producer-prefix> is some string specific to the producer,
	which has a language-designation embedded in
	the name when appropriate.
	Or the language name could be embedded in the <gidnumber>.
  <file-designator> names the file, such as wa.h in the example.
  <gidnumber> is a string generated to identify
	that specific wa.h header file in such a way that
	a) a 'matching' output from another 
	   compile generates the same <gidnumber>
	b) a non-matching (say because of #defines) output generates
           a different <gidnumber>.
	<i>It may be useful to think of a <gidnumber> as a
	   kind of hash code.</i>

So for example, one the trivial example wa.h above 
is assigned  my.compiler.company.c.wa.h.123456

The section-group-name is a name assigned to an entire
section group.

Global labels for DIEs (need explained below) within
a section group could be
	<prefix>.<file-designator>.<gidnumber>.<die-number>
such as
	my.compiler.company.dw.c.wa.h.123456.987
where 
	<Prefix> distinguishes this as a dwarf debug info name,
	and should identify the producer and when appropriate,
	the language. 
	<die-number> could be a number sequentially assigned.
	    to entities (tokens, perhaps) found during compilation.
        <file-designator>, <gidnumber> are as above.
	


It is up to the producer to ensure that if <die-numbers>
in separate compilations would not match properly that
a distinct <gidnumber> would have been generated.

This means that every point in the section-group
.debug_info that could be referenced from outside
by *any* compilation unit
must normally have an external name
	<prefix>.<file-designator>.<gidnumber>.<die-number>
generated for it in the linker symbol table, whether the
current compile references all those points or not.
(The completeness of the set of names generated
is a quality of implementation issue.)

Note that  only section-groups that are designated as
duplicate-removal-applies actually require the
	<prefix>.<file-designator>.<gidnumber>.<die-number>
external labels for DIEs as all other section group sections
can use 'local' labels (section-relative relocations).
(This is a consequence of separate compilation, not
a rule imposed by this document).
<i>
Local labels would be references with DW_FORM_ref4
or DW_FORM_ref8 (these are affected by
relocations so DW_FORM_ref_udata, DW_FORM_ref1
and DW_FORM_ref2 are normally not usable and
DW_FORM_ref_addr is not necessary for a local label).
</i>

Implementations should clearly document their naming
conventions.

3.8.7 DW_TAG_subunit and DW_TAG_compile_unit

A Section Group compilation unit using
DW_TAG_compile_unit is like 
any other compilation unit, in that it's contents
would be evaluated by consumers as it it were an
ordinary compilation unit.

Consider a #include within a C++ namespace
declaration or within a function definition as examples where
the DIEs in the Section Group should not be used
independently of being referenced from elsewhere.
They are not (necessarily) file-level entities.
Another example is #include in C, as there is no notion
of a 'global' namespace for the types in C.

Consequently a compiler would use use
	DW_TAG_subunit
in place of DW_TAG_compilation unit in a section-group whenever
the section-group contents are not necessarily globally-visible.
This directs consumers to ignore that compilation unit
when scanning top level declarations and definitions.
The DW_TAG_subunit 'compilation unit' will be
referenced from elsewhere and the referencing locations
give the appropriate context that the DW_TAG_subunit
be scanned.

A DW_TAG_subunit may have, as appropriate, any of
the attributes assigned to a DW_TAG_compile_unit.

3.8.8 DW_FORM_ref_addr

Use DW_FORM_ref_addr to reference from one compilation
unit's
debugging-information-entries to those of another
compilation-unit.

When referencing into a removable-section-group .debug_info
from another .debug_info (from anywhere), the
	<prefix>.<file-designator>.<gidnumber>.<die-number>
name should be used for an external symbol and a relocation
generated based on that.  

<i>
When referencing into a non-section-group .debug_info,
from another .debug_info
(from anywhere) DW_FORM_ref_addr is still the form to be used, but
a section-relative relocation generated by use of 
a non-exported name (often called an 'internal
name') may be used. 
</i>
	

3.8.9 #include compression

C++ has a much greater problem than C with the number and size
of the headers included and the amount of data in each, but
even with C there is substantial header file information duplication.

A reasonable approach is to put each header file in its
own section group, using the naming rules mentioned above.
The section groups would be marked to ensure duplicate removal.
All data instances an code instances (even if they came from
the header files above) would be put into 
non-section-group sections such as the base object file .debug_info
section.




<i>
Where there is no predefined order for headers
to be #included and odd interactions, such that the source of
definition of some subtype is different depending on order of
inclusion.  Due to intention or error, such does happen.   In
such a case the 'signature' of the header had better be very
precise else the users will be quite annoyed when the debugger
works in a way that does not reflect the real source.
<i>


3.8.10 eliminating function duplication

Function templates (C++) result in code for the same template
instantiation being compiled into multiple archives or
relocatable-objects.  The linker wants to keep only one of a
given entity.  The debug for this and everything else for this
function should be thrown away (keep just one copy).

For each such code group (function template in this example)
the compiler assigns a name for the group which will match all
other instantiations of this function but match nothing else.
(And the elf section group 'remove duplicates' flag would be
set).

The second and subsequent definitions seen by the static linker
are simply discarded.

References to other .debug_info sections (for DIEs) follow the
approach suggested above, but the naming rule might be slightly
different as <file-designator> should be interpreted as
<function-designator>.

3.8.11 single-function-per-dwarf-compilation-unit

This is related to the section group above (as implementations
may want to produce a single relocatable-object with multiple
section groups, one per function).
One purpose of such is to allow a linker to easily
reorganize the order of functions in the executable
(perhaps to improve cache performance).
Another is to make it easy for a linker to completely
remove unused functions.
These would not be marked as 'remove duplicates', since
the functions are not duplicates of anything.

Each function is given a compilation unit (and a section group).

Each compilation unit is complete, with text, data, and dwarf
sections.

And there is a compilation unit that has the file-level
declarations and definitions.  Other per-function
compilation-unit dwarf information (.debug_info) points to this
file-level compilation unit's .debug_info entries.


Elf Note:
  The section groups could have the section group flag
  set to zero (see the Elf section group definition near the end
  of this document) so there is no need for a unique
  section group name. 

Here the section groups can use DW_FORM_ref_addr 
and internal labels (section-relative relocations) to refer to
the main object file sections, as the section groups here are
either deleted as unused or kept. There is no possibility
(aside from error) of a group from some other compilation being
used in place of one of these groups.

3.8.11 Inlining and out-of-line-instances

Abstract instances and concrete-out-of-line instances
may be put in distinct compilation units if Section Groups
are in use.  This makes possible some useful
duplicate dwarf-elimination, as it can be useful to
have out-of-line-instances and abstract roots
in distinct Section Groups.

<i>
No special provision for eliminating class duplication
resulting from template instantiation is made here,
though nothing prevents eliminating such duplicates
with the techniques of this section.
<i>

3.8.12 gcc example
[perhaps this should be in appendix, and just referenced.]
gcc-specific rules, mentioned to try to make this clearer 

This is with respect to what will likely be
in gcc version 3.30 
and is not turned on by default as of February 2001.
	.gnu.linkonce.wi.wa.h.92485121
is an example of a section-group name.

At this time gcc does not implement Elf Section Groups, but
instead uses a section name like
	gnu.linkonce.wi.wa.h.92485121
instead with the linker applying special rules.
gcc will probably transition to the Elf Section Group rules.

DW.wa.h.92485121.4
is a sample name of a DIE in the wa.h section group,
92485121 being the <gidnumber> and 4 being the die id..

Data is never in these header section groups, but is
always in the object file base sections . Only
types are put in the section groups.
Abstract inline DIEs are planned to be put in 
gcc-section-groups,
though this is not done as of February 2001.

3.8.13 Elf Section Group
[ This should probably be in an appendix or left out entirely.]
Elf specifics of SECTION GROUP (COMDAT)

The following attempts to be an accurate rendering of section
group but should be taken as general information only. 
The generic Elf specification is the only true definition.

A section of type SHT_GROUP defines a grouping of sections.

In the section-header for section group the sh_link field of an
SHT_GROUP section gives the section number of a symbol table
section.  The sh_info field gives the symbol index of the
identifying entry of this section group.  The symbol indexed is
the "identifying symbol" of the section group.

The content of an SHT_GROUP section is
a) A single flag word. If set to 1, it
   means that duplicates are to be discarded.
   If not set, some other criterion might be applied
   by the linker to discard the section group (such as 
   removing unreferenced functions).
b) A list of section numbers.
   The listed section numbers are the sections in this section group.

Only relocatable-objects have identifying symbols or section groups.

Given two section groups with the same identifying symbol the
linker will simply discard and ignore the second group and all
its sections.

References to the sections comprising a group from sections
outside the group must be made via symbol table entries with
STB_GLOBAL or STB_WEAK binding and section SHN_UNDEF.  If there
is a definition of the same symbol in the relocatable-object
containing the references, it must have a separate symbol table
entry from the references.  Sections outside the group may not
reference symbols with STB_LOCAL binding for addresses
contained in the group's sections, including symbols with type
STT_SECTION.

No non-symbol references may exist from outside a section group
to the inside of the group.

A symbol table entry that is defined relative to one of the
group's sections and that is contained in a symbol table
section that is not part of the group, must be removed if the
group members are discarded.

	
------------------
Acknowledgements:

Jason Merrill outlined this design in
a posting to the dwarf2 mailing list 19 Jan 2001:
this document is derived from his ideas and implementation.

Ron Brender made crucial contributions to the design
and the document.

Jim Dehnert helped with an earlier version of this document.

------------------
Corrections/questions to:
davea@sgi.com



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]