== consolidated debug archive ==

I've had an idea for a long time about sharing of DWARF data (and some
other ELF bits) across files.  The plan is to have one big file that is a
container for mostly-normal ELF files, but modified such that they can
share some data.  I call this big file a ''consolidated debug archive'',
abbreviated ''CDAR''.

The original plan was to use a traditional '''ar''' archive.  But now I
think it should be a special file format, somewhat inspired by glibc's
''locale archive'' files.  I've thought about the details of this quite a
lot, but never worked it all out concretely.  So everything here is subject
to heavy reworking if someone actually tries to implement it.

=== concept ===

The observation is that usually many related ''.debug'' files travel
together, such as in a '''-debuginfo''' rpm.  Just as many CUs in the same
module repeat the same information, many files in a package repeat the same
information.  This includes whole DWARF trees from the several modules
using the same types from the same header files and so forth.  It also
includes the same names used in many places: symbol names, section names,
DWARF string tables, source file names.

 * Every file pretty much has the same section names, so all
   '''.shstrtab''' sections are redundant.
 * When a DSO defines exported symbols, then each DSO or executable that
   links to those symbols repeats the same symbol names in its .strtab too.
 * The ELF symbol names in .strtab sections are the same names that appear
   in DW_AT_name (for C) or DW_AT_linkage_name (for C++ mangled names)
   values in DWARF data.
 * The source file names that appear in .debug_line file tables also appear
   in the DW_AT_name attributes of DW_TAG_compile_unit entries.

=== archive format ===

The archive consists of these sections:

 * an archive header
 * build ID table
 * a file name table
 * a ''constant pool''
 * a ''CU pool''
 * individual files

=== files in a CDAR ===

The main contents of a CDAR are individual ELF files.  These are the same
things we see today in ''.debug'' files, with a few differences.

 * Each "secondary" section in the file might be '''SHT_NOBITS''' instead
   of its normal type.  This includes:

   * string tables: .strtab, .shstrtab, .debug_str
   * secondary DWARF sections: .debug_abbrev, .debug_line, .debug_macinfo,
     .debug_loc, .debug_ranges

   When a section is SHT_NOBITS, that means that its contents are part of
   the ''pool'' (see below).  The offsets that would normally refer to the
   section ('''st_name''', '''sh_name''', '''DW_FORM_strp''',
   '''DW_FORM_sec_offset''', etc.) are instead interpreted as absolute
   offsets into the ''pool''.

   This is easy to integrate into the reader code.  When it initializes the
   pointers into the mapped section data at startup time, when a section is
   '''SHT_NOBITS''', it instead replaces that '''Elf_Data''' with one
   pointing to the ''constant pool''.  The existing reader code then
   automatically finds offsets inside the pool.  Existing code that
   maintains caches indexed by the mapped data pointer will automatically
   reuse and share its caches for data shared in the pool by multiple CUs
   or multiple files.

 * The DWARF entries in the file can use the new form
   '''DW_FORM_GNU_ref_cdar'''.  This is treated similarly to
   '''DW_FORM_ref_addr''', but its offset is taken as a position in the
   ''CU pool'' rather than in the file's own .debug_info section.

The storage of each individual file's contents is preceded by a simple file
header that just gives its size.  It would be possible to extend this to
include other fields like owner, mode, and mtime, like an '''ar''' file
header has.  But it's not clear any of these are useful to have in a CDAR.
If any, perhaps just mtime, but even that doesn't seem especially useful.

=== build ID table ===

This table supports quick lookup of files by their build IDs.  It
associates a build ID with an entry in the file name table and with a file
data record.

My original idea is that its entries would be sorted by the build ID bits
so that consumers can use binary search for a build ID.  Alternatively, it
could be some scheme encoding a hash table, if that makes for faster
lookups without the table being too much larger.

The table format would be designed to be compact and alignment-friendly.
Since build IDs are in general of arbitrary length, the archive header or
the table's own header would need to indicate the length of IDs being used.
There is no need to support disparate ID lengths inside a single CDAR.
I imagine that each table entry would be something like three aligned words:
ID table index, name table index, file record's absolute offset.

=== file name table ===

This table associates file names with the files and build IDs.
Because of hard links, symlinks, or copies among the input files,
there might be multiple file name table entries for a single file.

I imagine that each table entry would be something like two aligned words:
''constant pool'' offset of the file name, and file record's absolute offset.

=== constant pool ===

This is the "big soup" for everything that does not need any more outside
structure.  It doesn't need any kind of header, the archive header could
just give its position and length.

All string tables can be merged together and be part of this.  Also all
.debug_line tables can just live here, etc.  Because they are all in the
same pool, strings that match a file name in some file table need not
appear separately in a string table.  Those string table offsets will just
be pool offsets, so they can point directly into part of a .debug_line file
table where the same string appears.  This will avoid duplicating the same
string that appears both in a file table and in a CU's '''DW_AT_name'''.

Whatever else can be shared goes in here too.  All the "secondary" sections
are not read by themselves, but only in chunks that have their own headers
(or none) starting at an offset given in an attribute value or a CU header
or similar place, and their own terminators.

=== CU pool ===

This is like a big .debug_info section.  The archive header gives its
position in the file and its length.  Its internal structure is nothing but
the normal sequence of CU headers and DIE data.

It contains all '''DW_TAG_partial_unit''' entries, so enumerating any
particular file's .debug_info section does not come across them to skip.

When a top-level CU uses '''DW_TAG_imported_unit''', its '''DW_AT_import'''
uses a '''DW_FORM_GNU_ref_cdar''' reference to point into the CU pool.
Other references into shared entries do the same.