== consolidated debug archive == I've had an idea for a long time about sharing of DWARF data (and some other ELF bits) across files. The plan is to have one big file that is a container for mostly-normal ELF files, but modified such that they can share some data. I call this big file a ''consolidated debug archive'', abbreviated ''CDAR''. The original plan was to use a traditional '''ar''' archive. But now I think it should be a special file format, somewhat inspired by glibc's ''locale archive'' files. I've thought about the details of this quite a lot, but never worked it all out concretely. So everything here is subject to heavy reworking if someone actually tries to implement it. === concept === The observation is that usually many related ''.debug'' files travel together, such as in a '''-debuginfo''' rpm. Just as many CUs in the same module repeat the same information, many files in a package repeat the same information. This includes whole DWARF trees from the several modules using the same types from the same header files and so forth. It also includes the same names used in many places: symbol names, section names, DWARF string tables, source file names. * Every file pretty much has the same section names, so all '''.shstrtab''' sections are redundant. * When a DSO defines exported symbols, then each DSO or executable that links to those symbols repeats the same symbol names in its .strtab too. * The ELF symbol names in .strtab sections are the same names that appear in DW_AT_name (for C) or DW_AT_linkage_name (for C++ mangled names) values in DWARF data. * The source file names that appear in .debug_line file tables also appear in the DW_AT_name attributes of DW_TAG_compile_unit entries. === archive format === The archive consists of these sections: * an archive header * build ID table * a file name table * a ''constant pool'' * a ''CU pool'' * individual files === files in a CDAR === The main contents of a CDAR are individual ELF files. These are the same things we see today in ''.debug'' files, with a few differences. * Each "secondary" section in the file might be '''SHT_NOBITS''' instead of its normal type. This includes: * string tables: .strtab, .shstrtab, .debug_str * secondary DWARF sections: .debug_abbrev, .debug_line, .debug_macinfo, .debug_loc, .debug_ranges When a section is SHT_NOBITS, that means that its contents are part of the ''pool'' (see below). The offsets that would normally refer to the section ('''st_name''', '''sh_name''', '''DW_FORM_strp''', '''DW_FORM_sec_offset''', etc.) are instead interpreted as absolute offsets into the ''pool''. This is easy to integrate into the reader code. When it initializes the pointers into the mapped section data at startup time, when a section is '''SHT_NOBITS''', it instead replaces that '''Elf_Data''' with one pointing to the ''constant pool''. The existing reader code then automatically finds offsets inside the pool. Existing code that maintains caches indexed by the mapped data pointer will automatically reuse and share its caches for data shared in the pool by multiple CUs or multiple files. * The DWARF entries in the file can use the new form '''DW_FORM_GNU_ref_cdar'''. This is treated similarly to '''DW_FORM_ref_addr''', but its offset is taken as a position in the ''CU pool'' rather than in the file's own .debug_info section. The storage of each individual file's contents is preceded by a simple file header that just gives its size. It would be possible to extend this to include other fields like owner, mode, and mtime, like an '''ar''' file header has. But it's not clear any of these are useful to have in a CDAR. If any, perhaps just mtime, but even that doesn't seem especially useful. === build ID table === This table supports quick lookup of files by their build IDs. It associates a build ID with an entry in the file name table and with a file data record. My original idea is that its entries would be sorted by the build ID bits so that consumers can use binary search for a build ID. Alternatively, it could be some scheme encoding a hash table, if that makes for faster lookups without the table being too much larger. The table format would be designed to be compact and alignment-friendly. Since build IDs are in general of arbitrary length, the archive header or the table's own header would need to indicate the length of IDs being used. There is no need to support disparate ID lengths inside a single CDAR. I imagine that each table entry would be something like three aligned words: ID table index, name table index, file record's absolute offset. === file name table === This table associates file names with the files and build IDs. Because of hard links, symlinks, or copies among the input files, there might be multiple file name table entries for a single file. I imagine that each table entry would be something like two aligned words: ''constant pool'' offset of the file name, and file record's absolute offset. === constant pool === This is the "big soup" for everything that does not need any more outside structure. It doesn't need any kind of header, the archive header could just give its position and length. All string tables can be merged together and be part of this. Also all .debug_line tables can just live here, etc. Because they are all in the same pool, strings that match a file name in some file table need not appear separately in a string table. Those string table offsets will just be pool offsets, so they can point directly into part of a .debug_line file table where the same string appears. This will avoid duplicating the same string that appears both in a file table and in a CU's '''DW_AT_name'''. Whatever else can be shared goes in here too. All the "secondary" sections are not read by themselves, but only in chunks that have their own headers (or none) starting at an offset given in an attribute value or a CU header or similar place, and their own terminators. === CU pool === This is like a big .debug_info section. The archive header gives its position in the file and its length. Its internal structure is nothing but the normal sequence of CU headers and DIE data. It contains all '''DW_TAG_partial_unit''' entries, so enumerating any particular file's .debug_info section does not come across them to skip. When a top-level CU uses '''DW_TAG_imported_unit''', its '''DW_AT_import''' uses a '''DW_FORM_GNU_ref_cdar''' reference to point into the CU pool. Other references into shared entries do the same.