This is the mail archive of the binutils@sourceware.org mailing list for the binutils project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Dynamic linking optimizations

From: John Richard Moser <nigelenki at comcast dot net>
To: binutils at sourceware dot org
Date: Tue, 03 Mar 2009 02:30:31 -0500
Subject: Dynamic linking optimizations

The dynamic linking process, as I understand it, has some basic performance considerations. Particularly, you may have many symbols with similar prefixes to look up, such as the following:

libFunkyStuff_main_loop
libFunkyStuff_order_data
libFunkyStuff_order_ops
libGraphics_garbage
sndlib_close_sound_device
sndlib_open_sound_device
sndlib_play_sound

To handle this, we do a number of things:

1. Generate a simple chaining hash table, drop collisions into buckets.

2. Point from that hash table into a symbol table.

3. Somehow associate the symbol with the offset in the file

So on lookup we have the following task:

1. Generate a hash for the symbol

2. Look up the proper entry in the hash table for a DSO.

3. Character-by-character compare for each entry in the bucket.

4. If we get a match, we've found it. Else check the next DSO.

For the lookup, (1) can be handled by generating a hash or by pre-computing hashes and storing them (DT_GNU_HASH).

So the simple question I want to ask: why in the hell would you do this?

If you have to flat out compare everything, no hash table, then you make duplicate comparisons. For $NUM_STRINGS strings prefixed with the same prefix of $PFX_LEN characters, looking up every such string would cause you to traverse that exact prefix $NUM_STRINGS * ($NUM_STRINGS + 1) / 2 times.

Okay, so hashing gives you a way to mitigate this. If you have 2048 symbols, and 1024 buckets in the hash table, and your hash algorithm is perfect for the given case, then you'll traverse each prefix at most TWO times per lookup (i.e. two similarly-prefixed strings hit the same bucket) and at least ONE time per lookup. This is much better.

In practice, you can still end up with the same string in the same bucket 40 times, just by pure bad luck and just the right (or wrong) hashing table size. Arbitrarily adjust the size and your luck changes and you're good (welcome to -Wl,-O1).

Again: why in the hell would you do this?

Let's start playing with theory (nobody cares, show us the code! I don't have any). I say something blindingly obvious, and you tell me why this is an obviously bad approach. If I can A) prove you wrong; or B) change the approach to erase the bad stuff, then I have a better solution. If not, then the obvious has happened: A bunch of people who are a hell of a lot smarter than me came up with something much better than I could this morning over coffee, and I'll be unplugging my speakers before they start emitting penis-shaped sound waves.

Let's look at the symbol list given above, and work on the naive case without the hash table. Now say we want to look for sndlib_play_sound() ... we'll do the following comparisons:

l{skip}
l{skip}
l{skip}
l{skip}
sndlib_c{skip}
sndlib_o{skip}
sndlib_play_sound\0

That's 38 character comparisons, 6 adjunct operations ("Skip this
symbol, it's not the one we want").

Now here's an idea you've seen before, because I posted it before with little to no thought on how to implement it. This time I've come up with something for that, but more on that later. Let's arrange this in a Patricia Trie.

 ---lib--FunkyStuff_--main_loop
 |    |            |-order_--data
 |    |                    |-ops
 |    |-Graphics_garbage
 |-sndlib_--close_sound_device
          |-open_sound_device
          |-play_sound

Starting at the root, we'd need ...

l{skip}
s{follow}
ndlib_{branch}
c{skip}
o{skip}
p{follow}
lay_sound\0

Now let's say that each of these branches is an array of structs:

struct trie_branch {
	char		key; /* Next character in string */
	unsigned void	*offset; /* distance to jump to continue */
}

And that each time we branch, we see another struct:

struct trie_branch_header {
	unsigned char	branches; /* how many branches */
	/* What?  You think it's more complicated than that? */
}

This means "skip" is effectively character comparison and increment an
index:

for (i = 0; i < num_branches; i++) {
    if (branch_list[i].key == symbol_name[index])
        follow_branch(branch_list[i].key);
}

Following of course has an adjunct operation attached to it: you have to compute an address from an address offset (pointer arithmetic). Amortizing 'skip' to the character comparison that it is gives us the below behavior:

ls{follow}
ndlib_{branch}
cop{follow}
lay_sound\0

21 character comparisons, 3 operations in between. (there is no corner case that creates more comparisons. It may be possible to intentionally create a horrible trie where you branch every character, by iterating every possible duple at every level i.e. aa* ab* ac* -> aaa* aab* aac*... you would also need ($valid_chars)^($length_of_string) symbols to pull this off, over 200,000 symbols for any string of 3 characters for example given [a-zA-Z0-9])

Now here's the kicker: How do you lay this out as a flat data structure? Sure this is a beautiful idea, you'll never need a hash table, each time you enter the DSO to do a symbol lookup against its symbol table you'll only compare the same prefix ONE or ZERO times; but how do you get this on disk and in a format appropriate for loading by the dynamic linker? How do you deal with cache behavior using this garbage?

If the symbols are best normally ordered in alphabetical order, then the
basic answer is simple:  sort the branches in alphabetical order (i.e.
sndlib{c,o,p} not sndlib{p,c,o}).  When converting this complex
structure to a flat bitstream, follow it by naive recursion: branch from
the root out to the FIRST branch, and the FIRST branch off that, and the
FIRST branch off that, etc; then back off ONE level and follow the
SECOND, then THIRD; and so on.  For our example:

---lib--FunkyStuff_--main_loop
 |    |            |-order_--data
 |    |                    |-ops
 |    |-Graphics_garbage
 |-sndlib_--close_sound_device
          |-open_sound_device
          |-play_sound

Write Branches: {l,s}
Write 'l':  ib
Write Branches: {F,G}
Write 'F':  unkyStuff_
Write Branches:  {m,o}
Write 'm':  ain_loop
Write 'o':  rder_
Write Branches:  {d,o}
Write 'd':  ata
Write 'o':  ps
Write 'G':  raphics_garbage
Write 's':  ndlib_
Write Branches:  {c,o,p}
Write 'c':  lose_sound_device
Write 'o':  pen_sound_device
Write 'p':  lay_sound

So from the root...

{hdr:2}\0{'l',$addr_to_l0}{'s',$addr_to_s0}{hdr:2}ib\0{'F',$addr_to_F0}{'G',$addr...}

and so on. Breaking down the above, {hdr:2} means this header:

struct trie_branch_header {
	unsigned char	branches; /* how many branches */
	/* What?  You think it's more complicated than that? */
}

with a value that indicates '2' branches. In practice we have to represent 256 possible values, so we'd probably store ($NUM_BRANCHES - 1) i.e. store 0 for 1, 1 for 2, and so on. Implementation detail.

Following this header is the string for this portion. For the root, that's ... null string, which is one byte of no data terminated with a null terminator '\0' (i.e. it's just \0).

Following this is {'l',$addr_to_l0}{'s',$addr_to_s0}, which represents the two branches off the root, towards "lib" and "sndlib_" respectively. These are:

struct trie_branch {
	char		key; /* Next character in string */
	unsigned void	*offset; /* distance to jump to continue */
}

And "offset' represents the offset from the beginning of the whole trie. Since we know how many branches are available to take, there's no way to run off the end of this, so we can order these things however we like...

Following that is {hdr:2}ib\0{'F',$addr_to_F0}{'G',$addr...}. This is where the branch for 'l' off the root goes. It's "\0" "l" "ib\0" i.e. "lib" so far. It has 2 branches, 'F' and 'G' heading to libFunkyStuff and libGraphics symbols.

You get the idea.

Reading it back would be trivial. Further, if you search for symbols in alphabetical order and write out in the order stated, you'll always do the bulk of the work near the beginning of the data structure in the beginning, and inch further along as you continue. It wouldn't be a linear read, but it'd be concentrated in hot areas as you moved through the lookup process. Theoretically, when you finished with one branch, you'd be reading just before the start of where the next branch would jump to, and you may just wind up following the next branch right into something the CPU's precache algorithms pulled into cache already.

Non-trivially, you could read ordered *and* *keep* *track* *of* *the* *dynamic* *linker's* *state*. In other words, if I'm reading around libFunkyStuff_* branch, I can push each level I travel deeper onto a stack. When I'm no longer matching with libFunkyStuff_ (determined through heuristics logic), I can pop off that stack, step back, and not have to repeat the comparison from the root down. My lookup for libGraphicsGarbage can start looking at the branch for 'G' (knowing I just came from 'F' and it's sorted), avoiding a few character comparisons... or if there's a hell of a lot under there (huge C++ name spaces with thousands of symbols), a LOT of character comparisons.

Of course that's non-trivial and I have no idea how you'd do that or what the performance costs are. Simplicity is key here.

Does anyone have any valid reason why this wouldn't work? Am I a raging nut? Please explain any flaws in my logic, because if they're there I've obviously ALREADY failed to see them and telling me to think about it a little harder won't work (I first came up with this idea when Meeks proposed DynSort).

Follow-Ups:
- Re: Dynamic linking optimizations
  - From: Ian Lance Taylor

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]