This is the mail archive of the
mailing list for the binutils project.
[GOLD] ideas for speeding up gold
- From: Geoff Pike <gpike at chromium dot org>
- To: binutils at sourceware dot org
- Date: Tue, 2 Oct 2012 15:02:02 -0700
- Subject: [GOLD] ideas for speeding up gold
I would like to speed up gold. I'm not sure what benchmark(s) to use
to measure progress. (Please let me know.) What I've been using is the
link of chromium version 154897. The source for chromium 154897 is
available as a tar ball.
The final invocation of gold takes ~24.5s and produces
out/Release/chrome, which is 1.3e9 bytes. I'm using an HP z600: dual
socket, 4-way Xeons, plus hyperthreading, so /proc/cpuinfo shows 16
"cpus." 12G RAM. 1T disk, unencrypted, probably a Seagate Barracuda.
Plus a second disk which shouldn't have much to do with these tests.
Note that 24.5s can only be achieved if the "caches are warm" and not
much else is using RAM.
Looking at a CPU profile, some of the top items are:
SHA-1 (#1, at ~16%)
hash table operations (#2 or #3, I forget)
string hashing for hash tables (i.e., string_hash() from gold.h) (#4
or so, 5.x%)
(Also, I played with a different benchmark and found it was spending
time in the logic to find strings that are suffixes of other strings,
i.e., set_string_offsets(). Please let me know if speeding that up
would be helpful, or let me know which benchmarks to use.)
For #1, I think the fix is to
A. Add another hash function to the list of ones that are available
for build ID, and
B. Make it, instead of SHA-1, the default for "ld --build-id" (or,
alternatively, make gcc use it by default)
I have a prototype that does this, and the choice of hash function
doesn't matter much as long as it is parallelizable. SHA-1 on my
benchmark is a sequential bottleneck at the end of the 24.5s and it
takes about 5.2s. With a parallelizable hash function I can get that
down to well under a second. E.g., in my prototype, the function does
MD5 on 2MB chunks, and the array of MD5 hashes is then hashed with
SHA-1. I'll send a patch today or tomorrow.
For #2, I think the fix is to use a better hash_map, instead of
tr1::hash_map or whatnot. One possibility would be to have the
configure script check if dense_hash_map is available, and use it if
so. Or we could copy dense_hash_map into gold. I'm not sure what's
best. Or we could ignore it, as the size of the gain may not be that
much... I haven't prototyped it so I can't quantify the gain.
For #3, I think the fix is to use a better hash function. For x86-64,
CityHash is always good (I'm biased). The same questions as in the
previous paragraph arise though. As a quick-and-dirty prototype, I
just changed it to a different bad hash function that has higher ILP,
and that does indeed help a little.
Final thought: when it is time to compute the build ID, would calling
msync() help encourage the OS to write dirty pages to the output file?
I tried this and didn't see much effect, but I may try again after #1