A profiled build of systemtap, running "stap -p2 -e 'probe kernel.syscall.* {}'" gives some hints as to why that translation stage takes on the order of a minute on a 3GHz machine. It looks like query_dwarf_func/etc. are called a ridiculous number of times, considering that those syscall aliases expand to a mere 200-odd kernel.function()s. Here are some gprof highlights. Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 27.30 4.54 4.54 query_dwarf_func(Dwarf_Die*, void*) 13.65 6.81 2.27 6274460 0.00 0.00 dwflpp::focus_on_function(Dwarf_Die*) 13.35 9.03 2.22 12548920 0.00 0.00 dwflpp::func_is_inline() 7.82 10.33 1.30 6524382 0.00 0.00 dwflpp::default_name(char const*, char const*) 4.63 11.10 0.77 3421068 0.00 0.00 dwflpp::function_name_matches(std::basic_string<char, std::char_traits<char>, std::allocator<char> >) 2.89 11.58 0.48 231852 0.00 0.02 query_cu(Dwarf_Die*, void*) .... index % time self children called name <spontaneous> [1] 66.8 4.54 6.57 query_dwarf_func(Dwarf_Die*, void*) [1] 2.27 1.25 6274460/6274460 dwflpp::focus_on_function(Dwarf_Die*) [5] 2.22 0.00 12548920/12548920 dwflpp::func_is_inline() [6] 0.77 0.00 3421068/3421068 dwflpp::function_name_matches(std::basic_string<char, std::char_traits<char>, std::allocator<char> >) [13] 0.01 0.04 275/275 std::map<unsigned long long, func_info, std::less<unsigned long long>, std::allocator<std::pair<unsigned long long const, func_info> > >::operator[](unsigned long long const&) [90] 0.01 0.00 275/275 dwflpp::function_file(char const**) [278] 0.00 0.00 275/275 dwflpp::function_entrypc(unsigned long long*) [1068] 0.00 0.00 275/550 func_info::func_info() [978] 0.00 0.00 275/275 dwflpp::function_die(Dwarf_Die*) [1066] 0.00 0.00 275/275 dwflpp::function_line(int*) [1067] 0.00 0.00 275/275 func_info::operator=(func_info const&) [1074] 0.00 0.00 275/1100 func_info::~func_info() [855] ----------------------------------------------- <spontaneous> [2] 25.6 0.02 4.24 query_module(Dwfl_Module*, void**, char const*, unsigned long long, void*) [2] 0.18 4.01 278/278 dwflpp::iterate_over_cus(int (*)(Dwarf_Die*, void*), void*) [3] 0.04 0.00 18070/18070 dwflpp::focus_on_module(Dwfl_Module*) [108] 0.01 0.00 18348/18348 dwflpp::module_name_matches(std::basic_string<char, std::char_traits<char>, std::allocator<char> >) [322] ----------------------------------------------- 0.18 4.01 278/278 query_module(Dwfl_Module*, void**, char const*, unsigned long long, void*) [2] [3] 25.2 0.18 4.01 278 dwflpp::iterate_over_cus(int (*)(Dwarf_Die*, void*), void*) [3] 0.48 3.52 231852/231852 query_cu(Dwarf_Die*, void*) [4] 0.01 0.00 278/278 dwflpp::get_module_dwarf(bool) [272]
This arises due to the fact that we match each function pattern in a probe header independently against the dwarf files. Thus for the 200-odd kernel.function("foo") patterns which result from the glob and alias expansion, we do 200-odd passes over the dwarf data. A plausible solution would be to change the pattern-match code to take a set of globs, rather than a single glob, and to amalgamate all (or some suitably-similar subset of) the dwarf probe-points in a probe header into a single pass over the dwarf data.
A somewhat similar problem used to occur with respect to prologue analysis: all CUs were searched for line records, even if no functions were actually selected for them. A one-line conditional (tapsets.cxx:2181 in v1.95) made a huge difference in eliminating wasted effort. Might a similar oversight be responsible for the excessive searching now? Perhaps some results may be cached from earlier searches?
I have committed a change which makes the dwflpp structures persist between pattern elaboration calls, adds an indexed CU and function DIE cache to the dwflpp, and reimplements the CU and function DIE-iteration loops via these indices (multimap::lower_bound on pattern-stems with fnmatch() filtering on the remaining set). kernel.syscall.* reduced from ~1.5m to ~2s, on my workstation. Let me know if you find further problems.
Wow, nicely done.
Not a big deal, but the new code appears to work differently with respect to semok/twenty.stp. On kernels that have debuginfo-less modules loaded (e.g. vmware), this test case used to pass, I suppose skipping over them. Now, it results in an overall failure. It would be nice to restore the old behavior if that was not too hard. (It would work like an implicit blacklist over debuginfo-less modules.)
really? I see failures both ways. what sort of messages do you get when it "succeeds"?