This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

revamp sdt.h


I have spoken before about some of the shortcomings of the .probes
section format the sdt.h macros generate.  (I'm not really sure how
much I've written that in any postings here and how much it may have
been only in verbal grumblings in some unrecorded voice meetings.)
With Rayson's recent work, we've also noted the need to have sdt.h
macros that can work with hand-written assembly code.

So here is a first discussion draft of an entirely revamped set of
sdt.h macros and binary format they generate.  There is no conceptual
change here at all, it is just a new encoding of exactly the same
information as today's v2 sdt.h probes.  The only change to the
translator is the new binary format decoder.  Actually, that's not
true, but the changes are small and I'll explain them all in a moment.

Two files are attached at the end, which you should read or skim along
with my explanation here.  First is the core macro nest of the new
sdt.h, with some example macro uses.  The other file is a small
standalone C program based on libelf (elfutils >= 0.130) that decodes
the new binary format and prints out the probes.  That can serve as
the model for the new translator code.

The essential macros in this first draft are actually pretty complete
and usable (I didn't include all the sugar, just enough for examples).
The one thing they are not is friendly to -pedantic (unless used with
-std=c99).  They use variadic macros heavily and I'm not sure I could
have hashed this out without them and not become homicidal.  But
chances are we can rejigger the macro nest without them later if we
have to, with only slight dangers to life and limb of bystanders.
Anyway, not an issue for a discussion draft.

This version addresses these issues done poorly by the existing stuff,
some of which are purely about the macros and some of which are about
the format itself.

* can be used in assembly (.S) source files
* can be used inside inline asm statements in C source

Both of these matter for the places probes should go in libpthread functions.

* no data relocs

The old formats are non-starters for libc/libpthread, where the number
of dynamic relocs of any kind is very carefully tuned to keep the
startup cost on every program in the system as small as possible.

* minimal memory footprint of any kind

The cost is exactly one byte of rodata (rounded to alignment, so one
word at least) total in the final file, plus just the size of the nop
instruction itself times the number of probes.

These last two are achieved by putting the data into a non-allocated
ELF note.  This has some nice properties we get for free:
* no runtime cost, it's all fixed at link time and never in memory
* preserved in both stripped files and .debug files
It also has one new wrinkle we didn't have before (which is the flip
side of not having any dynamic relocs), which is that prelink won't
adjust its contents for address offsets.

One drawback of using a naked ELF note for every probe is that there
is a proportionally large per-probe overhead for the note headers.
But that just means something like another 20 bytes on top of the 16
or 24 you might have had for each probe, before counting the name
strings.  We could make a much more compact note format if we wanted
to rely on a link-time step.  But the absolute numbers involved in the
size of the notes are still pretty small (I think it's smaller per
probe than v2 .probes is, and it's just ELF file size instead of being
runtime memory footprint).  IMHO there is quite a lot to be said for
'#include <sys/sdt.h>' (and maybe later -lsdt, but now not even that)
being the sum total of extra fiddling to an existing build setup
needed to add static probes.

Ok, now it's time to look at new-sdt.h, attached below.  You can just
look at /* Example uses */ and below for the moment.  That file can be
compiled either as C or as assembly to show those examples in a binary.
(The assembly is for x86-64, though you can trivially change the operand
expressions to something that will work on another machine if you want
to see an example there.)

The scenario below is for building a DSO.  You could just as well drop
the -fPIC and -shared flags and create an executable instead.  I'm only
showing the one example because both are really just the same, and the
DSO case lets me illustrate the prelink issue.

	$ gcc -c -o s.o -xc new-sdt.h -O2  -fPIC
	$ gcc -c -o s2.o -Dfrob=diddle -Dmain=dummy -xc new-sdt.h -O2 -fPIC
	$ gcc -c -o s3.o -x assembler-with-cpp new-sdt.h -O2 -fPIC
	$ gcc -shared -o s.so s.o s2.o s3.o

Ok.  So now we compiled two objects from C sources and one from assembly
sources, and linked those together into a DSO (or executable).  The
different objects use overlapping sets of provider and probe names,
i.e. some probes have instances in two of the objects.

Now let's build the little decoder program:

	$ gcc -std=gnu99 -g sdt-extractor.c -o sdt-extractor -lelf

And now we can run it:

	$ ./sdt-extractor s.so
	0x5a0	libfoo.noargs              :
	0x5a1	libfoo.frob                -4@%edi 4@(%rsi)
	0x5a2	libfoo.diddle              8@%rsi -4@%edi
	0x5a3	libfoo.asm_noargs          
	0x5a4	libfoo.asmfrob             %edi %rax (%rsi)
	0x5af	libfoo.asmfrobarg          4@(%rsi,%rdi,4) 8@%rax 4@$2
	0x5d0	libfoo.noargs              :
	0x5d1	libfoo.diddle              -4@%edi 4@(%rsi)
	0x5d2	libfoo.diddle              8@%rsi -4@%edi
	0x5d3	libfoo.asm_noargs          
	0x5d4	libfoo.asmfrob             %edi %rax (%rsi)
	0x5df	libfoo.asmfrobarg          4@(%rsi,%rdi,4) 8@%rax 4@$2
	0x5e4	libfoo.noargs              :
	0x5e5	libfoo.frob                %rax, -20(%rbp)
	0x5e6	libfoo.diddle              (%rdi), %rax

As you can see, we have a probe address, a provider name, a probe name,
and an argument format string.  Don't worry yet about the argument
details.  I'll get to that after covering the prelink issue.

So, all this probe information is stored in the .note.stapsdt section,
which is not allocated data, has no relocs, and does not get touched by
prelink.  So the probe addresses stored in there at link time stay as
they started.  But, prelink might adjust the actual text addresses:

	$ prelink -r 0x1000000 s.so
	$ ./sdt-extractor s.so
	0x10005a0	libfoo.noargs              :
	0x10005a1	libfoo.frob                -4@%edi 4@(%rsi)
	0x10005a2	libfoo.diddle              8@%rsi -4@%edi
	0x10005a3	libfoo.asm_noargs          
	0x10005a4	libfoo.asmfrob             %edi %rax (%rsi)
	0x10005af	libfoo.asmfrobarg          4@(%rsi,%rdi,4) 8@%rax 4@$2
	0x10005d0	libfoo.noargs              :
	0x10005d1	libfoo.diddle              -4@%edi 4@(%rsi)
	0x10005d2	libfoo.diddle              8@%rsi -4@%edi
	0x10005d3	libfoo.asm_noargs          
	0x10005d4	libfoo.asmfrob             %edi %rax (%rsi)
	0x10005df	libfoo.asmfrobarg          4@(%rsi,%rdi,4) 8@%rax 4@$2
	0x10005e4	libfoo.noargs              :
	0x10005e5	libfoo.frob                %rax, -20(%rbp)
	0x10005e6	libfoo.diddle              (%rdi), %rax
	$

As you can see, everything is still correct: the probe addresses got the
prelink offset applied, and nothing else changed.  So how does this work?

It uses the .stapsdt.base section.  This is a special section we add to
the text.  All the .ifndef and comdat magic in the macro for this is
just there so that we only ever have one of these sections in a final
link and it's only ever one byte long.  Really it could be 0 bytes long,
but the linker swallows the section if we make it empty, so we pad it
with a byte (and alignment padding will usually mean that it consumes at
least one word in the binary's text segment).  Nothing about this
section itself matters, we just use it as a marker to detect prelink
address adjustments.

Each probe note records the link-time address of the .stapsdt.base
section alongside the probe PC address.  The decoder compares the base
address stored in the note with the .stapsdt.base section's sh_addr.
Initially these are the same, but the section header will be adjusted by
prelink.  So the decoder applies the difference to the probe PC address
to get the correct prelinked PC address.

I've put this magic into the macro and note format unconditionally, but
none of that is necessary for executables.  We could make it conditional
on #ifdef __PIC__.  But the cost (a word per note, plus the 1-byte
section of runtime rodata) seems small enough that it's nicer not to
bother with two variants of the format.

A library or application built using a custom linker script could
possibly remove, rename, or hide the .stapsdt.base section.  But that is
a rare thing to do (and even with some custom linker scripts, it may
well come through fine).  We do rely on the decoder in the translator
being able to find that section by name, but that is certainly no more
than the old .probes schemes relied on.


Now, some notes about the note format.

Note that the name of the notes section is not normative, and in a final
executable/DSO you might actually be looking at intermixed notes of
other kinds (follow the sdt-extractor.c example to consider all
appropriate sections and check all notes in them via gelf_getnote).

The ELF note format is variable-sized and includes a "vendor string" and
a type code.  Both the header and the "payload" after that are aligned
to 4 bytes within the section.

We're using the string "stapsdt" and that give us complete control of
the meaning of that (32-bit) type code (GElf_Nhdr.n_type).  So if we
want to have different flavors of probes, or different encoding formats,
now or in the future, we can encode all such selections in that type
code.  For this discussion draft, I'm using just one flavor (intended
for uprobes probes, i.e. a nop) and n_type=3 (for "sdt v3").

After the note header, the n_descsz bytes are:

	probe PC address (4 or 8 bytes)
	link-time sh_addr of .stapsdt.base section (4 or 8 bytes)
	provider name (null-terminated string)
	probe name (null-terminated string)
	argument format (null-terminated string)

Finally, I've made some changes to the v2 argument format string,
some trivial and one substantive.

* For no arguments, the string can be either "" or ":".
* Arguments can be separated by commas, whitespace, or both.

These differences are just for the convenience of writing the macro nest.

* Sized arguments.

In looking at the proposed libpthread probes using v2 sdt.h, I noticed
that adding the probes introduced not only the nop instructions
themselves, but some extra code before them to sign-extend or
zero-extend int arguments (on the hot path, even adding register
pressure!).  We really don't want that perturbation of the code
generation just for the common situation of having int-typed probe
parameters on a 64-bit machine.

So, these macros do not cast a probe argument to size_t as the existing
macros do.  Instead, they just make it an rvalue of int or wider (by
doing a plain + 0).  That coerces short (and bitfields and whatever) to
int, and coerces array references to pointers.  So arguments will still
wind up integers, and be either 32 or 64 bits (on a 32-bit machine,
there could be a 64-bit probe argument, which would be forced into a
memory operand).

This is encoded in the argument format string.  Each argument might
still be a plain assembly operand (from hand-written assembly), in which
case you should assume it's meant to be natural word size, or perhaps
the word size indicated by the register syntax (e.g. %eax or %r11d on
x86-64 mean the low 32 bits only).  But normally each argument will look
like "N@OP" where OP is the actual assembly operand, and N is one of:

	4	32 bits unsigned
	-4	32 bits signed
	8	64 bits unsigned
	-8	64 bits signed

The signedness doesn't really matter for 64 bits, though you could
potentially still use it to choose %d vs %u formatting for $parms$
and that sort of thing.  The -4@ notation tells you that you need to
extract it as 32 bits (low 32 of a register, or only address 4 bytes
if a memory access) and sign-extend it to 64 bits for a stap long.

This shifts the work of sign extension (when you want it) to the
translator/generated probe runtime code, rather than putting it into
the probed hot path code to be run even when no probes are in use.
With this, we can choose probe points and arguments carefully for
libpthread/libc and reasonably expect not to perturb the generated
code at all beyond the actual nop insertions.


I think I've explained everything.  
The discussion draft for sdt.h glosses over some trivial nits,
but I think I was pretty thorough about all the important nits.

The one thing I didn't mention is the semaphore option.  There isn't
one.  I can't tell what the story is with the semaphore these days, but
it looks like we're not really doing that any more.  If we want it in,
or even optionally in at compile time, then it is easy enough to add it
to these macros, and use new n_type values to indicate with vs without
variants of the note format.


Thanks,
Roland


#ifdef __ASSEMBLER__
# define _SDT_PROBE(provider, name, arg_format, ...) \
  _SDT_ASM_BODY(provider, name, arg_format, __VA_ARGS__)
# define _SDT_ASM_1(...)		__VA_ARGS__;
# define _SDT_ASM_STRING_1(...)		.asciz #__VA_ARGS__;
# define _SDT_ASM_ARGS(format, ...)	_SDT_ASM_STRING_1(__VA_ARGS__)
# define _SDT_ARG(n, x)			x
#else
# define _SDT_PROBE(provider, name, arg_format, ...) __asm__ __volatile__ \
  (_SDT_ASM_BODY(provider, name, arg_format, :) :: __VA_ARGS__)
# define _SDT_ASM_1(...)		#__VA_ARGS__ "\n"
# define _SDT_ASM_STRING_1(...)		_SDT_ASM_1(.asciz #__VA_ARGS__)
# define _SDT_ASM_ARGS(format, ...)	_SDT_ASM_STRING_1(format)
# define _SDT_ARGFMT(n)			%c[_SDT_S##n]@_SDT_ARGTMPL(_SDT_A##n)
# define _SDT_ARG(n, x)			\
  [_SDT_S##n] "n" ((__builtin_constant_p ((x) + 0 < 0) ? 1 : -1) \
		   * (int) sizeof ((x) + 0)),		 \
  [_SDT_A##n] "nor" ((x) + 0)
#endif
#define _SDT_ASM(...)			_SDT_ASM_1(__VA_ARGS__)
#define _SDT_ASM_STRING(...)		_SDT_ASM_STRING_1(__VA_ARGS__)

#if defined __powerpc__ || defined __powerpc64__
# define _SDT_ARGTMPL(id)	%I[id]%[id]
#else
# define _SDT_ARGTMPL(id)	%[id]
#endif

#include <bits/wordsize.h>
#if __WORDSIZE == 64
# define _SDT_ASM_ADDR	.quad
#else
# define _SDT_ASM_ADDR	.long
#endif

#define _SDT_NOP	nop

#define _SDT_NOTE_NAME	"stapsdt"
#define _SDT_NOTE_TYPE	3

#define _SDT_ASM_BODY(provider, name, arg_format, ...)			      \
  _SDT_ASM(990:	_SDT_NOP)						      \
  _SDT_ASM(	.section .note.stapsdt,"","note")			      \
  _SDT_ASM(	.balign 4)						      \
  _SDT_ASM(	.int 992f-991f, 994f-993f, _SDT_NOTE_TYPE)		      \
  _SDT_ASM(991:	.asciz _SDT_NOTE_NAME)					      \
  _SDT_ASM(992:	.balign 4)						      \
  _SDT_ASM(993:	_SDT_ASM_ADDR 990b)					      \
  _SDT_ASM(	_SDT_ASM_ADDR _.stapsdt.base)				      \
  _SDT_ASM_STRING(provider)						      \
  _SDT_ASM_STRING(name)							      \
  _SDT_ASM_ARGS(arg_format, __VA_ARGS__)				      \
  _SDT_ASM(994:	.balign 4)						      \
  _SDT_ASM(	.previous)						      \
  _SDT_ASM(.ifndef _.stapsdt.base)					      \
  _SDT_ASM(	.section .stapsdt.base,"aG","progbits",.stapsdt.base,comdat)  \
  _SDT_ASM(	.weak _.stapsdt.base)					      \
  _SDT_ASM(	.hidden _.stapsdt.base)					      \
  _SDT_ASM(_.stapsdt.base: .space 1)					      \
  _SDT_ASM(	.size _.stapsdt.base, 1)				      \
  _SDT_ASM(	.previous)						      \
  _SDT_ASM(.endif)

#define PROBE0(provider, name) \
  _SDT_PROBE(provider, name, :, :)
#define PROBE1(provider, name, arg1) \
  _SDT_PROBE(provider, name, _SDT_ARGFMT(1), _SDT_ARG(1, arg1))
#define PROBE2(provider, name, arg1, arg2) \
  _SDT_PROBE(provider, name, _SDT_ARGFMT(1) _SDT_ARGFMT(2), \
	     _SDT_ARG(1, arg1), _SDT_ARG(2, arg2))

#define PROBE_ASM(provider, name, ...)		\
  _SDT_ASM_BODY(provider, name, __VA_ARGS__, :)
#define PROBE_ASM_TEMPLATE(n)		_SDT_ASM_TEMPLATE_##n
#define PROBE_ASM_OPERANDS(n, ...)	_SDT_ASM_OPERANDS_##n(__VA_ARGS__)
#define _SDT_ASM_TEMPLATE_0		:
#define _SDT_ASM_TEMPLATE_1		_SDT_ARGFMT(1)
#define _SDT_ASM_TEMPLATE_2		_SDT_ASM_TEMPLATE_1 _SDT_ARGFMT(2)
#define _SDT_ASM_TEMPLATE_3		_SDT_ASM_TEMPLATE_2 _SDT_ARGFMT(3)
#define _SDT_ASM_OPERANDS_0()		/* no operands */
#define _SDT_ASM_OPERANDS_1(arg1)	_SDT_ARG(1, arg1)
#define _SDT_ASM_OPERANDS_2(arg1, arg2)	_SDT_ARG(1, arg1), _SDT_ARG(2, arg2)
#define _SDT_ASM_OPERANDS_3(arg1, arg2, arg3)	\
  _SDT_ARG(1, arg1), _SDT_ARG(2, arg2), _SDT_ARG(3, arg3)


/* Example uses */

#define LIB libfoo    /* Probe do macros support indirecting the names.  */

#ifdef __ASSEMBLER__

#define ARG1 %rax
#define ARG2 -20(%rbp)

/* Here in an assembly source file, probes look just like in C source.
   The arguments are assembly operands that the sdt decoder can grok;
   e.g. constants might need to be marked, etc.  */
PROBE0(LIB, noargs)
PROBE2(LIB, frob, ARG1, ARG2)
PROBE2(LIB, diddle, (%rdi), %rax)

#else

struct bar { unsigned int baz; short int spaz; };

void frob (int foo, struct bar *bar)
{
  /* Plain C use is as before.  */
  PROBE0(LIB, noargs);
  PROBE2(LIB, frob, foo, bar->baz);
  PROBE2(LIB, diddle, bar, bar->spaz);

  /* Here's a use inside traditional inline asm.
     Note that GCC does not do %format handling in this case.  */
  __asm (PROBE_ASM(LIB, asm_noargs)
	 "# standalone asm: %0 et al not translated, no %% needed");

  /* Here's a use inside a fancy GCC asm using operands from C.
     Here the asm writer is choosing which assembly operands to
     tell sdt, just like writing a probe in an assembly source file.
     Note spaces with no commas between the operands.
     Those might or might not be substituted GCC %format thingies.  */
  __asm volatile ("# do something with %0\n"
		  PROBE_ASM(LIB, asmfrob, %0 %%rax %1)
		  "# do something with %1"
		  : : "r" (foo), "m" (bar->baz));

  /* Here's an asm use where the probe arguments are specified separately
     in C, so they behave just like a plain C probe would.  The
     PROBE_ASM_TEMPLATE(n) macro says we have n arguments from C.
     Then PROBE_ASM_OPERANDS(n, ...) can appear anywhere in the
     asm's list of input operands.  */
  const int fold[3] = { 1, 2, 3 };
  static int ugh[3] = { 1, 2, 3 }; /* array as arg demonstrates why + 0 */
  __asm volatile (PROBE_ASM(LIB, asmfrobarg, PROBE_ASM_TEMPLATE(3))
		  "# magic insn uses no operands"
		  : : PROBE_ASM_OPERANDS(3, bar[foo].baz, ugh, fold[1]));
}

int main () {}

#endif
#define _SDT_NOTE_NAME	"stapsdt"
#define _SDT_NOTE_TYPE	3

#define _GNU_SOURCE
#include <gelf.h>
#include <fcntl.h>
#include <unistd.h>
#include <error.h>
#include <errno.h>
#include <string.h>
#include <inttypes.h>
#include <assert.h>
#include <stdio.h>

static void
handle_probe (Elf *elf, GElf_Addr base, int type, const char *data, size_t len)
{
  if (type != _SDT_NOTE_TYPE)
    {
      error (0, 0, "unknown %s n_type %u", _SDT_NOTE_NAME, type);
      return;
    }

  union
  {
    Elf64_Addr a64[2];
    Elf32_Addr a32[2];
  } buf;
  Elf_Data dst =
    {
      .d_type = ELF_T_ADDR, .d_version = EV_CURRENT,
      .d_buf = &buf, .d_size = gelf_fsize (elf, ELF_T_ADDR, 2, EV_CURRENT)
    };
  assert (dst.d_size <= sizeof buf);

  if (len < dst.d_size + 3)
    {
      error (0, 0, "short note");
      return;
    }

  Elf_Data src =
    {
      .d_type = ELF_T_ADDR, .d_version = EV_CURRENT,
      .d_buf = (void *) data, .d_size = dst.d_size
    };

  if (gelf_xlatetom (elf, &dst, &src,
		     elf_getident (elf, NULL)[EI_DATA]) == NULL)
    error (0, 0, "gelf_xlatetom: %s", elf_errmsg (-1));

  const char *provider = data + dst.d_size;
  const char *name = memchr (provider, '\0', data + len - provider);
  if (name == NULL)
    {
      error (0, 0, "corrupt probe");
      return;
    }

  ++name;
  const char *args = memchr (name, '\0', data + len - name);
  if (args++ == NULL ||
      memchr (args, '\0', data + len - name) != data + len - 1)
  if (name == NULL)
    {
      error (0, 0, "corrupt probe");
      return;
    }

  GElf_Addr pc;
  GElf_Addr base_ref;
  if (gelf_getclass (elf) == ELFCLASS32)
    {
      pc = buf.a32[0];
      base_ref = buf.a32[1];
    }
  else
    {
      pc = buf.a64[0];
      base_ref = buf.a64[1];
    }

  pc += base - base_ref;

  printf ("%#" PRIx64 "\t%s.%-20s%s\n", pc, provider, name, args);
}

static void
handle_notes (Elf *elf, Elf_Scn *scn, GElf_Addr base)
{
  if (base == (GElf_Addr) -1)
    {
      error (0, 0, "notes before base section");
      base = 0;
    }

  Elf_Data *data = elf_getdata (scn, NULL);
  size_t next;
  GElf_Nhdr nhdr;
  size_t name_off;
  size_t desc_off;
  for (size_t offset = 0;
       (next = gelf_getnote (data, offset, &nhdr, &name_off, &desc_off)) > 0;
       offset = next)
    if (nhdr.n_namesz == sizeof _SDT_NOTE_NAME
	&& !memcmp (data->d_buf + name_off,
		    _SDT_NOTE_NAME, sizeof _SDT_NOTE_NAME))
      handle_probe (elf, base,
		    nhdr.n_type, data->d_buf + desc_off, nhdr.n_descsz);
}

static void
handle_elf (Elf *elf)
{
  size_t shstrndx;
  if (elf_getshdrstrndx (elf, &shstrndx))
    {
      error (0, 0, "elf_getshdrstrndx: %s", elf_errmsg (-1));
      return;
    }

  GElf_Addr base = -1;

  Elf_Scn *scn = NULL;
  while ((scn = elf_nextscn (elf, scn)) != NULL)
    {
      GElf_Shdr shdr;
      if (gelf_getshdr (scn, &shdr) == NULL)
	{
	  error (0, 0, "elf_getshdr: %s", elf_errmsg (-1));
	  continue;
	}
      switch (shdr.sh_type)
	{
	case SHT_NOTE:
	  if (!(shdr.sh_flags & SHF_ALLOC))
	    handle_notes (elf, scn, base);
	  break;

	case SHT_PROGBITS:
	  if (base == (GElf_Addr) -1
	      && (shdr.sh_flags & SHF_ALLOC) && shdr.sh_name != 0)
	    {
	      const char *scn_name = elf_strptr (elf, shstrndx, shdr.sh_name);
	      if (scn_name != NULL && !strcmp (scn_name, ".stapsdt.base"))
		base = shdr.sh_addr;
	    }
	  break;
	}
    }
}

static void
handle_file (const char *file)
{
  int fd = open64 (file, O_RDONLY);
  if (fd < 0)
    error (0, errno, "%s", file);
  else
    {
      Elf *elf = elf_begin (fd, ELF_C_READ_MMAP_PRIVATE, NULL);
      if (elf == NULL)
	error (0, 0, "elf_begin: %s: %s", elf_errmsg (-1));
      else
	{
	  handle_elf (elf);
	  elf_end (elf);
	}
      close (fd);
    }
}

int
main (int argc, char **argv)
{
  elf_version (EV_CURRENT);

  for (int argi = 1; argi < argc; ++argi)
    handle_file (argv[argi]);

  return error_message_count > 0;
}

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]