This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: memcmp-sse4.S EqualHappy bug


* Torvald Riegel (triegel@redhat.com) wrote:
> On Fri, 2015-06-19 at 13:55 -0400, Simo Sorce wrote:
> > On Fri, 2015-06-19 at 19:40 +0200, Torvald Riegel wrote:
> > > On Fri, 2015-06-19 at 13:18 -0400, Simo Sorce wrote:
> > > > On Fri, 2015-06-19 at 13:16 -0400, Simo Sorce wrote:
> > > > > On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote:
> > > > > > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote:
> > > > > > > * Torvald Riegel (triegel@redhat.com) wrote:
> > > > > > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote:
> > > > > > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote:
> > > > > > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote:
> > > > > > > > > > > I agree with aborting, but only as long as the hot path's performance
> > > > > > > > > > > is not impacted and I haven't thought about how to do that.
> > > > > > > > > > > 
> > > > > > > 
> > > > > > > > > Clearly people are better using atomic comparisons on canary values
> > > > > > > > > instead, but it seem easy to avoid false positives (returning 0 when
> > > > > > > > > memory is clearly different) and keep these things working, so why not
> > > > > > > > > do it ?
> > > > > > > > 
> > > > > > > > I see two separate issues here.  First, where do we draw the line, and
> > > > > > > > what do we guarantee.  I strongly believe that programs must not have
> > > > > > > > data races, and that they should use atomics or other synchronization
> > > > > > > > properly (this doesn't mean locks, but relaxed memory order atomics, for
> > > > > > > > example).
> > > > > > > > Second, do we try to keep buggy programs working.  If it has no cost to
> > > > > > > > do so (e.g., like it might be true in this case), then doing that can
> > > > > > > > help to trigger less errors.  But that doesn't mean the buggy programs
> > > > > > > > should get fixed eventually.
> > > > > > > 
> > > > > > > I find it difficult to understand the boundaries of what the C library
> > > > > > > is allowed to do in this type of optimisation.
> > > > > > > 
> > > > > > > For example, consider the following:
> > > > > > > 
> > > > > > >     char a[128];
> > > > > > >     char b[128];
> > > > > > > 
> > > > > > >     put some static data in a[0-63]
> > > > > > >     put some static data in b[0-63]
> > > > > > >     a[64]=0;
> > > > > > >     b[64]=0;
> > > > > > >     start a thread doing stuff in a[65..]
> > > > > > >     start a thread doing stuff in b[65..]
> > > > > > >     
> > > > > > >     if (!strcmp(a,b)) {
> > > > > > >       /* Do something */
> > > > > > >     }
> > > > > > > 
> > > > > > >    a) Is that behaviour defined?
> > > > > > 
> > > > > > Good question.  I think it should be.  This depends on both the data
> > > > > > race definition and what strcmp/strncmp/memcmp are specified to do using
> > > > > > the abstract machine.
> > > > > > 
> > > > > > The data race definition uses memory locations as granularity, which is
> > > > > > in 3.14 in C11.  Separate characters in an array should be separate
> > > > > > memory locations.
> > > > > > 
> > > > > > C11 isn't very specific regarding strcmp, and just says that it
> > > > > > "compares the string[s]" (7.24.4.2).  C++14 is a bit more specific
> > > > > > regarding basic_string::compare (21.4.7.9), saying that first the length
> > > > > > of the strings are determined, and then a strncmp is run using the
> > > > > > smaller of the two lengths.
> > > > > > 
> > > > > > Assuming the C++ specs, only the array indices [0..64] should be
> > > > > > accessed by the abstract machine.  So no data race with the stuff going
> > > > > > on in [65..).
> > > > > > 
> > > > > > >    b) Is it defined with strncmp(a,b,64) ?
> > > > > > 
> > > > > > Yes.
> > > > > > 
> > > > > > >    c) Is it defined with strncmp(a,b,128)?
> > > > > > 
> > > > > > Not sure.  C11 says that not more than "n characters" are compared, and
> > > > > > characters that follow the a null character aren't compared either.
> > > > > > This indicates it wouldn't be different from strncmp(a,b,64) in the
> > > > > > particular case.
> > > > > > Regarding C++11, I'm not sure.  The closest copies a substring
> > > > > > (conceptually) and then compares, but the substring copying has to
> > > > > > determine length of the string and then subtracting the max length.
> > > > > > This would do a strlen first, which wouldn't access past index 64.
> > > > > > Thus, should be fine too.
> > > > > > 
> > > > > > >    d) Is it defined with memcmp(a,b,64)?
> > > > > > 
> > > > > > No data race, IMO.
> > > > > > 
> > > > > > >    e) Is it defined with memcmp(a,b,128)?
> > > > > > 
> > > > > > Data race.  Undefined behavior.
> > > > > > 
> > > > > > >    f) If I moved that boundary off a nice round % 8 mark would
> > > > > > >       it matter?
> > > > > > 
> > > > > > No difference as far as the standard is concerned.
> > > > > > 
> > > > > > > I can imagine there may be lots of things that terminate
> > > > > > > a string and let other stuff happen in the allocated space
> > > > > > > after the end of the string in the belief that at the end
> > > > > > > of that string all is unknown.   Andrea's case is a bit different
> > > > > > > in that it's the later data that's static, but that doesn't
> > > > > > > sound like it should change the answer as to what's allowed.
> > > > > > 
> > > > > > I think it does, because the question is whether there is a data race on
> > > > > > the memory locations that the abstract machine would access.
> > > > > 
> > > > > Well given we are making examples.
> > > > > 
> > > > > Assume 2 structures like this:
> > > > > 
> > > > > struct test {
> > > > >     void *pointer;
> > > > >     char start[16];
> > > > >     char end[240];
> > > > > }
> > > > > 
> > > > > and
> > > > > 
> > > > > struct test a;
> > > > > struct test b;
> > > > > 
> > > > > memset(a.end, 1, 240);
> > > > > memset(b.end, 2, 240);
> > > > > 
> > > > > In what case it would be expected/legal for 
> > > > > memcmp(a.start, b.start, 256); to ever return 0 ?
> > > > 
> > > > Just to be clear, if there are data-races in [ab].pointer or [as].start
> > > > I see absolutely no problem in stopping short and returning a
> > > > positive/negative value. I just do not see how returning 0 would ever
> > > > make sense.
> > > 
> > > It would be legal if the invocation of memcmp and at least one of the
> > > invocations of memset would be concurrent (ie, not ordered by
> > > happens-before).
> > 
> > Nope, sorry , there is a misunderstanding, consider the memsets part of
> > initialization, *never* run in parallel with memcmp.
> 
> Ah, I see.  I'm not really sure what the rules are in this case; my
> guess would be that this is not UB, because the access is not beyond the
> object that is struct test.
> 
> Perhaps this survey is interesting for you:  https://goo.gl/2TY0H5
> 
> > > I do see that this wouldn't be an intuitive outcome if one reasons based
> > > on assuming some simplified (or straightforward, or natural, or whatever
> > > the personal opinion might be...) implementation.  But a data race is
> > > undefined behavior, and that's where we're drawing the line.
> > 
> > Sure, I can understand that is the *whole* are under consideration is
> > racing. But here we are talking about a large chunk of the area that is
> > different and *never* raced upon.
> 
> Well, undefined behavior is not constrained to some "neighborhood" or
> such around the cause of the undefined behavior.
> It makes sense that the standard doesn't constrain it, because then it
> would have to guarantee the constraints and those would require
> reasoning about a particular implementation, which a language standard
> is meant to prevent :)
> Consider a hypothetical GPU that wants to run C code.  It accepts code
> in a virtual ISA that requires using special atomic operations for all
> memory accesses that would be data races otherwise.  The implementation
> of this GPU uses a fast JIT compiler to transform to the native ISA of
> the GPU.  Now this JIT ocmpiler detects in a program that a whole page
> is compared using memcmp (or a custom memcmp using non-atomics).  It
> decides to move this page into local memory, assuming because of the
> data-race-freedom requirement that no other thread will write to the
> page.  After the memcmp, it will move it back.  But your program isn't
> data-race-free, and another thread writes to one byte in the page.  This
> byte is then lost when the page is moved back from local memory.
> 
> This is a made-up example (although such a virtual ISA is not
> unrealistic) and maybe this implementation would try to fail gracefully
> instead of just losing the update.  But I think it helps show that it
> isn't quite clear in general which constraints would be realistic for
> undefined behavior.

Isn't that's illegal since the parameters to memcmp are 'const void *',
so it shouldn't be written?

Going back to an earlier part of the thread; I came up with a reason
why I really wouldn't want these routines to 'assert()'.

In emulation/virtualisation we generally haven't got a clue what our
guests are doing, and we don't want to stop them most of the time.
So there are a couple of failry common types of structure that
break these rules:

1) (e.g. framebuffer emulation)
  (Many untrusted threads writing to memory)
  One control thread
     do {
       read memory -> display it
     } while (!end)
     barrier;
     display it once more

  That 'display it' can be quite reasonably expected to get
inconsistent data, but at the higher level it will be safe because
we know we're always going to do one more 'display it' at the end
when things settle down.

similarly.
2) (VM migration)

   Many untrusted threads
      do {
        n=some-value;
        change some data[n];
        write_barrier();
        atomically_set(dirty[n]);
      }

    One control thread
      do {
        for (all i) {
          read_barrier();
          if (atomic_read(dirty[i])) {
              send(data[n]);
          }
        }
      } while (!end)
      barrier();
      for (all i) {
        if (atomic_read(dirty[i])) {
            send(data[n]);
        }
      }

   and often that 'send' process involves looking for compressable data;
I had a look and I don't think we're currently using libc routines
in there on the data for the comparison in (2), but I'm not 100% sure
about (1).

In these cases we expect the data to be inconsistent for some period of
time, but guarantee the consistency since we know we're going to read it
again at some better time.  We have no sensible way to stop all the
guest threads writing to memory.

Dave

> > I can accept that the C standard may still decide this is undefined
> > behavior even if just one small part of the total area sees concurrent
> > modifications, but again, you are returning "short" from the optimized
> > code, so it is unclear to me how you can return 0 when the memcmp code
> > hasn't checked all the memory, wouldn't it make more sense to return the
> > current position ? It is still undefined but at least it won't cause
> > false positives.
> 
> Yes, that's the second issue: Trying fail gracefully or mask failures in
> buggy programs, provided the runtime overheads and maintenance costs
> aren't too high.
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]