This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: memcmp-sse4.S EqualHappy bug
- From: Torvald Riegel <triegel at redhat dot com>
- To: Simo Sorce <ssorce at redhat dot com>
- Cc: Andrea Arcangeli <aarcange at redhat dot com>, "Carlos O'Donell" <carlos at redhat dot com>, "Dr. David Alan Gilbert" <dgilbert at redhat dot com>, Szabolcs Nagy <nsz at port70 dot net>, libc-alpha at sourceware dot org, "H.J. Lu" <hongjiu dot lu at intel dot com>
- Date: Fri, 19 Jun 2015 19:36:22 +0200
- Subject: Re: memcmp-sse4.S EqualHappy bug
- Authentication-results: sourceware.org; auth=none
- References: <1434621040 dot 5250 dot 212 dot camel at localhost dot localdomain> <20150618124900 dot GD14955 at redhat dot com> <1434637415 dot 5250 dot 271 dot camel at localhost dot localdomain> <20150618145202 dot GG14955 at redhat dot com> <1434642635 dot 5250 dot 292 dot camel at localhost dot localdomain> <20150618161943 dot GN14955 at redhat dot com> <20150618172231 dot GS14955 at redhat dot com> <1434649785 dot 30819 dot 37 dot camel at localhost dot localdomain> <20150618181219 dot GL2248 at work-vm> <55841B6B dot 10104 at redhat dot com> <20150619140710 dot GA14955 at redhat dot com> <1434724946 dot 2716 dot 88 dot camel at willson dot usersys dot redhat dot com> <1434727945 dot 3061 dot 51 dot camel at localhost dot localdomain> <1434733721 dot 2716 dot 96 dot camel at willson dot usersys dot redhat dot com>
On Fri, 2015-06-19 at 13:08 -0400, Simo Sorce wrote:
> It is true you identify 2 different issues, but I am not interested in
> either, I merely care about failing safe.
>
> On Fri, 2015-06-19 at 17:32 +0200, Torvald Riegel wrote:
> > I see two separate issues here. First, where do we draw the line, and
> > what do we guarantee. I strongly believe that programs must not have
> > data races, and that they should use atomics or other synchronization
> > properly (this doesn't mean locks, but relaxed memory order atomics,
> > for example).
>
> Programs should not have data races, true, but sometimes they do,
> failing safe makes 'bad' programs not be worse.
>
> > Second, do we try to keep buggy programs working. If it has no cost
> > to do so (e.g., like it might be true in this case), then doing that
> > can help to trigger less errors. But that doesn't mean the buggy
> > programs should get fixed eventually.
>
> Sure, but sometimes it is hard to define something as a bug.
>
> In the specific case Andrea reported (which I pondered on), the race was
> clearly there, but also clearly the underlying memory was never
> identical at any given point, yet memcmp reported 0 which means: the
> whole memory area was checked and matched. This was clearly not the
> case, memcmp didn't even check the whole memory area, and returned 0.
This is allowed if something is undefined behavior. A return value of 0
only has the specified meaning if there's no undefined behavior
elsewhere.
> Now, if the 2 memory areas ever where (even for a nanosecond) identical,
> then returning zero would have been perfectly acceptable.
How do you know that when there are data races? Making this assumption
requires making assumptions about how the modification happened. For
example, if the modifications used sequential code too, this could
potentially write temporary values.
Second, you say "even for a nanosecond", and that seems to say to you're
thinking about some kind of total order of events. This isn't
necessarily the case though; even if you'd use relaxed memory-order
atomic stores and loads to access the array, what a the memcpy using the
atomics would see would not necessarily be ordered as if the stores
would happen one at a time, or in some way that can be totally ordered.
> During a race
> you can't know what the result is.
The standard "specifies" undefined behavior, which allows for much more
than not knowing what the result is. It could we conforming to burn
your computer, for example ;)
> But a function that supposedly checks
> a data area for equality and instead stops short and return "equal" seem
> just wrong, and it is certainly unexpected.
I do understand that it could be surprising behavior, especially if
thinking at the level of what a straight-forward implementation may or
may not do. But we have to draw the line somewhere to enable compiler
optimizations, and we don't want to describe compiler optimizations in
detail nor reason about that detail, so the big hammer of undefined
behavior is a sensible choice.
> I understand that the cause for that is the program is operating on part
> of the area currently and it shouldn't but still, I would expect memcmp
> to at least always go through the whole area it is expected to check and
> not stop short on a fluke and return a "random" result.
I am sympathetic to this viewpoint, but a random result, or undefined
behavior, is really not that unrealistic if you consider more complex C
code with data races, that may be subject to more compiler optimization
than one would expect for memcmp or such.
Just to be clear, I don't want to be simply nitpicking here, nor say
that the standard is the only true design choice, or something like
that. I just think that it's dangerous to start considering some data
races to be "benign" because that's a truly slippery slope and the
reasoning required for that can be really complicated.