This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: glibc segfault on "special" long double values is _ok_!?


On 6/8/07, Jan-Benedict Glaw <jbglaw@lug-owl.de> wrote:

In this setup, you control all the cluster and you can ensure that all
nodes use the same hardware and that no node will send data over the
network that wasn't the result of CPU calculation.

In the ticket, the case was different in that he got data fed in that
most probably was _not_ the result of a calculation done by the CPU,
but hand-craftes.

This won't happen in your controlled cluster.

It would be nice if that was true, but it is not, as I already wrote:-



> Can the network infrastructure corrupt bits in the exchanged data?
> Yes.  Not often, but it does happen.  Same for the RAM.  So what do we
> do when we detect a problem?  Print debugging messages, as Nix already

Stop. Would you continue with known-wrong data once detected?

Exactly. Stop and prompt diagnosis of the problem.



hexdump (&my_long_double, sizeof my_long_double());
kill (getpid (), SIGABRT);

or just call abort() which is designed for the purpose.


That way, you get a nice core dump and can call GDB with it. With
"clean" floats, just use GDB's "print" to print it (or even call
printf() with it.)

If printf fails on the offending bit pattern, presumably that is not going to help.

> Could we just print the raw bytes as hex or something?  Sure, but then
> we'd need to interpret that anyway.  The days of manually poring over
> core dumps that came out of the line printer shuld be behind us these
> days.

Once you detected madness somewhere in your data, be sceptic with it.

Obviously. There needs to exist some strategy where the offending data can be logged and analysed. The mechanism for problem diagnosis needs to scale.

You can fully control your cluster, but in the case discussed here,
the data was injected by a non-controlled source.

No item of hardware is fully under control either. Push enough bits through it, some will get corrupted. As I said in the email to which you are replying, this happens in practice, for real.


James.



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]