Summary:  Removal of sysdeps/x86_64/fpu/s_sincos.S causes regressions  

Product:  glibc  Reporter:  Markus Trippelsdorf <markus> 
Component:  math  Assignee:  Not yet assigned to anyone <unassigned> 
Status:  RESOLVED DUPLICATE  
Severity:  enhancement  CC:  aj, bugdal, carlos, davem, faltet, jakub, liubov.dmitrieva, nshmyrev, siddhesh, wbrana 
Priority:  P2  Flags:  fweimer:
security

Version:  2.16  
Target Milestone:  2.18  
URL:  https://bugs.freedesktop.org/show_bug.cgi?id=52551  
Host:  Target:  
Build:  Last reconfirmed: 
Description
Markus Trippelsdorf
20120726 21:07:44 UTC
Please have a look at the gcc ffastmath option if you want fast sincos and can live with the limitations. (In reply to comment #1) > Please have a look at the gcc ffastmath option if you want fast sincos and > can live with the limitations. Yes. I've already tried this, but unfortunately it doesn't seem to help in this case. Libpoppler behaves exactly the same when compiled with or without ffastmath (or Ofast). Actually, the current glibc behavior is equivalent to using "fnobuiltinsin fnobuiltincos" when building poppler. Judging from the profiling figures in the bug report, it looks like the problem is the cost of feraiseexcept, not the cost of correct sin/cos computation. Would dropping use of feraiseexcept and replacing it with simple arithmetic that generates the desired exception flags be a viable solution? Marking this as an enhancement. We should try to fix this for 2.17 since it's a large performance regression for this function. Using Intel's SSE2 sinf and cosf http://sourceware.org/ml/libcalpha/201206/msg00692.html solves the issue. It only takes ~26 seconds to open the testcase: # Overhead Command Shared Object Symbol 11.16% okular libpoppler.so.26.0.0 [.] SplashXPathScanner::computeIntersections() 11.04% okular libpoppler.so.26.0.0 [.] void std::__introsort_loop<SplashIntersect*, long, cmpIntersectFunctor>(SplashIntersect*, Sp lashIntersect*, long, cmpIntersectFunctor) [clone .isra.15] 8.27% okular libpoppler.so.26.0.0 [.] Gfx::doRadialShFill(GfxRadialShading*) 5.43% okular libc2.16.90.so [.] _int_malloc 5.31% okular libm2.16.90.so [.] __cosf 5.16% okular libpoppler.so.26.0.0 [.] SplashXPathScanner::addIntersection(double, double, unsigned int, int, int, int) 5.01% okular libm2.16.90.so [.] __sinf 2.98% okular libc2.16.90.so [.] _int_free 2.83% okular libpoppler.so.26.0.0 [.] Lexer::getObj(Object*, int) I had to manually hook up the new functions with: #undef sin #define sin sinf #undef cos #define cos cosf *** Bug 14496 has been marked as a duplicate of this bug. *** This needs more work and discussion so retargetting this bug to 2.18 this is critical bug it shouldn't be marked as enhancement (In reply to comment #9) > this is critical bug > it shouldn't be marked as enhancement If we fix sinf to be fast again it will be imprecise which is a critical bug for another class of users. There is no implementation that can satisfy all requirements from all users. At the very least the precise implementation is at least correct if slow. Therefore this is an enhancement request to make the implementation fast. We are working on the possibility of having 3 libm variants, fast, default, and precise. At that point ffastmath will select the fast variant and provide a fast but imprecise implementation. Until then we need help looking into these performance issues to see if we can relax the feraiseexcept in some cases or optimize the code to call it less times. Carlos, as I noted above, the horrible slowness seems to be coming not from the correct algorithm but from the silly use of feraiseexcept to raise exceptions. Fixing this should get the performance back at least close to what it was before when the function was broken. Sorry, I see that you did acknowledge that already. I just skimmed through the code but I'm not familiar enough with out glibc's math library is laid out to figure out where feraiseexcept is getting called from... Once it's found, fixing the problem should just be a matter of replacing the call with code that would generate the exception naturally. However since the only exception that should be needed is the inexact exception, it should already be getting raised by the code to compute sin/cos. A quick test to see if this is the source of the problem would be to make a noop implementation of feraiseexcept and link it in (might need to use static linking to get libm to use it) then measure the performance. This should have been fixed with the patch for bug 14496. Markus, can you confirm that? I can see how the fix to #14496 would help, but it's still a bug if feraiseexcept is being called at all. Certainly the code that was fixed in that patch should not even be taken as part of a call to sin or cos. There's simply no reason to be calling feraiseexcept. That's the point of the fix to bug 14496. The root of the problem is that we save and restore rounding modes in functions that we don't have nonRN modes implemented. This calls feupdateenv (IIRC), which in turn calls feraiseexcept if there was an exception to be raised. The exception mask check in x86_64 was incorrect, causing feraiseexcept to be called all the time. with the fix, it is only called when there actually is an exception. (In reply to comment #13) > This should have been fixed with the patch for bug 14496. Markus, can you > confirm that? It now takes ~40 seconds to render the testpage on trunk. Better than 60 seconds but still 40% slower than s_sincos.S. 25.82% libm2.17.90.so [.] __cos 22.06% libm2.17.90.so [.] __sin 6.33% libm2.17.90.so [.] __dubsin 4.59% libpoppler.so.36.0.0 [.] GfxPath::lineTo(double, double) 4.50% libpoppler.so.36.0.0 [.] Gfx::doRadialShFill(GfxRadialShading*) With the split into 3 libm variants, fast alternative definitely doesn't have to support rounding other than round to even, but I'd say that even the default version shouldn't support it, only the precise one. After all, until very recently, those rounding modes were not supported in transcendentals in glibc, and froundingmath isn't the default in gcc either. Perhaps we should add a predefined macro for froundingmath in gcc, and the precise variant of libm could be selected using that macro, if the correct rounding precise variants would be suffixed differently (like the *_finite entrypoints are for ffastmath), then all the 3 libm variants could live in the same libm.so.6 and people would just choose what they want using ffastmath vs. default vs. froundingmath. (In reply to comment #16) > (In reply to comment #13) > > This should have been fixed with the patch for bug 14496. Markus, can you > > confirm that? > > It now takes ~40 seconds to render the testpage on trunk. > Better than 60 seconds but still 40% slower than s_sincos.S. That's expected. The assembly implementation used the fsincos instruction and the current implementation calls the sin() and cos() functions. The rounding mode changes have nothing to do with this part of the performance regression. Maybe we could implement a sincos_finite that calls the fsincos instruction and actually gets used with ffinitemath. (In reply to comment #17) > Perhaps we should add a predefined macro for froundingmath in gcc, and the > precise variant of libm could be selected using that macro, if the correct > rounding precise variants would be suffixed differently (like the *_finite > entrypoints are for ffastmath), then all the 3 libm variants could live in > the same libm.so.6 and people would just choose what they want using > ffastmath vs. default vs. froundingmath. I like this idea. On Fri, 26 Apr 2013, jakub at redhat dot com wrote: > Perhaps we should add a predefined macro for froundingmath in gcc, and the > precise variant of libm could be selected using that macro, if the correct > rounding precise variants would be suffixed differently (like the *_finite Basing things on a predefined macro like that won't work with any future GCC support for #pragma STDC FENV_ACCESS on (within the scope of that pragma, function versions supporting rounding modes should be called). So since you'd need a new compiler feature anyway, maybe something like __builtin_rounding_math () would be better as it would allow FENV_ACCESS support without further library changes (the safe default for older compilers without ffastmath being to assume froundingmath may be in effect). So you could have e.g. #define sin(x) (__builtin_rounding_math () ? sin (x) : __sin_noround (x)) (I'm using plain sin as the version supporting rounding modes for safety for programs declaring functions themselves rather than including the system header.) DTS 186611 constant rounding modes (see <http://www.openstd.org/jtc1/sc22/wg14/www/docs/n1676.pdf>) would require a further compiler feature to say "call this function with the dynamic rounding mode in effect being the constant mode for the current scope". Note that having extra variants like this would definitely need libmtest.inc tests to be run for the extra variants (there's an existing issue on the wiki todo list to run the tests not involving infinities and NaNs with ffinitemathonly to provide coverage of the *_finite versions). Siddhesh, still none of the fenv stuff should be called at all if the initial rounding mode was nearest (except of course checking the initial mode). This should be specialcased since it's true basically all the time in realworld applications. Jakub, the default libm should conform to IEEE requirements and should not cause correct programs to crash or go into infinite loops (both of which happen if you just remove the check for rounding mode due to poor design in the IBM math library). This is A LOT weaker than the current aim of glibc, which is to provide correct rounding for all transcendental functions despite the fact that IEEE explicitly chose not to require it (just <1ulp error). In my opinion, if there are 3 versions, they should be: 1. Fast: Whatever the gamer/graphics folks want. 2. Default: IEEEconforming 3. Best: Correctlyrounded results for all functions It's only a limited subset of transcendental functions for which glibc is trying to be correctly rounding, and only for double for those functions; for most functions the aim is generally errors within "a few" ulps (and ISO C doesn't specify particular accuracy requirements), though correctly rounding would be some sort of ideal. Implementing the proposed TS 18661 bindings to IEEE 7542008 would be a huge amount of work (and require compiler work as well), but if anyone gets the time for that work then I believe part 4 of the draft TS will reserve names such as crsin for correctly rounding functions (and crsinf128 for correctly rounding sine for _Float128, etc.), without requiring those functions, so it might make sense eventually to provide those names for correctlyrounding functions where available without making them the default versions. So far I'm still working on going through the accumulated libm bugs that don't require such huge amounts of work as DTS 18661 or ISO 24747 implementation  where one reported bug often shows up lots of related bugs not previously in Bugzilla, that also need to be fixed  rather than trying to find time for larger libm projects (and I'm also mostly leaving aside libm bugs in areas where others have shown interest in fixing them, such as performance or errno handling). (The present functions, producing correctlyrounded results only for roundtonearest, would not of course meet the requirements for functions such as crsin, although it's probably possible to adapt them by just doing a little bit of final arithmetic in the original rounding mode, having done the intermediate computations to produce a higherprecision result in roundtonearest.) (In reply to comment #10) > If we fix sinf to be fast again it will be imprecise which is a critical bug > for another class of users. There is no implementation that can satisfy all > requirements from all users. At the very least the precise implementation is at > least correct if slow. Therefore this is an enhancement request to make the > implementation fast. > > We are working on the possibility of having 3 libm variants, fast, default, and > precise. At that point ffastmath will select the fast variant and provide a > fast but imprecise implementation. > http://sourceware.org/bugzilla/show_bug.cgi?id=13658 impreciseness of fast sincos can't be critical bug because it took 15 years (19972012) to report this bug which means s_sincos.S is precise enough for 99% of apps. s_sincos.S should be restored because slow sincos can break many apps because most open source apps aren't properly tested and app devs aren't aware that sincos is 50% slower since 2.16. Only users will notice than e.g. new Ubuntu is slow but users usually can't submit proper bug report like this one to app devs. App devs won't know that they should add ffastmath to restore performance. Very few apps are using ffastmath in present. ffastmath can add new bugs to apps which aren't broken by fast sincos. New function named e.g. sincos_precise should be added to glibc for 1% apps which are affected by impreciseness of fast sincos. I've posted a patch to implement __sincos_finite: http://sourceware.org/ml/libcalpha/201304/msg00720.html This should allow you to compile with ffastmath or ffinitemathonly to use this fast implementation of sincos. (In reply to comment #25) > I've posted a patch to implement __sincos_finite: > > http://sourceware.org/ml/libcalpha/201304/msg00720.html > > This should allow you to compile with ffastmath or ffinitemathonly to use > this fast implementation of sincos. Will you also test about 35 000 Linux apps and libs if they are affected by slow sincos and not affected by ffinitemathonly and submit bug reports to app/lib devs with request for addition of ffinitemathonly to environment defined CFLAGS/CXXFLAGS? wbrana, "getting the right answer is slow" is not an excuse for giving the wrong answer. The "optimized" version of sincos has ZERO bits of precision for certain inputs. With that said, there are not "35000" apps and libraries to check and send bug reports to. A quick check with nm(1) will show you which ones even call trig functions, and it's a tiny fraction of that. Moreover, of those that do, a good number (anything except gaming and graphics) actually want the correct answer, not the fastbutwrong answer, and had bugs that had gone unnoticed with the old, wrong sincos. Getting fastbutwrong math results is easy; you don't need the standard library for this. The purpose of the C language providing math functions (and recommending that they conform to IEEE requirements) in the standard library is to give applications something that would be very difficult to achieve themselves. Being wrongbydefault but providing a special option to get correct answers is not a solution because portable programs that were not written specifically to glibc will not be able to get the right answers. They would have to encode glibcspecific knowledge to override the incorrect default. Requiring platformspecific knowledge to get the optimal performance on a given platform (glibc) is unfortunate but at least justifiable. Requiring platformspecific knowledge to get the correct results is not justifiable. With all that said, I still question the cause of the huge performance drop. The only cost of the correct sin/cos versus the incorrect one should be argument reduction, which should be a noop if the argument is already in the range [pi,pi] or so and cheap even a good ways outside of that range... according to http://sourceware.org/bugzilla/show_bug.cgi?id=13658#c2 fast functions can be used when source operand is in certain range. Slow functions should be used only outside interval. "The FPTAN, FSIN, FCOS, and FSINCOS instructions set the C2 flag to 1 to indicate that the source operand is beyond the allowable range of ±2^63 and clear the C2 flag if the source operand is within the allowable range." So, outside the interval [2^63,+2^63] ("allowable range"), these instructions must not be used (or they can be used, but with a fallback if the C2 flag is set to 1). But note that the glibc implementation is more accurate, even with (very probably) correct rounding, so that it is better to use it anyway. (In reply to comment #28) > according to http://sourceware.org/bugzilla/show_bug.cgi?id=13658#c2 > fast functions can be used when source operand is in certain range. > Slow functions should be used only outside interval. > > "The FPTAN, FSIN, FCOS, and FSINCOS instructions set the C2 flag to 1 to > indicate that the source operand is beyond the allowable range of ±2^63 and > clear the C2 flag if the source operand is within the allowable range." > > So, outside the interval [2^63,+2^63] ("allowable range"), these instructions > must not be used (or they can be used, but with a fallback if the C2 flag is > set to 1). But note that the glibc implementation is more accurate, even with > (very probably) correct rounding, so that it is better to use it anyway. The problem is not just about range reduction. fsincos is not always accurate within the allowable range either. (In reply to comment #27) > With all that said, I still question the cause of the huge performance drop. > The only cost of the correct sin/cos versus the incorrect one should be > argument reduction, which should be a noop if the argument is already in the > range [pi,pi] or so and cheap even a good ways outside of that range... I considered the possibility of keeping the assembly default and then falling into the default implementation if the exception flags indicate that the result was not accurate enough, but that doesn't catch all cases. In fact, the limited internal precision of PI (66bit mantissa) guarantees that there will be cases where fsincos thinks it has hit the target, but it hasn't. So until there's a way to flag inaccurate results reliably, I think the __sincos_finite way is the best option. There is definitely some scope to look at the implementation of __sin and __cos functions, or even figure out some clever way to get results faster in sincos than just calling __sin and __cos, but that will need more work. (In reply to comment #25) > I've posted a patch to implement __sincos_finite: > > http://sourceware.org/ml/libcalpha/201304/msg00720.html > > This should allow you to compile with ffastmath or ffinitemathonly to use > this fast implementation of sincos. Your patch doesn't work for me. I've rebuild glibc and then recompiled poppler with "ffinitemathonly". But I still only see calls to __sin and __cos with "perf top". The poppler code in question can be found here: http://cgit.freedesktop.org/poppler/poppler/tree/poppler/Gfx.cc#n3112 When I just add the following file to glibc: markus@x4 glibc % cat sysdeps/x86_64/fpu/s_sincos.S #include <machine/asm.h> .text ENTRY (__sincos) movsd %xmm0, 8(%rsp) fldl 8(%rsp) fsincos fnstsw %ax testl $0x400,%eax jnz 1f fstpl (%rsi) fstpl (%rdi) retq 1: fldpi fadd %st(0) fxch %st(1) 2: fprem1 fnstsw %ax testl $0x400,%eax jnz 2b fstp %st(1) fsincos fstpl (%rsi) fstpl (%rdi) retq END (__sincos) weak_alias (__sincos,sincos) poppler gets fast again and uses __sincos. On Mon, Apr 29, 2013 at 12:32:34PM +0000, wbrana at gmail dot com wrote:
> http://sourceware.org/bugzilla/show_bug.cgi?id=14412
>
>  Comment #28 from wbrana at gmail dot com 20130429 12:32:34 UTC 
> according to http://sourceware.org/bugzilla/show_bug.cgi?id=13658#c2
> fast functions can be used when source operand is in certain range.
> Slow functions should be used only outside interval.
>
> "The FPTAN, FSIN, FCOS, and FSINCOS instructions set the C2 flag to 1 to
> indicate that the source operand is beyond the allowable range of ±2^63 and
> clear the C2 flag if the source operand is within the allowable range."
>
> So, outside the interval [2^63,+2^63] ("allowable range"), these instructions
> must not be used (or they can be used, but with a fallback if the C2 flag is
> set to 1). But note that the glibc implementation is more accurate, even with
> (very probably) correct rounding, so that it is better to use it anyway.
>
Well results of sin,cos.. with large numbers are pure garbage no matter
what you try. As cos(1.0e18*PI) = cos(1.0e18*PI + PI) you can basicaly print random
number between 1.0 and 1.0 and be correct.
wbrana, unfortunately that information is mistaken; fsin DOES NOT work on the interval [2^63,+2^63]. A quick test reveals: #include <math.h> #include <stdio.h> double x87_sin(double x) { __asm__("fsin" : "+t"(x)); return x; } int main() { double x = 0x1p63; printf("%.15g %.15g\n", sin(x), x87_sin(x)); } $ ./a.out 0.999930376673442 9.22337203685478e+18 I suspect this is just a boundary issue (the [] should be () in the range) since slightly smaller values give more reasonable, but still badly wrong, results. For example if x is 0x1p62, 0.702922443619209 0.707132927452779 which is wrong in the third decimal place. Siddhesh, could you explain the motivation for "overloading" the meaning of "_finite" with "fast but wrong"? I'm guessing the idea is that you're thinking finite math implies not just lack of infinities but also "unreasonably large" inputs? This is definitely the glibc team's call, since there's no reasonable basis for assuming you'll get correct results from nondefault performanceoriented math settings, but I think some justification would be nice. Ondrej, the results are not pure garbage. There are only two correct results sin(0x1p1000) can give, either of the nearest representable neighbors (less than 1ulp error) of the number 2**1000. Your fallacy is writing PI in your example. There is no such floating point number as PI, and this is part of why implementing correct trig functions is nontrivial; implementing ones that work in degree units, or a base2 division of the unit circle, would be much easier. Markus, your implementation has incorrect argument reduction. For inputs outside of a small range (somewhere around [2pi,2pi] but maybe slightly larger), and near multiples of pi/2 (where either sin or cos has a nearzero result and thus needs extra precision) you really need to fall back to the C rangereduction code (or good luck reimplementing THAT in asm...). You could perhaps use the x87 instruction in limited cases, though. (In reply to comment #32) > wbrana, unfortunately that information is mistaken; fsin DOES NOT work on the > interval [2^63,+2^63]. A quick test reveals: what about smaller interval [2^31,+2^31] (In reply to comment #30) > Your patch doesn't work for me. I've rebuild glibc and then > recompiled poppler with "ffinitemathonly". But I still > only see calls to __sin and __cos with "perf top". Looks like your code does not call sincos directly. sin and cos calls on the same input seem to get optimized into sincos, so I'm not sure if ffinitemathonly can detect that. Also, the patch also changes an installed header, so you need to somehow include that during your compilation. At the outer edges of that range, the error is "only" around 1000 ulps. This isn't correct, but at least it's better than just having 4 or 5 bits of precision (or none at all). Since the x87 internally uses a 66bit representation of pi, and doubles are 52bit, I suspect precision starts to trail off around 16384*pi, but it will be bad even for much smaller values when the result is close to zero. Experimentally, 0x1p18 seems to be the first power of two for which the result is wrong within 15 decimal places, but 0x1p14*M_PI is already badly wrong (only the first 4 decimal places are correct). On Mon, Apr 29, 2013 at 01:28:10PM +0000, bugdal at aerifal dot cx wrote:
> http://sourceware.org/bugzilla/show_bug.cgi?id=14412
>
>  Comment #33 from Rich Felker <bugdal at aerifal dot cx> 20130429 13:28:10 UTC 
> Ondrej, the results are not pure garbage. There are only two correct results
> sin(0x1p1000) can give, either of the nearest representable neighbors (less
> than 1ulp error) of the number 2**1000. Your fallacy is writing PI in your
> example. There is no such floating point number as PI, and this is part of why
> implementing correct trig functions is nontrivial; implementing ones that work
> in degree units, or a base2 division of the unit circle, would be much easier.
>
Ok, assume you wrote function cos_deg that takes input degree in
degrees. You still have
cosdeg(360*(2**60)) == cosdeg(360*(2**60)+180)
In short when you have inputs with zero significant digits you can
expect only garbage as output.
(In reply to comment #36) > (In reply to comment #30) > > Your patch doesn't work for me. I've rebuild glibc and then > > recompiled poppler with "ffinitemathonly". But I still > > only see calls to __sin and __cos with "perf top". > > Looks like your code does not call sincos directly. sin and cos calls on the > same input seem to get optimized into sincos, so I'm not sure if > ffinitemathonly can detect that. Also, the patch also changes an installed > header, so you need to somehow include that during your compilation. I've just verified locally that if gcc's transformation of sin + cos into sincos works fine with the further transformation into __sincos_finite with ffinitemathonly, so it's most likely that your code is not using the header file changes. Easiest way to test this is to add the patch on top of your distribution glibc, build and install it. BTW, the question of whether ffinitemathonly is the right option to do this kind of thing is still open, so it's likely that the resulting fix will be different. The crux would be similar though, i.e. a fast function assembly that is not always accurate. (In reply to comment #39) > (In reply to comment #36) > > (In reply to comment #30) > > > Your patch doesn't work for me. I've rebuild glibc and then > > > recompiled poppler with "ffinitemathonly". But I still > > > only see calls to __sin and __cos with "perf top". > > > > Looks like your code does not call sincos directly. sin and cos calls on the > > same input seem to get optimized into sincos, so I'm not sure if > > ffinitemathonly can detect that. Also, the patch also changes an installed > > header, so you need to somehow include that during your compilation. > > I've just verified locally that if gcc's transformation of sin + cos into > sincos works fine with the further transformation into __sincos_finite with > ffinitemathonly, so it's most likely that your code is not using the header > file changes. Easiest way to test this is to add the patch on top of your > distribution glibc, build and install it. Hmm, that was exactly how I've tested your patch. And <math.h> is included in Gfx.cc, so bits/mathfinite.h should be included when ffinitemathonly is used. Will doublecheck later. Ondrej, the input always has 52 significant bits unless it's denormal. If, due to the way you computed the input value, none of them are significant to you, then that's your problem. sin() and cos() are still required to return the correct result assuming all bits of input are significant. On Mon, Apr 29, 2013 at 02:48:43PM +0000, bugdal at aerifal dot cx wrote: > http://sourceware.org/bugzilla/show_bug.cgi?id=14412 > >  Comment #41 from Rich Felker <bugdal at aerifal dot cx> 20130429 14:48:43 UTC  > Ondrej, the input always has 52 significant bits unless it's denormal. If, due > to the way you computed the input value, none of them are significant to you, > then that's your problem. > sin() and cos() are still required to return the > correct result assuming all bits of input are significant. > Citation needed. I don't have a copy of the latest IEEE 754 on hand (there's no free online version), but the consensus among implementors seems to be that a few special functions (such as sqrt) require correct rounding, and that the remaining functions (mostly transcendental) are required to give results with less than 1ulp of error. If sin(0x1p1000) is computed with incorrect argument reduction, it may have up to approximately 1e16 ulp error. Moreover, the C standard specifies: "2 The sin functions compute the sine of x (measured in radians). Returns 3 The sin functions return sin x." Without annex F, the C standard places no requirements on the precision or accuracy of floating point results (e.g. 2.0+2.0==5.0 is legal), but annex F recommends IEEE conforming behavior. If anyone else can provide some better citations, they'd be welcome. On Mon, 29 Apr 2013, bugdal at aerifal dot cx wrote:
> I don't have a copy of the latest IEEE 754 on hand (there's no free online
> version), but the consensus among implementors seems to be that a few special
> functions (such as sqrt) require correct rounding, and that the remaining
> functions (mostly transcendental) are required to give results with less than
> 1ulp of error. If sin(0x1p1000) is computed with incorrect argument reduction,
> it may have up to approximately 1e16 ulp error.
IEEE 7542008, subclause 5.4 "formatOf generalcomputational operations",
specifies squareRoot and fusedMultiplyAdd on the same basis as addition,
subtraction, multiplication, division, convertFromInt, various operations
converting to integer formats and a range of other operations. As the
2008 revision, these are "formatOf" operations (so including e.g. fused
multiplyadd of long double values, rounded once to float as the
destination format), which do not necessarily have C bindings as of C11 
for the formatOf bindings you need DTS 186611.
Those operations are fully defined. C11 Annex F for sqrt says "sqrt is
fully specified as a basic arithmetic operation in IEC 60559. The returned
value is dependent on the current rounding direction mode." (F.10.4.5) and
similarly for fma (F.10.10.1) says "fma(x, y, z) computes xy + z,
correctly rounded once.".
For most functions, not fully defined in IEEE 754 or C11, no specific
accuracy requirements are imposed, although there are requirements on
particular special cases in Annex F. That is, Annex F leaves the
implementationdefined accuracy from 5.2.4.2.2 paragraph 6 for most of the
functions.
IEEE 7542008 clause 9 "Recommended operations" lists a range of other
functions with the requirement "A conforming function shall return results
correctly rounded for the applicable rounding direction for all operands
in its domain.". These include, for example, both sin and sinPi (sine of
pi times the argument). There are no ISO C bindings for these operations
at present. I believe the intent is that TS 186614 will specify bindings
to them, using names such as crsin for the correctly rounding functions
(while probably adding names such as sinpi for noncorrectlyrounding
versions). I believe the intent is that the correctlyrounding versions
will be optional for implementations of TS 186614. (The exhaustive
searches required to prove correctlyrounding, boundedresourceuse
implementations of various functions are as far as I know still not
feasible for binary128, for example, and correct rounding without
arbitraryprecision computation is particularly hard for functions of
multiple arguments.)
I don't know where the idea of a 1ulp limit on errors being some sort of
IEEE requirement came from. As far as I'm concerned, sin(0x1p1000) giving
a result that is "reasonably close" (say a few ulp) to the actual sine of
0x1p1000 is a matter of quality of implementation, and giving a result
that isn't wildly different is a matter of meeting the requirement that
sin implements the sine function rather than some other function that
returns random numbers for large inputs.
(For some of the additional functions in ISO 24747, the results for large
values of certain inputs are specified as implementationdefined because
of perceived difficulty of implementing the functions accurately for large
inputs. But none of the functions in C11 itself have such latitude to do
something arbitrary for large inputs.)
(In reply to comment #32) > I suspect this is just a boundary issue (the [] should be () in the range) > since slightly smaller values give more reasonable, but still badly wrong, > results. For example if x is 0x1p62, Yes, it is () and not [], according to the manual. > > 0.702922443619209 0.707132927452779 > > which is wrong in the third decimal place. > This is not restricted to numbers on the boundary. A quick google search on accuracy of fsincos will show that there are numbers even close to pi/2 that are way off in terms of accuracy. > Siddhesh, could you explain the motivation for "overloading" the meaning of > "_finite" with "fast but wrong"? I'm guessing the idea is that you're thinking > finite math implies not just lack of infinities but also "unreasonably large" > inputs? This is definitely the glibc team's call, since there's no reasonable > basis for assuming you'll get correct results from nondefault > performanceoriented math settings, but I think some justification would be > nice. That was a mistake  I somehow muddled it up with ffastmath in my head. I'm going to first try to see if the default sincos can be made faster. Jakub pointed out offline that the argument reduction code is duplicated in the calls to __sin and __cos and consolidating that in sincos should definitely be a good first step. Ondrej, I think the citation we were looking for is C99 7.12.1 Treatment of error conditions: "1 The behavior of each of the functions in <math.h> is specified for all representable values of its input arguments, except where stated otherwise." Another citation, from the C99 Rationale (http://www.openstd.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf), supporting the option to give wrong results for trig functions: Page 27, lines 913: "Some math functions such as those that do argument reduction modulo an approximation of π have good accuracy for small arguments, but poor accuracy for large arguments. It is not unusual for an implementation of the trigonometric functions to have zero bits correct in the computed result for large arguments. For cases like this, an implementation might break the domain of the function into disjoint regions and specify the accuracy in each region." Note that an implementation should still document the accuracy, and in the case of wrong argument reduction, the error could theoretically be as large as 2**1074 ULPs for some inputs of sin/cos (if the correct result were denormal but the function returned 1.0 or 1.0) and much worse (even possibly infinite) for tan. *** Bug 15936 has been marked as a duplicate of this bug. *** Although, formally, the regression part came from removing that file, the substance (that the IBM libm functions are slow in special cases and correctlyrounded functions should be under different names, not the default for sin, cos, sincos) is the same as bug 5781. *** This bug has been marked as a duplicate of bug 5781 *** 