This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: IEEE128 binary float to decimal float conversion routines

From: Joseph Myers <joseph at codesourcery dot com>
To: Steven Munroe <munroesj at linux dot vnet dot ibm dot com>
Cc: Steve Munroe <sjmunroe at us dot ibm dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, Michael R Meissner <mrmeissn at us dot ibm dot com>, "Paul E. Murphy" <murphyp at linux dot vnet dot ibm dot com>, Tulio Magno Quites Machado Filho <tuliom at linux dot vnet dot ibm dot com>
Date: Wed, 16 Dec 2015 00:06:58 +0000
Subject: Re: IEEE128 binary float to decimal float conversion routines
Authentication-results: sourceware.org; auth=none
References: <564A16D5 dot 3020105 at linux dot vnet dot ibm dot com> <alpine dot DEB dot 2 dot 10 dot 1511161803500 dot 30498 at digraph dot polyomino dot org dot uk> <564A6A90 dot 40607 at linux dot vnet dot ibm dot com> <alpine dot DEB dot 2 dot 10 dot 1511162356020 dot 32387 at digraph dot polyomino dot org dot uk> <201511180131 dot tAI1Vs2L023118 at d03av01 dot boulder dot ibm dot com> <alpine dot DEB dot 2 dot 10 dot 1511180144150 dot 2302 at digraph dot polyomino dot org dot uk> <201511182301 dot tAIN1Igc011083 at d03av02 dot boulder dot ibm dot com> <alpine dot DEB dot 2 dot 10 dot 1511182322260 dot 26547 at digraph dot polyomino dot org dot uk> <1449594999 dot 9274 dot 45 dot camel at oc7878010663> <alpine dot DEB dot 2 dot 10 dot 1512081737230 dot 19569 at digraph dot polyomino dot org dot uk> <1450214326 dot 9926 dot 37 dot camel at oc7878010663>

On Tue, 15 Dec 2015, Steven Munroe wrote:

> Ok let my try with the simpler case of _Decimal128 to Float128 where the
> significand conversion is exact (log2(10e34) -> 112.9 -> <= 113 bits).
> So you mention "continued fraction analysis" which was not part of my
> formal education (40+ years ago) but I will try.

The theory of rational approximations (determining how close a/b can be to 
given x for integer a and b, given bounds on b) applies here, because you 
want to determine how close a*10^m and b*2^n can be, for a and b in the 
ranges of mantissa values and m and n corresponding exponents, which is 
essentially equivalent to determining how close a/b can be to 2^n/10^m.

Earlier in this thread, Christoph Lauter referred to a paper 
<http://www.computer.org/csdl/trans/tc/preprint/07271015-abs.html> about 
comparison between binary and decimal floating-point values - conversions 
are essentially the same issue, and that paper references previous work on 
conversions.  Section 4.2 of that paper discusses how the classical theory 
of continued fractions can be applied to find the closest cases (the exact 
details of the relevant analysis depend on the details of the code you use 
to implement the conversions).

> So any _Decimal128 < 9999999999999999999999999999999999e48 (1.0e82) can
> be converted with one _Float128 multiply, of 2 exact values, giving a
> rounded result to 1ULP.

Those cases are indeed straightforward (and if you round-to-odd then that 
gives conversions of numbers within that range to narrower binary types, 
and you can similarly divide instead of multiplying in cases of integer * 
1e-n, n <= 48).

> Now as the exponent of _Decimal128 input exceeds 48 the table of
> _float128 powers of 10 will contain values that have been rounded. Now I
> assume that some additional exponent range can be covered by by insuring
> that the table _float128 powers_of_10 have been pre-rounded to odd?

But pre-rounding the larger powers to odd doesn't help; rounding to odd 
generally only helps when it's the very last operation before rounding to 
a narrower type that gets rounded to odd, with all previous operations 
being exact.  In this case, you'd have an inexact power followed by a 
multiplication, and because one of the arguments to the multiplication is 
inexact (and the multiplication is not by a power of 2), the result of the 
multiplication may not be correctly rounded.

Instead, the larger powers need to be stored with extra precision, and 
extra precision used for the multiplication.  (You don't need to store all 
the powers in the range of _Float128 because you have a speed/space 
trade-off; you could store fewer powers and then compute the one you need 
at runtime by doing an extra-precision multiply of two or more stored 
powers.  Or you could store them all if that's the right trade-off for the 
processors in question.)

The amount of extra precision needed depends on both how close the closest 
cases for correct rounding are, and on how much error can accumulate from 
the multiplications - there needs to be enough precision that the 
inaccuracy in the stored values and subsequent computations is small 
enough that it does not affect the rounding of the final result.

-- 
Joseph S. Myers
joseph@codesourcery.com

References:
- Re: IEEE128 binary float to decimal float conversion routines
  - From: Steven Munroe
- Re: IEEE128 binary float to decimal float conversion routines
  - From: Joseph Myers
- Re: IEEE128 binary float to decimal float conversion routines
  - From: Steven Munroe

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]