This is the mail archive of the elfutils-devel@sourceware.org mailing list for the elfutils project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: incorrect signed data

From: Josh Stone <jistone at redhat dot com>
To: elfutils-devel at lists dot fedorahosted dot org
Date: Wed, 05 Feb 2014 09:36:17 -0800
Subject: Re: incorrect signed data
On 02/05/2014 03:44 AM, Mark Wielaard wrote:
> On Tue, 2014-02-04 at 18:24 -0800, Josh Stone wrote:
>> On 02/04/2014 03:12 PM, Josh Stone wrote:
>>> There are only a few internal dwarf_formsdata calls: for the decls as I
>>> mentioned, and in array_size() for DW_AT_lower/upper_bound.  AFAICS the
>>> spec doesn't explicitly call bounds signed or unsigned, but only
>>> unsigned makes sense to me, so these also ought to use dwarf_formudata.
>>
>> http://www.dwarfstd.org/ShowIssue.php?issue=020702.1
>>
>> So Fortran allows negative bounds, yay, and this is the origin of the
>> standard's vague statements about data[1248] context.
> 
> Thanks for finding this, it explains the context nicely.
> 
>> Here's a little experiment with gcc-gfortran-4.8.2-7.fc20.x86_64:
>> (and forgive my fortran ignorance, but at least this compiles)
>>
>>      PROGRAM main
>>        INTEGER A(10:199)
>>        INTEGER B(-20:-10)
>>        A(10) = B(-10)
>>      END
>>
>> yields:
>>
>>  [    67]    array_type
>>              type                 (ref4) [    7f]
>>              sibling              (ref4) [    78]
>>  [    70]      subrange_type
>>                type                 (ref4) [    78]
>>                lower_bound          (data1) 10
>>                upper_bound          (data1) 199
>>  [    78]    base_type
>>              byte_size            (data1) 8
>>              encoding             (data1) signed (5)
>>              name                 (strp) "integer(kind=8)"
>>  [    7f]    base_type
>>              byte_size            (data1) 4
>>              encoding             (data1) signed (5)
>>              name                 (strp) "integer(kind=4)"
>>  [    86]    array_type
>>              type                 (ref4) [    7f]
>>              sibling              (ref4) [    a5]
>>  [    8f]      subrange_type
>>                type                 (ref4) [    78]
>>                lower_bound          (data8) 18446744073709551596
>>                upper_bound          (data8) 18446744073709551606
>>
>> Thus gfortran appears to support the current elfutils behavior - read it
>> as unsigned and cast it without sign extension.  It happily put 199 in
>> data1, and went all the way to data8 for negative values.  It could have
>> been more compact with sdata instead of data8 though.
>>
>> Also, apparently eu-readelf is not using dwarf_formsdata for bounds, and
>> it should.
> 
> It doesn't because it is very low-level and doesn't use any context. So
> it just sees the DW_FORM_data8 and will print its value.

That's a lazy answer, and not actually true.  The encodings above are
"just" data1, but it prints the meaning.  And high_pc is often data8
these days, but it computes and prints its symbolic meaning.  So there's
no reason it couldn't give lower/upper_bound some treatment.  And
assuming signed is probably ok, even for C. (see more below)

> But if I read
> the DWARF issue correctly then a higher-level interface seeing a
> DW_TAG_subrange_type would lookup the DW_TAG_type for the DIE first to
> see whether it is signed or not to decide how to interpret the
> DW_AT_lower and upper bound values. It can even be a reference or an
> exprloc that represents the actual value. We might want to introduce a
> dwarf_subrange_bounds () function that does that.

The subrange_type encoding indicates signed index, but that doesn't
actually change how GCC is using data[1248].  That data1=199 would be
negative if you sign extended it, but for actual negative values it's
using a full data8 with the MSB set.  It seems pretty clear that GCC
does not expect consumers to perform any sign extension on data[1248].

Note, even though the subrange type here is signed "integer(kind=8)", I
couldn't get gfortran to accept bounds greater than INT32_MAX.  So
there's no chance that 18446744073709551596 could be a legitimate
positive bound, at least for this language.

FWIW, C appears to get subrange_type unsigned "sizetype".  So you could
use this to determine how to show the resulting number, even though the
data1 should never be sign-extended in the process.  But if we just
decide to always print bounds signed, this will be fine until some C
program has an array bigger than INT64_MAX.

>>   Binutils readelf prints those as hex, no better.
>>
>> FWIW, libdwarf's dwarfdump just reveals its indecision:
>>   DW_AT_lower_bound           10
>>   DW_AT_upper_bound           199(as signed = -57)
>> and
>>   DW_AT_lower_bound           18446744073709551596(as signed = -20)
>>   DW_AT_upper_bound           18446744073709551606(as signed = -10)
> 
> Right, because dwarfdump is similar to eu-readelf, it doesn't use any
> context and so it doesn't know how to represent the value encoded with
> DW_FORM_data8. I actually like that it also prints the signed value if
> different. Maybe we should make eu-readelf do the same? Printing is hex
> like binutils readelf does is another way to mask the ambiguity at the
> low-level.
> 
>> So now I'm not sure anything needs to change.  At least dwarf_formsdata
>> should stay as-is for gcc.
> 
> Are you sure? I think your original analysis is correct that
> dwarf_formsdata () is wrong and really should sign-extend.

No, see above; GCC wrote signed index 199 into a data1:199, so it would
be wrong to sign-extend this.  We're at the mercy of the producer.

I also tried dragonegg to see what LLVM does to my fortran:

 [     b]  compile_unit
           producer             (strp) "4.8.2 20131212 (Red Hat 4.8.2-7)"
           language             (data2) C89 (1)
           name                 (strp) "foo.f90"

(note: wrong language!)

 [    a7]    base_type
             name                 (strp) "int"
             byte_size            (data1) 4
             encoding             (data1) signed (5)
 [    ae]    array_type
             type                 (ref4) [    a0]
 [    b3]      subrange_type
               type                 (ref4) [    a7]
               upper_bound          (data1) 188
 [    ba]    array_type
             type                 (ref4) [    a0]
 [    bf]      subrange_type
               type                 (ref4) [    a7]
               upper_bound          (data1) 9

So even though the subrange_type is signed, it 0-indexed both arrays
(like C89). :/  But still, that data1:188 should not be sign-extended to
have its proper meaning.  (my original 10:199 became 0:188)

>>   We could conceivably use dwarf_formudata for
>> DW_AT_decl_file/line/column, since those really are specified unsigned,
>> but this is unlikely to ever make a difference.  The values for
>> dwarf_decl_line/column are asserted 0..INT_MAX, and people with more
>> than INT64_MAX files are already insane.
> 
> You are right. Still using dwarf_formudata () would be more correct
> IMHO.

Yeah, udata is technically correct, which is the best kind of correct.
:)

I'll put together a patch.

Josh
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]