This is the mail archive of the
elfutils-devel@sourceware.org
mailing list for the elfutils project.
Re: incorrect signed data
- From: Josh Stone <jistone at redhat dot com>
- To: elfutils-devel at lists dot fedorahosted dot org
- Date: Wed, 05 Feb 2014 09:36:17 -0800
- Subject: Re: incorrect signed data
On 02/05/2014 03:44 AM, Mark Wielaard wrote:
> On Tue, 2014-02-04 at 18:24 -0800, Josh Stone wrote:
>> On 02/04/2014 03:12 PM, Josh Stone wrote:
>>> There are only a few internal dwarf_formsdata calls: for the decls as I
>>> mentioned, and in array_size() for DW_AT_lower/upper_bound. AFAICS the
>>> spec doesn't explicitly call bounds signed or unsigned, but only
>>> unsigned makes sense to me, so these also ought to use dwarf_formudata.
>>
>> http://www.dwarfstd.org/ShowIssue.php?issue=020702.1
>>
>> So Fortran allows negative bounds, yay, and this is the origin of the
>> standard's vague statements about data[1248] context.
>
> Thanks for finding this, it explains the context nicely.
>
>> Here's a little experiment with gcc-gfortran-4.8.2-7.fc20.x86_64:
>> (and forgive my fortran ignorance, but at least this compiles)
>>
>> PROGRAM main
>> INTEGER A(10:199)
>> INTEGER B(-20:-10)
>> A(10) = B(-10)
>> END
>>
>> yields:
>>
>> [ 67] array_type
>> type (ref4) [ 7f]
>> sibling (ref4) [ 78]
>> [ 70] subrange_type
>> type (ref4) [ 78]
>> lower_bound (data1) 10
>> upper_bound (data1) 199
>> [ 78] base_type
>> byte_size (data1) 8
>> encoding (data1) signed (5)
>> name (strp) "integer(kind=8)"
>> [ 7f] base_type
>> byte_size (data1) 4
>> encoding (data1) signed (5)
>> name (strp) "integer(kind=4)"
>> [ 86] array_type
>> type (ref4) [ 7f]
>> sibling (ref4) [ a5]
>> [ 8f] subrange_type
>> type (ref4) [ 78]
>> lower_bound (data8) 18446744073709551596
>> upper_bound (data8) 18446744073709551606
>>
>> Thus gfortran appears to support the current elfutils behavior - read it
>> as unsigned and cast it without sign extension. It happily put 199 in
>> data1, and went all the way to data8 for negative values. It could have
>> been more compact with sdata instead of data8 though.
>>
>> Also, apparently eu-readelf is not using dwarf_formsdata for bounds, and
>> it should.
>
> It doesn't because it is very low-level and doesn't use any context. So
> it just sees the DW_FORM_data8 and will print its value.
That's a lazy answer, and not actually true. The encodings above are
"just" data1, but it prints the meaning. And high_pc is often data8
these days, but it computes and prints its symbolic meaning. So there's
no reason it couldn't give lower/upper_bound some treatment. And
assuming signed is probably ok, even for C. (see more below)
> But if I read
> the DWARF issue correctly then a higher-level interface seeing a
> DW_TAG_subrange_type would lookup the DW_TAG_type for the DIE first to
> see whether it is signed or not to decide how to interpret the
> DW_AT_lower and upper bound values. It can even be a reference or an
> exprloc that represents the actual value. We might want to introduce a
> dwarf_subrange_bounds () function that does that.
The subrange_type encoding indicates signed index, but that doesn't
actually change how GCC is using data[1248]. That data1=199 would be
negative if you sign extended it, but for actual negative values it's
using a full data8 with the MSB set. It seems pretty clear that GCC
does not expect consumers to perform any sign extension on data[1248].
Note, even though the subrange type here is signed "integer(kind=8)", I
couldn't get gfortran to accept bounds greater than INT32_MAX. So
there's no chance that 18446744073709551596 could be a legitimate
positive bound, at least for this language.
FWIW, C appears to get subrange_type unsigned "sizetype". So you could
use this to determine how to show the resulting number, even though the
data1 should never be sign-extended in the process. But if we just
decide to always print bounds signed, this will be fine until some C
program has an array bigger than INT64_MAX.
>> Binutils readelf prints those as hex, no better.
>>
>> FWIW, libdwarf's dwarfdump just reveals its indecision:
>> DW_AT_lower_bound 10
>> DW_AT_upper_bound 199(as signed = -57)
>> and
>> DW_AT_lower_bound 18446744073709551596(as signed = -20)
>> DW_AT_upper_bound 18446744073709551606(as signed = -10)
>
> Right, because dwarfdump is similar to eu-readelf, it doesn't use any
> context and so it doesn't know how to represent the value encoded with
> DW_FORM_data8. I actually like that it also prints the signed value if
> different. Maybe we should make eu-readelf do the same? Printing is hex
> like binutils readelf does is another way to mask the ambiguity at the
> low-level.
>
>> So now I'm not sure anything needs to change. At least dwarf_formsdata
>> should stay as-is for gcc.
>
> Are you sure? I think your original analysis is correct that
> dwarf_formsdata () is wrong and really should sign-extend.
No, see above; GCC wrote signed index 199 into a data1:199, so it would
be wrong to sign-extend this. We're at the mercy of the producer.
I also tried dragonegg to see what LLVM does to my fortran:
[ b] compile_unit
producer (strp) "4.8.2 20131212 (Red Hat 4.8.2-7)"
language (data2) C89 (1)
name (strp) "foo.f90"
(note: wrong language!)
[ a7] base_type
name (strp) "int"
byte_size (data1) 4
encoding (data1) signed (5)
[ ae] array_type
type (ref4) [ a0]
[ b3] subrange_type
type (ref4) [ a7]
upper_bound (data1) 188
[ ba] array_type
type (ref4) [ a0]
[ bf] subrange_type
type (ref4) [ a7]
upper_bound (data1) 9
So even though the subrange_type is signed, it 0-indexed both arrays
(like C89). :/ But still, that data1:188 should not be sign-extended to
have its proper meaning. (my original 10:199 became 0:188)
>> We could conceivably use dwarf_formudata for
>> DW_AT_decl_file/line/column, since those really are specified unsigned,
>> but this is unlikely to ever make a difference. The values for
>> dwarf_decl_line/column are asserted 0..INT_MAX, and people with more
>> than INT64_MAX files are already insane.
>
> You are right. Still using dwarf_formudata () would be more correct
> IMHO.
Yeah, udata is technically correct, which is the best kind of correct.
:)
I'll put together a patch.
Josh