This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC][BZ #17943] Use long for int_fast8_t


On 09/02/15 13:36, Maciej W. Rozycki wrote:
> On Sun, 8 Feb 2015, OndÅej BÃlka wrote:
> 
>> Hi, as in bugzilla entry what is rationale of using char as int_fast8_t?
>>
>> It is definitely slower with division, following code is 25% slower on
>> haswell with char than when you use long.
> 
>  It may boil down to the choice of instructions produced made by the 
> compiler.  I can hardly imagine 8-bit division to be slower than 64-bit 
> one on a processor that implements subword integer arithmetic.
> 
>> There is question what about other architectures and atomic operations,
>> are byte ones better than int?
>>
>> int main ()
>> {
>>   int i;
>>   char x = 32;
>>   for (i=0; i<1000000000; i++)
>>     x = 11 * x + 5 + x / 3;
>>   return x;
>> }
> 
>  On Intel Atom for example division latencies are as follows[1]:
> 
> 		latency	throughput
> IDIV r/m8	   33	    32
> IDIV r/m16	   42	    41
> IDIV r/m32	   57	    56
> IDIV r/m64	  197	   196
> 
> I'd expect the ratio of elapsed times for the corresponding data widths 
> and a manual division algorithm used with processors that have no hardware 
> divider to be similar.  There is no latency difference between individual 
> data widths AFAICT for multiplication or general ALU operations.
> 
>  For processors that do have a hardware divider implementing word 
> calculation only I'd expect either a constant latency or again, a decrease 
> in operation time depending on the actual width of significant data 
> contained in operands.
> 
>  For example the M14Kc MIPS32 processor has an RTL configuration option to 
> include either an area-efficient or a high-performance MDU 
> (multiply/divide unit).  The area-efficient MDU has a latency of 33 clocks 
> for unsigned division (signed division adds up to 2 clocks for sign 
> reversal).  The high-performance MDU reduces the latency as follows[2]:
> 
> "Divide operations are implemented with a simple 1-bit-per-clock iterative 
> algorithm.  An early-in detection checks the sign extension of the 
> dividend (rs) operand.  If rs is 8 bits wide, 23 iterations are skipped. 
> For a 16-bit-wide rs, 15 iterations are skipped, and for a 24-bit-wide rs, 
> 7 iterations are skipped.  Any attempt to issue a subsequent MDU 
> instruction while a divide is still active causes an IU pipeline stall 
> until the divide operation has completed."
> 
> As it happens automatically there is no benefit from using a narrower data 
> type, and the lack of subword arithmetic operations means that using such 
> a type will require a truncation operation from time to time for 
> multiplication or general ALU operations.
> 
>> --- a/sysdeps/generic/stdint.h
>> +++ b/sysdeps/generic/stdint.h
>> @@ -87,12 +87,13 @@ typedef unsigned long long int	uint_least64_t;
>>  /* Fast types.  */
>>  
>>  /* Signed.  */
>> -typedef signed char		int_fast8_t;
>>  #if __WORDSIZE == 64
>> +typedef long int		int_fast8_t;
>>  typedef long int		int_fast16_t;
>>  typedef long int		int_fast32_t;
>>  typedef long int		int_fast64_t;
>>  #else
>> +typedef int			int_fast8_t;
>>  typedef int			int_fast16_t;
>>  typedef int			int_fast32_t;
>>  __extension__

On AArch64 there's nothing to be gained in terms of performance from
using a 64-bit type over a 32-bit type when both can hold the required
range of values.  In fact, it's likely to make things slower, since
multiply and divide operations will most likely take longer.

So on AArch64 int_fast8_t, int_fast16_t and int_fast32_t should all map
to int, not long.

R.

> 
>  So I find the choice of types above to be already questionable for a 
> generic header.  By default I'd expect fast data types to have the same 
> width as their fixed-width counterparts for the large benefit they provide 
> with most architectures that do implement subword arithmetic weighed 
> against the small loss they will likely incur with architectures that only 
> implement word arithmetic.  Then individual ports could override the 
> defaults as they see fit.
> 
>  At this point the discussion is however I believe moot though -- there 
> will have been relocatable objects out there with data embedded using 
> these types already so the ABI has been set and I don't see a way of 
> changing it without breaking binary compatibility.
> 
>  References:
> 
> [1] "Intel 64 and IA-32 Architectures Optimization Reference Manual", 
>     Intel Corporation, Order Number: 248966-020, November 2009, Table 12-2 
>     "Intel Atom Microarchitecture Instructions Latency Data", p. 12-21
> 
> [2] "MIPS32 M14Kc Processor Core Datasheet", Revision 01.00, MIPS 
>     Technologies, Inc., Document Number: MD00672, November 2, 2009, 
>     Subsection "High-Performance MDU", p. 6
> 
>   Maciej
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]