This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH 6/7] stdlib: Optimization qsort{_r} swap implementation

From: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
To: Paul Eggert <eggert at cs dot ucla dot edu>, libc-alpha at sourceware dot org
Date: Mon, 22 Jan 2018 08:55:14 -0200
Subject: Re: [PATCH 6/7] stdlib: Optimization qsort{_r} swap implementation
Authentication-results: sourceware.org; auth=none
References: <1516298002-4618-1-git-send-email-adhemerval.zanella@linaro.org> <1516298002-4618-7-git-send-email-adhemerval.zanella@linaro.org> <f1472674-e8f1-77d6-acd2-069f7fa89ce0@cs.ucla.edu>


On 22/01/2018 06:27, Paul Eggert wrote:
> Adhemerval Zanella wrote:
>> +static inline bool
>> +check_alignment (const void *base, size_t align)
>> +{
>> +  return _STRING_ARCH_unaligned || ((uintptr_t)base % (align - 1)) == 0;
>> +}
> 
> Surely the '(align - 1)' was supposed to be 'align'. Has this been tested on an architecture that does not allow unaligned access?

Yes, I checked on sparc64 machine.  This test is similar to the Linux kernel one
at lib/sort.c.

> 
>> +static inline void
>> +swap_generic (void *a, void *b, size_t size)
> 
> Why is this inline? It's used only as a function pointer, and the other functions so used are not declared inline.

It should not be, I fixed it locally.

> 
>> +static inline swap_t
>> +select_swap_func (const void *base, size_t size)
>> +{
>> +  if (size == 4 && check_alignment (base, 4))
>> +    return swap_u32;
>> +  else if (size == 8 && check_alignment (base, 8))
>> +    return swap_u64;
>> +  return swap_generic;
>> +}
> 
> The conditions aren't portable enough. Use something like this instead for swap_u32, and similarly for swap_u64.
> 
>   if (size == sizeof (uint32_t) && check_alignment (base, alignof (uint32_t)))
>     return swap_u32;

Ack, fixed locally.

> 
>> +static void
>> +swap_u32 (void *a, void *b, size_t size)
> 
> The pointer arguments should be declared 'void *restrict'. This can help GCC generate better code. Similarly for the other swap functions.
> 
>> +  uint32_t tmp = *(uint32_t*) a;
>> +  *(uint32_t*) a = *(uint32_t*) b;
>> +  *(uint32_t*) b = tmp;
> 
> It's nicer to avoid casts when possible, as is the case here and elsewhere. This is because casts are too powerful in C. Something like this, say:
> 
>     uint32_t *ua = a, *ub = b, tmp = *ua;
>     *ua = *ub, *ub = tmp;

Right, I changed to your suggestion.

> 
>> +  unsigned char tmp[128];
> 
> Why 128? A comment seems called for.

It is indeed an arbitrary value based on some real cases usage which
covers all gcc and firefox usage (largest key size gcc uses is 56
and firefox is 40). I will add comment about it.

> 
>> +static inline void
>> +swap_generic (void *a, void *b, size_t size)
>> +{
>> +  unsigned char tmp[128];
>> +  do
>> +    {
>> +      size_t s = size > sizeof (tmp) ? sizeof (tmp) : size;
>> +      memcpy (tmp, a, s);
>> +      a = __mempcpy (a, b, s);
>> +      b = __mempcpy (b, tmp, s);
>> +      size -= s;
>> +    }
>> +  while (size > 0);
>> +}
> 
> On my platform (GCC 7.2.1 20170915 (Red Hat 7.2.1-2) x86-64) this inlined the memcpy but not the mempcpy calls. How about something like this instead? It should let the compiler do a better job of block-move-style operations in the loop. If mempcpy is inlined for you, feel free to substitute it for two of the loop's calls to memcpy.
> 
>   static void
>   swap_generic (void *restrict va, void *restrict vb, size_t size)
>   {
>     char *a = va, *b = vb;
>     enum { n = 128 }; /* Why 128?  */
>     unsigned char tmp[n];
>     while (size >= n)
>       {
>     memcpy (tmp, a, n);
>     memcpy (a, b, n);
>     memcpy (b, tmp, n);
>     a += n;
>     b += n;
>     size -= n;
>       }
>     memcpy (tmp, a, size);
>     memcpy (a, b, size);
>     memcpy (b, tmp, size);
>   }

Because for this specific code inline is not always a gain, it depends 1. which is the
minimum ISA compiler will use to generate the inline variants and 2. which ifunc variants
glibc will provide for mempcpy (if if it is exposed for internal calls).  On recent x86_64
for instance the mempcpy call will issue __memmove_avx_unaligned_erms, which is faster
than default SSE2 inline variations.  Your suggestion turns to be in fact slower (base
is current approach) on a my machine (i7-4790K):

Results for member size 32
  Sorted
  nmemb   |      base |   patched | diff
        32|      1184 |      1268 | 7.09
      4096|    325865 |    333332 | 2.29
     32768|   3331750 |   3431695 | 3.00
    524288|  69067176 |  68805735 | -0.38

  Repeated
  nmemb   |      base |   patched | diff
        32|      4813 |      5779 | 20.07
      4096|   1624137 |   1972045 | 21.42
     32768|  15896739 |  19705289 | 23.96
    524288| 316328778 | 393942797 | 24.54

  MostlySorted
  nmemb   |      base |   patched | diff
        32|      5332 |      6198 | 16.24
      4096|   1312703 |   1563919 | 19.14
     32768|  12360726 |  14990070 | 21.27
    524288| 231603294 | 283228681 | 22.29

  Unsorted
  nmemb   |      base |   patched | diff
        32|      6047 |      7115 | 17.66
      4096|   1695241 |   2010943 | 18.62
     32768|  16430388 |  19636166 | 19.51
    524288| 329496913 | 395355847 | 19.99

In fact if I use -fno-builtin to force the memcpy call to issue the ifunc I get another
speed up (and it could be faster, for x86_64 at least, if glibc is build with -mavx):

Results for member size 32
  Sorted
  nmemb   |      base |   patched | diff
        32|      1184 |      1240 | 4.73
      4096|    325865 |    326596 | 0.22
     32768|   3331750 |   3613807 | 8.47
    524288|  69067176 |  74352201 | 7.65

  Repeated
  nmemb   |      base |   patched | diff
        32|      4813 |      4133 | -14.13
      4096|   1624137 |   1707452 | 5.13
     32768|  15896739 |  13999315 | -11.94
    524288| 316328778 | 280461810 | -11.34

  MostlySorted
  nmemb   |      base |   patched | diff
        32|      5332 |      4681 | -12.21
      4096|   1312703 |   1226684 | -6.55
     32768|  12360726 |  11362772 | -8.07
    524288| 231603294 | 212250739 | -8.36

  Unsorted
  nmemb   |      base |   patched | diff
        32|      6047 |      6676 | 10.40
      4096|   1695241 |   1492257 | -11.97
     32768|  16430388 |  14799600 | -9.93
    524288| 329496913 | 303681410 | -7.83

It might be that your approach is faster for other architectures which do not have
ifunc mempcpy, however I do not want to over-engineer this code since most real
word correspond to key sizes of 4 and 8.

Follow-Ups:
- Re: [PATCH 6/7] stdlib: Optimization qsort{_r} swap implementation
  - From: Alexander Monakov
- Re: [PATCH 6/7] stdlib: Optimization qsort{_r} swap implementation
  - From: Paul Eggert

References:
- [PATCH 0/7] Refactor qsort implementation
  - From: Adhemerval Zanella
- [PATCH 6/7] stdlib: Optimization qsort{_r} swap implementation
  - From: Adhemerval Zanella
- Re: [PATCH 6/7] stdlib: Optimization qsort{_r} swap implementation
  - From: Paul Eggert

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]