This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function



On 11/10/2017 18:44, Andrew Pinski wrote:
> On Wed, Oct 11, 2017 at 2:39 PM, Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>> On 11/10/2017 18:20, Wilco Dijkstra wrote:
>>> Adhemerval Zanella wrote:
>>>> LGTM, for the pointer brought by Szabolcs:
>>>>
>>>>  - I still prefer to have the macros on one file instead of multiple ones.
>>>>    It makes checks for all the possible code patch easier and one can
>>>>    theoretical build the desirable memset by just define the required
>>>>    macros instead of pulling from source code from different files.
>>>
>>> At the design level, I don't think that's a good idea. It's essential that this
>>> code is easy to read and understand. It must be 100% correct and is
>>> extremely performance sensitive to minor changes (eg. instruction order
>>> or alignment can have major impact). Inline macros and lots of complex
>>> #ifdefs in the code mean getting this right becomes almost impossible.
>>> I really don't want to create large blobs of unmaintainable assembly code
>>> like several other targets.
>>
>> My idea is to prevent have different code paths in different files that
>> are 'injected' by macros or ifdefs.  Getting all of them in one place is
>> better imho than spread over multiple files.  Now, what you are advocating
>> is a different topic: whether the modifications of the generic
>> implementation are valuable.
>>
>>>
>>> Note we also should benchmark this on various other cores to see what
>>> the impact is. I wrote this memset code for specific reasons - changing that
>>> could have a big impact on some cores, so we need to show this doesn't
>>> cause regressions.
>>
>> Ideally it would prefer to have a more concise selection as:
>>
>>    1. A memset that reads dczid_el0 using mrs (as the default one).
>>       This will be selected as default and thus should not incur in any
>>       regression.
>>
>>    2. A memset that reads the zva size from global variable and is selected
>>       solely by falkor (and ideally it would handle 64 and 128 cacheline
>>       sizes).
>>
>>    3. no zva variant in the case of zva being 0.
>>
>>>
>>> Also we need to decide on how to read the ZVA size. I don't think there is
>>> any point in supporting that many options (reading it, using a global, using
>>> an ifunc and not using ZVA at all). The very first memset had a few of these
>>> options, and I removed them precisely because it created way too many
>>> permutations to test and benchmark. So if we add ifuncs, then we shouldn't
>>> need to use a global too. For the dynamic linker we could just use a basic
>>> memset without any ifuncs (so then it becomes 2 variants, ifunc using ZVA,
>>> and no ZVA).
>>
>> Siddhesh can help me out, but the idea is *precisely* that Falkor is faster
>> using a global variable than accessing the dczid_el0 and its shown by its
>> benchmarks.  Now, the discussion is indeed if this microarchitecture difference
>> is worth adding ifuncs variants.
> 
> For at least the micro-archs I work with, reading dczid_el0 can and
> will most likely be faster than reading from global memory.
> Especially if the global memory is not in the L1 cache.  This is one
> case where a micro-benchmark can fall down.  If the global memory is
> in L1 cache the read is 3 cycles while reading from dczid_el0 is 4
> cycles, but once it is out of L1 cache, reading becomes 10x worse plus
> it pollutes the L1 cache.

My understanding it is Falkor idiosyncrasy. 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]