This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function
- From: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
- To: Andrew Pinski <pinskia at gmail dot com>
- Cc: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>, "siddhesh at sourceware dot org" <siddhesh at sourceware dot org>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, nd <nd at arm dot com>, Szabolcs Nagy <Szabolcs dot Nagy at arm dot com>
- Date: Wed, 11 Oct 2017 18:47:16 -0300
- Subject: Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function
- Authentication-results: sourceware.org; auth=none
- References: <DB6PR0801MB20531D1099A99E5A4DF1E042834A0@DB6PR0801MB2053.eurprd08.prod.outlook.com> <ac874303-6330-1a84-6970-7768ea408e96@linaro.org> <CA+=Sn1kkFgQE1yKq6vODmjkbsqao01giVSr3fbYvYErPFfC-Sg@mail.gmail.com>
On 11/10/2017 18:44, Andrew Pinski wrote:
> On Wed, Oct 11, 2017 at 2:39 PM, Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>> On 11/10/2017 18:20, Wilco Dijkstra wrote:
>>> Adhemerval Zanella wrote:
>>>> LGTM, for the pointer brought by Szabolcs:
>>>>
>>>> - I still prefer to have the macros on one file instead of multiple ones.
>>>> It makes checks for all the possible code patch easier and one can
>>>> theoretical build the desirable memset by just define the required
>>>> macros instead of pulling from source code from different files.
>>>
>>> At the design level, I don't think that's a good idea. It's essential that this
>>> code is easy to read and understand. It must be 100% correct and is
>>> extremely performance sensitive to minor changes (eg. instruction order
>>> or alignment can have major impact). Inline macros and lots of complex
>>> #ifdefs in the code mean getting this right becomes almost impossible.
>>> I really don't want to create large blobs of unmaintainable assembly code
>>> like several other targets.
>>
>> My idea is to prevent have different code paths in different files that
>> are 'injected' by macros or ifdefs. Getting all of them in one place is
>> better imho than spread over multiple files. Now, what you are advocating
>> is a different topic: whether the modifications of the generic
>> implementation are valuable.
>>
>>>
>>> Note we also should benchmark this on various other cores to see what
>>> the impact is. I wrote this memset code for specific reasons - changing that
>>> could have a big impact on some cores, so we need to show this doesn't
>>> cause regressions.
>>
>> Ideally it would prefer to have a more concise selection as:
>>
>> 1. A memset that reads dczid_el0 using mrs (as the default one).
>> This will be selected as default and thus should not incur in any
>> regression.
>>
>> 2. A memset that reads the zva size from global variable and is selected
>> solely by falkor (and ideally it would handle 64 and 128 cacheline
>> sizes).
>>
>> 3. no zva variant in the case of zva being 0.
>>
>>>
>>> Also we need to decide on how to read the ZVA size. I don't think there is
>>> any point in supporting that many options (reading it, using a global, using
>>> an ifunc and not using ZVA at all). The very first memset had a few of these
>>> options, and I removed them precisely because it created way too many
>>> permutations to test and benchmark. So if we add ifuncs, then we shouldn't
>>> need to use a global too. For the dynamic linker we could just use a basic
>>> memset without any ifuncs (so then it becomes 2 variants, ifunc using ZVA,
>>> and no ZVA).
>>
>> Siddhesh can help me out, but the idea is *precisely* that Falkor is faster
>> using a global variable than accessing the dczid_el0 and its shown by its
>> benchmarks. Now, the discussion is indeed if this microarchitecture difference
>> is worth adding ifuncs variants.
>
> For at least the micro-archs I work with, reading dczid_el0 can and
> will most likely be faster than reading from global memory.
> Especially if the global memory is not in the L1 cache. This is one
> case where a micro-benchmark can fall down. If the global memory is
> in L1 cache the read is 3 cycles while reading from dczid_el0 is 4
> cycles, but once it is out of L1 cache, reading becomes 10x worse plus
> it pollutes the L1 cache.
My understanding it is Falkor idiosyncrasy.