This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PING][PATCHv3 1/2] aarch64: Hoist ZVA check out of the memset function


Adhemerval Zanella wrote:
> LGTM, for the pointer brought by Szabolcs:
>
>  - I still prefer to have the macros on one file instead of multiple ones.
>    It makes checks for all the possible code patch easier and one can
>    theoretical build the desirable memset by just define the required
>    macros instead of pulling from source code from different files.

At the design level, I don't think that's a good idea. It's essential that this
code is easy to read and understand. It must be 100% correct and is
extremely performance sensitive to minor changes (eg. instruction order
or alignment can have major impact). Inline macros and lots of complex
#ifdefs in the code mean getting this right becomes almost impossible.
I really don't want to create large blobs of unmaintainable assembly code
like several other targets.

Note we also should benchmark this on various other cores to see what
the impact is. I wrote this memset code for specific reasons - changing that
could have a big impact on some cores, so we need to show this doesn't
cause regressions.

Also we need to decide on how to read the ZVA size. I don't think there is
any point in supporting that many options (reading it, using a global, using
an ifunc and not using ZVA at all). The very first memset had a few of these
options, and I removed them precisely because it created way too many
permutations to test and benchmark. So if we add ifuncs, then we shouldn't
need to use a global too. For the dynamic linker we could just use a basic
memset without any ifuncs (so then it becomes 2 variants, ifunc using ZVA,
and no ZVA).

Finally we'll need to look into more detail at the new memcpy benchmarks -
while looping through memory seems like a good idea, it appears like it
only increments by 1. So at first sight it's still testing the exact same thing as
the existing benchmarks - all data is always cached in L1. For memset I guess
we're still missing a randomized benchmark based on real trace data.

Wilco



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]