This is the mail archive of the
libc-help@sourceware.org
mailing list for the glibc project.
Re: CPU dispatching in libc
Ryan S. Arnold wrote:
>Agner wrote:
>> Does such a CPU dispatching exist in libc? How does it work? It should
>>be possible to compile a static binary on a system with SSE-whatever,
>>and run it on a system with SSE-something-else. Therefore, I want the
>>CPU-dispatching to be inside libc.
>We (IBM) had discussions with AMD and Intel at the 2007 GCC Summit where
>they indicated that they were interested in dynamic runtime checks for
>hardware capability which would route the application to the correct CPU
>optimized function implementation while the application was running by
>using a first-time-called hwcap check.
>ïThe 'first-time-called' hwcap check would work by having a wrapper
>function check to see if it had an internal function pointer set for an
>optimized version of the function. If not, then it'd check the hwcap
>for the specific platform information, find the correct function pointer
>and set it. Subsequent calls wouldn't pay this resolution
>penalty. ïI'm not sure if they made any progress on this. H.J. Lu at
>Intel would probably be able to tell you.
The framework for CPU dispatching must be in place before any progress
can be made. So this is the reason why the memory and string functions
are so slow in libc. What are you doing with math functions? Most other
libraries use SSE2 for math functions if available. I can't find the
math functions in libc, so I don't know what you are doing here.
>You should contact H.J Lu (via email and CC this mailing list) and ask
>him if they made any progress with their 'first-time-called'
>optimization checks idea.
I have CC'ed this mail to him.
If CPU dispatching is not implemented yet, here is my proposal for an
efficient mechanism:
The function entry has JMP POINTER where POINTER is a pointer stored in
the data segment.
POINTER initially points to a dispatcher. The dispatcher calls a
function WhichInstructionSetDoIHave. According to the value received, it
changes POINTER to point to the optimal version of the code. Then jumps
to [POINTER]. The next time the function is called, it goes through
POINTER directly to the optimal version. The cost of dispatching is then
just one single instruction, except for the first time. (A 32-bit
position-independent version needs to get a reference thunk into ecx first).
The most probable path should be immediately after JMP POINTER.
The WhichInstructionSetDoIHave function reads its value from a variable
CurrentInstructionSet in the data segment. This variable is initially
zero, indicating that it must use CPUID etc. to determine the
instruction set. It is possible to detect whether XMM registers are
enabled by using the FXSAVE/FXRSTOR instructions rather than asking the
operating system or catching an exception. This will make it easier to
port libc to different operating systems.
For testing purposes, it should be possible to change the value of
CurrentInstructionSet. Set it to a lower value for testing older
versions, set it to a higher value for testing new versions if you have
an emulator for that instruction set.