2021-10-26 02:11:57 -07:00
|
|
|
Using XSTATE features in user space applications
|
|
|
|
================================================
|
|
|
|
|
|
|
|
The x86 architecture supports floating-point extensions which are
|
|
|
|
enumerated via CPUID. Applications consult CPUID and use XGETBV to
|
|
|
|
evaluate which features have been enabled by the kernel XCR0.
|
|
|
|
|
|
|
|
Up to AVX-512 and PKRU states, these features are automatically enabled by
|
|
|
|
the kernel if available. Features like AMX TILE_DATA (XSTATE component 18)
|
|
|
|
are enabled by XCR0 as well, but the first use of related instruction is
|
|
|
|
trapped by the kernel because by default the required large XSTATE buffers
|
|
|
|
are not allocated automatically.
|
|
|
|
|
2023-01-20 17:18:57 -07:00
|
|
|
The purpose for dynamic features
|
|
|
|
--------------------------------
|
|
|
|
|
|
|
|
Legacy userspace libraries often have hard-coded, static sizes for
|
|
|
|
alternate signal stacks, often using MINSIGSTKSZ which is typically 2KB.
|
|
|
|
That stack must be able to store at *least* the signal frame that the
|
|
|
|
kernel sets up before jumping into the signal handler. That signal frame
|
|
|
|
must include an XSAVE buffer defined by the CPU.
|
|
|
|
|
|
|
|
However, that means that the size of signal stacks is dynamic, not static,
|
|
|
|
because different CPUs have differently-sized XSAVE buffers. A compiled-in
|
|
|
|
size of 2KB with existing applications is too small for new CPU features
|
|
|
|
like AMX. Instead of universally requiring larger stack, with the dynamic
|
|
|
|
enabling, the kernel can enforce userspace applications to have
|
|
|
|
properly-sized altstacks.
|
|
|
|
|
2021-10-26 02:11:57 -07:00
|
|
|
Using dynamically enabled XSTATE features in user space applications
|
|
|
|
--------------------------------------------------------------------
|
|
|
|
|
|
|
|
The kernel provides an arch_prctl(2) based mechanism for applications to
|
|
|
|
request the usage of such features. The arch_prctl(2) options related to
|
|
|
|
this are:
|
|
|
|
|
|
|
|
-ARCH_GET_XCOMP_SUPP
|
|
|
|
|
|
|
|
arch_prctl(ARCH_GET_XCOMP_SUPP, &features);
|
|
|
|
|
|
|
|
ARCH_GET_XCOMP_SUPP stores the supported features in userspace storage of
|
|
|
|
type uint64_t. The second argument is a pointer to that storage.
|
|
|
|
|
|
|
|
-ARCH_GET_XCOMP_PERM
|
|
|
|
|
|
|
|
arch_prctl(ARCH_GET_XCOMP_PERM, &features);
|
|
|
|
|
|
|
|
ARCH_GET_XCOMP_PERM stores the features for which the userspace process
|
|
|
|
has permission in userspace storage of type uint64_t. The second argument
|
|
|
|
is a pointer to that storage.
|
|
|
|
|
|
|
|
-ARCH_REQ_XCOMP_PERM
|
|
|
|
|
|
|
|
arch_prctl(ARCH_REQ_XCOMP_PERM, feature_nr);
|
|
|
|
|
|
|
|
ARCH_REQ_XCOMP_PERM allows to request permission for a dynamically enabled
|
|
|
|
feature or a feature set. A feature set can be mapped to a facility, e.g.
|
|
|
|
AMX, and can require one or more XSTATE components to be enabled.
|
|
|
|
|
|
|
|
The feature argument is the number of the highest XSTATE component which
|
|
|
|
is required for a facility to work.
|
|
|
|
|
|
|
|
When requesting permission for a feature, the kernel checks the
|
|
|
|
availability. The kernel ensures that sigaltstacks in the process's tasks
|
|
|
|
are large enough to accommodate the resulting large signal frame. It
|
|
|
|
enforces this both during ARCH_REQ_XCOMP_SUPP and during any subsequent
|
|
|
|
sigaltstack(2) calls. If an installed sigaltstack is smaller than the
|
|
|
|
resulting sigframe size, ARCH_REQ_XCOMP_SUPP results in -ENOSUPP. Also,
|
|
|
|
sigaltstack(2) results in -ENOMEM if the requested altstack is too small
|
|
|
|
for the permitted features.
|
|
|
|
|
|
|
|
Permission, when granted, is valid per process. Permissions are inherited
|
|
|
|
on fork(2) and cleared on exec(3).
|
|
|
|
|
|
|
|
The first use of an instruction related to a dynamically enabled feature is
|
|
|
|
trapped by the kernel. The trap handler checks whether the process has
|
|
|
|
permission to use the feature. If the process has no permission then the
|
|
|
|
kernel sends SIGILL to the application. If the process has permission then
|
|
|
|
the handler allocates a larger xstate buffer for the task so the large
|
|
|
|
state can be context switched. In the unlikely cases that the allocation
|
|
|
|
fails, the kernel sends SIGSEGV.
|
x86/fpu: Optimize out sigframe xfeatures when in init state
tl;dr: AMX state is ~8k. Signal frames can have space for this
~8k and each signal entry writes out all 8k even if it is zeros.
Skip writing zeros for AMX to speed up signal delivery by about
4% overall when AMX is in its init state.
This is a user-visible change to the sigframe ABI.
== Hardware XSAVE Background ==
XSAVE state components may be tracked by the processor as being
in their initial configuration. Software can detect which
features are in this configuration by looking at the XSTATE_BV
field in an XSAVE buffer or with the XGETBV(1) instruction.
Both the XSAVE and XSAVEOPT instructions enumerate features s
being in the initial configuration via the XSTATE_BV field in the
XSAVE header, However, XSAVEOPT declines to actually write
features in their initial configuration to the buffer. XSAVE
writes the feature unconditionally, regardless of whether it is
in the initial configuration or not.
Basically, XSAVE users never need to inspect XSTATE_BV to
determine if the feature has been written to the buffer.
XSAVEOPT users *do* need to inspect XSTATE_BV. They might also
need to clear out the buffer if they want to make an isolated
change to the state, like modifying one register.
== Software Signal / XSAVE Background ==
Signal frames have historically been written with XSAVE itself.
Each state is written in its entirety, regardless of being in its
initial configuration.
In other words, the signal frame ABI uses the XSAVE behavior, not
the XSAVEOPT behavior.
== Problem ==
This means that any application which has acquired permission to
use AMX via ARCH_REQ_XCOMP_PERM will write 8k of state to the
signal frame. This 8k write will occur even when AMX was in its
initial configuration and software *knows* this because of
XSTATE_BV.
This problem also exists to a lesser degree with AVX-512 and its
2k of state. However, AVX-512 use does not require
ARCH_REQ_XCOMP_PERM and is more likely to have existing users
which would be impacted by any change in behavior.
== Solution ==
Stop writing out AMX xfeatures which are in their initial state
to the signal frame. This effectively makes the signal frame
XSAVE buffer look as if it were written with a combination of
XSAVEOPT and XSAVE behavior. Userspace which handles XSAVEOPT-
style buffers should be able to handle this naturally.
For now, include only the AMX xfeatures: XTILE and XTILEDATA in
this new behavior. These require new ABI to use anyway, which
makes their users very unlikely to be broken. This XSAVEOPT-like
behavior should be expected for all future dynamic xfeatures. It
may also be extended to legacy features like AVX-512 in the
future.
Only attempt this optimization on systems with dynamic features.
Disable dynamic feature support (XFD) if XGETBV1 is unavailable
by adding a CPUID dependency.
This has been measured to reduce the *overall* cycle cost of
signal delivery by about 4%.
Fixes: 2308ee57d93d ("x86/fpu/amx: Enable the AMX feature in 64-bit mode")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: "Chang S. Bae" <chang.seok.bae@intel.com>
Link: https://lore.kernel.org/r/20211102224750.FA412E26@davehans-spike.ostc.intel.com
2021-11-02 15:47:50 -07:00
|
|
|
|
2023-01-20 17:18:59 -07:00
|
|
|
AMX TILE_DATA enabling example
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
Below is the example of how userspace applications enable
|
|
|
|
TILE_DATA dynamically:
|
|
|
|
|
|
|
|
1. The application first needs to query the kernel for AMX
|
|
|
|
support::
|
|
|
|
|
|
|
|
#include <asm/prctl.h>
|
|
|
|
#include <sys/syscall.h>
|
|
|
|
#include <stdio.h>
|
|
|
|
#include <unistd.h>
|
|
|
|
|
|
|
|
#ifndef ARCH_GET_XCOMP_SUPP
|
|
|
|
#define ARCH_GET_XCOMP_SUPP 0x1021
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef ARCH_XCOMP_TILECFG
|
|
|
|
#define ARCH_XCOMP_TILECFG 17
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef ARCH_XCOMP_TILEDATA
|
|
|
|
#define ARCH_XCOMP_TILEDATA 18
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#define MASK_XCOMP_TILE ((1 << ARCH_XCOMP_TILECFG) | \
|
|
|
|
(1 << ARCH_XCOMP_TILEDATA))
|
|
|
|
|
|
|
|
unsigned long features;
|
|
|
|
long rc;
|
|
|
|
|
|
|
|
...
|
|
|
|
|
|
|
|
rc = syscall(SYS_arch_prctl, ARCH_GET_XCOMP_SUPP, &features);
|
|
|
|
|
|
|
|
if (!rc && (features & MASK_XCOMP_TILE) == MASK_XCOMP_TILE)
|
|
|
|
printf("AMX is available.\n");
|
|
|
|
|
|
|
|
2. After that, determining support for AMX, an application must
|
|
|
|
explicitly ask permission to use it::
|
|
|
|
|
|
|
|
#ifndef ARCH_REQ_XCOMP_PERM
|
|
|
|
#define ARCH_REQ_XCOMP_PERM 0x1023
|
|
|
|
#endif
|
|
|
|
|
|
|
|
...
|
|
|
|
|
|
|
|
rc = syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_PERM, ARCH_XCOMP_TILEDATA);
|
|
|
|
|
|
|
|
if (!rc)
|
|
|
|
printf("AMX is ready for use.\n");
|
|
|
|
|
|
|
|
Note this example does not include the sigaltstack preparation.
|
|
|
|
|
x86/fpu: Optimize out sigframe xfeatures when in init state
tl;dr: AMX state is ~8k. Signal frames can have space for this
~8k and each signal entry writes out all 8k even if it is zeros.
Skip writing zeros for AMX to speed up signal delivery by about
4% overall when AMX is in its init state.
This is a user-visible change to the sigframe ABI.
== Hardware XSAVE Background ==
XSAVE state components may be tracked by the processor as being
in their initial configuration. Software can detect which
features are in this configuration by looking at the XSTATE_BV
field in an XSAVE buffer or with the XGETBV(1) instruction.
Both the XSAVE and XSAVEOPT instructions enumerate features s
being in the initial configuration via the XSTATE_BV field in the
XSAVE header, However, XSAVEOPT declines to actually write
features in their initial configuration to the buffer. XSAVE
writes the feature unconditionally, regardless of whether it is
in the initial configuration or not.
Basically, XSAVE users never need to inspect XSTATE_BV to
determine if the feature has been written to the buffer.
XSAVEOPT users *do* need to inspect XSTATE_BV. They might also
need to clear out the buffer if they want to make an isolated
change to the state, like modifying one register.
== Software Signal / XSAVE Background ==
Signal frames have historically been written with XSAVE itself.
Each state is written in its entirety, regardless of being in its
initial configuration.
In other words, the signal frame ABI uses the XSAVE behavior, not
the XSAVEOPT behavior.
== Problem ==
This means that any application which has acquired permission to
use AMX via ARCH_REQ_XCOMP_PERM will write 8k of state to the
signal frame. This 8k write will occur even when AMX was in its
initial configuration and software *knows* this because of
XSTATE_BV.
This problem also exists to a lesser degree with AVX-512 and its
2k of state. However, AVX-512 use does not require
ARCH_REQ_XCOMP_PERM and is more likely to have existing users
which would be impacted by any change in behavior.
== Solution ==
Stop writing out AMX xfeatures which are in their initial state
to the signal frame. This effectively makes the signal frame
XSAVE buffer look as if it were written with a combination of
XSAVEOPT and XSAVE behavior. Userspace which handles XSAVEOPT-
style buffers should be able to handle this naturally.
For now, include only the AMX xfeatures: XTILE and XTILEDATA in
this new behavior. These require new ABI to use anyway, which
makes their users very unlikely to be broken. This XSAVEOPT-like
behavior should be expected for all future dynamic xfeatures. It
may also be extended to legacy features like AVX-512 in the
future.
Only attempt this optimization on systems with dynamic features.
Disable dynamic feature support (XFD) if XGETBV1 is unavailable
by adding a CPUID dependency.
This has been measured to reduce the *overall* cycle cost of
signal delivery by about 4%.
Fixes: 2308ee57d93d ("x86/fpu/amx: Enable the AMX feature in 64-bit mode")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: "Chang S. Bae" <chang.seok.bae@intel.com>
Link: https://lore.kernel.org/r/20211102224750.FA412E26@davehans-spike.ostc.intel.com
2021-11-02 15:47:50 -07:00
|
|
|
Dynamic features in signal frames
|
|
|
|
---------------------------------
|
|
|
|
|
2024-04-29 15:55:27 -07:00
|
|
|
Dynamically enabled features are not written to the signal frame upon signal
|
x86/fpu: Optimize out sigframe xfeatures when in init state
tl;dr: AMX state is ~8k. Signal frames can have space for this
~8k and each signal entry writes out all 8k even if it is zeros.
Skip writing zeros for AMX to speed up signal delivery by about
4% overall when AMX is in its init state.
This is a user-visible change to the sigframe ABI.
== Hardware XSAVE Background ==
XSAVE state components may be tracked by the processor as being
in their initial configuration. Software can detect which
features are in this configuration by looking at the XSTATE_BV
field in an XSAVE buffer or with the XGETBV(1) instruction.
Both the XSAVE and XSAVEOPT instructions enumerate features s
being in the initial configuration via the XSTATE_BV field in the
XSAVE header, However, XSAVEOPT declines to actually write
features in their initial configuration to the buffer. XSAVE
writes the feature unconditionally, regardless of whether it is
in the initial configuration or not.
Basically, XSAVE users never need to inspect XSTATE_BV to
determine if the feature has been written to the buffer.
XSAVEOPT users *do* need to inspect XSTATE_BV. They might also
need to clear out the buffer if they want to make an isolated
change to the state, like modifying one register.
== Software Signal / XSAVE Background ==
Signal frames have historically been written with XSAVE itself.
Each state is written in its entirety, regardless of being in its
initial configuration.
In other words, the signal frame ABI uses the XSAVE behavior, not
the XSAVEOPT behavior.
== Problem ==
This means that any application which has acquired permission to
use AMX via ARCH_REQ_XCOMP_PERM will write 8k of state to the
signal frame. This 8k write will occur even when AMX was in its
initial configuration and software *knows* this because of
XSTATE_BV.
This problem also exists to a lesser degree with AVX-512 and its
2k of state. However, AVX-512 use does not require
ARCH_REQ_XCOMP_PERM and is more likely to have existing users
which would be impacted by any change in behavior.
== Solution ==
Stop writing out AMX xfeatures which are in their initial state
to the signal frame. This effectively makes the signal frame
XSAVE buffer look as if it were written with a combination of
XSAVEOPT and XSAVE behavior. Userspace which handles XSAVEOPT-
style buffers should be able to handle this naturally.
For now, include only the AMX xfeatures: XTILE and XTILEDATA in
this new behavior. These require new ABI to use anyway, which
makes their users very unlikely to be broken. This XSAVEOPT-like
behavior should be expected for all future dynamic xfeatures. It
may also be extended to legacy features like AVX-512 in the
future.
Only attempt this optimization on systems with dynamic features.
Disable dynamic feature support (XFD) if XGETBV1 is unavailable
by adding a CPUID dependency.
This has been measured to reduce the *overall* cycle cost of
signal delivery by about 4%.
Fixes: 2308ee57d93d ("x86/fpu/amx: Enable the AMX feature in 64-bit mode")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: "Chang S. Bae" <chang.seok.bae@intel.com>
Link: https://lore.kernel.org/r/20211102224750.FA412E26@davehans-spike.ostc.intel.com
2021-11-02 15:47:50 -07:00
|
|
|
entry if the feature is in its initial configuration. This differs from
|
|
|
|
non-dynamic features which are always written regardless of their
|
|
|
|
configuration. Signal handlers can examine the XSAVE buffer's XSTATE_BV
|
|
|
|
field to determine if a features was written.
|
2023-01-20 17:19:00 -07:00
|
|
|
|
|
|
|
Dynamic features for virtual machines
|
|
|
|
-------------------------------------
|
|
|
|
|
|
|
|
The permission for the guest state component needs to be managed separately
|
|
|
|
from the host, as they are exclusive to each other. A coupled of options
|
|
|
|
are extended to control the guest permission:
|
|
|
|
|
|
|
|
-ARCH_GET_XCOMP_GUEST_PERM
|
|
|
|
|
|
|
|
arch_prctl(ARCH_GET_XCOMP_GUEST_PERM, &features);
|
|
|
|
|
|
|
|
ARCH_GET_XCOMP_GUEST_PERM is a variant of ARCH_GET_XCOMP_PERM. So it
|
|
|
|
provides the same semantics and functionality but for the guest
|
|
|
|
components.
|
|
|
|
|
|
|
|
-ARCH_REQ_XCOMP_GUEST_PERM
|
|
|
|
|
|
|
|
arch_prctl(ARCH_REQ_XCOMP_GUEST_PERM, feature_nr);
|
|
|
|
|
|
|
|
ARCH_REQ_XCOMP_GUEST_PERM is a variant of ARCH_REQ_XCOMP_PERM. It has the
|
|
|
|
same semantics for the guest permission. While providing a similar
|
|
|
|
functionality, this comes with a constraint. Permission is frozen when the
|
|
|
|
first VCPU is created. Any attempt to change permission after that point
|
|
|
|
is going to be rejected. So, the permission has to be requested before the
|
|
|
|
first VCPU creation.
|
|
|
|
|
|
|
|
Note that some VMMs may have already established a set of supported state
|
|
|
|
components. These options are not presumed to support any particular VMM.
|