1
* Fix the guest view of the ID registers, making the relevant fields
   writable from userspace (affecting ID_AA64DFR0_EL1 and ID_AA64PFR1_EL1)
 
 * Correcly expose S1PIE to guests, fixing a regression introduced
   in 6.12-rc1 with the S1POE support
 
 * Fix the recycling of stage-2 shadow MMUs by tracking the context
   (are we allowed to block or not) as well as the recycling state
 
 * Address a couple of issues with the vgic when userspace misconfigures
   the emulation, resulting in various splats. Headaches courtesy
   of our Syzkaller friends
 
 * Stop wasting space in the HYP idmap, as we are dangerously close
   to the 4kB limit, and this has already exploded in -next
 
 * Fix another race in vgic_init()
 
 * Fix a UBSAN error when faking the cache topology with MTE
   enabled
 
 RISCV:
 
 * RISCV: KVM: use raw_spinlock for critical section in imsic
 
 x86:
 
 * A bandaid for lack of XCR0 setup in selftests, which causes trouble
   if the compiler is configured to have x86-64-v3 (with AVX) as the
   default ISA.  Proper XCR0 setup will come in the next merge window.
 
 * Fix an issue where KVM would not ignore low bits of the nested CR3
   and potentially leak up to 31 bytes out of the guest memory's bounds
 
 * Fix case in which an out-of-date cached value for the segments could
   by returned by KVM_GET_SREGS.
 
 * More cleanups for KVM_X86_QUIRK_SLOT_ZAP_ALL
 
 * Override MTRR state for KVM confidential guests, making it WB by
   default as is already the case for Hyper-V guests.
 
 Generic:
 
 * Remove a couple of unused functions
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmcVK54UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOfrgf7BRyihd28OGaqVuv2BqGYrxqfOkd6
 ZqpJDOy+X7UE3iG5NhTxw4mghCJFhOwIL7gDSZwPLe6D2k01oqPSP2pLMqXb5oOv
 /EkltRvzG0YIH3sjZY5PROrMMxnvSKkJKxETFxFQQzMKRym2v/T5LAzrium58YIT
 vWZXxo2HTPXOw/U5upAqqMYJMeeJEL3kurVHtOsPytUFjrIOl0BfeKvgjOwonDIh
 Awm4JZwk0+1d8sYfkuzsSrTQmtshDCx1jkFN1juirt90s1EwgmOvVKiHo3gMsVP9
 veDRoLTx2fM/r7TrhoHo46DTA2vbfmCltWcT0cn5x8P24BFGXXe/IDJIHA==
 =IVlI
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "ARM64:

   - Fix the guest view of the ID registers, making the relevant fields
     writable from userspace (affecting ID_AA64DFR0_EL1 and
     ID_AA64PFR1_EL1)

   - Correcly expose S1PIE to guests, fixing a regression introduced in
     6.12-rc1 with the S1POE support

   - Fix the recycling of stage-2 shadow MMUs by tracking the context
     (are we allowed to block or not) as well as the recycling state

   - Address a couple of issues with the vgic when userspace
     misconfigures the emulation, resulting in various splats. Headaches
     courtesy of our Syzkaller friends

   - Stop wasting space in the HYP idmap, as we are dangerously close to
     the 4kB limit, and this has already exploded in -next

   - Fix another race in vgic_init()

   - Fix a UBSAN error when faking the cache topology with MTE enabled

  RISCV:

   - RISCV: KVM: use raw_spinlock for critical section in imsic

  x86:

   - A bandaid for lack of XCR0 setup in selftests, which causes trouble
     if the compiler is configured to have x86-64-v3 (with AVX) as the
     default ISA. Proper XCR0 setup will come in the next merge window.

   - Fix an issue where KVM would not ignore low bits of the nested CR3
     and potentially leak up to 31 bytes out of the guest memory's
     bounds

   - Fix case in which an out-of-date cached value for the segments
     could by returned by KVM_GET_SREGS.

   - More cleanups for KVM_X86_QUIRK_SLOT_ZAP_ALL

   - Override MTRR state for KVM confidential guests, making it WB by
     default as is already the case for Hyper-V guests.

  Generic:

   - Remove a couple of unused functions"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (27 commits)
  RISCV: KVM: use raw_spinlock for critical section in imsic
  KVM: selftests: Fix out-of-bounds reads in CPUID test's array lookups
  KVM: selftests: x86: Avoid using SSE/AVX instructions
  KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory
  KVM: VMX: reset the segment cache after segment init in vmx_vcpu_reset()
  KVM: x86: Clean up documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL
  KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range()
  KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot
  x86/kvm: Override default caching mode for SEV-SNP and TDX
  KVM: Remove unused kvm_vcpu_gfn_to_pfn_atomic
  KVM: Remove unused kvm_vcpu_gfn_to_pfn
  KVM: arm64: Ensure vgic_ready() is ordered against MMIO registration
  KVM: arm64: vgic: Don't check for vgic_ready() when setting NR_IRQS
  KVM: arm64: Fix shift-out-of-bounds bug
  KVM: arm64: Shave a few bytes from the EL2 idmap code
  KVM: arm64: Don't eagerly teardown the vgic on init error
  KVM: arm64: Expose S1PIE to guests
  KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule
  KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed
  ...
This commit is contained in:
Linus Torvalds 2024-10-21 11:22:04 -07:00
commit d129377639
25 changed files with 277 additions and 103 deletions

View File

@ -8098,13 +8098,15 @@ KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS By default, KVM emulates MONITOR/MWAIT (if
KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT is
disabled.
KVM_X86_QUIRK_SLOT_ZAP_ALL By default, KVM invalidates all SPTEs in
fast way for memslot deletion when VM type
is KVM_X86_DEFAULT_VM.
When this quirk is disabled or when VM type
is other than KVM_X86_DEFAULT_VM, KVM zaps
only leaf SPTEs that are within the range of
the memslot being deleted.
KVM_X86_QUIRK_SLOT_ZAP_ALL By default, for KVM_X86_DEFAULT_VM VMs, KVM
invalidates all SPTEs in all memslots and
address spaces when a memslot is deleted or
moved. When this quirk is disabled (or the
VM type isn't KVM_X86_DEFAULT_VM), KVM only
ensures the backing memory of the deleted
or moved memslot isn't reachable, i.e KVM
_may_ invalidate only SPTEs related to the
memslot.
=================================== ============================================
7.32 KVM_CAP_MAX_VCPU_ID

View File

@ -136,7 +136,7 @@ For direct sp, we can easily avoid it since the spte of direct sp is fixed
to gfn. For indirect sp, we disabled fast page fault for simplicity.
A solution for indirect sp could be to pin the gfn, for example via
kvm_vcpu_gfn_to_pfn_atomic, before the cmpxchg. After the pinning:
gfn_to_pfn_memslot_atomic, before the cmpxchg. After the pinning:
- We have held the refcount of pfn; that means the pfn can not be freed and
be reused for another gfn.

View File

@ -178,6 +178,7 @@ struct kvm_nvhe_init_params {
unsigned long hcr_el2;
unsigned long vttbr;
unsigned long vtcr;
unsigned long tmp;
};
/*

View File

@ -51,6 +51,7 @@
#define KVM_REQ_RELOAD_PMU KVM_ARCH_REQ(5)
#define KVM_REQ_SUSPEND KVM_ARCH_REQ(6)
#define KVM_REQ_RESYNC_PMU_EL0 KVM_ARCH_REQ(7)
#define KVM_REQ_NESTED_S2_UNMAP KVM_ARCH_REQ(8)
#define KVM_DIRTY_LOG_MANUAL_CAPS (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
KVM_DIRTY_LOG_INITIALLY_SET)
@ -211,6 +212,12 @@ struct kvm_s2_mmu {
*/
bool nested_stage2_enabled;
/*
* true when this MMU needs to be unmapped before being used for a new
* purpose.
*/
bool pending_unmap;
/*
* 0: Nobody is currently using this, check vttbr for validity
* >0: Somebody is actively using this.

View File

@ -166,7 +166,8 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
int create_hyp_stack(phys_addr_t phys_addr, unsigned long *haddr);
void __init free_hyp_pgds(void);
void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 size);
void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start,
u64 size, bool may_block);
void kvm_stage2_flush_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t end);
void kvm_stage2_wp_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t end);

View File

@ -78,6 +78,8 @@ extern void kvm_s2_mmu_iterate_by_vmid(struct kvm *kvm, u16 vmid,
extern void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu);
extern void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu);
extern void check_nested_vcpu_requests(struct kvm_vcpu *vcpu);
struct kvm_s2_trans {
phys_addr_t output;
unsigned long block_size;
@ -124,7 +126,7 @@ extern int kvm_s2_handle_perm_fault(struct kvm_vcpu *vcpu,
struct kvm_s2_trans *trans);
extern int kvm_inject_s2_fault(struct kvm_vcpu *vcpu, u64 esr_el2);
extern void kvm_nested_s2_wp(struct kvm *kvm);
extern void kvm_nested_s2_unmap(struct kvm *kvm);
extern void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block);
extern void kvm_nested_s2_flush(struct kvm *kvm);
unsigned long compute_tlb_inval_range(struct kvm_s2_mmu *mmu, u64 val);

View File

@ -146,6 +146,7 @@ int main(void)
DEFINE(NVHE_INIT_HCR_EL2, offsetof(struct kvm_nvhe_init_params, hcr_el2));
DEFINE(NVHE_INIT_VTTBR, offsetof(struct kvm_nvhe_init_params, vttbr));
DEFINE(NVHE_INIT_VTCR, offsetof(struct kvm_nvhe_init_params, vtcr));
DEFINE(NVHE_INIT_TMP, offsetof(struct kvm_nvhe_init_params, tmp));
#endif
#ifdef CONFIG_CPU_PM
DEFINE(CPU_CTX_SP, offsetof(struct cpu_suspend_ctx, sp));

View File

@ -997,6 +997,9 @@ static int kvm_vcpu_suspend(struct kvm_vcpu *vcpu)
static int check_vcpu_requests(struct kvm_vcpu *vcpu)
{
if (kvm_request_pending(vcpu)) {
if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
return -EIO;
if (kvm_check_request(KVM_REQ_SLEEP, vcpu))
kvm_vcpu_sleep(vcpu);
@ -1031,6 +1034,8 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
if (kvm_dirty_ring_check_request(vcpu))
return 0;
check_nested_vcpu_requests(vcpu);
}
return 1;

View File

@ -24,28 +24,25 @@
.align 11
SYM_CODE_START(__kvm_hyp_init)
ventry __invalid // Synchronous EL2t
ventry __invalid // IRQ EL2t
ventry __invalid // FIQ EL2t
ventry __invalid // Error EL2t
ventry . // Synchronous EL2t
ventry . // IRQ EL2t
ventry . // FIQ EL2t
ventry . // Error EL2t
ventry __invalid // Synchronous EL2h
ventry __invalid // IRQ EL2h
ventry __invalid // FIQ EL2h
ventry __invalid // Error EL2h
ventry . // Synchronous EL2h
ventry . // IRQ EL2h
ventry . // FIQ EL2h
ventry . // Error EL2h
ventry __do_hyp_init // Synchronous 64-bit EL1
ventry __invalid // IRQ 64-bit EL1
ventry __invalid // FIQ 64-bit EL1
ventry __invalid // Error 64-bit EL1
ventry . // IRQ 64-bit EL1
ventry . // FIQ 64-bit EL1
ventry . // Error 64-bit EL1
ventry __invalid // Synchronous 32-bit EL1
ventry __invalid // IRQ 32-bit EL1
ventry __invalid // FIQ 32-bit EL1
ventry __invalid // Error 32-bit EL1
__invalid:
b .
ventry . // Synchronous 32-bit EL1
ventry . // IRQ 32-bit EL1
ventry . // FIQ 32-bit EL1
ventry . // Error 32-bit EL1
/*
* Only uses x0..x3 so as to not clobber callee-saved SMCCC registers.
@ -76,6 +73,13 @@ __do_hyp_init:
eret
SYM_CODE_END(__kvm_hyp_init)
SYM_CODE_START_LOCAL(__kvm_init_el2_state)
/* Initialize EL2 CPU state to sane values. */
init_el2_state // Clobbers x0..x2
finalise_el2_state
ret
SYM_CODE_END(__kvm_init_el2_state)
/*
* Initialize the hypervisor in EL2.
*
@ -102,9 +106,12 @@ SYM_CODE_START_LOCAL(___kvm_hyp_init)
// TPIDR_EL2 is used to preserve x0 across the macro maze...
isb
msr tpidr_el2, x0
init_el2_state
finalise_el2_state
str lr, [x0, #NVHE_INIT_TMP]
bl __kvm_init_el2_state
mrs x0, tpidr_el2
ldr lr, [x0, #NVHE_INIT_TMP]
1:
ldr x1, [x0, #NVHE_INIT_TPIDR_EL2]
@ -199,9 +206,8 @@ SYM_CODE_START_LOCAL(__kvm_hyp_init_cpu)
2: msr SPsel, #1 // We want to use SP_EL{1,2}
/* Initialize EL2 CPU state to sane values. */
init_el2_state // Clobbers x0..x2
finalise_el2_state
bl __kvm_init_el2_state
__init_el2_nvhe_prepare_eret
/* Enable MMU, set vectors and stack. */

View File

@ -317,7 +317,7 @@ int kvm_smccc_call_handler(struct kvm_vcpu *vcpu)
* to the guest, and hide SSBS so that the
* guest stays protected.
*/
if (cpus_have_final_cap(ARM64_SSBS))
if (kvm_has_feat(vcpu->kvm, ID_AA64PFR1_EL1, SSBS, IMP))
break;
fallthrough;
case SPECTRE_UNAFFECTED:
@ -428,7 +428,7 @@ int kvm_arm_copy_fw_reg_indices(struct kvm_vcpu *vcpu, u64 __user *uindices)
* Convert the workaround level into an easy-to-compare number, where higher
* values mean better protection.
*/
static int get_kernel_wa_level(u64 regid)
static int get_kernel_wa_level(struct kvm_vcpu *vcpu, u64 regid)
{
switch (regid) {
case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1:
@ -449,7 +449,7 @@ static int get_kernel_wa_level(u64 regid)
* don't have any FW mitigation if SSBS is there at
* all times.
*/
if (cpus_have_final_cap(ARM64_SSBS))
if (kvm_has_feat(vcpu->kvm, ID_AA64PFR1_EL1, SSBS, IMP))
return KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL;
fallthrough;
case SPECTRE_UNAFFECTED:
@ -486,7 +486,7 @@ int kvm_arm_get_fw_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg)
case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1:
case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2:
case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3:
val = get_kernel_wa_level(reg->id) & KVM_REG_FEATURE_LEVEL_MASK;
val = get_kernel_wa_level(vcpu, reg->id) & KVM_REG_FEATURE_LEVEL_MASK;
break;
case KVM_REG_ARM_STD_BMAP:
val = READ_ONCE(smccc_feat->std_bmap);
@ -588,7 +588,7 @@ int kvm_arm_set_fw_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg)
if (val & ~KVM_REG_FEATURE_LEVEL_MASK)
return -EINVAL;
if (get_kernel_wa_level(reg->id) < val)
if (get_kernel_wa_level(vcpu, reg->id) < val)
return -EINVAL;
return 0;
@ -624,7 +624,7 @@ int kvm_arm_set_fw_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg)
* We can deal with NOT_AVAIL on NOT_REQUIRED, but not the
* other way around.
*/
if (get_kernel_wa_level(reg->id) < wa_level)
if (get_kernel_wa_level(vcpu, reg->id) < wa_level)
return -EINVAL;
return 0;

View File

@ -328,9 +328,10 @@ static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64
may_block));
}
void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 size)
void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start,
u64 size, bool may_block)
{
__unmap_stage2_range(mmu, start, size, true);
__unmap_stage2_range(mmu, start, size, may_block);
}
void kvm_stage2_flush_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t end)
@ -1015,7 +1016,7 @@ static void stage2_unmap_memslot(struct kvm *kvm,
if (!(vma->vm_flags & VM_PFNMAP)) {
gpa_t gpa = addr + (vm_start - memslot->userspace_addr);
kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, vm_end - vm_start);
kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, vm_end - vm_start, true);
}
hva = vm_end;
} while (hva < reg_end);
@ -1042,7 +1043,7 @@ void stage2_unmap_vm(struct kvm *kvm)
kvm_for_each_memslot(memslot, bkt, slots)
stage2_unmap_memslot(kvm, memslot);
kvm_nested_s2_unmap(kvm);
kvm_nested_s2_unmap(kvm, true);
write_unlock(&kvm->mmu_lock);
mmap_read_unlock(current->mm);
@ -1912,7 +1913,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
(range->end - range->start) << PAGE_SHIFT,
range->may_block);
kvm_nested_s2_unmap(kvm);
kvm_nested_s2_unmap(kvm, range->may_block);
return false;
}
@ -2179,8 +2180,8 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
phys_addr_t size = slot->npages << PAGE_SHIFT;
write_lock(&kvm->mmu_lock);
kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, size);
kvm_nested_s2_unmap(kvm);
kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, size, true);
kvm_nested_s2_unmap(kvm, true);
write_unlock(&kvm->mmu_lock);
}

View File

@ -632,9 +632,9 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
/* Set the scene for the next search */
kvm->arch.nested_mmus_next = (i + 1) % kvm->arch.nested_mmus_size;
/* Clear the old state */
/* Make sure we don't forget to do the laundry */
if (kvm_s2_mmu_valid(s2_mmu))
kvm_stage2_unmap_range(s2_mmu, 0, kvm_phys_size(s2_mmu));
s2_mmu->pending_unmap = true;
/*
* The virtual VMID (modulo CnP) will be used as a key when matching
@ -650,6 +650,16 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
out:
atomic_inc(&s2_mmu->refcnt);
/*
* Set the vCPU request to perform an unmap, even if the pending unmap
* originates from another vCPU. This guarantees that the MMU has been
* completely unmapped before any vCPU actually uses it, and allows
* multiple vCPUs to lend a hand with completing the unmap.
*/
if (s2_mmu->pending_unmap)
kvm_make_request(KVM_REQ_NESTED_S2_UNMAP, vcpu);
return s2_mmu;
}
@ -663,6 +673,13 @@ void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu)
void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
{
/*
* The vCPU kept its reference on the MMU after the last put, keep
* rolling with it.
*/
if (vcpu->arch.hw_mmu)
return;
if (is_hyp_ctxt(vcpu)) {
vcpu->arch.hw_mmu = &vcpu->kvm->arch.mmu;
} else {
@ -674,10 +691,18 @@ void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu)
{
if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu)) {
/*
* Keep a reference on the associated stage-2 MMU if the vCPU is
* scheduling out and not in WFI emulation, suggesting it is likely to
* reuse the MMU sometime soon.
*/
if (vcpu->scheduled_out && !vcpu_get_flag(vcpu, IN_WFI))
return;
if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu))
atomic_dec(&vcpu->arch.hw_mmu->refcnt);
vcpu->arch.hw_mmu = NULL;
}
vcpu->arch.hw_mmu = NULL;
}
/*
@ -730,7 +755,7 @@ void kvm_nested_s2_wp(struct kvm *kvm)
}
}
void kvm_nested_s2_unmap(struct kvm *kvm)
void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block)
{
int i;
@ -740,7 +765,7 @@ void kvm_nested_s2_unmap(struct kvm *kvm)
struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
if (kvm_s2_mmu_valid(mmu))
kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu));
kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block);
}
}
@ -1184,3 +1209,17 @@ int kvm_init_nv_sysregs(struct kvm *kvm)
return 0;
}
void check_nested_vcpu_requests(struct kvm_vcpu *vcpu)
{
if (kvm_check_request(KVM_REQ_NESTED_S2_UNMAP, vcpu)) {
struct kvm_s2_mmu *mmu = vcpu->arch.hw_mmu;
write_lock(&vcpu->kvm->mmu_lock);
if (mmu->pending_unmap) {
kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), true);
mmu->pending_unmap = false;
}
write_unlock(&vcpu->kvm->mmu_lock);
}
}

View File

@ -1527,6 +1527,14 @@ static u64 __kvm_read_sanitised_id_reg(const struct kvm_vcpu *vcpu,
val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_MTE);
val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_SME);
val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_RNDR_trap);
val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_NMI);
val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_MTE_frac);
val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_GCS);
val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_THE);
val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_MTEX);
val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_DF2);
val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_PFAR);
break;
case SYS_ID_AA64PFR2_EL1:
/* We only expose FPMR */
@ -1550,7 +1558,8 @@ static u64 __kvm_read_sanitised_id_reg(const struct kvm_vcpu *vcpu,
val &= ~ID_AA64MMFR2_EL1_CCIDX_MASK;
break;
case SYS_ID_AA64MMFR3_EL1:
val &= ID_AA64MMFR3_EL1_TCRX | ID_AA64MMFR3_EL1_S1POE;
val &= ID_AA64MMFR3_EL1_TCRX | ID_AA64MMFR3_EL1_S1POE |
ID_AA64MMFR3_EL1_S1PIE;
break;
case SYS_ID_MMFR4_EL1:
val &= ~ARM64_FEATURE_MASK(ID_MMFR4_EL1_CCIDX);
@ -1985,7 +1994,7 @@ static u64 reset_clidr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r)
* one cache line.
*/
if (kvm_has_mte(vcpu->kvm))
clidr |= 2 << CLIDR_TTYPE_SHIFT(loc);
clidr |= 2ULL << CLIDR_TTYPE_SHIFT(loc);
__vcpu_sys_reg(vcpu, r->reg) = clidr;
@ -2376,7 +2385,19 @@ static const struct sys_reg_desc sys_reg_descs[] = {
ID_AA64PFR0_EL1_RAS |
ID_AA64PFR0_EL1_AdvSIMD |
ID_AA64PFR0_EL1_FP), },
ID_SANITISED(ID_AA64PFR1_EL1),
ID_WRITABLE(ID_AA64PFR1_EL1, ~(ID_AA64PFR1_EL1_PFAR |
ID_AA64PFR1_EL1_DF2 |
ID_AA64PFR1_EL1_MTEX |
ID_AA64PFR1_EL1_THE |
ID_AA64PFR1_EL1_GCS |
ID_AA64PFR1_EL1_MTE_frac |
ID_AA64PFR1_EL1_NMI |
ID_AA64PFR1_EL1_RNDR_trap |
ID_AA64PFR1_EL1_SME |
ID_AA64PFR1_EL1_RES0 |
ID_AA64PFR1_EL1_MPAM_frac |
ID_AA64PFR1_EL1_RAS_frac |
ID_AA64PFR1_EL1_MTE)),
ID_WRITABLE(ID_AA64PFR2_EL1, ID_AA64PFR2_EL1_FPMR),
ID_UNALLOCATED(4,3),
ID_WRITABLE(ID_AA64ZFR0_EL1, ~ID_AA64ZFR0_EL1_RES0),
@ -2390,7 +2411,21 @@ static const struct sys_reg_desc sys_reg_descs[] = {
.get_user = get_id_reg,
.set_user = set_id_aa64dfr0_el1,
.reset = read_sanitised_id_aa64dfr0_el1,
.val = ID_AA64DFR0_EL1_PMUVer_MASK |
/*
* Prior to FEAT_Debugv8.9, the architecture defines context-aware
* breakpoints (CTX_CMPs) as the highest numbered breakpoints (BRPs).
* KVM does not trap + emulate the breakpoint registers, and as such
* cannot support a layout that misaligns with the underlying hardware.
* While it may be possible to describe a subset that aligns with
* hardware, just prevent changes to BRPs and CTX_CMPs altogether for
* simplicity.
*
* See DDI0487K.a, section D2.8.3 Breakpoint types and linking
* of breakpoints for more details.
*/
.val = ID_AA64DFR0_EL1_DoubleLock_MASK |
ID_AA64DFR0_EL1_WRPs_MASK |
ID_AA64DFR0_EL1_PMUVer_MASK |
ID_AA64DFR0_EL1_DebugVer_MASK, },
ID_SANITISED(ID_AA64DFR1_EL1),
ID_UNALLOCATED(5,2),
@ -2433,6 +2468,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
ID_AA64MMFR2_EL1_NV |
ID_AA64MMFR2_EL1_CCIDX)),
ID_WRITABLE(ID_AA64MMFR3_EL1, (ID_AA64MMFR3_EL1_TCRX |
ID_AA64MMFR3_EL1_S1PIE |
ID_AA64MMFR3_EL1_S1POE)),
ID_SANITISED(ID_AA64MMFR4_EL1),
ID_UNALLOCATED(7,5),
@ -2903,7 +2939,7 @@ static bool handle_alle1is(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
* Drop all shadow S2s, resulting in S1/S2 TLBIs for each of the
* corresponding VMIDs.
*/
kvm_nested_s2_unmap(vcpu->kvm);
kvm_nested_s2_unmap(vcpu->kvm, true);
write_unlock(&vcpu->kvm->mmu_lock);
@ -2955,7 +2991,30 @@ union tlbi_info {
static void s2_mmu_unmap_range(struct kvm_s2_mmu *mmu,
const union tlbi_info *info)
{
kvm_stage2_unmap_range(mmu, info->range.start, info->range.size);
/*
* The unmap operation is allowed to drop the MMU lock and block, which
* means that @mmu could be used for a different context than the one
* currently being invalidated.
*
* This behavior is still safe, as:
*
* 1) The vCPU(s) that recycled the MMU are responsible for invalidating
* the entire MMU before reusing it, which still honors the intent
* of a TLBI.
*
* 2) Until the guest TLBI instruction is 'retired' (i.e. increment PC
* and ERET to the guest), other vCPUs are allowed to use stale
* translations.
*
* 3) Accidentally unmapping an unrelated MMU context is nonfatal, and
* at worst may cause more aborts for shadow stage-2 fills.
*
* Dropping the MMU lock also implies that shadow stage-2 fills could
* happen behind the back of the TLBI. This is still safe, though, as
* the L1 needs to put its stage-2 in a consistent state before doing
* the TLBI.
*/
kvm_stage2_unmap_range(mmu, info->range.start, info->range.size, true);
}
static bool handle_vmalls12e1is(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
@ -3050,7 +3109,11 @@ static void s2_mmu_unmap_ipa(struct kvm_s2_mmu *mmu,
max_size = compute_tlb_inval_range(mmu, info->ipa.addr);
base_addr &= ~(max_size - 1);
kvm_stage2_unmap_range(mmu, base_addr, max_size);
/*
* See comment in s2_mmu_unmap_range() for why this is allowed to
* reschedule.
*/
kvm_stage2_unmap_range(mmu, base_addr, max_size, true);
}
static bool handle_ipas2e1is(struct kvm_vcpu *vcpu, struct sys_reg_params *p,

View File

@ -417,8 +417,28 @@ static void __kvm_vgic_vcpu_destroy(struct kvm_vcpu *vcpu)
kfree(vgic_cpu->private_irqs);
vgic_cpu->private_irqs = NULL;
if (vcpu->kvm->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3)
if (vcpu->kvm->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3) {
/*
* If this vCPU is being destroyed because of a failed creation
* then unregister the redistributor to avoid leaving behind a
* dangling pointer to the vCPU struct.
*
* vCPUs that have been successfully created (i.e. added to
* kvm->vcpu_array) get unregistered in kvm_vgic_destroy(), as
* this function gets called while holding kvm->arch.config_lock
* in the VM teardown path and would otherwise introduce a lock
* inversion w.r.t. kvm->srcu.
*
* vCPUs that failed creation are torn down outside of the
* kvm->arch.config_lock and do not get unregistered in
* kvm_vgic_destroy(), meaning it is both safe and necessary to
* do so here.
*/
if (kvm_get_vcpu_by_id(vcpu->kvm, vcpu->vcpu_id) != vcpu)
vgic_unregister_redist_iodev(vcpu);
vgic_cpu->rd_iodev.base_addr = VGIC_ADDR_UNDEF;
}
}
void kvm_vgic_vcpu_destroy(struct kvm_vcpu *vcpu)
@ -524,22 +544,31 @@ int kvm_vgic_map_resources(struct kvm *kvm)
if (ret)
goto out;
dist->ready = true;
dist_base = dist->vgic_dist_base;
mutex_unlock(&kvm->arch.config_lock);
ret = vgic_register_dist_iodev(kvm, dist_base, type);
if (ret)
if (ret) {
kvm_err("Unable to register VGIC dist MMIO regions\n");
goto out_slots;
}
/*
* kvm_io_bus_register_dev() guarantees all readers see the new MMIO
* registration before returning through synchronize_srcu(), which also
* implies a full memory barrier. As such, marking the distributor as
* 'ready' here is guaranteed to be ordered after all vCPUs having seen
* a completely configured distributor.
*/
dist->ready = true;
goto out_slots;
out:
mutex_unlock(&kvm->arch.config_lock);
out_slots:
mutex_unlock(&kvm->slots_lock);
if (ret)
kvm_vgic_destroy(kvm);
kvm_vm_dead(kvm);
mutex_unlock(&kvm->slots_lock);
return ret;
}

View File

@ -236,7 +236,12 @@ static int vgic_set_common_attr(struct kvm_device *dev,
mutex_lock(&dev->kvm->arch.config_lock);
if (vgic_ready(dev->kvm) || dev->kvm->arch.vgic.nr_spis)
/*
* Either userspace has already configured NR_IRQS or
* the vgic has already been initialized and vgic_init()
* supplied a default amount of SPIs.
*/
if (dev->kvm->arch.vgic.nr_spis)
ret = -EBUSY;
else
dev->kvm->arch.vgic.nr_spis =

View File

@ -55,7 +55,7 @@ struct imsic {
/* IMSIC SW-file */
struct imsic_mrif *swfile;
phys_addr_t swfile_pa;
spinlock_t swfile_extirq_lock;
raw_spinlock_t swfile_extirq_lock;
};
#define imsic_vs_csr_read(__c) \
@ -622,7 +622,7 @@ static void imsic_swfile_extirq_update(struct kvm_vcpu *vcpu)
* interruptions between reading topei and updating pending status.
*/
spin_lock_irqsave(&imsic->swfile_extirq_lock, flags);
raw_spin_lock_irqsave(&imsic->swfile_extirq_lock, flags);
if (imsic_mrif_atomic_read(mrif, &mrif->eidelivery) &&
imsic_mrif_topei(mrif, imsic->nr_eix, imsic->nr_msis))
@ -630,7 +630,7 @@ static void imsic_swfile_extirq_update(struct kvm_vcpu *vcpu)
else
kvm_riscv_vcpu_unset_interrupt(vcpu, IRQ_VS_EXT);
spin_unlock_irqrestore(&imsic->swfile_extirq_lock, flags);
raw_spin_unlock_irqrestore(&imsic->swfile_extirq_lock, flags);
}
static void imsic_swfile_read(struct kvm_vcpu *vcpu, bool clear,
@ -1051,7 +1051,7 @@ int kvm_riscv_vcpu_aia_imsic_init(struct kvm_vcpu *vcpu)
}
imsic->swfile = page_to_virt(swfile_page);
imsic->swfile_pa = page_to_phys(swfile_page);
spin_lock_init(&imsic->swfile_extirq_lock);
raw_spin_lock_init(&imsic->swfile_extirq_lock);
/* Setup IO device */
kvm_iodevice_init(&imsic->iodev, &imsic_iodoev_ops);

View File

@ -37,6 +37,7 @@
#include <asm/apic.h>
#include <asm/apicdef.h>
#include <asm/hypervisor.h>
#include <asm/mtrr.h>
#include <asm/tlb.h>
#include <asm/cpuidle_haltpoll.h>
#include <asm/ptrace.h>
@ -980,6 +981,9 @@ static void __init kvm_init_platform(void)
}
kvmclock_init();
x86_platform.apic_post_init = kvm_apic_init;
/* Set WB as the default cache mode for SEV-SNP and TDX */
mtrr_overwrite_state(NULL, 0, MTRR_TYPE_WRBACK);
}
#if defined(CONFIG_AMD_MEM_ENCRYPT)

View File

@ -1556,6 +1556,17 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool flush = false;
/*
* To prevent races with vCPUs faulting in a gfn using stale data,
* zapping a gfn range must be protected by mmu_invalidate_in_progress
* (and mmu_invalidate_seq). The only exception is memslot deletion;
* in that case, SRCU synchronization ensures that SPTEs are zapped
* after all vCPUs have unlocked SRCU, guaranteeing that vCPUs see the
* invalid slot.
*/
lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
lockdep_is_held(&kvm->slots_lock));
if (kvm_memslots_have_rmaps(kvm))
flush = __kvm_rmap_zap_gfn_range(kvm, range->slot,
range->start, range->end,
@ -1884,14 +1895,10 @@ static bool sp_has_gptes(struct kvm_mmu_page *sp)
if (is_obsolete_sp((_kvm), (_sp))) { \
} else
#define for_each_gfn_valid_sp(_kvm, _sp, _gfn) \
#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn) \
for_each_valid_sp(_kvm, _sp, \
&(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)]) \
if ((_sp)->gfn != (_gfn)) {} else
#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn) \
for_each_gfn_valid_sp(_kvm, _sp, _gfn) \
if (!sp_has_gptes(_sp)) {} else
if ((_sp)->gfn != (_gfn) || !sp_has_gptes(_sp)) {} else
static bool kvm_sync_page_check(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
{
@ -7063,15 +7070,15 @@ static void kvm_mmu_zap_memslot_pages_and_flush(struct kvm *kvm,
/*
* Since accounting information is stored in struct kvm_arch_memory_slot,
* shadow pages deletion (e.g. unaccount_shadowed()) requires that all
* gfns with a shadow page have a corresponding memslot. Do so before
* the memslot goes away.
* all MMU pages that are shadowing guest PTEs must be zapped before the
* memslot is deleted, as freeing such pages after the memslot is freed
* will result in use-after-free, e.g. in unaccount_shadowed().
*/
for (i = 0; i < slot->npages; i++) {
struct kvm_mmu_page *sp;
gfn_t gfn = slot->base_gfn + i;
for_each_gfn_valid_sp(kvm, sp, gfn)
for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn)
kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {

View File

@ -63,8 +63,12 @@ static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index)
u64 pdpte;
int ret;
/*
* Note, nCR3 is "assumed" to be 32-byte aligned, i.e. the CPU ignores
* nCR3[4:0] when loading PDPTEs from memory.
*/
ret = kvm_vcpu_read_guest_page(vcpu, gpa_to_gfn(cr3), &pdpte,
offset_in_page(cr3) + index * 8, 8);
(cr3 & GENMASK(11, 5)) + index * 8, 8);
if (ret)
return 0;
return pdpte;

View File

@ -4888,9 +4888,6 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmx->hv_deadline_tsc = -1;
kvm_set_cr8(vcpu, 0);
vmx_segment_cache_clear(vmx);
kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
seg_setup(VCPU_SREG_CS);
vmcs_write16(GUEST_CS_SELECTOR, 0xf000);
vmcs_writel(GUEST_CS_BASE, 0xffff0000ul);
@ -4917,6 +4914,9 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmcs_writel(GUEST_IDTR_BASE, 0);
vmcs_write32(GUEST_IDTR_LIMIT, 0xffff);
vmx_segment_cache_clear(vmx);
kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, 0);
vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS, 0);

View File

@ -1313,8 +1313,6 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn);
kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
int kvm_vcpu_map(struct kvm_vcpu *vcpu, gpa_t gpa, struct kvm_host_map *map);
void kvm_vcpu_unmap(struct kvm_vcpu *vcpu, struct kvm_host_map *map, bool dirty);
unsigned long kvm_vcpu_gfn_to_hva(struct kvm_vcpu *vcpu, gfn_t gfn);

View File

@ -244,6 +244,7 @@ CFLAGS += -Wall -Wstrict-prototypes -Wuninitialized -O2 -g -std=gnu99 \
-fno-stack-protector -fno-PIE -I$(LINUX_TOOL_INCLUDE) \
-I$(LINUX_TOOL_ARCH_INCLUDE) -I$(LINUX_HDR_PATH) -Iinclude \
-I$(<D) -Iinclude/$(ARCH_DIR) -I ../rseq -I.. $(EXTRA_CFLAGS) \
-march=x86-64-v2 \
$(KHDR_INCLUDES)
ifeq ($(ARCH),s390)
CFLAGS += -march=z10

View File

@ -68,6 +68,8 @@ struct test_feature_reg {
}
static const struct reg_ftr_bits ftr_id_aa64dfr0_el1[] = {
S_REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64DFR0_EL1, DoubleLock, 0),
REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64DFR0_EL1, WRPs, 0),
S_REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64DFR0_EL1, PMUVer, 0),
REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64DFR0_EL1, DebugVer, ID_AA64DFR0_EL1_DebugVer_IMP),
REG_FTR_END,
@ -134,6 +136,13 @@ static const struct reg_ftr_bits ftr_id_aa64pfr0_el1[] = {
REG_FTR_END,
};
static const struct reg_ftr_bits ftr_id_aa64pfr1_el1[] = {
REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64PFR1_EL1, CSV2_frac, 0),
REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64PFR1_EL1, SSBS, ID_AA64PFR1_EL1_SSBS_NI),
REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64PFR1_EL1, BT, 0),
REG_FTR_END,
};
static const struct reg_ftr_bits ftr_id_aa64mmfr0_el1[] = {
REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64MMFR0_EL1, ECV, 0),
REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64MMFR0_EL1, EXS, 0),
@ -200,6 +209,7 @@ static struct test_feature_reg test_regs[] = {
TEST_REG(SYS_ID_AA64ISAR1_EL1, ftr_id_aa64isar1_el1),
TEST_REG(SYS_ID_AA64ISAR2_EL1, ftr_id_aa64isar2_el1),
TEST_REG(SYS_ID_AA64PFR0_EL1, ftr_id_aa64pfr0_el1),
TEST_REG(SYS_ID_AA64PFR1_EL1, ftr_id_aa64pfr1_el1),
TEST_REG(SYS_ID_AA64MMFR0_EL1, ftr_id_aa64mmfr0_el1),
TEST_REG(SYS_ID_AA64MMFR1_EL1, ftr_id_aa64mmfr1_el1),
TEST_REG(SYS_ID_AA64MMFR2_EL1, ftr_id_aa64mmfr2_el1),
@ -569,9 +579,9 @@ int main(void)
test_cnt = ARRAY_SIZE(ftr_id_aa64dfr0_el1) + ARRAY_SIZE(ftr_id_dfr0_el1) +
ARRAY_SIZE(ftr_id_aa64isar0_el1) + ARRAY_SIZE(ftr_id_aa64isar1_el1) +
ARRAY_SIZE(ftr_id_aa64isar2_el1) + ARRAY_SIZE(ftr_id_aa64pfr0_el1) +
ARRAY_SIZE(ftr_id_aa64mmfr0_el1) + ARRAY_SIZE(ftr_id_aa64mmfr1_el1) +
ARRAY_SIZE(ftr_id_aa64mmfr2_el1) + ARRAY_SIZE(ftr_id_aa64zfr0_el1) -
ARRAY_SIZE(test_regs) + 2;
ARRAY_SIZE(ftr_id_aa64pfr1_el1) + ARRAY_SIZE(ftr_id_aa64mmfr0_el1) +
ARRAY_SIZE(ftr_id_aa64mmfr1_el1) + ARRAY_SIZE(ftr_id_aa64mmfr2_el1) +
ARRAY_SIZE(ftr_id_aa64zfr0_el1) - ARRAY_SIZE(test_regs) + 2;
ksft_set_plan(test_cnt);

View File

@ -60,7 +60,7 @@ static bool is_cpuid_mangled(const struct kvm_cpuid_entry2 *entrie)
{
int i;
for (i = 0; i < sizeof(mangled_cpuids); i++) {
for (i = 0; i < ARRAY_SIZE(mangled_cpuids); i++) {
if (mangled_cpuids[i].function == entrie->function &&
mangled_cpuids[i].index == entrie->index)
return true;

View File

@ -3035,24 +3035,12 @@ kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gf
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn)
{
return gfn_to_pfn_memslot_atomic(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
}
EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn_atomic);
kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
{
return gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn);
kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn)
{
return gfn_to_pfn_memslot(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
}
EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn);
int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
struct page **pages, int nr_pages)
{