linux

Author	SHA1	Message	Date
Marcelo Tosatti	0ba73cdadb	KVM: MMU: sync roots on mmu reload Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:20 +02:00
Marcelo Tosatti	e8bc217aef	KVM: MMU: mode specific sync_page Examine guest pagetable and bring the shadow back in sync. Caller is responsible for local TLB flush before re-entering guest mode. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:19 +02:00
Marcelo Tosatti	38187c830c	KVM: MMU: do not write-protect large mappings There is not much point in write protecting large mappings. This can only happen when a page is shadowed during the window between is_largepage_backed and mmu_lock acquision. Zap the entry instead, so the next pagefault will find a shadowed page via is_largepage_backed and fallback to 4k translations. Simplifies out of sync shadow. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:18 +02:00
Marcelo Tosatti	a378b4e64c	KVM: MMU: move local TLB flush to mmu_set_spte Since the sync page path can collapse flushes. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:17 +02:00
Marcelo Tosatti	1e73f9dd88	KVM: MMU: split mmu_set_spte Split the spte entry creation code into a new set_spte function. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:16 +02:00
Marcelo Tosatti	93a423e704	KVM: MMU: flush remote TLBs on large->normal entry overwrite It is necessary to flush all TLB's when a large spte entry is overwritten with a normal page directory pointer. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:15 +02:00
Gleb Natapov	af2152f545	KVM: don't enter guest after SIPI was received by a CPU The vcpu should process pending SIPI message before entering guest mode again. kvm_arch_vcpu_runnable() returns true if the vcpu is in SIPI state, so we can't call it here. Signed-off-by: Gleb Natapov <gleb@qumranet.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:09 +02:00
Harvey Harrison	2259e3a7a6	KVM: x86.c make kvm_load_realmode_segment static Noticed by sparse: arch/x86/kvm/x86.c:3591:5: warning: symbol 'kvm_load_realmode_segment' was not declared. Should it be static? Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:07 +02:00
Marcelo Tosatti	4c2155ce81	KVM: switch to get_user_pages_fast Convert gfn_to_pfn to use get_user_pages_fast, which can do lockless pagetable lookups on x86. Kernel compilation on 4-way guest is 3.7% faster on VMX. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:06 +02:00
Amit Shah	bfadaded0d	KVM: Device Assignment: Free device structures if IRQ allocation fails When an IRQ allocation fails, we free up the device structures and disable the device so that we can unregister the device in the userspace and not expose it to the guest at all. Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2008-10-15 14:25:04 +02:00
Ben-Ami Yassour	62c476c7c7	KVM: Device Assignment with VT-d Based on a patch by: Kay, Allen M <allen.m.kay@intel.com> This patch enables PCI device assignment based on VT-d support. When a device is assigned to the guest, the guest memory is pinned and the mapping is updated in the VT-d IOMMU. [Amit: Expose KVM_CAP_IOMMU so we can check if an IOMMU is present and also control enable/disable from userspace] Signed-off-by: Kay, Allen M <allen.m.kay@intel.com> Signed-off-by: Weidong Han <weidong.han@intel.com> Signed-off-by: Ben-Ami Yassour <benami@il.ibm.com> Signed-off-by: Amit Shah <amit.shah@qumranet.com> Acked-by: Mark Gross <mgross@linux.intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 14:25:04 +02:00
Guillaume Thouvenin	aa3a816b6d	KVM: x86 emulator: Use DstAcc for 'and' For instruction 'and al,imm' we use DstAcc instead of doing the emulation directly into the instruction's opcode. Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:16:14 +02:00
Guillaume Thouvenin	8a9fee67fb	KVM: x86 emulator: Add cmp al, imm and cmp ax, imm instructions (ocodes 3c, 3d) Add decode entries for these opcodes; execution is already implemented. Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:16:14 +02:00
Guillaume Thouvenin	9c9fddd0e7	KVM: x86 emulator: Add DstAcc operand type Add DstAcc operand type. That means that there are 4 bits now for DstMask. "In the good old days cpus would have only one register that was able to fully participate in arithmetic operations, typically called A for Accumulator. The x86 retains this tradition by having special, shorter encodings for the A register (like the cmp opcode), and even some instructions that only operate on A (like mul). SrcAcc and DstAcc would accommodate these instructions by decoding A into the corresponding 'struct operand'." -- Avi Kivity Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:16:14 +02:00
Sheng Yang	defed7ed92	x86: Move FEATURE_CONTROL bits to msr-index.h For MSR_IA32_FEATURE_CONTROL is already there. Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:16:14 +02:00
Sheng Yang	9ea542facb	KVM: VMX: Rename IA32_FEATURE_CONTROL bits Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:16:14 +02:00
Avi Kivity	ef46f18ea0	KVM: x86 emulator: fix jmp r/m64 instruction jmp r/m64 doesn't require the rex.w prefix to indicate the operand size is 64 bits. Set the Stack attribute (even though it doesn't involve the stack, really) to indicate this. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:27 +02:00
Jan Kiszka	4b92fe0c9d	KVM: VMX: Cleanup stalled INTR_INFO read Commit 1c0f4f5011829dac96347b5f84ba37c2252e1e08 left a useless access of VM_ENTRY_INTR_INFO_FIELD in vmx_intr_assist behind. Clean this up. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:26 +02:00
Marcelo Tosatti	9c3e4aab5a	KVM: x86: unhalt vcpu0 on reset Since "KVM: x86: do not execute halted vcpus", HLT by vcpu0 before system reset by the IO thread will hang the guest. Mark vcpu as runnable in such case. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:26 +02:00
Mohammed Gamal	d19292e457	KVM: x86 emulator: Add call near absolute instruction (opcode 0xff/2) Add call near absolute instruction. Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:26 +02:00
Marcelo Tosatti	d76901750a	KVM: x86: do not execute halted vcpus Offline or uninitialized vcpu's can be executed if requested to perform userspace work. Follow Avi's suggestion to handle halted vcpu's in the main loop, simplifying kvm_emulate_halt(). Introduce a new vcpu->requests bit to indicate events that promote state from halted to running. Also standardize vcpu wake sites. Signed-off-by: Marcelo Tosatti <mtosatti <at> redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:26 +02:00
Mohammed Gamal	a6a3034cb9	KVM: x86 emulator: Add in/out instructions (opcodes 0xe4-0xe7, 0xec-0xef) The patch adds in/out instructions to the x86 emulator. The instruction was encountered while running the BIOS while using the invalid guest state emulation patch. Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:25 +02:00
Avi Kivity	fa89a81766	KVM: Add statistics for guest irq injections These can help show whether a guest is making progress or not. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:25 +02:00
Sheng Yang	d40a1ee485	KVM: MMU: Modify kvm_shadow_walk.entry to accept u64 addr EPT is 4 level by default in 32pae(48 bits), but the addr parameter of kvm_shadow_walk->entry() only accept unsigned long as virtual address, which is 32bit in 32pae. This result in SHADOW_PT_INDEX() overflow when try to fetch level 4 index. Fix it by extend kvm_shadow_walk->entry() to accept 64bit addr in parameter. Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:25 +02:00
Mohammed Gamal	fb4616f431	KVM: x86 emulator: Add std and cld instructions (opcodes 0xfc-0xfd) This adds the std and cld instructions to the emulator. Encountered while running the BIOS with invalid guest state emulation enabled. Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:25 +02:00
Joerg Roedel	a89c1ad270	KVM: add MC5_MISC msr read support Currently KVM implements MC0-MC4_MISC read support. When booting Linux this results in KVM warnings in the kernel log when the guest tries to read MC5_MISC. Fix this warnings with this patch. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:24 +02:00
Avi Kivity	48d1503949	KVM: SVM: No need to unprotect memory during event injection when using npt No memory is protected anyway. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:24 +02:00
Avi Kivity	3201b5d9f0	KVM: MMU: Fix setting the accessed bit on non-speculative sptes The accessed bit was accidentally turned on in a random flag word, rather than, the spte itself, which was lucky, since it used the non-EPT compatible PT_ACCESSED_MASK. Fix by turning the bit on in the spte and changing it to use the portable accessed mask. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:24 +02:00
Avi Kivity	171d595d3b	KVM: MMU: Flush tlbs after clearing write permission when accessing dirty log Otherwise, the cpu may allow writes to the tracked pages, and we lose some display bits or fail to migrate correctly. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:24 +02:00
Avi Kivity	2245a28fe2	KVM: MMU: Add locking around kvm_mmu_slot_remove_write_access() It was generally safe due to slots_lock being held for write, but it wasn't very nice. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:24 +02:00
Avi Kivity	bc2d429979	KVM: MMU: Account for npt/ept/realmode page faults Now that two-dimensional paging is becoming common, account for tdp page faults. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:23 +02:00
Mohammed Gamal	a5e2e82b8b	KVM: x86 emulator: Add mov r, imm instructions (opcodes 0xb0-0xbf) The emulator only supported one instance of mov r, imm instruction (opcode 0xb8), this adds the rest of these instructions. Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:23 +02:00
Avi Kivity	acee3c04e8	KVM: Allocate guest memory as MAP_PRIVATE, not MAP_SHARED There is no reason to share internal memory slots with fork()ed instances. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:23 +02:00
Avi Kivity	abb9e0b8e3	KVM: MMU: Convert the paging mode shadow walk to use the generic walker Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:23 +02:00
Avi Kivity	140754bc80	KVM: MMU: Convert direct maps to use the generic shadow walker Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:23 +02:00
Avi Kivity	3d000db568	KVM: MMU: Add generic shadow walker We currently walk the shadow page tables in two places: direct map (for real mode and two dimensional paging) and paging mode shadow. Since we anticipate requiring a third walk (for invlpg), it makes sense to have a generic facility for shadow walk. This patch adds such a shadow walker, walks the page tables and calls a method for every spte encountered. The method can examine the spte, modify it, or even instantiate it. The walk can be aborted by returning nonzero from the method. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:23 +02:00
Avi Kivity	6c41f428b7	KVM: MMU: Infer shadow root level in direct_map() In all cases the shadow root level is available in mmu.shadow_root_level, so there is no need to pass it as a parameter. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:22 +02:00
Avi Kivity	6e37d3dc3e	KVM: MMU: Unify direct map 4K and large page paths The two paths are equivalent except for one argument, which is already available. Merge the two codepaths. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:22 +02:00
Avi Kivity	135f8c2b07	KVM: MMU: Move SHADOW_PT_INDEX to mmu.c It is not specific to the paging mode, so can be made global (and reusable). Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:22 +02:00
Avi Kivity	6eb06cb286	KVM: x86 emulator: remove bad ByteOp specifier from NEG descriptor Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:22 +02:00
roel kluin	41afa02587	KVM: x86 emulator: remove duplicate SrcImm Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:21 +02:00
Avi Kivity	f4bbd9aaaa	KVM: Load real mode segments correctly Real mode segments to not reference the GDT or LDT; they simply compute base = selector * 16. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:21 +02:00
Avi Kivity	a16b20da87	KVM: VMX: Change segment dpl at reset to 3 This is more emulation friendly, if not 100% correct. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:21 +02:00
Avi Kivity	5706be0daf	KVM: VMX: Change cs reset state to be a data segment Real mode cs is a data segment, not a code segment. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:21 +02:00
Harvey Harrison	ee032c993e	KVM: make irq ack notifier functions static sparse says: arch/x86/kvm/x86.c:107:32: warning: symbol 'kvm_find_assigned_dev' was not declared. Should it be static? arch/x86/kvm/i8254.c:225:6: warning: symbol 'kvm_pit_ack_irq' was not declared. Should it be static? Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:21 +02:00
Amit Shah	29c8fa32c5	KVM: Use kvm_set_irq to inject interrupts ... instead of using the pic and ioapic variants Signed-off-by: Amit Shah <amit.shah@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:21 +02:00
Amit Shah	94c935a1ee	KVM: SVM: Fix typo Fix typo in as-yet unused macro definition. Signed-off-by: Amit Shah <amit.shah@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:20 +02:00
Mohammed Gamal	a89a8fb93b	KVM: VMX: Modify mode switching and vmentry functions This patch modifies mode switching and vmentry function in order to drive invalid guest state emulation. Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:20 +02:00
Mohammed Gamal	ea953ef0ca	KVM: VMX: Add invalid guest state handler This adds the invalid guest state handler function which invokes the x86 emulator until getting the guest to a VMX-friendly state. [avi: leave atomic context if scheduling] [guillaume: return to atomic context correctly] Signed-off-by: Laurent Vivier <laurent.vivier@bull.net> Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:20 +02:00
Mohammed Gamal	04fa4d3211	KVM: VMX: Add module parameter and emulation flag. The patch adds the module parameter required to enable emulating invalid guest state, as well as the emulation_required flag used to drive emulation whenever needed. Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:20 +02:00
Mohammed Gamal	648dfaa7df	KVM: VMX: Add Guest State Validity Checks This patch adds functions to check whether guest state is VMX compliant. Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:20 +02:00
Amit Shah	6762b7299a	KVM: Device assignment: Check for privileges before assigning irq Even though we don't share irqs at the moment, we should ensure regular user processes don't try to allocate system resources. We check for capability to access IO devices (CAP_SYS_RAWIO) before we request_irq on behalf of the guest. Noticed by Avi. Signed-off-by: Amit Shah <amit.shah@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:20 +02:00
Avi Kivity	dc7404cea3	KVM: Handle spurious acks for PIT interrupts Spurious acks can be generated, for example if the PIC is being reset. Handle those acks gracefully rather than flooding the log with warnings. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:19 +02:00
Marcelo Tosatti	85428ac7c3	KVM: fix i8259 reset irq acking The irq ack during pic reset has three problems: - Ignores slave/master PIC, using gsi 0-8 for both. - Generates an ACK even if the APIC is in control. - Depends upon IMR being clear, which is broken if the irq was masked at the time it was generated. The last one causes the BIOS to hang after the first reboot of Windows installation, since PIT interrupts stop. [avi: fix check whether pic interrupts are seen by cpu] Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:19 +02:00
Avi Kivity	ecfc79c700	KVM: VMX: Use interrupt queue for !irqchip_in_kernel Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:19 +02:00
Marcelo Tosatti	29415c37f0	KVM: set debug registers after "schedulable" section The vcpu thread can be preempted after the guest_debug_pre() callback, resulting in invalid debug registers on the new vcpu. Move it inside the non-preemptable section. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:19 +02:00
Sheng Yang	464d17c8b7	KVM: VMX: Clean up magic number 0x66 in init_rmode_tss Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:19 +02:00
Dave Hansen	6ad18fba05	KVM: Reduce stack usage in kvm_pv_mmu_op() We're in a hot path. We can't use kmalloc() because it might impact performance. So, we just stick the buffer that we need into the kvm_vcpu_arch structure. This is used very often, so it is not really a waste. We also have to move the buffer structure's definition to the arch-specific x86 kvm header. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:18 +02:00
Dave Hansen	b772ff362e	KVM: Reduce stack usage in kvm_arch_vcpu_ioctl() [sheng: fix KVM_GET_LAPIC using wrong size] Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:18 +02:00
Dave Hansen	f0d662759a	KVM: Reduce kvm stack usage in kvm_arch_vm_ioctl() On my machine with gcc 3.4, kvm uses ~2k of stack in a few select functions. This is mostly because gcc fails to notice that the different case: statements could have their stack usage combined. It overflows very nicely if interrupts happen during one of these large uses. This patch uses two methods for reducing stack usage. 1. dynamically allocate large objects instead of putting on the stack. 2. Use a union{} member for all of the case variables. This tricks gcc into combining them all into a single stack allocation. (There's also a comment on this) Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:18 +02:00
Ben-Ami Yassour	4d5c5d0fe8	KVM: pci device assignment Based on a patch from: Amit Shah <amit.shah@qumranet.com> This patch adds support for handling PCI devices that are assigned to the guest. The device to be assigned to the guest is registered in the host kernel and interrupt delivery is handled. If a device is already assigned, or the device driver for it is still loaded on the host, the device assignment is failed by conveying a -EBUSY reply to the userspace. Devices that share their interrupt line are not supported at the moment. By itself, this patch will not make devices work within the guest. The VT-d extension is required to enable the device to perform DMA. Another alternative is PVDMA. Signed-off-by: Amit Shah <amit.shah@qumranet.com> Signed-off-by: Ben-Ami Yassour <benami@il.ibm.com> Signed-off-by: Weidong Han <weidong.han@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:18 +02:00
Marcelo Tosatti	3cf57fed21	KVM: PIT: fix injection logic and count The PIT injection logic is problematic under the following cases: 1) If there is a higher priority vector to be delivered by the time kvm_pit_timer_intr_post is invoked ps->inject_pending won't be set. This opens the possibility for missing many PIT event injections (say if guest executes hlt at this point). 2) ps->inject_pending is racy with more than two vcpus. Since there's no locking around read/dec of pt->pending, two vcpu's can inject two interrupts for a single pt->pending count. Fix 1 by using an irq ack notifier: only reinject when the previous irq has been acked. Fix 2 with appropriate locking around manipulation of pending count and irq_ack by the injection / ack paths. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:17 +02:00
Marcelo Tosatti	f52447261b	KVM: irq ack notification Based on a patch from: Ben-Ami Yassour <benami@il.ibm.com> which was based on a patch from: Amit Shah <amit.shah@qumranet.com> Notify IRQ acking on PIC/APIC emulation. The previous patch missed two things: - Edge triggered interrupts on IOAPIC - PIC reset with IRR/ISR set should be equivalent to ack (LAPIC probably needs something similar). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> CC: Amit Shah <amit.shah@qumranet.com> CC: Ben-Ami Yassour <benami@il.ibm.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:16 +02:00
Avi Kivity	564f15378f	KVM: Add irq ack notifier list This can be used by kvm subsystems that are interested in when interrupts are acked, for example time drift compensation. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:16 +02:00
Alexander Graf	b5e2fec0eb	KVM: Ignore DEBUGCTL MSRs with no effect Netware writes to DEBUGCTL and reads from the DEBUGCTL and LAST*IP MSRs without further checks and is really confused to receive a #GP during that. To make it happy we should just make them stubs, which is exactly what SVM already does. Writes to DEBUGCTL that are vendor-specific are resembled to behave as if the virtual CPU does not know them. Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:15 +02:00
Avi Kivity	313dbd49dc	KVM: VMX: Avoid vmwrite(HOST_RSP) when possible Usually HOST_RSP retains its value across guest entries. Take advantage of this and avoid a vmwrite() when this is so. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:15 +02:00
Avi Kivity	80e31d4f61	KVM: SVM: Unify register save/restore across 32 and 64 bit hosts Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:14 +02:00
Avi Kivity	c801949ddf	KVM: VMX: Unify register save/restore across 32 and 64 bit hosts Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:14 +02:00
Jan Kiszka	77ab6db0a1	KVM: VMX: Reinject real mode exception As we execute real mode guests in VM86 mode, exception have to be reinjected appropriately when the guest triggered them. For this purpose the patch adopts the real-mode injection pattern used in vmx_inject_irq to vmx_queue_exception, additionally taking care that the IP is set correctly for #BP exceptions. Furthermore it extends handle_rmode_exception to reinject all those exceptions that can be raised in real mode. This fixes the execution of himem.exe from FreeDOS and also makes its debug.com work properly. Note that guest debugging in real mode is broken now. This has to be fixed by the scheduled debugging infrastructure rework (will be done once base patches for QEMU have been accepted). Signed-off-by: Jan Kiszka <jan.kiszka@web.de> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:14 +02:00
Jan Kiszka	19bd8afdc4	KVM: Consolidate XX_VECTOR defines Signed-off-by: Jan Kiszka <jan.kiszka@web.de> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:14 +02:00
Avi Kivity	7edd0ce058	KVM: Consolidate PIC isr clearing into a function Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:14 +02:00
Mohammed Gamal	60bd83a125	KVM: VMX: Remove redundant check in handle_rmode_exception Since checking for vcpu->arch.rmode.active is already done whenever we call handle_rmode_exception(), checking it inside the function is redundant. Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:13 +02:00
Avi Kivity	f7d9238f5d	KVM: VMX: Move interrupt post-processing to vmx_complete_interrupts() Instead of looking at failed injections in the vm entry path, move processing to the exit path in vmx_complete_interrupts(). This simplifes the logic and removes any state that is hidden in vmx registers. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:13 +02:00
Avi Kivity	937a7eaef9	KVM: Add a pending interrupt queue Similar to the exception queue, this hold interrupts that have been accepted by the virtual processor core but not yet injected. Not yet used. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:13 +02:00
Avi Kivity	35920a3569	KVM: VMX: Fix pending exception processing The vmx code assumes that IDT-Vectoring can only be set when an exception is injected due to the exception in question. That's not true, however: if the exception is injected correctly, and later another exception occurs but its delivery is blocked due to a fault, then we will incorrectly assume the first exception was not delivered. Fix by unconditionally dequeuing the pending exception, and requeuing it (or the second exception) if we see it in the IDT-Vectoring field. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:13 +02:00
Avi Kivity	26eef70c3e	KVM: Clear exception queue before emulating an instruction If we're emulating an instruction, either it will succeed, in which case any previously queued exception will be spurious, or we will requeue the same exception. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:13 +02:00
Avi Kivity	668f612fa0	KVM: VMX: Move nmi injection failure processing to vm exit path Instead of processing nmi injection failure in the vm entry path, move it to the vm exit path (vm_complete_interrupts()). This separates nmi injection from nmi post-processing, and moves the nmi state from the VT state into vcpu state (new variable nmi_injected specifying an injection in progress). Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:13 +02:00
Avi Kivity	cf393f7566	KVM: Move NMI IRET fault processing to new vmx_complete_interrupts() Currently most interrupt exit processing is handled on the entry path, which is confusing. Move the NMI IRET fault processing to a new function, vmx_complete_interrupts(), which is called on the vmexit path. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:12 +02:00
Avi Kivity	5b5c6a5a60	KVM: MMU: Simplify kvm_mmu_zap_page() The twisty maze of conditionals can be reduced. [joerg: fix tlb flushing] Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:12 +02:00
Avi Kivity	31aa2b44af	KVM: MMU: Separate the code for unlinking a shadow page from its parents Place into own function, in preparation for further cleanups. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:12 +02:00
Amit Shah	867767a365	KVM: Introduce kvm_set_irq to inject interrupts in guests This function injects an interrupt into the guest given the kvm struct, the (guest) irq number and the interrupt level. Signed-off-by: Amit Shah <amit.shah@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:15:12 +02:00
Marcelo Tosatti	5fdbf9765b	KVM: x86: accessors for guest registers As suggested by Avi, introduce accessors to read/write guest registers. This simplifies the ->cache_regs/->decache_regs interface, and improves register caching which is important for VMX, where the cost of vmcs_read/vmcs_write is significant. [avi: fix warnings] Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:13:57 +02:00
Sheng Yang	ca60dfbb69	KVM: VMX: Rename misnamed msr bits MSR_IA32_FEATURE_LOCKED is just a bit in fact, which shouldn't be prefixed with MSR_. So is MSR_IA32_FEATURE_VMXON_ENABLED. Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-10-15 10:13:57 +02:00
Ingo Molnar	0afe2db213	Merge branch 'x86/unify-cpu-detect' into x86-v28-for-linus-phase4-D Conflicts: arch/x86/kernel/cpu/common.c arch/x86/kernel/signal_64.c include/asm-x86/cpufeature.h	2008-10-11 20:23:20 +02:00
Sheng Yang	534e38b447	KVM: VMX: Always return old for clear_flush_young() when using EPT As well as discard fake accessed bit and dirty bit of EPT. Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-09-11 11:48:19 +03:00
Joerg Roedel	e5eab0cede	KVM: SVM: fix guest global tlb flushes with NPT Accesses to CR4 are intercepted even with Nested Paging enabled. But the code does not check if the guest wants to do a global TLB flush. So this flush gets lost. This patch adds the check and the flush to svm_set_cr4. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-09-11 11:39:25 +03:00
Joerg Roedel	44874f8491	KVM: SVM: fix random segfaults with NPT enabled This patch introduces a guest TLB flush on every NPF exit in KVM. This fixes random segfaults and #UD exceptions in the guest seen under some workloads (e.g. long running compile workloads or tbench). A kernbench run with and without that fix showed that it has a slowdown lower than 0.5% Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-09-11 11:31:53 +03:00
Sheng Yang	315a6558f3	x86: move VMX MSRs to msr-index.h They are hardware specific MSRs, and we would use them in virtualization feature detection later. Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-10 14:00:55 +02:00
Arjan van de Ven	beb20d52d0	hrtimer: convert kvm to the new hrtimer apis In order to be able to do range hrtimers we need to use accessor functions to the "expire" member of the hrtimer struct. This patch converts KVM to these accessors. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2008-09-05 21:35:07 -07:00
Avi Kivity	cd5998ebfb	KVM: MMU: Fix torn shadow pte The shadow code assigns a pte directly in one place, which is nonatomic on i386 can can cause random memory references. Fix by using an atomic setter. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-08-25 17:24:27 +03:00
Avi Kivity	ed84862433	KVM: Advertise synchronized mmu support to userspace Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-29 12:34:02 +03:00
Andrea Arcangeli	e930bffe95	KVM: Synchronize guest physical memory map to host virtual memory map Synchronize changes to host virtual addresses which are part of a KVM memory slot to the KVM shadow mmu. This allows pte operations like swapping, page migration, and madvise() to transparently work with KVM. Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-29 12:33:53 +03:00
Andrea Arcangeli	604b38ac03	KVM: Allow browsing memslots with mmu_lock This allows reading memslots with only the mmu_lock hold for mmu notifiers that runs in atomic context and with mmu_lock held. Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-29 12:33:50 +03:00
Andrea Arcangeli	a1708ce8a3	KVM: Allow reading aliases with mmu_lock This allows the mmu notifier code to run unalias_gfn with only the mmu_lock held. Only alias writes need the mmu_lock held. Readers will either take the slots_lock in read mode or the mmu_lock. Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-29 12:33:40 +03:00
Andrea Arcangeli	cddb8a5c14	mmu-notifiers: core With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages. There are secondary MMUs (with secondary sptes and secondary tlbs) too. sptes in the kvm case are shadow pagetables, but when I say spte in mmu-notifier context, I mean "secondary pte". In GRU case there's no actual secondary pte and there's only a secondary tlb because the GRU secondary MMU has no knowledge about sptes and every secondary tlb miss event in the MMU always generates a page fault that has to be resolved by the CPU (this is not the case of KVM where the a secondary tlb miss will walk sptes in hardware and it will refill the secondary tlb transparently to software if the corresponding spte is present). The same way zap_page_range has to invalidate the pte before freeing the page, the spte (and secondary tlb) must also be invalidated before any page is freed and reused. Currently we take a page_count pin on every page mapped by sptes, but that means the pages can't be swapped whenever they're mapped by any spte because they're part of the guest working set. Furthermore a spte unmap event can immediately lead to a page to be freed when the pin is released (so requiring the same complex and relatively slow tlb_gather smp safe logic we have in zap_page_range and that can be avoided completely if the spte unmap event doesn't require an unpin of the page previously mapped in the secondary MMU). The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know when the VM is swapping or freeing or doing anything on the primary MMU so that the secondary MMU code can drop sptes before the pages are freed, avoiding all page pinning and allowing 100% reliable swapping of guest physical address space. Furthermore it avoids the code that teardown the mappings of the secondary MMU, to implement a logic like tlb_gather in zap_page_range that would require many IPI to flush other cpu tlbs, for each fixed number of spte unmapped. To make an example: if what happens on the primary MMU is a protection downgrade (from writeable to wrprotect) the secondary MMU mappings will be invalidated, and the next secondary-mmu-page-fault will call get_user_pages and trigger a do_wp_page through get_user_pages if it called get_user_pages with write=1, and it'll re-establishing an updated spte or secondary-tlb-mapping on the copied page. Or it will setup a readonly spte or readonly tlb mapping if it's a guest-read, if it calls get_user_pages with write=0. This is just an example. This allows to map any page pointed by any pte (and in turn visible in the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an full MMU with both sptes and secondary-tlb like the shadow-pagetable layer with kvm), or a remote DMA in software like XPMEM (hence needing of schedule in XPMEM code to send the invalidate to the remote node, while no need to schedule in kvm/gru as it's an immediate event like invalidating primary-mmu pte). At least for KVM without this patch it's impossible to swap guests reliably. And having this feature and removing the page pin allows several other optimizations that simplify life considerably. Dependencies: 1) mm_take_all_locks() to register the mmu notifier when the whole VM isn't doing anything with "mm". This allows mmu notifier users to keep track if the VM is in the middle of the invalidate_range_begin/end critical section with an atomic counter incraese in range_begin and decreased in range_end. No secondary MMU page fault is allowed to map any spte or secondary tlb reference, while the VM is in the middle of range_begin/end as any page returned by get_user_pages in that critical section could later immediately be freed without any further ->invalidate_page notification (invalidate_range_begin/end works on ranges and ->invalidate_page isn't called immediately before freeing the page). To stop all page freeing and pagetable overwrites the mmap_sem must be taken in write mode and all other anon_vma/i_mmap locks must be taken too. 2) It'd be a waste to add branches in the VM if nobody could possibly run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of mmu notifiers, but this already allows to compile a KVM external module against a kernel with mmu notifiers enabled and from the next pull from kvm.git we'll start using them. And GRU/XPMEM will also be able to continue the development by enabling KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n). This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n. The mmu_notifier_register call can fail because mm_take_all_locks may be interrupted by a signal and return -EINTR. Because mmu_notifier_reigster is used when a driver startup, a failure can be gracefully handled. Here an example of the change applied to kvm to register the mmu notifiers. Usually when a driver startups other allocations are required anyway and -ENOMEM failure paths exists already. struct kvm kvm_arch_create_vm(void) { struct kvm kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + int err; if (!kvm) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops; + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm); + if (err) { + kfree(kvm); + return ERR_PTR(err); + } + return kvm; } mmu_notifier_unregister returns void and it's reliable. The patch also adds a few needed but missing includes that would prevent kernel to compile after these changes on non-x86 archs (x86 didn't need them by luck). [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix mm/filemap_xip.c build] [akpm@linux-foundation.org: fix mm/mmu_notifier.c build] Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jack Steiner <steiner@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Kanoj Sarcar <kanojsarcar@yahoo.com> Cc: Roland Dreier <rdreier@cisco.com> Cc: Steve Wise <swise@opengridcomputing.com> Cc: Avi Kivity <avi@qumranet.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Anthony Liguori <aliguori@us.ibm.com> Cc: Chris Wright <chrisw@redhat.com> Cc: Marcelo Tosatti <marcelo@kvack.org> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Izik Eidus <izike@qumranet.com> Cc: Anthony Liguori <aliguori@us.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-28 16:30:21 -07:00
Sheng Yang	5fdbcb9dd1	KVM: VMX: Fix undefined beaviour of EPT after reload kvm-intel.ko As well as move set base/mask ptes to vmx_init(). Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-27 11:34:10 +03:00
Sheng Yang	5ec5726a16	KVM: VMX: Fix bypass_guest_pf enabling when disable EPT in module parameter Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-27 11:34:10 +03:00
Marcelo Tosatti	c93cd3a588	KVM: task switch: translate guest segment limit to virt-extension byte granular field If 'g' is one then limit is 4kb granular. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-27 11:34:10 +03:00
Avi Kivity	577bdc4966	KVM: Avoid instruction emulation when event delivery is pending When an event (such as an interrupt) is injected, and the stack is shadowed (and therefore write protected), the guest will exit. The current code will see that the stack is shadowed and emulate a few instructions, each time postponing the injection. Eventually the injection may succeed, but at that time the guest may be unwilling to accept the interrupt (for example, the TPR may have changed). This occurs every once in a while during a Windows 2008 boot. Fix by unshadowing the fault address if the fault was due to an event injection. Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-27 11:34:10 +03:00
Marcelo Tosatti	34198bf842	KVM: task switch: use seg regs provided by subarch instead of reading from GDT There is no guarantee that the old TSS descriptor in the GDT contains the proper base address. This is the case for Windows installation's reboot-via-triplefault. Use guest registers instead. Also translate the address properly. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-07-27 11:34:09 +03:00

1 2 3 4 5 ...

369 Commits