This paves the way for multiple architecture support. Note that while
ioapic.c could potentially be shared with ia64, it is also moved.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Add printk_ratelimit check in front of printk. This prevents spamming
of the message during 32-bit ubuntu 6.06server install. Previously, it
would hang during the partition formatting stage.
Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch moves kvm_vm_stat to x86.h, and every arch
can define its own kvm_vm_stat in $arch.h
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patches moves two fields round_robin_prev_vcpu and tss to kvm_arch.
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patches moves two fields vpid and vioapic to kvm_arch
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patches create kvm_arch to hold arch-specific kvm fileds
and moves fields naliases and aliases to kvm_arch.
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patches moves kvm_vcpu_stat to x86.h, so every
arch can define its own kvm_vcpu_stat structure.
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patches removes KVM_COMM macro, original it is hold
kvm_vcpu common fields.
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patches moves kvm_vcpu definition to kvm.h, and finally
kvm.h includes x86.h.
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Since these functions need to know the details of kvm or kvm_vcpu structure,
it can't be put in x86.h. Create mmu.h to hold them.
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Move all the architecture-specific fields in kvm_vcpu into a new struct
kvm_vcpu_arch.
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Emulate cmpxchg8b atomically on i386. This is required to avoid a guest
pte walker from seeing a splitted write.
[avi: make it compile]
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This lets SVM ignore writes of the value 0 to the performance counter control
registers. Thus enabling them will still fail in the guest, but a write of 0
which keeps them disabled is accepted. This is required to boot Windows
Vista 64bit.
[avi: avoid fall-thru in switch statement]
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch fixes a compile error of the LAPIC code with APIC debugging enabled.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
There is a race where VCPU0 is shadowing a pagetable entry while VCPU1
is updating it, which results in a stale shadow copy.
Fix that by comparing the contents of the cached guest pte with the
current guest pte after write-protecting the guest pagetable.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
With this patch KVM on SVM will exit to userspace if the guest writes to CR8
and the in-kernel APIC is disabled.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
In addition to removing some duplicated code, this also handles the unlikely
case of real-mode code updating a guest page table. This can happen when
one vcpu (in real mode) touches a second vcpu's (in protected mode) page
tables, or if a vcpu switches to real mode, touches page tables, and switches
back.
Signed-off-by: Avi Kivity <avi@qumranet.com>
As set_pte() no longer references either a gpte or the guest walker, we can
move it out of paging mode dependent code (which compiles twice and is
generally nasty).
Signed-off-by: Avi Kivity <avi@qumranet.com>
When we emulate a guest pte write, we fail to apply the correct inherited
permissions from the parent ptes. Now that we store inherited permissions
in the shadow page, we can use that to update the pte permissions correctly.
Signed-off-by: Avi Kivity <avi@qumranet.com>
While the page table walker correctly generates a guest page fault
if a guest tries to execute a non-executable page, the shadow code does
not mark it non-executable. This means that if a guest accesses an nx
page first with a read access, then subsequent code fetch accesses will
succeed.
Fix by setting the nx bit on shadow ptes.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The nx bit is awkwardly placed in the 63rd bit position; furthermore it
has a reversed meaning compared to the other bits, which means we can't use
a bitwise and to calculate compounded access masks.
So, we simplify things by creating a new 3-bit exec/write/user access word,
and doing all calculations in that.
Signed-off-by: Avi Kivity <avi@qumranet.com>
In preparation for multi-threaded guest pte walking, use cmpxchg()
when updating guest pte's. This guarantees that the assignment of the
dirty bit can't be lost if two CPU's are faulting the same address
simultaneously.
[avi: fix kunmap_atomic() parameters]
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Stack instructions are always 64-bit on 64-bit mode; many of the
emulated stack instructions did not take that into account. Fix by
adding a 'Stack' bitflag and setting the operand size appropriately
during the decode stage (except for 'push r/m', which is in a group
with a few other instructions, so it gets its own treatment).
This fixes random crashes on Vista x64.
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch adds code to emulate the access to the cr8 register to the x86
instruction emulator in kvm. This is needed on svm, where there is no
hardware decode for control register access.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
With apic in userspace, we must exit to userspace after a cr8 write in order
to update the tpr. But if the apic is in the kernel, the exit is unnecessary.
Noticed by Joerg Roedel.
Signed-off-by: Avi Kivity <avi@qumranet.com>
We prepare eflags for the emulated instruction, then clobber it with an 'andl'.
Fix by popping eflags as the last thing in the sequence.
Patch taken from Xen (16143:959b4b92b6bf)
Signed-off-by: Avi Kivity <avi@qumranet.com>
Instead of each subarch doing its own thing, add an API for queuing an
injection, and manage failed exception injection centerally (i.e., if
an inject failed due to a shadow page fault, we need to requeue it).
Signed-off-by: Avi Kivity <avi@qumranet.com>
This abstracts the detail of x86 hlt and INIT modes into a function.
Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Change
dest_Loest_Prio -> IOAPIC_LOWEST_PRIORITY
dest_Fixed -> IOAPIC_FIXED
the original names are x86 specific, while the ioapic code will be reused
for ia64.
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch replaces lapic structure with kvm_vcpu in ioapic.c, making ioapic
independent of the local apic, as required by ia64.
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch removes the KVM specific defines for MSR_EFER that were being used
in the svm support file and migrates all references to use instead the ones
from the kernel headers that are used everywhere else and that have the same
values.
Signed-off-by: Carlo Marcelo Arenas Belon <carenas@sajinet.com.pe>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Currently, make headers_check barfs due to <asm/kvm.h>, which <linux/kvm.h>
includes, not existing. Rather than add a zillion <asm/kvm.h>s, export kvm.h
only if the arch actually supports it.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Unify the special instruction switch with the regular instruction switch,
and the two byte special instruction switch with the regular two byte
instruction switch. That makes it much easier to find an instruction or
the place an instruction needs to be added in.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Currently rep processing is handled somewhere in the middle of instruction
processing. Move it to a sensible place.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Add emulation for the cmps instruction. This lets OpenBSD boot on kvm.
Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Previous patches have removed the dependency on cr2; we can now stop passing
it to the emulator and rename uses to 'memop'.
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Mark guest pages as accessed when removed from the shadow page tables for
better lru processing.
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
cmps and scas instructions accept repeat prefixes F3 and F2. So in
order to emulate those prefixed instructions we need to be able to know
if prefixes are REP/REPE/REPZ or REPNE/REPNZ. Currently kvm doesn't make
this distinction. This patch introduces this distinction.
Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Non-x86 archs don't need this mechanism. Move it to arch, and
keep its interface in common.
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The state of SECONDARY_VM_EXEC_CONTROL shouldn't depend on in-kernel IRQ chip,
this patch fix this.
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The current cpuid management suffers from several problems, which inhibit
passing through the host feature set to the guest:
- No way to tell which features the host supports
While some features can be supported with no changes to kvm, others
need explicit support. That means kvm needs to vet the feature set
before it is passed to the guest.
- No support for indexed or stateful cpuid entries
Some cpuid entries depend on ecx as well as on eax, or on internal
state in the processor (running cpuid multiple times with the same
input returns different output). The current cpuid machinery only
supports keying on eax.
- No support for save/restore/migrate
The internal state above needs to be exposed to userspace so it can
be saved or migrated.
This patch adds extended cpuid support by means of three new ioctls:
- KVM_GET_SUPPORTED_CPUID: get all cpuid entries the host (and kvm)
supports
- KVM_SET_CPUID2: sets the vcpu's cpuid table
- KVM_GET_CPUID2: gets the vcpu's cpuid table, including hidden state
[avi: fix original KVM_SET_CPUID not removing nx on non-nx hosts as it did
before]
Signed-off-by: Dan Kenigsberg <danken@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
These are traditionally named 'page', but even more traditionally, that name
is reserved for variables that point to a 'struct page'. Rename them to 'sp'
(for "shadow page").
Signed-off-by: Avi Kivity <avi@qumranet.com>
Converting a frame number to an address is tricky since the data type changes
size. Introduce a function to do it. This fixes an actual bug when
accessing guest ptes.
Signed-off-by: Avi Kivity <avi@qumranet.com>
If all we're doing is increasing permissions on a pte (typical for demand
paging), then there's not need to flush remote tlbs. Worst case they'll
get a spurious page fault.
Signed-off-by: Avi Kivity <avi@qumranet.com>
I spent an hour worrying why I see so many guest page faults on FC6 i386.
Turns out bypass wasn't implemented for nonpae. Implement it so it doesn't
happen again.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Split kvm_arch_vcpu_create() into kvm_arch_vcpu_create() and
kvm_arch_vcpu_setup(), enabling preemption notification between the two.
This mean that we can now do vcpu_load() within kvm_arch_vcpu_setup().
Signed-off-by: Avi Kivity <avi@qumranet.com>
Moving !user_alloc case to kvm_arch to avoid unnecessary
code logic in non-x86 platform.
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Instead of incrementally changing the mmu cache size for every memory slot
operation, recalculate it from scratch. This is simpler and safer.
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Instead of fetching one byte at a time, prefetch 15 bytes (or until the next
page boundary) to avoid guest page table walks.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Theoretically used to acccess memory known to be ordinary RAM, it was
never implemented. It is questionable whether it is possible to implement
it correctly.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Improve dirty bit setting for pages that kvm release, until now every page
that we released we marked dirty, from now only pages that have potential
to get dirty we mark dirty.
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
When we map a page, we check whether some other vcpu mapped it for us and if
so, bail out. But we should decrease the refcount on the page as we do so.
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Use kvm_write_guest_page() with empty_zero_page, instead of doing
kmap and memset.
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Ensure that segment.base == segment.selector << 4 when entering the real
mode on Intel so that the CPU will not bark at us. This fixes some old
protected mode demo from http://www.x86.org/articles/pmbasics/tspec_a1_doc.htm.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Move out kvm_mmu init and exit functionality from kvm_main.c
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Meanwhile keep the interface in common, and leave as more logic in common
as possible.
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Instead of having each architecture do it individually, we
do this in the arch-independent code (just x86 as of now).
[avi: add svm to the mix, which was added to mainline during the
2.6.24-rc process]
Signed-off-by: Amit Shah <amit.shah@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This is a little more accurate (since it counts actual reloads, not potential
reloads), and reverses the sense of the statistic to measure a bad event like
most of the other stats (e.g. we want to minimize all counters).
Signed-off-by: Avi Kivity <avi@qumranet.com>
Add two arch hooks to handle kvm_create_vm and kvm destroy_vm. Now, just
put io_bus init and destory in common.
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Move kvm_vcpu_ioctl_translate to arch, since mmu would be put under arch.
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
We pass vcpu, vmx->fail, and vmx->launched to assembly code, but all three
are fields within vmx. Consolidate by only passing in vmx and offsets for
the rest.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Make KVM_CHECK_EXTENSION code into a function, all archs can define its
capability independently.
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The current 'lods' and 'stos' is depending on incoming CR2 rather than decode
memory address from registers.
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Will be called once arch module registers itself.
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Move some includes to x86.c from kvm_main.c, since the related functions
have been moved to x86.c
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This changes kvm_write_guest_page/kvm_read_guest_page to use
copy_to_user/read_from_user, as a result we get better speed
and better dirty bit tracking.
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Convert a guest frame number to the corresponding host virtual address.
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Check for the "error hva", an address outside the user address space that
signals a bad gfn.
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Add wbinvd VM Exit support to prepare for pass-through
device cache emulation and also enhance real time
responsiveness.
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
If vmx fails to inject a real-mode interrupt while fetching the interrupt
redirection table, it fails to record this in the vectoring information
field. So we detect this condition and do it ourselves.
Signed-off-by: Avi Kivity <avi@qumranet.com>
We'll want to write to it in order to fix real-mode irq injection problems,
but it is a read-only field. Storing it in a variable solves that issue.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Instead of injecting real-mode interrupts by writing the interrupt frame into
guest memory, abuse vmx by injecting a software interrupt. We need to
pretend the software interrupt instruction had a length > 0, so we have to
adjust rip backward.
This lets us not to mess with writing guest memory, which is complex and also
sleeps.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Every write access to guest pages should be tracked.
Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Instructions like 'inc reg' that have the register operand encoded
in the opcode are currently specially decoded. Extend
decode_register_operand() to handle that case, indicated by having
DstReg or SrcReg without ModRM.
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch moves implementation of the following functions from
kvm_main.c to x86.c:
free_pio_guest_pages, vcpu_find_pio_dev, pio_copy_data, complete_pio,
kernel_pio, pio_string_write, kvm_emulate_pio, kvm_emulate_pio_string
The function inject_gp, which was duplicated by yesterday's patch
series, is removed from kvm_main.c now because it is not needed anymore.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch moves the implementation of the functions of kvm_get/set_msr,
kvm_get/set_msr_common, and set_efer from kvm_main.c to x86.c. The
definition of EFER_RESERVED_BITS is moved too.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
KVM's nopage handler calls gfn_to_page() which acquires the mmap_sem when
calling out to get_user_pages(). nopage handlers are already invoked with the
mmap_sem held though. Introduce a __gfn_to_page() for use by the nopage
handler which requires the lock to already be held.
This was noticed by tglx.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch based on CR8/TPR patch, and enable the TPR shadow (FlexPriority)
for 32bit Windows. Since TPR is accessed very frequently by 32bit
Windows, especially SMP guest, with FlexPriority enabled, we saw significant
performance gain.
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch moves the implementation of get_apic_base and set_apic_base
from kvm_main.c to x86.c
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch moves the definition of segment_descriptor_64 for AMD64 and
EM64T from kvm_main.c to segment_descriptor.h. It also adds a proper
#ifndef...#define...#endif around that header file.
The implementation of segment_base is moved from kvm_main.c to x86.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch splits kvm_vm_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
The patch is unchanged since last submission.
Common ioctls for all architectures are:
KVM_CREATE_VCPU, KVM_GET_DIRTY_LOG, KVM_SET_USER_MEMORY_REGION
x86 specific ioctls are:
KVM_SET_MEMORY_REGION,
KVM_GET/SET_NR_MMU_PAGES, KVM_SET_MEMORY_ALIAS, KVM_CREATE_IRQCHIP,
KVM_CREATE_IRQ_LINE, KVM_GET/SET_IRQCHIP
KVM_SET_TSS_ADDR
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Emulation may cause a shadow pte to be instantiated, which requires
memory resources. Make sure the caches are filled to avoid an oops.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The code that dispatches the page fault and emulates if we failed to map
is duplicated across vmx and svm. Merge it to simplify further bugfixing.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The 'mov abs' instruction family (opcodes 0xa0 - 0xa3) still depends on cr2
provided by the page fault handler. This is wrong for several reasons:
- if an instruction accessed misaligned data that crosses a page boundary,
and if the fault happened on the second page, cr2 will point at the
second page, not the data itself.
- if we're emulating in real mode, or due to a FlexPriority exit, there
is no cr2 generated.
So, this change adds decoding for this instruction form and drops reliance
on cr2.
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch lets GCC to determine which registers to save when we
switch to/from a VCPU in the case of AMD i386
* Original code saves following registers:
ebx, ecx, edx, esi, edi, ebp
* Patched code:
- informs GCC that we modify following registers
using the clobber description:
ebx, ecx, edx, esi, edi
- rbp is saved (pop/push) because GCC seems to ignore its use in the clobber
description.
Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch lets GCC to determine which registers to save when we
switch to/from a VCPU in the case of AMD x86_64.
* Original code saves following registers:
rbx, rcx, rdx, rsi, rdi, rbp,
r8, r9, r10, r11, r12, r13, r14, r15
* Patched code:
- informs GCC that we modify following registers
using the clobber description:
rbx, rcx, rdx, rsi, rdi
r8, r9, r10, r11, r12, r13, r14, r15
- rbp is saved (pop/push) because GCC seems to ignore its use in the clobber
description.
Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch lets GCC to determine which registers to save when we
switch to/from a VCPU in the case of intel i386.
* Original code saves following registers:
eax, ebx, ecx, edx, edi, esi, ebp (using popa)
* Patched code:
- informs GCC that we modify following registers
using the clobber description:
ebx, edi, rsi
- doesn't save eax because it is an output operand (vmx->fail)
- cannot put ecx in clobber description because it is an input operand,
but as we modify it and we want to keep its value (vcpu), we must
save it (pop/push)
- ebp is saved (pop/push) because GCC seems to ignore its use the clobber
description.
- edx is saved (pop/push) because it is reserved by GCC (REGPARM) and
cannot be put in the clobber description.
- line "mov (%%esp), %3 \n\t" has been removed because %3
is ecx and ecx is restored just after.
Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch lets GCC to determine which registers to save when we
switch to/from a VCPU in the case of intel x86_64.
* Original code saves following registers:
rax, rbx, rcx, rdx, rsi, rdi, rbp,
r8, r9, r10, r11, r12, r13, r14, r15
* Patched code:
- informs GCC that we modify following registers
using the clobber description:
rbx, rdi, rsi,
r8, r9, r10, r11, r12, r13, r14, r15
- doesn't save rax because it is an output operand (vmx->fail)
- cannot put rcx in clobber description because it is an input operand,
but as we modify it and we want to keep its value (vcpu), we must
save it (pop/push)
- rbp is saved (pop/push) because GCC seems to ignore its use in the clobber
description.
- rdx is saved (pop/push) because it is reserved by GCC (REGPARM) and
cannot be put in the clobber description.
- line "mov (%%rsp), %3 \n\t" has been removed because %3
is rcx and rcx is restored just after.
- line ASM_VMX_VMWRITE_RSP_RDX() is moved out of the ifdef/else/endif
Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Currently kvm has a wart in that it requires three extra pages for use
as a tss when emulating real mode on Intel. This patch moves the allocation
internally, only requiring userspace to tell us where in the physical address
space we can place the tss.
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Reserve a few memory slots for kernel internal use. This is good for case
you have to register memory region and you want to be sure it was not
registered from userspace, and for case you want to register a memory region
that won't be seen from userspace.
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Remove kvm memory slot allocation mechanism from the ioctl
and put it to exported function.
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
kvm_vm_ioctl_set_memory_region() is able to remove memory in addition to
adding it. Therefore when using kernel swapping support for old userspaces,
we need to munmap the memory if the user request to remove it
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Split guest reset code out of vmx_vcpu_setup(). Besides being cleaner, this
moves the realmode tss setup (which can sleep) outside vmx_vcpu_setup()
(which is executed with preemption enabled).
[izik: remove unused variable]
Signed-off-by: Avi Kivity <avi@qumranet.com>
First step to split kvm_vcpu. Currently, we just use an macro to define
the common fields in kvm_vcpu for all archs, and all archs need to define
its own kvm_vcpu struct.
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Allocate a userspace buffer for older userspaces. Also eliminate phys_mem
buffer. The memset() in kvmctl really kills initial memory usage but swapping
works even with old userspaces.
A side effect is that maximum guest side is reduced for older userspace on
i386.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
ppc and s390 offer the possibility to track process times precisely
by looking at cpu timer on every context switch, irq, softirq etc.
We can use that infrastructure as well for guest time accounting.
We need to account the used time before we change the state.
This patch adds a call to account_system_vtime to kvm_guest_enter
and kvm_guest exit. If CONFIG_VIRT_CPU_ACCOUNTING is not set,
account_system_vtime is defined in hardirq.h as an empty function,
which means this patch does not change the behaviour on other
platforms.
I compile tested this patch on x86 and function tested the patch on
s390.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This allows guest memory to be swapped. Pages which are currently mapped
via shadow page tables are pinned into memory, but all other pages can
be freely swapped.
The patch makes gfn_to_page() elevate the page's reference count, and
introduces kvm_release_page() that pairs with it.
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
In case the page is not present in the guest memory map, return a dummy
page the guest can scribble on.
This simplifies error checking in its users.
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The current kvm mmu only reverse maps writable translation. This is used
to write-protect a page in case it becomes a pagetable.
But with swapping support, we need a reverse mapping of read-only pages as
well: when we evict a page, we need to remove any mapping to it, whether
writable or not.
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Instruction: cmc, clc, cli, sti
opcodes: 0xf5, 0xf8, 0xfa, 0xfb respectively.
[avi: fix reference to EFLG_IF which is not defined anywhere]
Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Simplify the walker level loop not to carry so much information from one
loop to the next. In addition to being complex, this made kmap_atomic()
critical sections difficult to manage.
As a result of this change, kmap_atomic() sections are limited to actually
touching the guest pte, which allows the other functions called from the
walker to do sleepy operations. This will happen when we enable swapping.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Beside the obvious goodness of making code more common, this prevents
a livelock with the next patch which moves interrupt injection out of the
critical section.
Signed-off-by: Avi Kivity <avi@qumranet.com>
If no apic is enabled in the bitmap of an interrupt delivery with delivery
mode of lowest priority, a warning should be reported rather than select
a fallback vcpu
Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Eddie (Yaozu) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Since the mmu uses different shadow pages for dirty large pages and clean
large pages, this allows the mmu to drop ptes that are now invalid.
Signed-off-by: Avi Kivity <avi@qumranet.com>
By forcing clean huge pages to be read-only, we have separate roles
for the shadow of a clean large page and the shadow of a dirty large
page. This is necessary because different ptes will be instantiated
for the two cases, even for read faults.
Signed-off-by: Avi Kivity <avi@qumranet.com>
This is more consistent with the accessed bit management, and makes the dirty
bit available earlier for other purposes.
Signed-off-by: Avi Kivity <avi@qumranet.com>
This time, the biggest change is gpa_to_hpa. The translation of GPA to HPA does
not depend on the VCPU state unlike GVA to GPA so there's no need to pass in
the kvm_vcpu.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Some of the MMU functions take a struct kvm_vcpu even though they affect all
VCPUs. This patch cleans up some of them to instead take a struct kvm. This
makes things a bit more clear.
The main thing that was confusing me was whether certain functions need to be
called on all VCPUs.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Instead of having the kernel allocate memory to the guest, let userspace
allocate it and pass the address to the kernel.
This is required for s390 support, but also enables features like memory
sharing and using hugetlbfs backed memory.
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Since vcpu->apic is of the correct type, there's not need to cast.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Move kvm_create_lapic() into kvm_vcpu_init(), rather than having svm
and vmx do it. And make it return the error rather than a fairly
random -ENOMEM.
This also solves the problem that neither svm.c nor vmx.c actually
handles the error path properly.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Instead of the asymetry of kvm_free_apic, implement kvm_free_lapic().
And guess what? I found a minor bug: we don't need to hrtimer_cancel()
from kvm_main.c, because we do that in kvm_free_apic().
Also:
1) kvm_vcpu_uninit should be the reverse order from kvm_vcpu_init.
2) Don't set apic->regs_page to zero before freeing apic.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The user is now able to set how many mmu pages will be allocated to the guest.
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
When kvm uses user-allocated pages in the future for the guest, we won't
be able to use page->private for rmap, since page->rmap is reserved for
the filesystem. So we move the rmap base pointers to the memory slot.
A side effect of this is that we need to store the gfn of each gpte in
the shadow pages, since the memory slot is addressed by gfn, instead of
hfn like struct page.
Signed-off-by: Izik Eidus <izik@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Now that smp_call_function_single() knows how to call a function on the
current cpu, there's no need to check explicitly.
Signed-off-by: Avi Kivity <avi@qumranet.com>