Fix the following warnings:
WARNING: arch/x86/kernel/built-in.o(.exit.text+0xf8): Section mismatch in reference from the function msr_exit() to the variable .cpuinit.data:msr_class_cpu_notifier
WARNING: arch/x86/kernel/built-in.o(.exit.text+0x158): Section mismatch in reference from the function cpuid_exit() to the variable .cpuinit.data:cpuid_class_cpu_notifier
WARNING: arch/x86/kernel/built-in.o(.exit.text+0x171): Section mismatch in reference from the function microcode_exit() to the variable .cpuinit.data:mc_cpu_notifier
In all three cases there were a function annotated __exit
that referenced a variable annotated __cpuinitdata.
The fix was to replace the annotation of the notifier
with __refdata to tell modpost that the reference to
a _cpuinit function in the notifier are OK.
The unregister call that references the notifier
variable will simple delete the function pointer
so there is no problem ignoring the reference.
Note: This looks like another case where __cpuinit
has been used as replacement for proper use
of CONFIG_HOTPLUG_CPU to decide what code are used for
HOTPLUG_CPU.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Silence the following warning:
WARNING: o-x86_64/arch/x86/kernel/built-in.o(.text+0x17cd3): Section mismatch in reference from the function remove_cpu_from_maps() to the variable .cpuinit.data:cpu_initialized
remove_cpu:maps() had a single user: __cpu_disable() so
mark it static and annotate it with __ref to silence the
warning from modpost.
_cpu_disable() has a single user in kernel/cpu.c:
=> take_cpu_down()
which again has a single user in the following call:
=> __stop_machine_run(take_cpu_down, &tcd_param, cpu);
Here a kthread is created.
So maybe the warning is correct and the right fix is to
remove the __cpuinitdata annotation of cpu_initialized?
Note: The analysis were disturbed by the fact that we had a variable
with the same name in cpu/common.c - but this is 32 bit only]
Note: Should smpboot_64 use cpu_clear()?
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
for bzImage, the vmlinux_64.lds still have s32 bit code, and startup_32
should be 0. fix the comment.
Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Doing a make randconfig I came across this error in the Makefile.
This patch makes a directory out of arch/x86/mach-default for
CONFIG_X86_RDC321X
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes a bug of early_ioremap_reset(), which had been fixed
before by "convert the boot time page table to the kernels native
format" patch. But that patch has been reverted now.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Because in i386 early boot stage, boot_cpu_data may be not available,
which makes clflush_cach_range() into infinite loop, which is called
by change_page_attr(). This patch fixes this by setting
boot_cpu_data.x86_clflush_size in early_cpu_detect().
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch replaces __change_page_attr_set_clr() with
change_page_attr_set_clr() in change_page_attr_clear() to flush the
TLB/cache properly.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arch/x86/kernel/cpu/intel_cacheinfo.c:355:7: warning: symbol 'i' shadows an earlier one
arch/x86/kernel/cpu/intel_cacheinfo.c:296:39: originally declared here
arch/x86/kernel/cpu/intel_cacheinfo.c:367:18: warning: incorrect type in argument 2 (different signedness)
arch/x86/kernel/cpu/intel_cacheinfo.c:367:18: expected unsigned int *eax
arch/x86/kernel/cpu/intel_cacheinfo.c:367:18: got int *
arch/x86/kernel/cpu/intel_cacheinfo.c:367:28: warning: incorrect type in argument 3 (different signedness)
arch/x86/kernel/cpu/intel_cacheinfo.c:367:28: expected unsigned int *ebx
arch/x86/kernel/cpu/intel_cacheinfo.c:367:28: got int *
arch/x86/kernel/cpu/intel_cacheinfo.c:367:38: warning: incorrect type in argument 4 (different signedness)
arch/x86/kernel/cpu/intel_cacheinfo.c:367:38: expected unsigned int *ecx
arch/x86/kernel/cpu/intel_cacheinfo.c:367:38: got int *
arch/x86/kernel/cpu/intel_cacheinfo.c:367:48: warning: incorrect type in argument 5 (different signedness)
arch/x86/kernel/cpu/intel_cacheinfo.c:367:48: expected unsigned int *edx
arch/x86/kernel/cpu/intel_cacheinfo.c:367:48: got int *
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86:
alpha: fix x86.git merge build error
ia64: on UP percpu variables are not small memory model
x86: fix arch/x86/kernel/test_nx.c modular build bug
s390: use generic percpu linux-2.6.git
POWERPC: use generic per cpu
ia64: use generic percpu
SPARC64: use generic percpu
percpu: change Kconfig to HAVE_SETUP_PER_CPU_AREA
modules: fold percpu_modcopy into module.c
x86: export copy_from_user_ll_nocache[_nozero]
x86: fix duplicated TIF on 64-bit
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: (27 commits)
lguest: use __PAGE_KERNEL instead of _PAGE_KERNEL
lguest: Use explicit includes rateher than indirect
lguest: get rid of lg variable assignments
lguest: change gpte_addr header
lguest: move changed bitmap to lg_cpu
lguest: move last_pages to lg_cpu
lguest: change last_guest to last_cpu
lguest: change spte_addr header
lguest: per-vcpu lguest pgdir management
lguest: make pending notifications per-vcpu
lguest: makes special fields be per-vcpu
lguest: per-vcpu lguest task management
lguest: replace lguest_arch with lg_cpu_arch.
lguest: make registers per-vcpu
lguest: make emulate_insn receive a vcpu struct.
lguest: map_switcher_in_guest() per-vcpu
lguest: per-vcpu interrupt processing.
lguest: per-vcpu lguest timers
lguest: make hypercalls use the vcpu struct
lguest: make write() operation smp aware
...
Manual conflict resolved (maybe even correctly, who knows) in
drivers/lguest/x86/core.c
Migrating the apic timer in the critical section is not very nice, and is
absolutely horrible with the real-time port. Move migration to the regular
vcpu execution path, triggered by a new bitflag.
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
When preparing to enter the guest, if an interrupt comes in while
preemption is disabled but interrupts are still enabled, we miss a
preemption point. Fix by explicitly checking whether we need to
reschedule.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Otherwise we re-initialize the mmu caches, which will fail since the
caches are already registered, which will cause us to deinitialize said caches.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Right now rmap_remove won't set the page as dirty if the shadow pte
pointed to this page had write access and then it became readonly.
This patches fixes that, by setting the page as dirty for spte changes from
write to readonly access.
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
When executing a test program called "crashme", we found the KVM guest cannot
survive more than ten seconds, then encounterd kernel panic. The basic concept
of "crashme" is generating random assembly code and trying to execute it.
After some fixes on emulator insn validity judgment, we found it's hard to
get the current emulator handle the invalid instructions correctly, for the
#UD trap for hypercall patching caused troubles. The problem is, if the opcode
itself was OK, but combination of opcode and modrm_reg was invalid, and one
operand of the opcode was memory (SrcMem or DstMem), the emulator will fetch
the memory operand first rather than checking the validity, and may encounter
an error there. For example, ".byte 0xfe, 0x34, 0xcd" has this problem.
In the patch, we simply check that if the invalid opcode wasn't vmcall/vmmcall,
then return from emulate_instruction() and inject a #UD to guest. With the
patch, the guest had been running for more than 12 hours.
Signed-off-by: Feng (Eric) Liu <eric.e.liu@intel.com>
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
If some other cpu steals mmu pages between our check and an attempt to
allocate, we can run out of mmu pages. Fix by moving the check into the
same critical section as the allocation.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Convert the synchronization of the shadow handling to a separate mmu_lock
spinlock.
Also guard fetch() by mmap_sem in read-mode to protect against alias
and memslot changes.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Since gfn_to_page() is a sleeping function, and we want to make the core mmu
spinlocked, we need to pass the page from the walker context (which can sleep)
to the shadow context (which cannot).
[marcelo: avoid recursive locking of mmap_sem]
Signed-off-by: Avi Kivity <avi@qumranet.com>
In preparation for a mmu spinlock, add kvm_read_guest_atomic()
and use it in fetch() and prefetch_page().
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Do not hold kvm->lock mutex across the entire pagefault code,
only acquire it in places where it is necessary, such as mmu
hash list, active list, rmap and parent pte handling.
Allow concurrent guest walkers by switching walk_addr() to use
mmap_sem in read-mode.
And get rid of the lockless __gfn_to_page.
[avi: move kvm_mmu_pte_write() locking inside the function]
[avi: add locking for real mode]
[avi: fix cmpxchg locking]
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This adds a mechanism for exposing the virtual apic tpr to the guest, and a
protocol for letting the guest update the tpr without causing a vmexit if
conditions allow (e.g. there is no interrupt pending with a higher priority
than the new tpr).
Signed-off-by: Avi Kivity <avi@qumranet.com>
Add a facility to report on accesses to the local apic tpr even if the
local apic is emulated in the kernel. This is basically a hack that
allows userspace to patch Windows which tends to bang on the tpr a lot.
Signed-off-by: Avi Kivity <avi@qumranet.com>
This can help diagnosing what the guest is trying to do. In many cases
we can get away with partial emulation of msrs.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Host side TLB flush can be merged together if multiple
spte need to be write-protected.
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Moving kvm_vcpu_kick() to x86.c. Since it should be
common for all archs, put its declarations in <linux/kvm_host.h>
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Move ioapic code to common, since IA64 also needs it.
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This paves the way for multiple architecture support. Note that while
ioapic.c could potentially be shared with ia64, it is also moved.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Currently, make headers_check barfs due to <asm/kvm.h>, which <linux/kvm.h>
includes, not existing. Rather than add a zillion <asm/kvm.h>s, export kvm.h
only if the arch actually supports it.
Signed-off-by: Avi Kivity <avi@qumranet.com>
memnode.map is s16 array because of nodeid is 16 bit now.
so need to increase the nodemap_size according to that bits.
Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
one early crash on one 8 node 256g machine:
Command line: console=uart8250,io,0x3f8,115200n8 initrd=kernel.org/mydisk11_x86_64.gz rw root=/dev/ram0 debug initcall_debug apic=debug acpi.debug_level=0x0000000f pci=routeirq ip=dhcp load_ramdisk=1 ramdisk_size=131072 BOOT_IMAGE=kernel.org/bzImage_2.6.25_k8.1
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009bc00 (usable)
BIOS-e820: 000000000009bc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e6000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 00000000dffe0000 (usable)
BIOS-e820: 00000000dffe0000 - 00000000dffee000 (ACPI data)
BIOS-e820: 00000000dffee000 - 00000000dffff050 (ACPI NVS)
BIOS-e820: 00000000dffff050 - 00000000e0000000 (reserved)
BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
BIOS-e820: 00000000ff700000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000004020000000 (usable)
Early serial console at I/O port 0x3f8 (options '115200n8')
console [uart0] enabled
end_pfn_map = 67239936
Kernel panic - not syncing: Duplicated early reservation d40000-e42000
Pid: 0, comm: swapper Not tainted 2.6.24-smp-g5a514e21-dirty #3
Call Trace:
[<ffffffff80221545>] lapic_get_maxlvt+0x0/0x10
[<ffffffff80221657>] clear_local_APIC+0x5/0xcf
[<ffffffff80221726>] disable_local_APIC+0x5/0x17
[<ffffffff8021fe16>] smp_send_stop+0x46/0x4c
[<ffffffff80235293>] panic+0x94/0x13e
[<ffffffff80bc3b03>] sctp_eps_proc_init+0x12/0x34
[<ffffffff80b9f1c5>] reserve_early+0x30/0x6c
[<ffffffff80803925>] init_memory_mapping+0x2cd/0x2dc
[<ffffffff80b9dc01>] setup_arch+0x21f/0x44e
[<ffffffff80b978be>] start_kernel+0x6f/0x2c7
[<ffffffff80b971cc>] _sinittext+0x1cc/0x1d3
it turns out there is overlap between pgtable and bss...
in System.map we have
ffffffff80d40420 b rsi_table
ffffffff80d40620 B krb5_seq_lock
ffffffff80d40628 b i.20437
ffffffff80d40630 b xprt_rdma_inline_write_padding
ffffffff80d40638 b sunrpc_table_header
ffffffff80d40640 b zero
ffffffff80d40644 b min_memreg
ffffffff80d40648 b rpcrdma_tk_lock_g
ffffffff80d40650 B sctp_assocs_id_lock
ffffffff80d40658 B proc_net_sctp
ffffffff80d40660 B sctp_assocs_id
ffffffff80d40680 B sysctl_sctp_mem
ffffffff80d40690 B sysctl_sctp_rmem
ffffffff80d406a0 B sysctl_sctp_wmem
ffffffff80d406b0 b sctp_ctl_socket
ffffffff80d406b8 b sctp_pf_inet6_specific
ffffffff80d406c0 b sctp_pf_inet_specific
ffffffff80d406c8 b sctp_af_v4_specific
ffffffff80d406d0 b sctp_af_v6_specific
ffffffff80d406d8 b sctp_rand.33270
ffffffff80d406dc b sctp_memory_pressure
ffffffff80d406e0 b sctp_sockets_allocated
ffffffff80d406e4 b sctp_memory_allocated
ffffffff80d406e8 b sctp_sysctl_header
ffffffff80d406f0 b zero
ffffffff80d406f4 A __bss_stop
ffffffff80d406f4 A _end
need to round up table_start to PAGE_SIZE.
also make the panic more informative.
Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This just adds the PCI IDs of AMD's family 10h and 11h CPU's northbridges to
k8topology discovery.
Signed-off-by: Joachim Deguara <joachim.deguara@amd.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Put appropriate pagetable update hooks in so that paravirt knows
what's going on in there.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use a standard list threaded through page->lru for maintaining the pgd
list on PAE. This is the same as 64-bit, and seems saner than using a
non-standard list via page->index.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds a new configuration option, which adds support for a new
early_param which gets checked in arch/x86/kernel/setup_{32,64}.c:setup_arch()
to decide wether OHCI-1394 FireWire controllers should be initialized and
enabled for physical DMA access to allow remote debugging of early problems
like issues ACPI or other subsystems which are executed very early.
If the config option is not enabled, no code is changed, and if the boot
paramenter is not given, no new code is executed, and independent of that,
all new code is freed after boot, so the config option can be even enabled
in standard, non-debug kernels.
With specialized tools, it is then possible to get debugging information
from machines which have no serial ports (notebooks) such as the printk
buffer contents, or any data which can be referenced from global pointers,
if it is stored below the 4GB limit and even memory dumps of of the physical
RAM region below the 4GB limit can be taken without any cooperation from the
CPU of the host, so the machine can be crashed early, it does not matter.
In the extreme, even kernel debuggers can be accessed in this way. I wrote
a small kgdb module and an accompanying gdb stub for FireWire which allows
to gdb to talk to kgdb using remote remory reads and writes over FireWire.
An version of the gdb stub fore FireWire is able to read all global data
from a system which is running a a normal kernel without any kernel debugger,
without any interruption or support of the system's CPU. That way, e.g. the
task struct and so on can be read and even manipulated when the physical DMA
access is granted.
A HOWTO is included in this patch, in Documentation/debugging-via-ohci1394.txt
and I've put a copy online at
ftp://ftp.suse.de/private/bk/firewire/docs/debugging-via-ohci1394.txt
It also has links to all the tools which are available to make use of it
another copy of it is online at:
ftp://ftp.suse.de/private/bk/firewire/kernel/ohci1394_dma_early-v2.diff
Signed-Off-By: Bernhard Kaindl <bk@suse.de>
Tested-By: Thomas Renninger <trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In x86 PAE mode, stop treating pmds as a special case. Previously
they were always allocated and freed with the pgd. The modifies the
code to be the same as 64-bit mode, where they are allocated on
demand.
This is a step on the way to unifying 32/64-bit pagetable allocation
as much as possible.
There is a complicating wart, however. When you install a new
reference to a pmd in the pgd, the processor isn't guaranteed to see
it unless you reload cr3. Since reloading cr3 also has the
side-effect of flushing the tlb, this is an expense that we want to
avoid whereever possible.
This patch simply avoids reloading cr3 unless the update is to the
current pagetable. Later patches will optimise this further.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: William Irwin <wli@holomorphy.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The change from current to tsk in do_page_fault is safe as
this is set at the very beginning of the function.
Removes a likely() annotation from the 64-bit version, this
could have instead been added to 32-bit.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When changing a kernel page from RO->RW, it's OK to leave stale TLB
entries around, since doing a global flush is expensive and they pose
no security problem. They can, however, generate a spurious fault,
which we should catch and simply return from (which will have the
side-effect of reloading the TLB to the current PTE).
This can occur when running under Xen, because it frequently changes
kernel pages from RW->RO->RW to implement Xen's pagetable semantics.
It could also occur when using CONFIG_DEBUG_PAGEALLOC, since it avoids
doing a global TLB flush after changing page permissions.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On !PAE 32-bit, _PAGE_NX will be 0, making is_prefetch always
return early. The test is sufficient on PAE as __supported_pte_mask
is updated in the same places as nx_enabled in init_32.c which also
takes disable_nx into account.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>