Ticket spinlocks have absolutely ghastly worst-case performance
characteristics in a virtual environment. If there is any contention
for physical CPUs (ie, there are more runnable vcpus than cpus), then
ticket locks can cause the system to end up spending 90+% of its time
spinning.
The problem is that (v)cpus waiting on a ticket spinlock will be
granted access to the lock in strict order they got their tickets. If
the hypervisor scheduler doesn't give the vcpus time in that order,
they will burn timeslices waiting for the scheduler to give the right
vcpu some time. In the worst case it could take O(n^2) vcpu scheduler
timeslices for everyone waiting on the lock to get it, not counting
new cpus trying to take the lock while the log-jam is sorted out.
These hooks allow a paravirt backend to replace the spinlock
implementation.
At the very least, this could revert the implementation back to the
old lock algorithm, which allows the next scheduled vcpu to take the
lock, and has basically fairly good performance.
It also allows the spinlocks to take advantages of the hypervisor
features to make locks more efficient (spin and block, for example).
The cost to native execution is an extra direct call when using a
spinlock function. There's no overhead if CONFIG_PARAVIRT is turned
off.
The lock structure is fixed at a single "unsigned int", initialized to
zero, but the spinlock implementation can use it as it wishes.
Thanks to Thomas Friebel's Xen Summit talk "Preventing Guests from
Spinning Around" for pointing out this problem.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <clameter@linux-foundation.org>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Virtualization <virtualization@lists.linux-foundation.org>
Cc: Xen devel <xen-devel@lists.xensource.com>
Cc: Thomas Friebel <thomas.friebel@amd.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Exceptions using paranoidentry need to have their exception frames
adjusted explicitly.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Now that the vdso32 code can cope with both syscall and sysenter
missing for 32-bit compat processes, just disable the features without
disabling vdso altogether.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
AMD only supports "syscall" from 32-bit compat usermode.
Intel and Centaur(?) only support "sysenter" from 32-bit compat usermode.
Set the X86 feature bits accordingly, and set up the vdso in
accordance with those bits. On the offchance we run on in a 64-bit
environment which supports neither syscall nor sysenter from 32-bit
mode, then fall back to the int $0x80 vdso.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
fix:
arch/x86/xen/built-in.o: In function `xen_enable_syscall':
(.cpuinit.text+0xdb): undefined reference to `sysctl_vsyscall32'
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Old versions of Xen (3.1 and before) don't support sysenter or syscall
from 32-bit compat userspaces. If we can't set the appropriate
syscall callback, then disable the corresponding feature bit, which
will cause the vdso32 setup to fall back appropriately.
Linux assumes that syscall is always available to 32-bit userspace,
and installs it by default if sysenter isn't available. In that case,
we just disable vdso altogether, forcing userspace libc to fall back
to int $0x80.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit 033786969d1d1b5af12a32a19d3a760314d05329.
Suresh Siddha reported that this broke booting on his 2GB testbox.
Reported-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
arch/x86/xen/enlighten.c: In function 'xen_set_fixmap':
arch/x86/xen/enlighten.c:1127: error: 'FIX_KMAP_BEGIN' undeclared (first use in this function)
arch/x86/xen/enlighten.c:1127: error: (Each undeclared identifier is reported only once
arch/x86/xen/enlighten.c:1127: error: for each function it appears in.)
arch/x86/xen/enlighten.c:1127: error: 'FIX_KMAP_END' undeclared (first use in this function)
make[1]: *** [arch/x86/xen/enlighten.o] Error 1
make: *** [arch/x86/xen/enlighten.o] Error 2
FIX_KMAP_BEGIN is only available on HIGHMEM.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Allow Xen to be enabled on 64-bit.
Also extend domain size limit from 8 GB (on 32-bit) to 32 GB on 64-bit.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
64-bit uses MSRs for important things like the base for fs and
gs-prefixed addresses. It's more efficient to use a hypercall to
update these, rather than go via the trap and emulate path.
Other MSR writes are just passed through; in an unprivileged domain
they do nothing, but it might be useful later.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
64-bit userspace expects the vdso to be mapped at a specific fixed
address, which happens to be in the middle of the kernel address
space. Because we have split user and kernel pagetables, we need to
make special arrangements for the vsyscall mapping to appear in the
kernel part of the user pagetable.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We set up entrypoints for syscall and sysenter. sysenter is only used
for 32-bit compat processes, whereas syscall can be used in by both 32
and 64-bit processes.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Because the x86_64 architecture does not enforce segment limits, Xen
cannot protect itself with them as it does in 32-bit mode. Therefore,
to protect itself, it runs the guest kernel in ring 3. Since it also
runs the guest userspace in ring3, the guest kernel must maintain a
second pagetable for its userspace, which does not map kernel space.
Naturally, the guest kernel pagetables map both kernel and userspace.
The userspace pagetable is attached to the corresponding kernel
pagetable via the pgd's page->private field. It is allocated and
freed at the same time as the kernel pgd via the
paravirt_pgd_alloc/free hooks.
Fortunately, the user pagetable is almost entirely shared with the
kernel pagetable; the only difference is the pgd page itself. set_pgd
will populate all entries in the kernel pagetable, and also set the
corresponding user pgd entry if the address is less than
STACK_TOP_MAX.
The user pagetable must be pinned and unpinned with the kernel one,
but because the pagetables are aliased, pgd_walk() only needs to be
called on the kernel pagetable. The user pgd page is then
pinned/unpinned along with the kernel pgd page.
xen_write_cr3 must write both the kernel and user cr3s.
The init_mm.pgd pagetable never has a user pagetable allocated for it,
because it can never be used while running usermode.
One awkward area is that early in boot the page structures are not
available. No user pagetable can exist at that point, but it
complicates the logic to avoid looking at the page structure.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We need to do this, otherwise we can get a GPF on hypercall return
after TLS descriptor is cleared but %fs is still pointing to it.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Implement the failsafe callback, so that iret and segment register
load exceptions are reported to the kernel.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Point the boot params cmd_line_ptr to the domain-builder-provided
command line.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rewrite pgd_walk to deal with 64-bit address spaces. There are two
notible features of 64-bit workspaces:
1. The physical address is only 48 bits wide, with the upper 16 bits
being sign extension; kernel addresses are negative, and userspace is
positive.
2. The Xen hypervisor mapping is at the negative-most address, just above
the sign-extension hole.
1. means that we can't easily use addresses when traversing the space,
since we must deal with sign extension. This rewrite expresses
everything in terms of pgd/pud/pmd indices, which means we don't need
to worry about the exact configuration of the virtual memory space.
This approach works equally well in 32-bit.
To deal with 2, assume the hole is between the uppermost userspace
address and PAGE_OFFSET. For 64-bit this skips the Xen mapping hole.
For 32-bit, the hole is zero-sized.
In all cases, the uppermost kernel address is FIXADDR_TOP.
A side-effect of this patch is that the upper boundary is actually
handled properly, exposing a long-standing bug in 32-bit, which failed
to pin kernel pmd page. The kernel pmd is not shared, and so must be
explicitly pinned, even though the kernel ptes are shared and don't
need pinning.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The x86_64 interrupt subsystem is oriented towards vectors, as opposed
to a flat irq space as it is in x86-32. This patch adds a simple
identity irq->vector mapping so that we can continue to feed irqs into
do_IRQ() and get a good result.
Ideally x86_32 will unify with the 64-bit code and use vectors too.
At that point we can move to mapping event channels to vectors, which
will allow us to economise on irqs (so per-cpu event channels can
share irqs, rather than having to allocte one per cpu, for example).
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use callback_op hypercall to register callbacks in a 32/64-bit
independent way (64-bit doesn't need a code segment, but that detail
is hidden in XEN_CALLBACK).
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
swapgs is a no-op under Xen, because the hypervisor makes sure the
right version of %gs is current when switching between user and kernel
modes. This means that the swapgs "implementation" can be inlined and
used when the stack is unsafe (usermode). Unfortunately, it means
that disabling patching will result in a non-booting kernel...
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Xen pushes two extra words containing the values of rcx and r11. This
pvop hook copies the words back into their appropriate registers, and
cleans them off the stack. This leaves the stack in native form, so
the normal handler can run unchanged.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make Xen's set_pte_mfn() use set_pte_vaddr rather than copying it.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We need to wait until the page structure is available to use the
proper pagetable page alloc/release operations, since they use struct
page to determine if a pagetable is pinned.
This happened to work in 32bit because nobody allocated new pagetable
pages in the interim between xen_pagetable_setup_done and
xen_post_allocator_init, but the 64-bit kenrel needs to allocate more
pagetable levels.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Someone's got to do it.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When building initial pagetables in 64-bit kernel the pud/pmd pointer may
be in ioremap/fixmap space, so we need to walk the pagetable to look up the
physical address.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arbitrary_virt_to_machine can truncate a machine address if its above
4G. Cast the problem away.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rearrange the pagetable initialization to share code with the 64-bit
kernel. Rather than deferring anything to pagetable_setup_start, just
set up an initial pagetable in swapper_pg_dir early at startup, and
create an additional 8MB of physical memory mappings. This matches
the native head_32.S mappings to a large degree, and allows the rest
of the pagetable setup to continue without much Xen vs. native
difference.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Early in boot, map a chunk of extra physical memory for use later on.
We need a pool of mapped pages to allocate further pages to construct
pagetables mapping all physical memory.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It also doesn't need the 32-bit hack version of set_pte for initial
pagetable construction, so just make it use the real thing.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Set up the initial pagetables to map the kernel mapping into the
physical mapping space. This makes __va() usable, since it requires
physical mappings.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Split xen-asm into 32- and 64-bit files, and implement the 64-bit
variants.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add the Xen entrypoint and ELF notes to head_64.S. Adapts xen-head.S
to compile either 32-bit or 64-bit.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
A number of random changes to make xen/smp.c compile in 64-bit mode.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>a
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As a stopgap until Mike Travis's x86-64 gs-based percpu patches are
ready, provide workaround functions for x86_read/write_percpu for
Xen's use.
Specifically, this means that we can't really make use of vcpu
placement, because we can't use a single gs-based memory access to get
to vcpu fields. So disable all that for now.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move all the smp_ops setup into smp.c, allowing a lot of things to
become static.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
x86_64 stores the active_mm in the pda, so fetch it from there.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We need extra pv_mmu_ops for 64-bit, to deal with the extra level of
pagetable.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Copy 64-bit definitions of various interface structures into place.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We need set_pte to work from a relatively early point, so enable it
from the start.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
add xen_timer_resume() hook.
Timer resume should be done after event channel is resumed.
add xen_arch_resume() hook when ipi becomes usable after resume.
After resume, some cpu specific resource must be reinitialized
on ia64 that can't be set by another cpu.
However available hooks is run once on only one cpu so that ipi has
to be used.
During stop_machine_run() ipi can't be used because interrupt is masked.
So add another hook after stop_machine_run().
Another approach might be use resume hook which is run by
device_resume(). However device_resume() may be executed on
suspend error recovery path.
So it is necessary to determine whether it is executed on real resume path
or error recovery path.
Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Print a backtrace if a multicall fails, to help with debugging.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This allows Xen's xen_cpu_up() to allocate a pda for the new CPU.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Update arch/x86's use of page-aligned variables. The change to
arch/x86/xen/mmu.c fixes an actual bug, but the rest are cleanups
and to set a precedent.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
process_64.c:__switch_to has some very old strange formatting, some of
it dating back to pre-git. Fix it up.
No functional changes.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Early fixmap will allocate its own L1 pagetable page for fixmap
mappings, so there's no need to preallocate one.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Call paravirt_pagetable_setup_{start,done}
These paravirt_ops functions were not being called on x86_64.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Using ACPI to find free address space allows us to find a gap for the
unallocated PCI resources or MMIO resources for hotplug devices within
the BIOS allowed PCI regions.
It works by evaluating the _CRS object under PCI0 looking for producer
resources. Then searches the e820 memory space for a gap within these
producer resources.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Dave Hansen reported a build error on 32bit which went unnoticed
as newer gcc versions seem to optimize unused static functions
away before compiling them.
Make vread_tsc() depend on CONFIG_X86_64
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
fix merge fallout:
arch/x86/pci/amd_bus.c: In function ‘enable_pci_io_ecs':
arch/x86/pci/amd_bus.c:581: error: too many arguments to function ‘on_each_cpu'
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'timers/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: add PCI ID for 6300ESB force hpet
x86: add another PCI ID for ICH6 force-hpet
kernel-paramaters: document pmtmr= command line option
acpi_pm clccksource: fix printk format warning
nohz: don't stop idle tick if softirqs are pending.
pmtmr: allow command line override of ioport
nohz: reduce jiffies polling overhead
hrtimer: Remove unused variables in ktime_divns()
hrtimer: remove warning in hres_timers_resume
posix-timers: print RT watchdog message
On Mon, 2008-04-21 at 18:54 -0400, Masami Hiramatsu wrote:
> Thank you for reporting.
>
> Actually, kprobes tries to fixup thread's flags in post_kprobe_handler
> (which is called from kprobe_exceptions_notify) by
> trace_hardirqs_fixup_flags(pt_regs->flags). However, even the irq flag
> is set in pt_regs->flags, true hardirq is still off until returning
> from do_debug. Thus, lockdep assumes that hardirq is off without annotation.
>
> IMHO, one possible solution is that fixing hardirq flags right after
> notify_die in do_debug instead of in post_kprobe_handler.
My reply to BZ 10489:
> [ 2.707509] Kprobe smoke test started
> [ 2.709300] ------------[ cut here ]------------
> [ 2.709420] WARNING: at kernel/lockdep.c:2658 check_flags+0x4d/0x12c()
> [ 2.709541] Modules linked in:
> [ 2.709588] Pid: 1, comm: swapper Not tainted 2.6.25.jml.057 #1
> [ 2.709588] [<c0126acc>] warn_on_slowpath+0x41/0x51
> [ 2.709588] [<c010bafc>] ? save_stack_trace+0x1d/0x3b
> [ 2.709588] [<c0140a83>] ? save_trace+0x37/0x89
> [ 2.709588] [<c011987d>] ? kernel_map_pages+0x103/0x11c
> [ 2.709588] [<c0109803>] ? native_sched_clock+0xca/0xea
> [ 2.709588] [<c0142958>] ? mark_held_locks+0x41/0x5c
> [ 2.709588] [<c0382580>] ? kprobe_exceptions_notify+0x322/0x3af
> [ 2.709588] [<c0142aff>] ? trace_hardirqs_on+0xf1/0x119
> [ 2.709588] [<c03825b3>] ? kprobe_exceptions_notify+0x355/0x3af
> [ 2.709588] [<c0140823>] check_flags+0x4d/0x12c
> [ 2.709588] [<c0143c9d>] lock_release+0x58/0x195
> [ 2.709588] [<c038347c>] ? __atomic_notifier_call_chain+0x0/0x80
> [ 2.709588] [<c03834d6>] __atomic_notifier_call_chain+0x5a/0x80
> [ 2.709588] [<c0383508>] atomic_notifier_call_chain+0xc/0xe
> [ 2.709588] [<c013b6d4>] notify_die+0x2d/0x2f
> [ 2.709588] [<c038168a>] do_debug+0x67/0xfe
> [ 2.709588] [<c0381287>] debug_stack_correct+0x27/0x30
> [ 2.709588] [<c01564c0>] ? kprobe_target+0x1/0x34
> [ 2.709588] [<c0156572>] ? init_test_probes+0x50/0x186
> [ 2.709588] [<c04fae48>] init_kprobes+0x85/0x8c
> [ 2.709588] [<c04e947b>] kernel_init+0x13d/0x298
> [ 2.709588] [<c04e933e>] ? kernel_init+0x0/0x298
> [ 2.709588] [<c04e933e>] ? kernel_init+0x0/0x298
> [ 2.709588] [<c0105ef7>] kernel_thread_helper+0x7/0x10
> [ 2.709588] =======================
> [ 2.709588] ---[ end trace 778e504de7e3b1e3 ]---
> [ 2.709588] possible reason: unannotated irqs-off.
> [ 2.709588] irq event stamp: 370065
> [ 2.709588] hardirqs last enabled at (370065): [<c0382580>] kprobe_exceptions_notify+0x322/0x3af
> [ 2.709588] hardirqs last disabled at (370064): [<c0381bb7>] do_int3+0x1d/0x7d
> [ 2.709588] softirqs last enabled at (370050): [<c012b464>] __do_softirq+0xfa/0x100
> [ 2.709588] softirqs last disabled at (370045): [<c0107438>] do_softirq+0x74/0xd9
> [ 2.714751] Kprobe smoke test passed successfully
how I love this stuff...
Ok, do_debug() is a trap, this can happen at any time regardless of the
machine's IRQ state. So the first thing we do is fix up the IRQ state.
Then we call this die notifier stuff; and return with messed up IRQ
state... YAY.
So, kprobes fudges it..
notify_die(DIE_DEBUG)
kprobe_exceptions_notify()
post_kprobe_handler()
modify regs->flags
trace_hardirqs_fixup_flags(regs->flags); <--- must be it
So what's the use of modifying flags if they're not meant to take effect
at some point.
/me tries to reproduce issue; enable kprobes test thingy && boot
OK, that reproduces..
So the below makes it work - but I'm not getting this code; at the time
I wrote that stuff I CC'ed each and every kprobe maintainer listed in
the usual places but got no reposonse - can some please explain this
stuff to me?
Are the saved flags only for the TF bit or are they made in full effect
later (and if so, where) ?
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-2.6.27' of git://git.infradead.org/users/dwmw2/firmware-2.6: (64 commits)
firmware: convert sb16_csp driver to use firmware loader exclusively
dsp56k: use request_firmware
edgeport-ti: use request_firmware()
edgeport: use request_firmware()
vicam: use request_firmware()
dabusb: use request_firmware()
cpia2: use request_firmware()
ip2: use request_firmware()
firmware: convert Ambassador ATM driver to request_firmware()
whiteheat: use request_firmware()
ti_usb_3410_5052: use request_firmware()
emi62: use request_firmware()
emi26: use request_firmware()
keyspan_pda: use request_firmware()
keyspan: use request_firmware()
ttusb-budget: use request_firmware()
kaweth: use request_firmware()
smctr: use request_firmware()
firmware: convert ymfpci driver to use firmware loader exclusively
firmware: convert maestro3 driver to use firmware loader exclusively
...
Fix up trivial conflicts with BKL removal in drivers/char/dsp56k.c and
drivers/char/ip2/ip2main.c manually.
* 'core/printk' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, generic: mark early_printk as asmlinkage
printk: export console_drivers
printk: remember the message level for multi-line output
printk: refactor processing of line severity tokens
printk: don't prefer unsuited consoles on registration
printk: clean up recursion check related static variables
namespacecheck: more kernel/printk.c fixes
namespacecheck: fix kernel printk.c
* 'tracing/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (228 commits)
ftrace: build fix for ftraced_suspend
ftrace: separate out the function enabled variable
ftrace: add ftrace_kill_atomic
ftrace: use current CPU for function startup
ftrace: start wakeup tracing after setting function tracer
ftrace: check proper config for preempt type
ftrace: trace schedule
ftrace: define function trace nop
ftrace: move sched_switch enable after markers
ftrace: prevent ftrace modifications while being kprobe'd, v2
fix "ftrace: store mcount address in rec->ip"
mmiotrace broken in linux-next (8-bit writes only)
ftrace: avoid modifying kprobe'd records
ftrace: freeze kprobe'd records
kprobes: enable clean usage of get_kprobe
ftrace: store mcount address in rec->ip
ftrace: build fix with gcc 4.3
namespacecheck: fixes
ftrace: fix "notrace" filtering priority
ftrace: fix printout
...
John Keller reports that PCI config space access is broken on machines
with more than one domain. conf1 accesses only work for domain 0, so make sure
we check the domain number in the raw routines before trying conf1.
Reported-by: John Keller <jpk@sgi.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Explain that we set up the descriptors for Big Real Mode, and why we
do so. In particular, one system that is known to fail without it is
the Lenovo X61.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The explanation for recent video BIOS suspend quirk failures is that
the VESA BIOS expects to be entered in Big Real Mode (*.limit = 0xffffffff)
instead of ordinary Real Mode (*.limit = 0xffff).
This patch changes the segment descriptors to Big Real Mode instead.
The segment descriptor registers (what Intel calls "segment cache") is
always active. The only thing that changes based on CR0.PE is how it is
*loaded* and the interpretation of the CS flags.
The segment descriptor registers contain of the following sub-registers:
selector (the "visible" part), base, limit and flags. In protected mode
or long mode, they are loaded from descriptors (or fs.base or gs.base can
be manipulated directly in long mode.) In real mode, the only thing
changed by a segment register load is the selector and the base, where the
base <- selector << 4. In particular, *the limit and the flags are not
changed*.
As far as the handling of the CS flags: a code segment cannot be writable
in protected mode, whereas it is "just another segment" in real mode, so
there is some kind of quirk that kicks in for this when CR0.PE <- 0. I'm
not sure if this is accomplished by actually changing the cs.flags register
or just changing the interpretation; it might be something that is
CPU-specific. In particular, the Transmeta CPUs had an explicit "CS is
writable if you're in real mode" override, so even if you had loaded CS
with an execute-only segment it'd be writable (but not readable!) on return
to real mode. I'm not at all sure if that is how other CPUs behave.
Signed-off-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
when try to make hpet_enable use io_remap instead fixmap got
ioremap: invalid physical address fed00000
------------[ cut here ]------------
WARNING: at arch/x86/mm/ioremap.c:161 __ioremap_caller+0x8c/0x2f3()
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.26-rc9-tip-01873-ga9827e7-dirty #358
Call Trace:
[<ffffffff8026615e>] warn_on_slowpath+0x6c/0xa7
[<ffffffff802e2313>] ? __slab_alloc+0x20a/0x3fb
[<ffffffff802d85c5>] ? mpol_new+0x88/0x17d
[<ffffffff8022a4f4>] ? mcount_call+0x5/0x31
[<ffffffff8022a4f4>] ? mcount_call+0x5/0x31
[<ffffffff8024b0d2>] __ioremap_caller+0x8c/0x2f3
[<ffffffff80e86dbd>] ? hpet_enable+0x39/0x241
[<ffffffff8022a4f4>] ? mcount_call+0x5/0x31
[<ffffffff8024b466>] ioremap_nocache+0x2a/0x40
[<ffffffff80e86dbd>] hpet_enable+0x39/0x241
[<ffffffff80e7a1f6>] hpet_time_init+0x21/0x4e
[<ffffffff80e730e9>] start_kernel+0x302/0x395
[<ffffffff80e722aa>] x86_64_start_reservations+0xb9/0xd4
[<ffffffff80e722fe>] ? x86_64_init_pda+0x39/0x4f
[<ffffffff80e72400>] x86_64_start_kernel+0xec/0x107
---[ end trace a7919e7f17c0a725 ]---
it seems for amd system that is set later...
try to move setting early in early_identify_cpu.
and remove same code for intel and centaur.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
only add direct mapping for aperture
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Strengthen the return type for the _node_to_cpumask_ptr to be
a const pointer. This adds compiler checking to insure that
node_to_cpumask_map[] is not changed inadvertently.
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: "akpm@linux-foundation.org" <akpm@linux-foundation.org>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Acked-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that IRQ2 is never made available to the I/O APIC, there is no need
to special-case it and mask as a workaround for broken systems. Actually,
because of the former, mask_IO_APIC_irq(2) is a no-op already.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
got this on a test-system:
calling numaq_tsc_disable+0x0/0x39
NUMAQ: disabling TSC
initcall numaq_tsc_disable+0x0/0x39 returned 0 after 0 msecs
that's because we should not be using arch_initcall to call numaq_tsc_disable.
need to call it in setup_arch before time_init()/tsc_init()
and call it in init_intel() to make the cpu feature bits right.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
end_user_pfn used to modify the meaning of the e820 maps.
Now that all e820 operations are cleaned up, unified, tightened up,
the e820 map always get updated to reality, we don't need to keep
this secondary mechanism anymore.
If you hit this commit in bisection it means something slipped through.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
optimization: try to merge the range with same page size in
init_memory_mapping, to get the best possible linear mappings set up.
thus when GBpages is not there, we could do 2M pages.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
tighten the boundary checks around max_low_pfn_mapped - dont overmap
nor undermap into holes.
also print out tseg for AMD cpus, for diagnostic purposes.
(this is an SMM area, and we split up any big mappings around that area)
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On three of the several paths in entry_64.S that call
do_notify_resume() on the way back to user mode, we fail to properly
check again for newly-arrived work that requires another call to
do_notify_resume() before going to user mode. These paths set the
mask to check only _TIF_NEED_RESCHED, but this is wrong. The other
paths that lead to do_notify_resume() do this correctly already, and
entry_32.S does it correctly in all cases.
All paths back to user mode have to check all the _TIF_WORK_MASK
flags at the last possible stage, with interrupts disabled.
Otherwise, we miss any flags (TIF_SIGPENDING for example) that were
set any time after we entered do_notify_resume(). More work flags
can be set (or left set) synchronously inside do_notify_resume(), as
TIF_SIGPENDING can be, or asynchronously by interrupts or other CPUs
(which then send an asynchronous interrupt).
There are many different scenarios that could hit this bug, most of
them races. The simplest one to demonstrate does not require any
race: when one signal has done handler setup at the check before
returning from a syscall, and there is another signal pending that
should be handled. The second signal's handler should interrupt the
first signal handler before it actually starts (so the interrupted PC
is still at the handler's entry point). Instead, it runs away until
the next kernel entry (next syscall, tick, etc).
This test behaves correctly on 32-bit kernels, and fails on 64-bit
(either 32-bit or 64-bit test binary). With this fix, it works.
#define _GNU_SOURCE
#include <stdio.h>
#include <signal.h>
#include <string.h>
#include <sys/ucontext.h>
#ifndef REG_RIP
#define REG_RIP REG_EIP
#endif
static sig_atomic_t hit1, hit2;
static void
handler (int sig, siginfo_t *info, void *ctx)
{
ucontext_t *uc = ctx;
if ((void *) uc->uc_mcontext.gregs[REG_RIP] == &handler)
{
if (sig == SIGUSR1)
hit1 = 1;
else
hit2 = 1;
}
printf ("%s at %#lx\n", strsignal (sig),
uc->uc_mcontext.gregs[REG_RIP]);
}
int
main (void)
{
struct sigaction sa;
sigset_t set;
sigemptyset (&sa.sa_mask);
sa.sa_flags = SA_SIGINFO;
sa.sa_sigaction = &handler;
if (sigaction (SIGUSR1, &sa, NULL)
|| sigaction (SIGUSR2, &sa, NULL))
return 2;
sigemptyset (&set);
sigaddset (&set, SIGUSR1);
sigaddset (&set, SIGUSR2);
if (sigprocmask (SIG_BLOCK, &set, NULL))
return 3;
printf ("main at %p, handler at %p\n", &main, &handler);
raise (SIGUSR1);
raise (SIGUSR2);
if (sigprocmask (SIG_UNBLOCK, &set, NULL))
return 4;
if (hit1 + hit2 == 1)
{
puts ("PASS");
return 0;
}
puts ("FAIL");
return 1;
}
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We have two conflicting DMA-based quirks in there for the same set of
boxes (HP nx6325 and nx6125) and one of them actually breaks my box.
So remove the extra code.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: =?iso-8859-1?q?T=F6r=F6k_Edwin?= <edwintorok@gmail.com>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I don't know, if this new code boots, but at least it
compiles. Someone should really test it.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Robert Richter <robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In the course of the recent unification of the NMI watchdog an assignment
to timer_ack to switch off unnecesary POLL commands to the 8259A in the
case of a watchdog failure has been accidentally removed. The statement
used to be limited to the 32-bit variation as since the rewrite of the
timer code it has been relevant for the 82489DX only. This change brings
it back.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There is no such entity as ISA IRQ2. The ACPI spec does not make it
explicitly clear, but does not preclude it either -- all it says is ISA
legacy interrupts are identity mapped by default (subject to overrides),
but it does not state whether IRQ2 exists or not. As a result if there is
no IRQ0 override, then IRQ2 is normally initialised as an ISA interrupt,
which implies an edge-triggered line, which is unmasked by default as this
is what we do for edge-triggered I/O APIC interrupts so as not to miss an
edge.
To the best of my knowledge it is useless, as IRQ2 has not been in use
since the PC/AT as back then it was taken by the 8259A cascade interrupt
to the slave, with the line position in the slot rerouted to newly-created
IRQ9. No device could thus make use of this line with the pair of 8259A
chips. Now in theory INTIN2 of the I/O APIC may be usable, but the
interrupt of the device wired to it would not be available in the PIC mode
at all, so I seriously doubt if anybody decided to reuse it for a regular
device.
However there are two common uses of INTIN2. One is for IRQ0, with an
ACPI interrupt override (or its equivalent in the MP table). But in this
case IRQ2 is gone entirely with INTIN0 left vacant. The other one is for
an 8959A ExtINTA cascade. In this case IRQ0 goes to INTIN0 and if ACPI is
used INTIN2 is assumed to be IRQ2 (there is no override and ACPI has no
way to report ExtINTA interrupts). This is where a problem happens.
The problem is INTIN2 is configured as a native APIC interrupt, with a
vector assigned and the mask cleared. And the line may indeed get active
and inject interrupts if the master 8959A has its timer interrupt enabled
(it might happen for other interrupts too, but they are normally masked in
the process of rerouting them to the I/O APIC). There are two cases where
it will happen:
* When the I/O APIC NMI watchdog is enabled. This is actually a misnomer
as the watchdog pulses are delivered through the 8259A to the LINT0
inputs of all the local APICs in the system. The implication is the
output of the master 8259A goes high and low repeatedly, signalling
interrupts to INTIN2 which is enabled too!
[The origin of the name is I think for a brief period during the
development we had a capability in our code to configure the watchdog to
use an I/O APIC input; that would be INTIN2 in this scenario.]
* When the native route of IRQ0 via INTIN0 fails for whatever reason -- as
it happens with the system considered here. In this scenario the timer
pulse is delivered through the 8259A to LINT0 input of the local APIC of
the bootstrap processor, quite similarly to how is done for the watchdog
described above. The result is, again, INTIN2 receives these pulses
too. Rafael's system used to escape this scenario, because an incorrect
IRQ0 override would occupy INTIN2 and prevent it from being unmasked.
My conclusion is IRQ2 should be excluded from configuration in all the
cases and the current exception for ACPI systems should be lifted. The
reason being the exception not only being useless, but harmful as well.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Unlike the 32-bit one, the 64-bit variation of the LVT0 setup code for
the "8259A Virtual Wire" through the local APIC timer configuration does
not fully configure the relevant irq_chip structure. Instead it relies on
the preceding I/O APIC code to have set it up, which does not happen if
the I/O APIC variants have not been tried.
The patch includes corresponding changes to the 32-bit variation too
which make them both the same, barring a small syntactic difference
involving sequence of functions in the source. That should work as an aid
with the upcoming merge.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
IRQ0 is edge-triggered, but the "8259A Virtual Wire" through the local
APIC configuration in the 32-bit version uses the "fasteoi" handler
suitable for level-triggered APIC interrupt. Rewrite code so that the
"edge" handler is used. The 64-bit version uses different code and is
unaffected.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The RING0_INT_FRAME macro defines a CFI_STARTPROC.
So we should really be using CFI_ENDPROC after it.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch removes the memset from the data structure initialization code and
allocate the structures with the __GFP_ZERO flag.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Cc: iommu@lists.linux-foundation.org
Cc: bhavna.sarathy@amd.com
Cc: robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch removes an unneeded initialization from the alloc_command_buffer
function and replaces a memset with __GFP_ZERO.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Cc: iommu@lists.linux-foundation.org
Cc: bhavna.sarathy@amd.com
Cc: robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use the X86_FEATURE_SYSENTER32 to remove hard-coded CPU vendor check.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add pseudo-feature bits to describe whether the CPU supports sysenter
and/or syscall from ia32-compat userspace. This removes a hardcoded
test in vdso32-setup.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Yinghai Lu reported crashes on 64-bit x86:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
IP: [<ffffffff80253b17>] hrtick_start_fair+0x89/0x173
[...]
And with a long session of debugging and a lot of difficulty, tracked it down
to this commit:
--------------->
8fbbc4b45c is first bad commit
commit 8fbbc4b45c
Author: Alok Kataria <akataria@vmware.com>
Date: Tue Jul 1 11:43:34 2008 -0700
x86: merge tsc_init and clocksource code
<--------------
The problem is that the TSC unification missed these Makefile rules
in arch/x86/kernel/Makefile:
# Do not profile debug and lowlevel utilities
CFLAGS_REMOVE_tsc_64.o = -pg
CFLAGS_REMOVE_tsc_32.o = -pg
...
CFLAGS_tsc_64.o := $(nostackp)
...
which rules make sure that various instrumentation and debugging
facilities are disabled for code that might end up in a VDSO - such as
the TSC code.
Reported-and-bisected-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Conflicts:
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As other IOMMUs do, this puts dummy pci_swiotlb_init() in swiotlb.h
and remove ifdef CONFIG_SWIOTLB in pci-dma.c.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Muli Ben-Yehuda <muli@il.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
asm-x86/calgary.h has dummy calgary_iommu_init() and detect_calgary()
in !CONFIG_CALGARY_IOMMU case. So we don't need ifdef
CONFIG_CALGARY_IOMMU in pci-dma.c.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Muli Ben-Yehuda <muli@il.ibm.com>
Cc: Alexis Bruemmer <alexisb@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Our way to handle gart_* functions for CONFIG_GART_IOMMU and
!CONFIG_GART_IOMMU cases is inconsistent.
We have some dummy gart_* functions in !CONFIG_GART_IOMMU case and
also use ifdef CONFIG_GART_IOMMU tricks in pci-dma.c to call some
gart_* functions in only CONFIG_GART_IOMMU case.
This patch removes ifdef CONFIG_GART_IOMMU in pci-dma.c and always use
dummy gart_* functions in iommu.h.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Muli Ben-Yehuda <muli@il.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
gart.h has only GART-specific stuff. Only GART code needs it. Other
IOMMU stuff should include iommu.h instead of gart.h.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Muli Ben-Yehuda <muli@il.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
when more than 4g memory is installed, don't map the big hole below 4g.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
save the SLIT, in case we are using fixmap to read it, and that fixmap
could be cleared by others.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
also let mem= to print out modified e820 map too
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Gas 2.15 complains about 32-bit registers being used in lea.
AS arch/x86/lib/copy_user_64.o
/local/scratch-2/jeremy/hg/xen/paravirt/linux/arch/x86/lib/copy_user_64.S: Assembler messages:
/local/scratch-2/jeremy/hg/xen/paravirt/linux/arch/x86/lib/copy_user_64.S:188: Error: `(%edx,%ecx,8)' is not a valid 64 bit base/index expression
/local/scratch-2/jeremy/hg/xen/paravirt/linux/arch/x86/lib/copy_user_64.S:257: Error: `(%edx,%ecx,8)' is not a valid 64 bit base/index expression
AS arch/x86/lib/copy_user_nocache_64.o
/local/scratch-2/jeremy/hg/xen/paravirt/linux/arch/x86/lib/copy_user_nocache_64.S: Assembler messages:
/local/scratch-2/jeremy/hg/xen/paravirt/linux/arch/x86/lib/copy_user_nocache_64.S:107: Error: `(%edx,%ecx,8)' is not a valid 64 bit base/index expression
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
arch/x86/pci/built-in.o: In function `pci_subsys_init':
visws.c:(.init.text+0xfc5): undefined reference to `pci_direct_conf1'
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
arch/x86/kernel/built-in.o: In function `visws_early_detect':
: undefined reference to `mach_get_smp_config_quirk'
arch/x86/kernel/built-in.o: In function `visws_early_detect':
: undefined reference to `mach_find_smp_config_quirk'
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Integration generated a duplicate call to use_tsc_delay.
Particularly, the one that is done before we check for general
tsc usability seems wrong.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
arch/x86/kernel/visws_quirks.c: In function ‘visws_early_detect’:
arch/x86/kernel/visws_quirks.c:293: error: ‘no_broadcast’ undeclared (first use in this function)
arch/x86/kernel/visws_quirks.c:293: error: (Each undeclared identifier is reported only once
arch/x86/kernel/visws_quirks.c:293: error: for each function it appears in.)
make[1]: *** [arch/x86/kernel/visws_quirks.o] Error 1
make: *** [arch/x86/kernel/visws_quirks.o] Error 2
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Patch d49c4288 (tip/x86/mpparse) introduced some changes in calling
subsys_init calls if CONFIG_X86_NUMAQ option is set. This patch
updates subsystem initalization according to this changes.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Robert Richter <robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
this is the big move: flip over VISWS to generic arch support.
From this commit on CONFIG_X86_VISWS is just another (default-disabled)
option that turns on certain quirks - no other complications.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
copy arch/x86/mach-visws/setup_visws.c, apic_visws.c and traps_visws.c
files to arch/x86/kernel/, in preparation of the switchover to a
non-subarch setup for VISWS.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
add early quirk support to the generic architecture code.
this allows VISWS to be supported by the generic code and allows us
to remove the VISWS subarch.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
move the include/asm-x86/mach-visws/ VISWS specific hardware
details include files into include/asm-x86/visws, to be used from
generic code.
No code changed.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
first step: make the VISWS subarch boot on a regular PC.
We take various shortcuts for that. We copy the generic arch setup file over
into the VISWS setup file.
This is the only step that is not expected to boot on a real VISWS.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add early quirks support.
In preparation of enabling the generic architecture to boot on a VISWS.
This will allow us to remove the VISWS subarch and all its complications.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
these two sub-architectures want PCI to be default-on, not default-off.
Reported-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When an interrupt is rerouted to a different I/O APIC pin the relevant
entry of the irq_2_pin list should get updated accordingly so that
operations are performed on the correct redirection entry.
This is already done by the 32-bit variation of the code and here is a
complementing 64-bit implementation. Should make someone's decision less
tough when merging the two. ;)
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit 90221a61a71b7ad659d8741cf1e404506b174982.
This too was just temporary diagnostics - not needed now that we've
got the final fix via:
| commit e2079c4386
| Author: Rafael J. Wysocki <rjw@sisk.pl>
| Date: Tue Jul 8 16:12:26 2008 +0200
|
| x86: fix C1E && nx6325 stability problem
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit a74a1cc3df0be89658bc735c8aed80c8392e2c15.
This was just temporary diagnostics commit - not needed now that we've
got the final fix.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
AMD_IOMMU should depend on IOMMU_HELPER since they are the IOMMU
helper functions. SWIOTLB requires IOMMU_HELPER so declaring that
AMD_IOMMU depends on SWIOTLB properly fixes the problems.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add ioremap_default(), which gives a sane mapping without worrying about
type conflicts.
Use it in /dev/mem read in place of ioremap(), as with ioremap(),
any mapping of the region (other than UC_MINUS) will cause a conflict
and failure of /dev/mem read.
Should address the vbetest failure reported at:
http://bugzilla.kernel.org/show_bug.cgi?id=11057
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
even on 64bit systems with less than 4G RAM, we can now use fixmap
to handle acpi SIT near end of ram.
change e820_end to e820_end_of_ram again?
or e820_ram_pfn?
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
and let 64-bit to fall back to use fixmap too.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix phys_pmd_init to make sure not to return bigger value than end.
also print out range split:1G/2M/4K in init_memory_mapping().
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
to avoid warning from find_low_pfn_range for high pages size etc
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
So far subsys_initcalls has been executed in this order depending on
the object order in the Makefile:
arch/x86/pci/visws.c:subsys_initcall(pcibios_init);
arch/x86/pci/numa.c:subsys_initcall(pci_numa_init);
arch/x86/pci/acpi.c:subsys_initcall(pci_acpi_init);
arch/x86/pci/legacy.c:subsys_initcall(pci_legacy_init);
arch/x86/pci/irq.c:subsys_initcall(pcibios_irq_init);
arch/x86/pci/common.c:subsys_initcall(pcibios_init);
This patch removes the ordering dependency. There is now only one
subsys_initcall function that contains subsystem initialization code
with a defined order.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This should be safe since mmconfig*.o and init.o do not contain
*initcalls with the same level as in other files.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
arch/x86/kernel/acpi/boot.c: In function ‘dmi_ignore_irq0_timer_override’:
arch/x86/kernel/acpi/boot.c:1443: error: implicit declaration of function ‘force_mask_ioapic_irq_2’
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The problems are that, with the ACPI vs timer overring issue _fixed_,
after using the box for some time (between several seconds and 1 hour, at
random) processes get very high CPU loads (once I've got X using 107% of
the CPU, for example) and the system becomes unresponsive, as though there
were interrupts lost or something similar.
Andreas Herrman reproduced similar problems:
> Ok, now I've reproduced the stability problem.
> - Using tip/master,
> - reverting e38502eb8aa82314d5ab0eba45f50e6790dadd88 and
> - applying your patch from this posting
> http://marc.info/?l=linux-kernel&m=121539354224562&w=4
>
> Starting X, firefox, gimp, tuxpaint and doing some drawing in tuxpaint
> results in a slow system. Drawing is almost not possible anymore --
> Selections of new colors, cursors etc. is performed with huge delay
> if it's performed at all.
>
> BTW, the code sets up timer IRQ as Virtual Wire IRQ:
>
> Jul 8 14:57:58 kodscha IO-APIC (apicid-pin) 2-22, 2-23 not connected.
> Jul 8 14:57:58 kodscha ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> Jul 8 14:57:58 kodscha ...trying to set up timer as Virtual Wire IRQ... works.
>
> and both INT0 and INT2 of IOAPIC are masked:
>
> Jul 8 14:57:58 kodscha NR Dst Mask Trig IRR Pol Stat Dmod Deli Vect:
> Jul 8 14:57:58 kodscha 00 000 1 0 0 0 0 0 0 00
> Jul 8 14:57:58 kodscha 01 003 0 0 0 0 0 1 1 31
> Jul 8 14:57:58 kodscha 02 003 1 0 0 0 0 0 0 30
>
> I've also seen strange CPU utilization -- with syslog-ng:
>
> top - 15:33:06 up 35 min, 4 users, load average: 1.70, 0.68, 0.37
> Tasks: 64 total, 4 running, 60 sleeping, 0 stopped, 0 zombie
> Cpu0 : 0.0%us,100.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> Cpu1 : 6.4%us, 87.2%sy, 0.0%ni, 5.8%id, 0.0%wa, 0.6%hi, 0.0%si, 0.0%st
> Mem: 895384k total, 283568k used, 611816k free, 35492k buffers
> Swap: 1959920k total, 0k used, 1959920k free, 163044k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 4632 root 20 0 17216 800 580 S 104 0.1 0:34.22 syslog-ng
> 28505 root 20 0 205m 11m 4024 S 6 1.3 0:21.16 X
> 28518 root 20 0 56292 5652 4492 S 1 0.6 0:01.80 fluxbox
> 1 root 20 0 3724 608 508 S 0 0.1 0:00.36 init
>
> So far I have no clue why C1E-idle in conjunction with virtual wire
> mode causes this strange behaviour.
>
> ... and I start to think about the root cause of all this.
>
> I've performed similar tests under X with the IRQ0/INT0 configuration and
> I did not see above symptoms.
So lets fall back to the IRQ0/INT0 configuration on this box.
This basically restores the dont-use-the-lapic-timer exception mechanism
that was unconditional on this box prior commit 8750bf5 ("x86: add C1E
aware idle function").
Signed-off-by: Ingo Molnar <mingo@elte.hu>
One of the last IOMMU updates covered a bug in the AMD IOMMU code. The early
detection code does not succeed if the GART is already detected. This patch
fixes this.
Cc: Robert Richter <robert.richter@amd.com>
Cc: Bhavna Sarathy <Bhavna.Sarathy@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Cc: iommu@lists.linux-foundation.org
Cc: Joerg Roedel <joerg.roedel@amd.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Bhavna Sarathy <Bhavna.Sarathy@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
handle head and tail that are not aligned to big pages (2MB/1GB boundary).
with this patch, on system that support gbpages, change:
last_map_addr: 1080000000 end: 1078000000
to:
last_map_addr: 1078000000 end: 1078000000
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When system have 4g less ram installed, and acpi table sit
near end of ram, make max_pfn cover them too,
so 64bit kernel don't need to mess up fixmap.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: "Suresh Siddha" <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
prepare for overmapped patch
also printout last_map_addr together with end
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Select X86_WP_WORKS_OK for x86_64 too.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
putuser_32.S and putuser_64.S are merged into putuser.S.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In putuser_32.S and putuser_64.S, replace things like .quad, .long,
and explicit references to [r|e]ax for the apropriate macros
in asm/asm.h.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove them where unambiguous.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In putuser_64.S, do it the i386 way, and replace the code
in beginning and end of functions with macros, since it's
always the same thing. Save lines.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Instead of operating over a register we need to put back
into normal state afterwards (the memory position), just
sub from rbx, which is trashed anyway. We can save a few instructions.
Also, this is the i386 way.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is consistent with i386 usage.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Instead of clobbering r8, clobber rbx, which is the i386 way.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Clobber it in the inline asm macros, and let the compiler do this for us.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
getuser_32.S and getuser_64.S are merged into getuser.S.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Switch .long and .quad with _ASM_PTR in getuser*.S.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There are situations in which the architecture wants to use the
register that represents its word-size, whatever it is. For those,
introduce __ASM_REG in asm.h, along with the first users _ASM_AX
and _ASM_DX. They have users waiting for it, namely the getuser
functions.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The instructions access registers, so the size is unambiguous.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is for consistency with i386.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Instead of doing a sub after the addition, use the
offset directly at the memory operand of the mov instructions.
This is the way i386 do.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since the instructions refer to registers, they'll be able
to figure it out.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There's really no reason to clobber r8 or pass the address in rcx.
We can safely use only two registers (which we already have to touch anyway)
to do the job.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
delay_32.c, delay_64.c are now equal, and are integrated into delay.c.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
For x86_64, we can't just use %0, as it would
generate a mul against rdx, which is not really what we
want (note the ">> 32" in x86_64 version).
Using a u64 variable with a shift in i386 generates bad code,
so the solution is to explicitly use %%edx in inline assembly
for both.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This way we achieve the same code for both arches.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is for consistency with i386. We call use_tsc_delay()
at tsc initialization for x86_64, so we'll be always using it.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove the "l" from inline asm at arch/x86/lib/delay_32.c.
It is not needed.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arch/x86/mm/pgtable_32.c:144: warning: 'fixmaps' defined but not used
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arch/x86/kernel/smpboot.c: In function 'do_boot_cpu':
arch/x86/kernel/smpboot.c:943: warning: label 'restore_state' defined but not used
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- order of local variable declarations
- minor code changes
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Acked-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- local caching of smp_processor_id() in default_do_nmi()
- v2: do not split default_do_nmi over two lines
On Wed, Jul 02, 2008 at 08:12:20PM +0400, Cyrill Gorcunov wrote:
> | -static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
> | +static notrace __kprobes void
> | +default_do_nmi(struct pt_regs *regs)
> | [ ... ]
> | -asmlinkage notrace __kprobes void default_do_nmi(struct pt_regs *regs)
> | +asmlinkage notrace __kprobes void
> | +default_do_nmi(struct pt_regs *regs)
>
> Hi Alexander, good done, thanks! But why did you split default_do_nmi
> definition by two lines? I think it would be better to keep them as it
> was before, ie by a single line
>
> static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
Thanks! Here is the replacement patch with default_do_nmi left on
a single line.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Acked-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- if (cond) block -> if (!cond) goto end_of_block
- local caching of current
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Acked-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reorder headers and collect globals in traps_32.c and traps_64.c
Code size and data size are unaffected by the changes. Code
itself is changed due to different ordering of data and bss.
The bss segment changed size due to a change in the packing
of the variables.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Acked-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch does not change the generated object files.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rename the paravirtualized calculate_cpu_khz to calibrate_tsc.
In all cases, we actually calibrate_tsc and use that as the cpu_khz value.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: Dan Hecht <dhecht@vmware.com>
Cc: Dan Hecht <dhecht@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Unify the clocksource code.
Unify the tsc_init code.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: Dan Hecht <dhecht@vmware.com>
Cc: Dan Hecht <dhecht@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Merge the tsc calibration code for the 32bit and 64bit kernel.
The paravirtualized calculate_cpu_khz for 64bit now points to the correct
tsc_calibrate code as in 32bit.
Original native_calculate_cpu_khz for 64 bit is now called as calibrate_cpu.
Also moved the recalibrate_cpu_khz function in the common file.
Note that this function is called only from powernow K7 cpu freq driver.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: Dan Hecht <dhecht@vmware.com>
Cc: Dan Hecht <dhecht@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move the basic global variable definitions and sched_clock handling in the
common "tsc.c" file.
- Unify notsc kernel command line handling for 32 bit and 64bit.
- Functional changes for 64bit.
- "tsc_disabled" is updated if "notsc" is passed at boottime.
- Fallback to jiffies for sched_clock, incase notsc is passed on
commandline.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: Dan Hecht <dhecht@vmware.com>
Cc: Dan Hecht <dhecht@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add lapic resource into kernel resource map and mark it as busy
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
CC: Maciej W. Rozycki <macro@linux-mips.org>
so when it is called after early_param, e820_saved get updated too.
esp for mpc update.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
seperate reserve_setup_data into e820_reserved_setup_data,
and reserve_early_setup_data.
So could use e820_reserved_setup_data to backup e820 with setup_data.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Bernhard Walle <bwalle@suse.de>
Cc: Ying Huang <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
so other path that will override memory_setup or
machine_specific_memory_setup could have e820_saved too.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch uses the /sys/firmware/memmap interface provided in the last patch
on the x86 architecture when E820 is used. The patch copies the E820
memory map very early, and registers the E820 map afterwards via
firmware_map_add_early().
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Acked-by: Greg KH <gregkh@suse.de>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: kexec@lists.infradead.org
Cc: yhlu.kernel@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
we already have the same srat handling interface for 32bit.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rather than using _PAGE_GLOBAL - which not all CPUs support - to test
CPA, use one of the reserved-for-software-use PTE flags instead. This
allows CPA testing to work on CPUs which don't support PGD.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Older x86-32 processors do not support global mappings (PGD), so must
only use it if the processor supports it.
The _PAGE_KERNEL* flags always have _PAGE_KERNEL set, since logically
we always want it set.
This is OK even on processors which do not support PGD, since all
_PAGE flags are masked with __supported_pte_mask before being turned
into a real in-pagetable pte. On 32-bit systems, __supported_pte_mask
is initialized to not contain _PAGE_GLOBAL, and it is then added if
the CPU is found to support it.
The x86-32 code used to use __PAGE_KERNEL/__PAGE_KERNEL_EXEC for this
purpose, but they're now redundant and can be removed.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Consistently set _PAGE_GLOBAL in _PAGE_KERNEL flags. This makes 32-
and 64-bit code consistent, and removes some special cases where
__PAGE_KERNEL* did not have _PAGE_GLOBAL set, causing confusion as a
result of the inconsistencies.
This patch only affects x86-64, which generally always supports PGD.
The x86-32 patch is next.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When allocating a new pud, unconditionally populate the pgd (why did
we bother to create a new pud if we weren't going to populate it?).
This will only happen if the pgd slot was empty, since any existing
pud will be reused.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
call it right after we are done with MADT/mptable handling, instead of
doing that in setup_per_cpu_areas() later on...
this way for_possible_cpu() can be used early.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
when acpi=off, cpu_to_apicid is ready after get_smp_config
so need to move init_cpu_to_node after it.
otherwise, we will get wrong cpu->node mapping, and it will rely on
amd_detect_cmp() to correct it - but that is too late as
setup_per_cpu_data is already called before that so we will get
per_cpu_data on the wrong node.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
move out e820_register_active_regions from non numa zones_sizes_init()
and remove numa version zones_sizes_init().
and let 32 bit call remove_all_active_ranges() in setup_arch() directly
like 64-bit
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
so it has a more meaningful name.
also change it to static.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch removes the need of the crashkernel=...@offset parameter to define
a fixed offset for crashkernel reservation. That feature can be used together
with a relocatable kernel where the kexec-tools relocate the kernel and
get the actual offset from /proc/iomem.
The use case is a kernel where the .text+.data+.bss is after 16M physical
memory (debug kernel with lockdep on x86_64 can cause that) which caused a
major pain in autoconfiguration in our distribution.
Also, that patch unifies crashdump architectures a bit since IA64 has
that semantics from the very beginning of the kdump port.
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Cc: vgoyal@redhat.com
Cc: Bernhard Walle <bwalle@suse.de>
Cc: kexec@lists.infradead.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
e820_search_gap also take a end_addr parameter to limit search from
start_addr to end_addr.
Signed-off-by: AloK N Kataria <akataria@vmware.com>
Acked-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: "lenb@kernel.org" <lenb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* When CONFIG_DEBUG_PER_CPU_MAPS is set, the node passed to
node_to_cpumask and node_to_cpumask_ptr should be validated.
If invalid, then a dump_stack is performed and a zero cpumask
is returned.
v2: Slightly different version to remove a compiler warning.
v3: Redone to reflect moving setup.c -> setup_percpu.c
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: "akpm@linux-foundation.org" <akpm@linux-foundation.org>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar wrote:
> -tip auto-testing found pagetable corruption (CPA self-test failure):
>
> [ 32.956015] CPA self-test:
> [ 32.958822] 4k 2048 large 508 gb 0 x 2556[ffff880000000000-ffff88003fe00000] miss 0
> [ 32.964000] CPA ffff88001d54e000: bad pte 1d4000e3
> [ 32.968000] CPA ffff88001d54e000: unexpected level 2
> [ 32.972000] CPA ffff880022c5d000: bad pte 22c000e3
> [ 32.976000] CPA ffff880022c5d000: unexpected level 2
> [ 32.980000] CPA ffff8800200ce000: bad pte 200000e3
> [ 32.984000] CPA ffff8800200ce000: unexpected level 2
> [ 32.988000] CPA ffff8800210f0000: bad pte 210000e3
>
> config and full log can be found at:
>
> http://redhat.com/~mingo/misc/config-Mon_Jun_30_11_11_51_CEST_2008.bad
> http://redhat.com/~mingo/misc/log-Mon_Jun_30_11_11_51_CEST_2008.bad
Phew. OK, I've worked this out. Short version is that's it's a false
alarm, and there was no real failure here. Long version:
* I changed the code to create the physical mapping pagetables to
reuse any existing mapping rather than replace it. Specifically,
reusing an pud pointed to by the pgd caused this symptom to appear.
* The specific PUD being reused is the one created statically in
head_64.S, which creates an initial 1GB mapping.
* That mapping doesn't have _PAGE_GLOBAL set on it, due to the
inconsistency between __PAGE_* and PAGE_*.
* The CPA test attempts to clear _PAGE_GLOBAL, and then checks to
see that the resulting range is 1) shattered into 4k pages, and 2)
has no _PAGE_GLOBAL.
* However, since it didn't have _PAGE_GLOBAL on that range to start
with, change_page_attr_clear() had nothing to do, and didn't
bother shattering the range,
* resulting in the reported messages
The simple fix is to set _PAGE_GLOBAL in level2_ident_pgt.
An additional fix to make CPA testing more robust by using some other
pagetable bit (one of the unused available-to-software ones). This
would solve spurious CPA test warnings under Xen which uses _PAGE_GLOBAL
for its own purposes (ie, not under guest control).
Also, we should revisit the use of _PAGE_GLOBAL in asm-x86/pgtable.h,
and use it consistently, and drop MAKE_GLOBAL. The first time I
proposed it it caused breakages in the very early CPA code; with luck
that's all fixed now.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Mark McLoughlin <markmc@redhat.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kva ram already mapped right after away, so don't need to get that for low ram.
avoid wasting one copy of pgdat.
also add node id in early_res name in case we get it from find_e820_area.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ying Huang would like setup_data to be reserved, but not included in the
no save range.
Here we try to modify the e820 table to reserve that range early.
also add that in early_res in case bootloader messes up with the ramdisk.
other solution would be
1. add early_res_to_highmem...
2. early_res_to_e820...
but they could reserve another type memory wrongly, if early_res has some
resource reserved early, and not needed later, but it is not removed from
early_res in time. Like the RAMDISK (already handled).
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: andi@firstfloor.org
Tested-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Looks like the setup.c unification missed the early_ioremap init from
the early_ioremap unification. Unconditionally call early_ioremap_init().
needed for "x86/paravirt: groundwork for 64-bit Xen support".
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Mark McLoughlin <markmc@redhat.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
change the enable_local_apic to static force_enable_local_apic for 32bit
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
use PMD_SHIFT to calculate boundary also adjust size for pre-allocated
table size
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
when 64bit resource is not enabled, we get:
arch/x86/kernel/e820.c: In function ‘e820_reserve_resources’:
arch/x86/kernel/e820.c:1217: warning: comparison is always false due to limited range of data type
because res->start/end is resource_t aka u32. it will overflow.
fix it with temp end of u64
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
some ram-end boundary only has page alignment, instead of 2M alignment.
v2: make init_memory_mapping more solid: start could be any value other than 0
v3: fix NON PAE by handling left over in kernel_physical_mapping
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
instead of calling it from trap_init()
also move init ioapic mapping out of apic_32.c
so 32 bit do same as 64 bit
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar wrote:
> that fixed the build but now we've got a boot crash with this config:
>
> time.c: Detected 2010.304 MHz processor.
> spurious 8259A interrupt: IRQ7.
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> IP: [<0000000000000000>]
> PGD 0
> Thread overran stack, or stack corrupted
> Oops: 0010 [1] SMP
> CPU 0
>
I don't know if this will fix this bug, but it's definitely a bugfix.
It was trashing random pages by overwriting them with pagetables...
Don't trash a large pmd's data when mapping physical memory.
This is a bugfix for "x86_64: adjust mapping of physical pagetables
to work with Xen".
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>
>>> It quickly broke the build in testing:
>>>
>>> include/asm/pgalloc.h: In function ‘paravirt_pgd_free':
>>> include/asm/pgalloc.h:14: error: parameter name omitted
>>> arch/x86/kernel/entry_64.S: In file included from
>>> arch/x86/kernel/traps_64.c:51:include/asm/pgalloc.h: In function
>>> ‘paravirt_pgd_free':
>>> include/asm/pgalloc.h:14: error: parameter name omitted
>>>
>>>
>> No, looks like my fault. The non-PARAVIRT version of
>> paravirt_pgd_free() is:
>>
>> static inline void paravirt_pgd_free(struct mm_struct *mm, pgd_t *) {}
>>
>> but C doesn't like missing parameter names, even if unused.
>>
>> This should fix it:
>>
>
> that fixed the build but now we've got a boot crash with this config:
>
> time.c: Detected 2010.304 MHz processor.
> spurious 8259A interrupt: IRQ7.
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> IP: [<0000000000000000>]
> PGD 0
> Thread overran stack, or stack corrupted
> Oops: 0010 [1] SMP
> CPU 0
>
> with:
>
> http://redhat.com/~mingo/misc/config-Thu_Jun_26_12_46_46_CEST_2008.bad
>
Use SWAPGS_UNSAFE_STACK in ia32entry.S in the places where the active
stack is the usermode stack.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
do that in init_memory_mapping
also remove one init_ohci1394_dma_on_all_controllers
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
asm-x86/paravirt.h already have protection with CONFIG_PARAVIRT inside
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch brings back limiting of the E820 map when a user-defined
E820 map is specified. While the behaviour of i386 (32 bit) was to limit
the E820 map (and /proc/iomem), the behaviour of x86-64 (64 bit) was not to
limit.
That patch limits the E820 map again for both x86 architectures.
Code was tested for compilation and booting on a 32 bit and 64 bit system.
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Acked-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: kexec@lists.infradead.org
Cc: vgoyal@redhat.com
Cc: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The patch "x86: introduce init_memory_mapping for 32bit" does not allocate
enough space for PTEs if the CPU does not implement PSE.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
64-bit Xen pushes a couple of extra words onto an exception frame.
Add a hook to deal with them.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In a 64-bit system, we need separate sysret/sysexit operations to
return to a 32-bit userspace.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citirx.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There's no need to combine restoring the user rsp within the sysret
pvop, so split it out. This makes the pvop's semantics closer to the
machine instruction.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citirx.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Don't conflate sysret and sysexit; they're different instructions with
different semantics, and may be in use at the same time (at least
within the same kernel, depending on whether its an Intel or AMD
system).
sysexit - just return to userspace, does no register restoration of
any kind; must explicitly atomically enable interrupts.
sysret - reloads flags from r11, so no need to explicitly enable
interrupts on 64-bit, responsible for restoring usermode %gs
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citirx.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is needed when the kernel is running on RING3, such as under Xen.
x86_64 has a weird feature that makes it #GP on iret when SS is a null
descriptor.
This need to be tested on bare metal to make sure it doesn't cause any
problems. AMD specs say SS is always ignored (except on iret?).
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We must do this because load_TLS() may need to clear %fs and %gs.
(e.g. under Xen).
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We must leave lazy mode before switching the %fs and %gs selectors.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We will need to set a pte on l3_user_pgt. Extract set_pte_vaddr_pud()
from set_pte_vaddr(), that will accept the l3 page table as parameter.
This change should be a no-op for existing code.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If PSE is not available, then fall back to 4k page mappings for the
vmemmap area.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This makes a few of changes to the construction of the initial
pagetables to work better with paravirt_ops/Xen. The main areas
are:
1. Support non-PSE mapping of memory, since Xen doesn't currently
allow 2M pages to be mapped in guests.
2. Make sure that the ioremap alias of all pages are dropped before
attaching the new page to the pagetable. This avoids having
writable aliases of pagetable pages.
3. Preserve existing pagetable entries, rather than overwriting. Its
possible that a fair amount of pagetable has already been constructed,
so reuse what's already in place rather than ignoring and overwriting it.
The algorithm relies on the invariant that any page which is part of
the kernel pagetable is itself mapped in the linear memory area. This
way, it can avoid using ioremap on a pagetable page.
The invariant holds because it maps memory from low to high addresses,
and also allocates memory from low to high. Each allocated page can
map at least 2M of address space, so the mapped area will always
progress much faster than the allocated area. It relies on the early
boot code mapping enough pages to get started.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Split x86_64_start_kernel() into two pieces:
The first essentially cleans up after head_64.S. It clears the
bss, zaps low identity mappings, sets up some early exception
handlers.
The second part preserves the boot data, reserves the kernel's
text/data/bss, pagetables and ramdisk, and then starts the kernel
proper.
This split is so that Xen can call the second part to do the set up it
needs done. It doesn't need any of the first part setups, because it
doesn't boot via head_64.S, and its redundant or actively damaging.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Set __PAGE_OFFSET to the most negative possible address +
16*PGDIR_SIZE. The gap is to allow a space for a hypervisor to fit.
The gap is more or less arbitrary, but it's what Xen needs.
When booting native, kernel/head_64.S has a set of compile-time
generated pagetables used at boot time. This patch removes their
absolutely hard-coded layout, and makes it parameterised on
__PAGE_OFFSET (and __START_KERNEL_map).
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rather than just jumping to 0 when there's a missing operation, raise a BUG.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Jan Beulich points out that vmalloc_sync_all() assumes that the
kernel's pmd is always expected to be present in the pgd. The current
pgd construction code will add the pgd to the pgd_list before its pmds
have been pre-populated, thereby making it visible to
vmalloc_sync_all().
However, because pgd_prepopulate_pmd also does the allocation, it may
block and cannot be done under spinlock.
The solution is to preallocate the pmds out of the spinlock, then
populate them while holding the pgd_list lock.
This patch also pulls the pmd preallocation and mop-up functions out
to be common, assuming that the compiler will generate no code for
them when PREALLOCTED_PMDS is 0. Also, there's no need for pgd_ctor
to clear the pgd again, since it's allocated as a zeroed page.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add hooks which are called at pgd_alloc/free time. The pgd_alloc hook
may return an error code, which if non-zero, causes the pgd allocation
to be failed. The hooks may be used to allocate/free auxillary
per-pgd information.
also fix:
> * Ingo Molnar <mingo@elte.hu> wrote:
>
> include/asm/pgalloc.h: In function ‘paravirt_pgd_free':
> include/asm/pgalloc.h:14: error: parameter name omitted
> arch/x86/kernel/entry_64.S: In file included from
> arch/x86/kernel/traps_64.c:51:include/asm/pgalloc.h: In function ‘paravirt_pgd_free':
> include/asm/pgalloc.h:14: error: parameter name omitted
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
vmalloc_sync_all() is only called from register_die_notifier and
alloc_vm_area. Neither is on any performance-critical paths, so
vmalloc_sync_all() itself is not on any hot paths.
Given that the optimisations in vmalloc_sync_all add a fair amount of
code and complexity, and are fairly hard to evaluate for correctness,
it's better to just remove them to simplify the code rather than worry
about its absolute performance.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
In file included from arch/x86/kernel/setup.c:118:
include/asm/highmem.h:64: error: expected identifier or ‘(' before ‘do'
include/asm/highmem.h:64: error: expected identifier or ‘(' before ‘while'
include/asm/highmem.h:67: error: expected identifier or ‘(' before ‘do'
include/asm/highmem.h:67: error: expected identifier or ‘(' before ‘while'
Signed-off-by: Ingo Molnar <mingo@elte.hu>