There is no need to skip instruction if the reason for a task switch
is a task gate in IDT and access to it is caused by an external even.
The problem is currently solved only for VMX since there is no reliable
way to skip an instruction in SVM. We should emulate it instead.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
We will need it later in task_switch().
Code in handle_exception() is dead. is_external_interrupt(vect_info)
will always be false since idt_vectoring_info is zeroed in
vmx_complete_interrupts().
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
...with a more straightforward switch().
Also fix a bug when NMI could be dropped on exit. Although this should
never happen in practice, since NMIs can only be injected, never triggered
internally by the guest like exceptions.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Bit 12 is undefined in any of the following cases:
If the VM exit sets the valid bit in the IDT-vectoring information field.
If the VM exit is due to a double fault.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The testing of feature is too early now, before vmcs_config complete initialization.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
A pte that is shadowed when the guest EFER.NXE=1 is not valid when
EFER.NXE=0; if bit 63 is set, the pte should cause a fault, and since the
shadow EFER always has NX enabled, this won't happen.
Fix by using a different shadow page table for different EFER.NXE bits. This
allows vcpus to run correctly with different values of EFER.NXE, and for
transitions on this bit to be handled correctly without requiring a full
flush.
Signed-off-by: Avi Kivity <avi@redhat.com>
Detect, indicate, and propagate page faults where reserved bits are set.
Take care to handle the different paging modes, each of which has different
sets of reserved bits.
[avi: fix pte reserved bits for efer.nxe=0]
Signed-off-by: Eddie Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The original one is for the code before refactoring.
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
EXIT_QUALIFICATION and GUEST_LINEAR_ADDRESS are natural width, not 64-bit.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
kvm_vcpu_block() unhalts vpu on an interrupt/timer without checking
if interrupt window is actually opened.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Currently timer events are processed before entering guest mode. Move it
to main vcpu event loop since timer events should be processed even while
vcpu is halted. Timer may cause interrupt/nmi to be injected and only then
vcpu will be unhalted.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The prioritized bit vector manipulation functions are useful in both vmx and
svm.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
We ignore writes to the performance counters and performance event
selector registers already. Kaspersky antivirus reads the eventsel
MSR causing it to crash with the current behaviour.
Return 0 as data when the eventsel registers are read to stop the
crash.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
free_mmu_pages() should only undo what alloc_mmu_pages() does.
Free mmu pages from the generic VM destruction function, kvm_destroy_vm().
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
After discussion with Marcelo, we decided to rework device assignment framework
together. The old problems are kernel logic is unnecessary complex. So Marcelo
suggest to split it into a more elegant way:
1. Split host IRQ assign and guest IRQ assign. And userspace determine the
combination. Also discard msi2intx parameter, userspace can specific
KVM_DEV_IRQ_HOST_MSI | KVM_DEV_IRQ_GUEST_INTX in assigned_irq->flags to
enable MSI to INTx convertion.
2. Split assign IRQ and deassign IRQ. Import two new ioctls:
KVM_ASSIGN_DEV_IRQ and KVM_DEASSIGN_DEV_IRQ.
This patch also fixed the reversed _IOR vs _IOW in definition(by deprecated the
old interface).
[avi: replace homemade bitcount() by hweight_long()]
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Fix this sparse warnings:
arch/x86/kvm/lapic.c:916:22: warning: symbol 'lapic_timer_ops' was not declared. Should it be static?
arch/x86/kvm/i8254.c:268:22: warning: symbol 'kpit_ops' was not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
The new way does not require additional loop over vcpus to calculate
the one with lowest priority as one is chosen during delivery bitmap
construction.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Use kvm_apic_match_dest() in kvm_get_intr_delivery_bitmask() instead
of duplicating the same code. Use kvm_get_intr_delivery_bitmask() in
apic_send_ipi() to figure out ipi destination instead of reimplementing
the logic.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Get rid of ioapic_inj_irq() and ioapic_inj_nmi() functions.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
There is no reason to update the shadow pte here because the guest pte
is only changed to dirty state.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Hide the internals of vcpu awakening / injection from the in-kernel
emulated timers. This makes future changes in this logic easier and
decreases the distance to more generic timer handling.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
We can infer elapsed time from hrtimer_expires_remaining.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Skip the test which checks if the PIT is properly routed when
using the IOAPIC, aimed at buggy hardware.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This issue just appeared in kvm-84 when running on 2.6.28.7 (x86-64)
with PREEMPT enabled.
We're getting syslog warnings like this many (but not all) times qemu
tells KVM to run the VCPU:
BUG: using smp_processor_id() in preemptible [00000000] code:
qemu-system-x86/28938
caller is kvm_arch_vcpu_ioctl_run+0x5d1/0xc70 [kvm]
Pid: 28938, comm: qemu-system-x86 2.6.28.7-mtyrel-64bit
Call Trace:
debug_smp_processor_id+0xf7/0x100
kvm_arch_vcpu_ioctl_run+0x5d1/0xc70 [kvm]
? __wake_up+0x4e/0x70
? wake_futex+0x27/0x40
kvm_vcpu_ioctl+0x2e9/0x5a0 [kvm]
enqueue_hrtimer+0x8a/0x110
_spin_unlock_irqrestore+0x27/0x50
vfs_ioctl+0x31/0xa0
do_vfs_ioctl+0x74/0x480
sys_futex+0xb4/0x140
sys_ioctl+0x99/0xa0
system_call_fastpath+0x16/0x1b
As it turns out, the call trace is messed up due to gcc's inlining, but
I isolated the problem anyway: kvm_write_guest_time() is being used in a
non-thread-safe manner on preemptable kernels.
Basically kvm_write_guest_time()'s body needs to be surrounded by
preempt_disable() and preempt_enable(), since the kernel won't let us
query any per-CPU data (indirectly using smp_processor_id()) without
preemption disabled. The attached patch fixes this issue by disabling
preemption inside kvm_write_guest_time().
[marcelo: surround only __get_cpu_var calls since the warning
is harmless]
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch finally enable MSI-X.
What we need for MSI-X:
1. Intercept one page in MMIO region of device. So that we can get guest desired
MSI-X table and set up the real one. Now this have been done by guest, and
transfer to kernel using ioctl KVM_SET_MSIX_NR and KVM_SET_MSIX_ENTRY.
2. Information for incoming interrupt. Now one device can have more than one
interrupt, and they are all handled by one workqueue structure. So we need to
identify them. The previous patch enable gsi_msg_pending_bitmap get this done.
3. Mapping from host IRQ to guest gsi as well as guest gsi to real MSI/MSI-X
message address/data. We used same entry number for the host and guest here, so
that it's easy to find the correlated guest gsi.
What we lack for now:
1. The PCI spec said nothing can existed with MSI-X table in the same page of
MMIO region, except pending bits. The patch ignore pending bits as the first
step (so they are always 0 - no pending).
2. The PCI spec allowed to change MSI-X table dynamically. That means, the OS
can enable MSI-X, then mask one MSI-X entry, modify it, and unmask it. The patch
didn't support this, and Linux also don't work in this way.
3. The patch didn't implement MSI-X mask all and mask single entry. I would
implement the former in driver/pci/msi.c later. And for single entry, userspace
should have reposibility to handle it.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
It's also convenient when we extend KVM supported vcpu number in the future.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Would be used with bit ops, and would be easily extended if KVM_MAX_VCPUS is
increased.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Windows 2008 accesses this MSR often on context switch intensive workloads;
since we run in guest context with the guest MSR value loaded (so swapgs can
work correctly), we can simply disable interception of rdmsr/wrmsr for this
MSR.
A complication occurs since in legacy mode, we run with the host MSR value
loaded. In this case we enable interception. This means we need two MSR
bitmaps, one for legacy mode and one for long mode.
Signed-off-by: Avi Kivity <avi@redhat.com>
Correct some event and UMASK values according to Intel SDM,
in the Nehalem and Atom tables.
Signed-off-by: Yong Wang <yong.y.wang@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <20090609131553.GA12489@ywang-moblin2.bj.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Booting a 32-bit kernel on Magny-Cours results in the following panic:
...
Using APIC driver default
...
Overriding APIC driver with bigsmp
...
Getting VERSION: 80050010
Getting VERSION: 80050010
Getting ID: 10000000
Getting ID: ef000000
Getting LVT0: 700
Getting LVT1: 10000
Kernel panic - not syncing: Boot APIC ID in local APIC unexpected (16 vs 0)
Pid: 1, comm: swapper Not tainted 2.6.30-rcX #2
Call Trace:
[<c05194da>] ? panic+0x38/0xd3
[<c0743102>] ? native_smp_prepare_cpus+0x259/0x31f
[<c073b19d>] ? kernel_init+0x3e/0x141
[<c073b15f>] ? kernel_init+0x0/0x141
[<c020325f>] ? kernel_thread_helper+0x7/0x10
The reason is that default_get_apic_id handled extension of local APIC
ID field just in case of XAPIC.
Thus for this AMD CPU, default_get_apic_id() returns 0 and
bigsmp_get_apic_id() returns 16 which leads to the respective kernel
panic.
This patch introduces a Linux specific feature flag to indicate
support for extended APIC id (8 bits instead of 4 bits width) and sets
the flag on AMD CPUs if applicable.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: <stable@kernel.org>
LKML-Reference: <20090608135509.GA12431@alberich.amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
These are defined as static cpumask_var_t so if MAXSMP is not used,
they are cleared already. Avoid surprises when MAXSMP is enabled.
Signed-off-by: Yinghai Lu <yinghai.lu@kernel.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
That prefix is already included in the DUMP_printk macro. So there is no
need to repeat it in the format string.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This fixes a bug with a device that could not be assigned to a KVM guest
because it is still assigned to a dma_ops protection domain.
[chrisw: simply remove WARN_ON(), will always fire since dev->driver
will be pci-sub]
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Handling this event causes device assignment in KVM to fail because the
device gets re-attached as soon as the pci-stub registers as the driver
for the device.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Fill in amd_hw_cache_event_id[] with the AMD CPU specific events,
for family 0x0f, 0x10 and 0x11.
There's apparently no distinction between load and store events, so
we only fill in the load events.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Using gcc 3.3.5 a "make allmodconfig" + "CONFIG_KVM=n"
triggers a build error:
arch/x86/mm/built-in.o(.init.text+0x43f7): In function `__change_page_attr':
arch/x86/mm/pageattr.c:114: undefined reference to `__udivdi3'
make: *** [.tmp_vmlinux1] Error 1
The culprit turned out to be a division in arch/x86/mm/memtest.c
For more info see this thread:
http://marc.info/?l=linux-kernel&m=124416232620683
The patch entirely removes the division that caused the build
error.
[ Impact: build fix with certain GCC versions ]
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: xiyou.wangcong@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@kernel.org>
LKML-Reference: <20090608170939.GB12431@alberich.amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix bug in the SGI UV macros that support systems with multiple
coherency domains. The macros used for referencing global MMR
(chipset registers) are failing to correctly "or" the NASID
(node identifier) bits that reside above M+N. These high bits
are supplied automatically by the chipset for memory accesses
coming from the processor socket.
However, the bits must be present for references to the special
global MMR space used to map chipset registers. (See uv_hub.h
for more details ...)
The bug results in references to invalid/incorrect nodes.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Cc: <stable@kernel.org>
LKML-Reference: <20090608154405.GA16395@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Standardize and tidy up all the messages we print during
perfcounter initialization.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fill in core2_hw_cache_event_id[] with the Atom model specific events.
The events can be used in all the tools via the -e (--event) parameter,
for example "-e l1-misses" or -"-e l2-accesses" or "-e l2-write-misses".
( Note: these are straight from the Intel manuals - not tested yet.)
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fill in core2_hw_cache_event_id[] with the Core2 model specific events.
The events can be used in all the tools via the -e (--event) parameter,
for example "-e l1-misses" or -"-e l2-accesses" or "-e l2-write-misses".
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
vfree() does its own 'NULL' check, so no need for check before
calling it.
In v2, remove the stray newline.
[ Impact: cleanup ]
Signed-off-by: Figo.zhang <figo1802@gmail.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
LKML-Reference: <1244385036.3402.11.camel@myhost>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This fixes a stack corruption panic or null dereference oops
due to a bad GS in resume_userspace() when returning from
sys_vm86() and calling lockdep_sys_exit().
Only a problem when CONFIG_LOCKDEP and CONFIG_CC_STACKPROTECTOR
enabled.
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Cc: H. Peter Anvin <hpa@zytor.com>
LKML-Reference: <1244384628.2323.4.camel@bimbo>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar reported that read_apic is buggy novadays:
[ 0.000000] Using APIC driver default
[ 0.000000] SMP: Allowing 1 CPUs, 0 hotplug CPUs
[ 0.000000] Local APIC disabled by BIOS -- you can enable it with "lapic"
[ 0.000000] APIC: disable apic facility
[ 0.000000] ------------[ cut here ]------------
[ 0.000000] WARNING: at arch/x86/kernel/apic/apic.c:254 native_apic_read_dummy+0x2d/0x3b()
[ 0.000000] Hardware name: HP OmniBook PC
Indeed we still rely on apic->read operation for SMP compiled
kernel. And instead of disfigure the SMP code with #ifdef we
allow to call apic->read. To capture any unexpected results
we check for apic->read being called for sane reason via
WARN_ON_ONCE but(!) instead of OR we should use AND logical
operation (thanks Yinghai for spotting the root of the problem).
Along with that we could be have bad MP table and we are
to fix it that way no SMP started and no complains about
BIOS bug if apic was just disabled via command line.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <20090607124840.GD4547@lenovo>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The Dell Optiplex 360 hangs on reboot, just like the Optiplex 330, so
the same quirk is needed.
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Cc: Steve Conklin <steve.conklin@canonical.com>
Cc: Leann Ogasawara <leann.ogasawara@canonical.com>
Cc: <stable@kernel.org>
LKML-Reference: <200906051202.38311.jdelvare@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove model information, encoding/decoding and reduce bookkeeping.
This, besides removing a lot of code and cleaning up the code, also
enables these features on many more CPUs that were enumerated before.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
LKML-Reference: <1244224637.8212.6.camel@ht.satnam>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Merge reason: This branch was on an -rc5 base so pull almost-2.6.30
to resync with the latest upstream fixes and make sure
the combination works fine.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
x86/pci: fix mmconfig detection with 32bit near 4g
PCI: use fixed-up device class when configuring device
Extend generic event enumeration with the PERF_TYPE_HW_CACHE
method.
This is a 3-dimensional space:
{ L1-D, L1-I, L2, ITLB, DTLB, BPU } x
{ load, store, prefetch } x
{ accesses, misses }
User-space passes in the 3 coordinates and the kernel provides
a counter. (if the hardware supports that type and if the
combination makes sense.)
Combinations that make no sense produce a -EINVAL.
Combinations that are not supported by the hardware produce -ENOTSUP.
Extend the tools to deal with this, and rewrite the event symbol
parsing code with various popular aliases for the units and
access methods above. So 'l1-cache-miss' and 'l1d-read-ops' are
both valid aliases.
( x86 is supported for now, with the Nehalem event table filled in,
and with Core2 and Atom having placeholder tables. )
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Counter type is a frequently used value and we do a lot of
bit juggling by encoding and decoding it from attr->config.
Clean this up by creating a separate attr->type field.
Also clean up the various similarly complex user-space bits
all around counter attribute management.
The net improvement is significant, and it will be easier
to add a new major type (which is what triggered this cleanup).
(This changes the ABI, all tools are adapted.)
(PowerPC build-tested.)
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The current code to set up the GART as an IOMMU enables GART
translations before it removes the aperture from the kernel memory
map, sets the GART PTEs to UC, sets up the guard and scratch
pages, or does a wbinvd(). This leaves the possibility of cache
aliasing open and can cause system crashes.
Re-order the code so as to enable the GART translations only
after all safeguards are in place and the tlb has been flushed.
AMD has tested this patch on both Istanbul systems and 1st
generation Opteron systems with APG enabled and seen no adverse
effects. Istanbul systems with HT Assist enabled sometimes
see MCE errors due to cache artifacts with the unmodified
code.
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Cc: <stable@kernel.org>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Cc: akpm@linux-foundation.org
Cc: jbarnes@virtuousgeek.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The powernow-k8 driver checks to see that the Performance Control/Status
Registers are declared as FFH (functional fixed hardware) by the BIOS.
However, this check got broken in the commit:
0e64a0c982
[CPUFREQ] checkpatch cleanups for powernow-k8
Fix based on an original patch from Naga Chumbalkar.
Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Cc: Mark Langsdorf <mark.langsdorf@amd.com>
Signed-off-by: Dave Jones <davej@redhat.com>
In order to make arch_vma_name() work from inside
install_special_mapping() we need to set the context.vdso
before calling it.
( This is needed for performance counters to be able to track
this special executable area. )
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We don't set up the canary; let's disable stack protector on boot.c so
we can get into lguest_init, then set it up. As a side effect,
switch_to_new_gdt() sets up %fs for us properly too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pascal reported and bisected a commit:
| x86/PCI: don't call e820_all_mapped with -1 in the mmconfig case
which broke one system system.
ACPI: Using IOAPIC for interrupt routing
PCI: MCFG configuration 0: base f0000000 segment 0 buses 0 - 255
PCI: MCFG area at f0000000 reserved in ACPI motherboard resources
PCI: Using MMCONFIG for extended config space
it didn't have
PCI: updated MCFG configuration 0: base f0000000 segment 0 buses 0 - 63
anymore, and try to use 0xf000000 - 0xffffffff for mmconfig
For 32bit, mcfg_res->end could be 32bit only (if 64 resources aren't used)
So use end - 1 to pass the value in mcfg->end to avoid overflow.
We don't need to worry about the e820 path, they are always 64 bit.
Reported-by: Pascal Terjan <pterjan@mandriva.com>
Bisected-by: Pascal Terjan <pterjan@mandriva.com>
Tested-by: Pascal Terjan <pterjan@mandriva.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Throttling logic is broken and we can lock up with too small
hw sampling intervals.
Make the throttling code more robust: disable counters even
if we already disabled them.
( Also clean up whitespace damage i noticed while reading
various pieces of code related to throttling. )
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The UV tlb shootdown code has a serious initialization error.
An array of structures [32*8] is initialized as if it were [32].
The array is indexed by (cpu number on the blade)*8, so the short
initialization works for up to 4 cpus on a blade.
But above that, we provide an invalid opcode to the hub's
broadcast assist unit.
This patch changes the allocation of the array to use its symbolic
dimensions for better clarity. And initializes all 32*8 entries.
Shortened 'UV_ACTIVATION_DESCRIPTOR_SIZE' to 'UV_ADP_SIZE' per Ingo's
recommendation.
Tested on the UV simulator.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: <stable@kernel.org>
LKML-Reference: <E1M6lZR-0007kV-Aq@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In alloc_coherent there is an omitted unlock on the path where mapping
fails. Add the unlock.
[ Impact: fix lock imbalance in alloc_coherent ]
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Remove the IRQ (non-NMI) handling bits as NMI will be used always.
Signed-off-by: Yong Wang <yong.y.wang@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090603051255.GA2791@ywang-moblin2.bj.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
git status complains of untracked (generated) files in arch/x86/boot..
# Untracked files:
# (use "git add <file>..." to include in what will be committed)
#
# ../../arch/x86/boot/compressed/mkpiggy
# ../../arch/x86/boot/compressed/piggy.S
# ../../arch/x86/boot/compressed/vmlinux.lds
# ../../arch/x86/boot/voffset.h
# ../../arch/x86/boot/zoffset.h
..so adjust .gitignore files accordingly.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The structure isn't hw only and when I read event, I think about those
things that fall out the other end. Rename the thing.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
Cc: Stephane Eranian <eranian@googlemail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Do as Power already does, emulate sample periods up to 2^63-1 by
composing them of smaller values limited by hardware capabilities.
Only once we wrap the software period do we generate an overflow
event.
Just 10 lines of new code.
Reported-by: Stephane Eranian <eranian@googlemail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
IRQ (non-NMI) sampling is not used anymore - remove the last few bits.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
A few renames:
s/irq_period/sample_period/
s/irq_freq/sample_freq/
s/PERF_RECORD_/PERF_SAMPLE_/
s/record_type/sample_type/
And change both the new sample_type and read_format to u64.
Reported-by: Stephane Eranian <eranian@googlemail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix the fact that the IOAPIC version number in the x86_64 code path always
gets assigned to 0, instead of the correct value.
Before the patch: (from "dmesg" output):
ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 8, version 0, address 0xfec00000, GSI 0-23 <---
After the patch:
ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23 <---
History:
io_apic_get_version() was compiled out of the x86_64 code path in the commit
f2c2cca3ac:
Author: Andi Kleen <ak@suse.de>
Date: Tue Sep 26 10:52:37 2006 +0200
[PATCH] Remove APIC version/cpu capability mpparse checking/printing
ACPI went to great trouble to get the APIC version and CPU capabilities
of different CPUs before passing them to the mpparser. But all
that data was used was to print it out. Actually it even faked some data
based on the boot cpu, not on the actual CPU being booted.
Remove all this code because it's not needed.
Cc: len.brown@intel.com
At the time, the IOAPIC version number was deliberately not printed
in the x86_64 code path. However, after the x86 and x86_64 files were
merged, the net result is that the IOAPIC version is printed incorrectly
in the x86_64 code path.
The patch below provides a fix. I have tested it with acpi, and with
acpi=off, and did not see any problems.
Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Acked-by: Yinghai Lu <yhlu.kernel@gmail.com>
LKML-Reference: <20090416014230.4885.94926.sendpatchset@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
*************************
Merge reason: irq/numa didnt build because this commit:
2759c32: x86: don't call read_apic_id if !cpu_has_apic
Had a dependency on x86/cpufeature changes. Pull in that
(small) branch to fix the dependency.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Conflicts:
arch/mips/sibyte/bcm1480/irq.c
arch/mips/sibyte/sb1250/irq.c
Merge reason: we gathered a few conflicts plus update to latest upstream fixes.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Merge reason: merge almost-rc8 into perfcounters/core, which was -rc6
based - to pick up the latest upstream fixes.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13302
On x86 and x86-64, it is possible that page tables are shared beween
shared mappings backed by hugetlbfs. As part of this,
page_table_shareable() checks a pair of vma->vm_flags and they must match
if they are to be shared. All VMA flags are taken into account, including
VM_LOCKED.
The problem is that VM_LOCKED is cleared on fork(). When a process with a
shared memory segment forks() to exec() a helper, there will be shared
VMAs with different flags. The impact is that the shared segment is
sometimes considered shareable and other times not, depending on what
process is checking.
What happens is that the segment page tables are being shared but the
count is inaccurate depending on the ordering of events. As the page
tables are freed with put_page(), bad pmd's are found when some of the
children exit. The hugepage counters also get corrupted and the Total and
Free count will no longer match even when all the hugepage-backed regions
are freed. This requires a reboot of the machine to "fix".
This patch addresses the problem by comparing all flags except VM_LOCKED
when deciding if pagetables should be shared or not for hugetlbfs-backed
mapping.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <starlight@binnacle.cx>
Cc: Eric B Munson <ebmunson@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Always use NMI for performance-monitoring interrupt as there could be
racy situations if we switch between irq and nmi mode frequently.
Signed-off-by: Yong Wang <yong.y.wang@intel.com>
LKML-Reference: <20090529052835.GA13657@ywang-moblin2.bj.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This will test the automatic aperture enlargement code. This is
important because only very few devices will ever trigger this code
path. So force it under CONFIG_IOMMU_STRESS.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Disabling the round-robin allocator results in reusing the same
dma-addresses again very fast. This is a good test if the iotlb flushing
is working correctly.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch makes sure no reserved addresses are allocated in an dma_ops
domain when the aperture is increased dynamically.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Simplify the code a little bit by using the same unit for all address
space related state in the dma_ops domain structure.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch changes the AMD IOMMU address allocator to allow up to 32
aperture ranges per dma_ops domain.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The code will be required when the aperture size increases dynamically
in the extended address allocator.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch makes sure that no function required for suspend/resume of
AMD IOMMU driver is thrown away after boot.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Current hardware uses msi instead of msi-x so this code it not necessary
and can not be tested. The best thing is to drop this code.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch restructures the AMD IOMMU initialization code to initialize
all hardware registers with one single function call.
This is helpful for suspend/resume support.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch introduces the for_each_iommu and for_each_iommu_safe macros
to simplify the developers life when having to iterate over all AMD
IOMMUs in the system.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Some drivers may use the dma api during ->remove which will
cause a protection domain to get reattached to a device. Delay the
detach until after the driver is completely unbound.
[ joro: added a little merge helper ]
[ Impact: fix too early device<->domain removal ]
Signed-off-by: Chris Wright <chrisw@redhat.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The bug never triggered. But it should be fixed to protect against
broken ACPI tables in the future.
[ Impact: protect against broken ivrs acpi table ]
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The devid parameter to set_dev_entry_from_acpi is the requester ID
rather than the device ID since it is used to index the IOMMU device
table. The handling of IVHD_DEV_ALIAS used to pass the device ID.
This patch fixes it to pass the requester ID.
[ Impact: fix setting the wrong req-id in acpi-table parsing ]
Signed-off-by: Neil Turton <nturton@solarflare.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The variable amd_iommu_last_bdf holds the maximum bdf of any device
controlled by an IOMMU, so the number of device entries needed is
amd_iommu_last_bdf+1. The function tbl_size used amd_iommu_last_bdf
instead. This would be a problem if the last device were a large
enough power of 2.
[ Impact: fix amd_iommu_last_bdf off-by-one error ]
Signed-off-by: Neil Turton <nturton@solarflare.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This Kconfig option is intended to enable various code paths or
parameters in IOMMU implementations to stress test the code and/or the
hardware. This can also be done by disabling optimizations in the code
when this option is switched on.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Add information about device memory mapping requirements for the IOMMU
as described in the IVRS ACPI table to the kernel log if amd_iommu_dump
was specified on the kernel command line.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Add information about devices belonging to an IOMMU as described in the
IVRS ACPI table to the kernel log if amd_iommu_dump was specified on the
kernel command line.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Add information about IOMMU devices described in the IVRS ACPI table to
the kernel log if amd_iommu_dump was specified on the kernel command
line.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This kernel parameter will be useful to get some AMD IOMMU related
information in dmesg that is not necessary for the default user but may
be helpful in debug situations.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The *fence instructions were moved to vsyscall_64.c by commit
cb9e35dce9. But this breaks the
vDSO, because vread methods are also called from there.
Besides, the synchronization might be unnecessary for other
time sources than TSC.
[ Impact: fix potential time warp in VDSO ]
Signed-off-by: Petr Tesarik <ptesarik@suse.cz>
LKML-Reference: <9d0ea9ea0f866bdc1f4d76831221ae117f11ea67.1243241859.git.ptesarik@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>
instead of declaring one variant as an inline function...
because other case is a variable
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4A13B344.7030307@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: avoid back to back on_each_cpu in cpa_flush_array
x86, relocs: ignore R_386_NONE in kernel relocation entries
Cleanup cpa_flush_array() to avoid back to back on_each_cpu() calls.
[ Impact: optimizes fix 0af48f42df ]
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Slightly modified by trenn@suse.de -> only do this on fam 10h and fam 11h.
Currently powernow-k8 determines CPU frequency from ACPI PSS objects, but
according to AMD family 11h BKDG this frequency is just a rounded value:
"CoreFreq (MHz) = The CPU COF specified by MSRC001_00[6B:64][CpuFid]
rounded to the nearest 100 Mhz."
As a consequnce powernow-k8 reports wrong CPU frequency on some systems,
e.g. on Turion X2 Ultra:
powernow-k8: Found 1 AMD Turion(tm)X2 Ultra DualCore Mobile ZM-82
processors (2 cpu cores) (version 2.20.00)
powernow-k8: 0 : pstate 0 (2200 MHz)
powernow-k8: 1 : pstate 1 (1100 MHz)
powernow-k8: 2 : pstate 2 (600 MHz)
But this is wrong as frequency for Pstate2 is 550 MHz. x86info reports it
correctly:
#x86info -a |grep Pstate
...
Pstate-0: fid=e, did=0, vid=24 (2200MHz)
Pstate-1: fid=e, did=1, vid=30 (1100MHz)
Pstate-2: fid=e, did=2, vid=3c (550MHz) (current)
Solution is to determine the frequency directly from Pstate MSRs instead
of using rounded values from ACPI table.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
- Make the message shorter and easier to grep for
- Use printk_once instead of WARN_ONCE (functionality of these was mixed)
Signed-off-by: Thomas Renninger <trenn@suse.de>
Cc: Langsdorf, Mark <mark.langsdorf@amd.com>
Signed-off-by: Dave Jones <davej@redhat.com>
arch/x86/kernel/cpu/cpufreq/powernow-k7.c:172: warning: 'invalidate_entry' defined but not used
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Dave Jones <davej@redhat.com>
Some atom procs don't do freq scaling (such as the atom 330 on my own
littlefalls2 board). By adding the atom family here, we at least get
the benefit of passive cooling in a thermal emergency. Not sure how
to see that its actually helping any, but the driver does bind and
claim its functioning on my atom 330.
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Dave Jones <davej@redhat.com>
We have a debug check that detects stuck NMIs and returns with
the PMU disabled in the global ctrl MSR - but i managed to trigger
a situation where this was not enough to deassert the NMI.
So clear/reset the full PMU and keep the disable count balanced when
exiting from here. This way the box produces a debug warning but
stays up and is more debuggable.
[ Impact: in case of PMU related bugs, recover more gracefully ]
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
My Nehalem box locks up in certain situations (with an
always-asserted NMI causing a lockup) if the PMU LVT
entry is programmed between NMI and IRQ mode with a
high frequency.
Standardize exlusively on NMIs instead.
[ Impact: fix lockup ]
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
For relocatable 32bit kernels, boot/compressed/relocs.c processes
relocation entries in the kernel image and appends it to the kernel
image such that boot/compressed/head_32.S can relocate the kernel.
The kernel image is one statically linked object and only uses two
relocation types - R_386_PC32 and R_386_32, of the two only the latter
needs massaging during kernel relocation and thus handled by relocs.
R_386_PC32 is ignored and all other relocation types are considered
error.
When the target of a relocation resides in a discarded section,
binutils doesn't throw away the relocation record but nullifies it by
changing it to R_386_NONE, which unfortunately makes relocs fail.
The problem was triggered by yet out-of-tree x86 stack unwind patches
but given the binutils behavior, ignoring R_386_NONE is the right
thing to do.
The problem has been tracked down to binutils behavior by Jan Beulich.
[ Impact: fix build with certain binutils by ignoring R_386_NONE ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <4A1B8150.40702@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'kvm-updates/2.6.30' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: Fix PDPTR reloading on CR4 writes
KVM: Make paravirt tlb flush also reload the PAE PDPTRs
This reverts commit b68f1d2e7a.
It is causing problems (stuck/stuttering profiling) - when mixed
NMI and non-NMI counters are used.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090525153931.703093461@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introduce a generic per counter interrupt throttle.
This uses the perf_counter_overflow() quick disable to throttle a specific
counter when its going too fast when a pmu->unthrottle() method is provided
which can undo the quick disable.
Power needs to implement both the quick disable and the unthrottle method.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090525153931.703093461@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
remove the x86 specific interrupt throttle
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090525153931.616671838@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Expose the INV and EDGE bits of the PMU to raw configs.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090525153931.494709027@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The processor is documented to reload the PDPTRs while in PAE mode if any
of the CR4 bits PSE, PGE, or PAE change. Linux relies on this
behaviour when zapping the low mappings of PAE kernels during boot.
The code already handled changes to CR4.PAE; augment it to also notice changes
to PSE and PGE.
This triggered while booting an F11 PAE kernel; the futex initialization code
runs before any CR3 reloads and writes to a NULL pointer; the futex subsystem
ended up uninitialized, killing PI futexes and pulseaudio which uses them.
Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
The paravirt tlb flush may be used not only to flush TLBs, but also
to reload the four page-directory-pointer-table entries, as it is used
as a replacement for reloading CR3. Change the code to do the entire
CR3 reloading dance instead of simply flushing the TLB.
Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
Remap percpu allocator has subtle bug when combined with page
attribute changing. Remap percpu allocator aliases PMD pages for the
first chunk and as pageattr doesn't know about the alias it ends up
updating page attributes of the original mapping thus leaving the
alises in inconsistent state which might lead to subtle data
corruption. Please read the following threads for more information:
http://thread.gmane.org/gmane.linux.kernel/835783
The following is the proposed fix which teaches pageattr about percpu
aliases.
http://thread.gmane.org/gmane.linux.kernel/837157
However, the above changes are deemed too pervasive for upstream
inclusion for 2.6.30 release, so this patch essentially disables
the remap allocator for the time being.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4A1A0A27.4050301@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
cpa_flush_array seems to prefer wbinvd() over clflush at 4M threshold.
clflush needs to be done on only one CPU as per instruction definition.
wbinvd() however, should be done on all CPUs.
[ Impact: fix missing flush which could cause data corruption ]
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
wbinvd is supported on all CPUs 486 or later. But,
pageattr.c is checking x86_model >= 4 before wbinvd(), which looks like
an oversight bug. It was first introduced at one place by changeset
d7c8f21a8c and got copied over to second
place in the same file later.
[ Impact: fix missing cache flush on early-model CPUs, potential data corruption ]
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Introduce "noxsave" boot parameter which will disable the cpu's xsave/xrstor
capabilities. Useful for debugging and working around xsave related issues.
[ Impact: make it possible to debug problems in the field ]
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Remove ACPI 3 E820 extended memory attributes support. At least one
vendor actively set all the flags to zero, but left ECX on return at
24. This bug may be present in other BIOSes.
The breakage functionally means the ACPI 3 flags are probably
completely useless, and that no OS any time soon is going to rely on
their existence. Therefore, drop support completely. We may want to
revisit this question in the future, if we find ourselves actually
needing the flags.
This reverts all or part of the following checkins:
cd670599b7c549e71d07
However, retain the part from the latter commit that copies e820 into
a temporary buffer; that is an unrelated BIOS workaround. Put in a
comment to explain that part.
See https://bugzilla.redhat.com/show_bug.cgi?id=499396 for some
additional information.
[ Impact: detect all memory on affected machines ]
Reported-by: Thomas J. Baker <tjb@unh.edu>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Acked-by: Len Brown <len.brown@intel.com>
Cc: Chuck Ebbert <cebbert@redhat.com>
Cc: Kyle McMartin <kmcmartin@redhat.com>
Cc: Matt Domsch <matt_domsch@dell.com>