Commit e45adf665a ("KVM: Introduce a new guest mapping API", 2019-01-31)
introduced a build failure on aarch64 defconfig:
$ make -j$(nproc) ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- O=out defconfig \
Image.gz
...
../arch/arm64/kvm/../../../virt/kvm/kvm_main.c:
In function '__kvm_map_gfn':
../arch/arm64/kvm/../../../virt/kvm/kvm_main.c:1763:9: error:
implicit declaration of function 'memremap'; did you mean 'memset_p'?
../arch/arm64/kvm/../../../virt/kvm/kvm_main.c:1763:46: error:
'MEMREMAP_WB' undeclared (first use in this function)
../arch/arm64/kvm/../../../virt/kvm/kvm_main.c:
In function 'kvm_vcpu_unmap':
../arch/arm64/kvm/../../../virt/kvm/kvm_main.c:1795:3: error:
implicit declaration of function 'memunmap'; did you mean 'vm_munmap'?
because these functions are declared in <linux/io.h> rather than <asm/io.h>,
and the former was being pulled in already on x86 but not on aarch64.
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- guest SVE support
- guest Pointer Authentication support
- Better discrimination of perf counters between host and guests
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAlzMM9kVHG1hcmMuenlu
Z2llckBhcm0uY29tAAoJECPQ0LrRPXpDEp8P/iqZvvZlLdlnWQwluWh237c28kAo
zELO0L7Wl+OJ66v2hzM+NPBi5kv/9pSv7AoKNLv3398YmKFt0n7yUB+MHi0BC9xi
ZEp4etCOiVcqcWWeDiAXLdR9OQlb7IDBDc56s4V9HQgK3sEb4u8aEJIy/nDBVniv
GVLMh1EOsrviIYso6UVxI1X7lPQevpCS0kv9/llhhzEj8QDxnQThjDuW3wrAyhQi
F9XNVjAMW8rft7vvok9cxT4v+TR1HgUajquoSrjXuonWHgKnC9tSH/dHILNK8Zij
5OApojGlZQrXIa5Sk3JOhGahVVY9Y+ewsw58J5bJxd0/xrKXnWk/Lann7NE+UcBf
RJMHfanIO/+JJRzHhagejK7pqnYXD1PWBwF8z3Hefs1IVw4eBvPBGuhIULJ6+eSP
+3JCwiOiwshG43gZlGmHcgvhPdeX4r/BlopWV9+0X/gAjcU1+3+ZG6J3jeAcC1Kx
i481dSzlZ7Ar7VWDCk7WgcmDvUwHXtxq0HbqzQjPBO04kkakjdPZZrZIX3+Qhlem
GpkPVb2z5h5KTk9Fx03ZXxPVdiOQh1UmNC8jlsYZPWcJVTLkySs7HWXZJe+WTs4Z
NLuen/eA4/NCon+UA6XdIG5Ddn/J39UuF1lCApHPHn576rwz+HmqpcN59XiU6y4h
XHIxzajFcXNpn802
=fjph
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-for-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm updates for 5.2
- guest SVE support
- guest Pointer Authentication support
- Better discrimination of perf counters between host and guests
Conflicts:
include/uapi/linux/kvm.h
- Fix a bug, fix a spelling mistake, remove some useless code.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJc2kTEAAoJEJ2a6ncsY3GfS88IAImcIlKXMvzSKtHFxGpRap17
9LTZs5MQAUZHVMFJXmrJLBgogtGxUw53aX53woeyerytZsoGU4+YzwgLhk4XBEzA
5Kt5ahlxu82sa2ThH1zyLlNWFXiTECgD5ErNTdavLbNlaKE8YG160+65/mSyixGz
vs5wLSYGv/37no1ay6PIZ3DtwqdrYq5nJbuG+ZsaamUHPJOGprqHqg0gaTJ877NZ
yQDUS7OVuEJ1pdUUK/elP+cnlqR9smaP5OUNsXYMHWJgPJMjc27/thBJy93iS1kk
/zKQ8AFmxqoaePnR7ymTbqurfFFHBiSavUmyWopSQppNHCf4DDE8XjLs9MXKez8=
=Lco4
-----END PGP SIGNATURE-----
Merge tag 'kvm-ppc-next-5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD
PPC KVM update for 5.2
* Support for guests to access the new POWER9 XIVE interrupt controller
hardware directly, reducing interrupt latency and overhead for guests.
* In-kernel implementation of the H_PAGE_INIT hypercall.
* Reduce memory usage of sparsely-populated IOMMU tables.
* Several bug fixes.
Second PPC KVM update for 5.2
* Fix a bug, fix a spelling mistake, remove some useless code.
There is no need to test for the device pointer validity when releasing
a KVM device. The file descriptor should identify it safely.
Fixes: 2bde9b3ec8 ("KVM: Introduce a 'release' method for KVM devices")
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The previous KVM_CAP_MANUAL_DIRTY_LOG_PROTECT has some problem which
blocks the correct usage from userspace. Obsolete the old one and
introduce a new capability bit for it.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Just imaging the case where num_pages < BITS_PER_LONG, then the loop
will be skipped while it shouldn't.
Signed-off-by: Peter Xu <peterx@redhat.com>
Fixes: 2a31b9db15
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_dirty_bitmap_bytes() will return the size of the dirty bitmap of
the memslot rather than the size of bitmap passed over from the ioctl.
Here for KVM_CLEAR_DIRTY_LOG we should only copy exactly the size of
bitmap that covers kvm_clear_dirty_log.num_pages.
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 2a31b9db15
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In KVM, specially for nested guests, there is a dominant pattern of:
=> map guest memory -> do_something -> unmap guest memory
In addition to all this unnecessarily noise in the code due to boiler plate
code, most of the time the mapping function does not properly handle memory
that is not backed by "struct page". This new guest mapping API encapsulate
most of this boiler plate code and also handles guest memory that is not
backed by "struct page".
The current implementation of this API is using memremap for memory that is
not backed by a "struct page" which would lead to a huge slow-down if it
was used for high-frequency mapping operations. The API does not have any
effect on current setups where guest memory is backed by a "struct page".
Further patches are going to also introduce a pfn-cache which would
significantly improve the performance of the memremap case.
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
is_dirty has been renamed to flush, but the comment for it is
outdated. And the description about @flush parameter for
kvm_clear_dirty_log_protect() is missing, add it in this patch
as well.
Signed-off-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If a memory slot's size is not a multiple of 64 pages (256K), then
the KVM_CLEAR_DIRTY_LOG API is unusable: clearing the final 64 pages
either requires the requested page range to go beyond memslot->npages,
or requires log->num_pages to be unaligned, and kvm_clear_dirty_log_protect
requires log->num_pages to be both in range and aligned.
To allow this case, allow log->num_pages not to be a multiple of 64 if
it ends exactly on the last page of the slot.
Reported-by: Peter Xu <peterx@redhat.com>
Fixes: 98938aa8ed ("KVM: validate userspace input in kvm_clear_dirty_log_protect()", 2019-01-02)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- VSIE crypto fixes
- new guest features for gen15
- disable halt polling for nested virtualization with overcommit
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJcxrJmAAoJEBF7vIC1phx8EsEP/2mIUbtY9OmVCZNHX43ds5Jr
WR51UA/cXQGzP1cqLrqIchjJ40J7KGYBqS+9MeOyUxX85HUvb5dGgUiIfDOmh8R7
YIHe3nkM0dcIRbeuSp48sA8rl817TNGSBg7GnUN+eaEvJ/U+WbLb1sry/0uZN6Tm
2iFkff+XgSeEfBmrlxiPVl5PGUxi6FtKQWDwhn+MRkvs4sdQBh1SBITMIrzMgDmQ
GMd5olfLp3AZZV2yniFvZM9TSWvKobCCH6IVF0/mBchxkqmdjQaKdSCRO6a1pLDh
8PVBN7i+yipLURUMBuDCMxGDBINJgvvXkThB8N9K6+CanUc8KCc7l0EimS93s3DB
FsutI/2mSFy/xJ4nk98VVp8WCbVftQLtyKUSytBiqCTSpg1gtFMMntCPAqlON4TV
xHOaAnJjF4Lhvfm0QrxQ22bAmuju6WIh5WKG8D+s7yqcn7GZeDUYdeftWiGNteaf
sJwX1Vq8H6iUac1mfp7UbfT+60UuiCkj/d9sY9eRBNlPPIX6V4UgZU4Xh8/rSMf3
qnN4RCBGIQqndUzRzaw7ZtAfNy5jBE1BABems49fy07kuPCzrg9tQqXlWxf/60Ad
QKqZ3Q/hb4ixYQJ7TAqQZmq1D3NL8w+V9MthcILmEGfMYF4BZKJV39ZigbttRIcN
ZuiS+8IfOWN1IXZ2zXL0
=mZyZ
-----END PGP SIGNATURE-----
Merge tag 'kvm-s390-next-5.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
KVM: s390: Features and fixes for 5.2
- VSIE crypto fixes
- new guest features for gen15
- disable halt polling for nested virtualization with overcommit
When a P9 sPAPR VM boots, the CAS negotiation process determines which
interrupt mode to use (XICS legacy or XIVE native) and invokes a
machine reset to activate the chosen mode.
To be able to switch from one interrupt mode to another, we introduce
the capability to release a KVM device without destroying the VM. The
KVM device interface is extended with a new 'release' method which is
called when the file descriptor of the device is closed.
Once 'release' is called, the 'destroy' method will not be called
anymore as the device is removed from the device list of the VM.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Some KVM devices will want to handle special mappings related to the
underlying HW. For instance, the XIVE interrupt controller of the
POWER9 processor has MMIO pages for thread interrupt management and
for interrupt source control that need to be exposed to the guest when
the OS has the required support.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
There are cases where halt polling is unwanted. For example when running
KVM on an over committed LPAR we rather want to give back the CPU to
neighbour LPARs instead of polling. Let us provide a callback that
allows architectures to disable polling.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
With VHE different exception levels are used between the host (EL2) and
guest (EL1) with a shared exception level for userpace (EL0). We can take
advantage of this and use the PMU's exception level filtering to avoid
enabling/disabling counters in the world-switch code. Instead we just
modify the counter type to include or exclude EL0 at vcpu_{load,put} time.
We also ensure that trapped PMU system register writes do not re-enable
EL0 when reconfiguring the backing perf events.
This approach completely avoids blackout windows seen with !VHE.
Suggested-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Andrew Murray <andrew.murray@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
The virt/arm core allocates a kvm_cpu_context_t percpu, at present this is
a typedef to kvm_cpu_context and is used to store host cpu context. The
kvm_cpu_context structure is also used elsewhere to hold vcpu context.
In order to use the percpu to hold additional future host information we
encapsulate kvm_cpu_context in a new structure and rename the typedef and
percpu to match.
Signed-off-by: Andrew Murray <andrew.murray@arm.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
When pointer authentication is supported, a guest may wish to use it.
This patch adds the necessary KVM infrastructure for this to work, with
a semi-lazy context switch of the pointer auth state.
Pointer authentication feature is only enabled when VHE is built
in the kernel and present in the CPU implementation so only VHE code
paths are modified.
When we schedule a vcpu, we disable guest usage of pointer
authentication instructions and accesses to the keys. While these are
disabled, we avoid context-switching the keys. When we trap the guest
trying to use pointer authentication functionality, we change to eagerly
context-switching the keys, and enable the feature. The next time the
vcpu is scheduled out/in, we start again. However the host key save is
optimized and implemented inside ptrauth instruction/register access
trap.
Pointer authentication consists of address authentication and generic
authentication, and CPUs in a system might have varied support for
either. Where support for either feature is not uniform, it is hidden
from guests via ID register emulation, as a result of the cpufeature
framework in the host.
Unfortunately, address authentication and generic authentication cannot
be trapped separately, as the architecture provides a single EL2 trap
covering both. If we wish to expose one without the other, we cannot
prevent a (badly-written) guest from intermittently using a feature
which is not uniformly supported (when scheduled on a physical CPU which
supports the relevant feature). Hence, this patch expects both type of
authentication to be present in a cpu.
This switch of key is done from guest enter/exit assembly as preparation
for the upcoming in-kernel pointer authentication support. Hence, these
key switching routines are not implemented in C code as they may cause
pointer authentication key signing error in some situations.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
[Only VHE, key switch in full assembly, vcpu_has_ptrauth checks
, save host key in ptrauth exception trap]
Signed-off-by: Amit Daniel Kachhap <amit.kachhap@arm.com>
Reviewed-by: Julien Thierry <julien.thierry@arm.com>
Cc: Christoffer Dall <christoffer.dall@arm.com>
Cc: kvmarm@lists.cs.columbia.edu
[maz: various fixups]
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
The introduction of kvm_arm_init_arch_resources() looks like
premature factoring, since nothing else uses this hook yet and it
is not clear what will use it in the future.
For now, let's not pretend that this is a general thing:
This patch simply renames the function to kvm_arm_init_sve(),
retaining the arm stub version under the new name.
Suggested-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
All architectures except MIPS were defining it in the same way,
and memory slots are handled entirely by common code so there
is no point in keeping the definition per-architecture.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Some aspects of vcpu configuration may be too complex to be
completed inside KVM_ARM_VCPU_INIT. Thus, there may be a
requirement for userspace to do some additional configuration
before various other ioctls will work in a consistent way.
In particular this will be the case for SVE, where userspace will
need to negotiate the set of vector lengths to be made available to
the guest before the vcpu becomes fully usable.
In order to provide an explicit way for userspace to confirm that
it has finished setting up a particular vcpu feature, this patch
adds a new ioctl KVM_ARM_VCPU_FINALIZE.
When userspace has opted into a feature that requires finalization,
typically by means of a feature flag passed to KVM_ARM_VCPU_INIT, a
matching call to KVM_ARM_VCPU_FINALIZE is now required before
KVM_RUN or KVM_GET_REG_LIST is allowed. Individual features may
impose additional restrictions where appropriate.
No existing vcpu features are affected by this, so current
userspace implementations will continue to work exactly as before,
with no need to issue KVM_ARM_VCPU_FINALIZE.
As implemented in this patch, KVM_ARM_VCPU_FINALIZE is currently a
placeholder: no finalizable features exist yet, so ioctl is not
required and will always yield EINVAL. Subsequent patches will add
the finalization logic to make use of this ioctl for SVE.
No functional change for existing userspace.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Julien Thierry <julien.thierry@arm.com>
Tested-by: zhang.lei <zhang.lei@jp.fujitsu.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
This patch adds a kvm_arm_init_arch_resources() hook to perform
subarch-specific initialisation when starting up KVM.
This will be used in a subsequent patch for global SVE-related
setup on arm64.
No functional change.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Julien Thierry <julien.thierry@arm.com>
Tested-by: zhang.lei <zhang.lei@jp.fujitsu.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
- Fix THP handling in the presence of pre-existing PTEs
- Honor request for PTE mappings even when THPs are available
- GICv4 performance improvement
- Take the srcu lock when writing to guest-controlled ITS data structures
- Reset the virtual PMU in preemptible context
- Various cleanups
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAlycy1MVHG1hcmMuenlu
Z2llckBhcm0uY29tAAoJECPQ0LrRPXpDlJkQAK0mjfSOyZpeYX48XqE2Vnmr0HH2
bfwh63jv2apR4frlT9XiSV4xUEhM8n6mblnSSuXZlykR74ZxzrCb0QKROTe67jAQ
HxJrK+9JuKZE4VTKtpx+JIyr377NosY/cx2E87Agdof/+L2tqfjKAnI6MLsVIuca
H3F6SH2YVhcPcf0+6nKPUF5wacKlsDZfmrq3Hgk1MbgXG5UmFdjBFlcTUOFSumkK
E4hD93VHiQUDKvJL/qRN3p940c0lVOcDxK9Gu1AhmosvdLYEZJgtZeLBvysDH6Xl
xbq5EU5q/Gks/DEQqallaOiocREc3mZCUlF+pC1sLUHalEClEvr47hlXCLHz9RzA
yV0wW7Foo0oZb+zQyswrCZbAhpO7Vuqd00ghSgV6n6hTS2O7t9syiSe3QCvgCkoy
YZ+Eogx66m68bvbBqFrLwGHUlg+XSpw8ZZ5sh6d6iCVjSVp7ptbmxMgxuMQoJgVT
jl7WbXWYdZ2f3gYlKhtN+3IGxFIAUvFeSxFxJYnwWQPEcR6gY/lBCbwGTL5WT96F
K12V6egTau3QrG+hP+DYPwzWNPcC5oNlayOKNPlTuiwzp2PsSdBAMpEN4Kadl8m8
cMvvRhrcpmrTZdLLQATU9O9A75M1si1a9zaLuC1icYZwYeV593BnqvBBRwS9r/1S
ccg/6EkNpWZby3yd
=mpQX
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master
KVM/ARM fixes for 5.1
- Fix THP handling in the presence of pre-existing PTEs
- Honor request for PTE mappings even when THPs are available
- GICv4 performance improvement
- Take the srcu lock when writing to guest-controlled ITS data structures
- Reset the virtual PMU in preemptible context
- Various cleanups
The function irqfd_wakeup() has flags defined as __poll_t and then it
has additional flags which is used for irqflags.
Redefine the inner flags variable as iflags so it does not shadow the
outer flags.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM's API requires thats ioctls must be issued from the same process
that created the VM. In other words, userspace can play games with a
VM's file descriptors, e.g. fork(), SCM_RIGHTS, etc..., but only the
creator can do anything useful. Explicitly reject device ioctls that
are issued by a process other than the VM's creator, and update KVM's
API documentation to extend its requirements to device ioctls.
Fixes: 852b6d57dc ("kvm: add device control API")
Cc: <stable@vger.kernel.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Some comments in virt/kvm/arm/mmu.c are outdated. Update them to
reflect the current state of the code.
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
[maz: commit message tidy-up]
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Fix sparse warnings:
arch/arm64/kvm/../../../virt/kvm/arm/vgic/vgic-its.c:1732:5: warning:
symbol 'vgic_its_has_attr_regs' was not declared. Should it be static?
arch/arm64/kvm/../../../virt/kvm/arm/vgic/vgic-its.c:1753:5: warning:
symbol 'vgic_its_attr_regs_access' was not declared. Should it be static?
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
[maz: fixed subject]
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
We rely on the mmu_notifier call backs to handle the split/merge
of huge pages and thus we are guaranteed that, while creating a
block mapping, either the entire block is unmapped at stage2 or it
is missing permission.
However, we miss a case where the block mapping is split for dirty
logging case and then could later be made block mapping, if we cancel the
dirty logging. This not only creates inconsistent TLB entries for
the pages in the the block, but also leakes the table pages for
PMD level.
Handle this corner case for the huge mappings at stage2 by
unmapping the non-huge mapping for the block. This could potentially
release the upper level table. So we need to restart the table walk
once we unmap the range.
Fixes : ad361f093c ("KVM: ARM: Support hugetlbfs backed huge pages")
Reported-by: Zheng Xiang <zhengxiang9@huawei.com>
Cc: Zheng Xiang <zhengxiang9@huawei.com>
Cc: Zenghui Yu <yuzenghui@huawei.com>
Cc: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
commit 6794ad5443 ("KVM: arm/arm64: Fix unintended stage 2 PMD mappings")
made the checks to skip huge mappings, stricter. However it introduced
a bug where we still use huge mappings, ignoring the flag to
use PTE mappings, by not reseting the vma_pagesize to PAGE_SIZE.
Also, the checks do not cover the PUD huge pages, that was
under review during the same period. This patch fixes both
the issues.
Fixes : 6794ad5443 ("KVM: arm/arm64: Fix unintended stage 2 PMD mappings")
Reported-by: Zenghui Yu <yuzenghui@huawei.com>
Cc: Zenghui Yu <yuzenghui@huawei.com>
Cc: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
When halting a guest, QEMU flushes the virtual ITS caches, which
amounts to writing to the various tables that the guest has allocated.
When doing this, we fail to take the srcu lock, and the kernel
shouts loudly if running a lockdep kernel:
[ 69.680416] =============================
[ 69.680819] WARNING: suspicious RCU usage
[ 69.681526] 5.1.0-rc1-00008-g600025238f51-dirty #18 Not tainted
[ 69.682096] -----------------------------
[ 69.682501] ./include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage!
[ 69.683225]
[ 69.683225] other info that might help us debug this:
[ 69.683225]
[ 69.683975]
[ 69.683975] rcu_scheduler_active = 2, debug_locks = 1
[ 69.684598] 6 locks held by qemu-system-aar/4097:
[ 69.685059] #0: 0000000034196013 (&kvm->lock){+.+.}, at: vgic_its_set_attr+0x244/0x3a0
[ 69.686087] #1: 00000000f2ed935e (&its->its_lock){+.+.}, at: vgic_its_set_attr+0x250/0x3a0
[ 69.686919] #2: 000000005e71ea54 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
[ 69.687698] #3: 00000000c17e548d (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
[ 69.688475] #4: 00000000ba386017 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
[ 69.689978] #5: 00000000c2c3c335 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
[ 69.690729]
[ 69.690729] stack backtrace:
[ 69.691151] CPU: 2 PID: 4097 Comm: qemu-system-aar Not tainted 5.1.0-rc1-00008-g600025238f51-dirty #18
[ 69.691984] Hardware name: rockchip evb_rk3399/evb_rk3399, BIOS 2019.04-rc3-00124-g2feec69fb1 03/15/2019
[ 69.692831] Call trace:
[ 69.694072] lockdep_rcu_suspicious+0xcc/0x110
[ 69.694490] gfn_to_memslot+0x174/0x190
[ 69.694853] kvm_write_guest+0x50/0xb0
[ 69.695209] vgic_its_save_tables_v0+0x248/0x330
[ 69.695639] vgic_its_set_attr+0x298/0x3a0
[ 69.696024] kvm_device_ioctl_attr+0x9c/0xd8
[ 69.696424] kvm_device_ioctl+0x8c/0xf8
[ 69.696788] do_vfs_ioctl+0xc8/0x960
[ 69.697128] ksys_ioctl+0x8c/0xa0
[ 69.697445] __arm64_sys_ioctl+0x28/0x38
[ 69.697817] el0_svc_common+0xd8/0x138
[ 69.698173] el0_svc_handler+0x38/0x78
[ 69.698528] el0_svc+0x8/0xc
The fix is to obviously take the srcu lock, just like we do on the
read side of things since bf308242ab. One wonders why this wasn't
fixed at the same time, but hey...
Fixes: bf308242ab ("KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock")
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
The normal interrupt flow is not to enable the vgic when no virtual
interrupt is to be injected (i.e. the LRs are empty). But when a guest
is likely to use GICv4 for LPIs, we absolutely need to switch it on
at all times. Otherwise, VLPIs only get delivered when there is something
in the LRs, which doesn't happen very often.
Reported-by: Nianyao Tang <tangnianyao@huawei.com>
Tested-by: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
for 32-bit guests
s390: interrupt cleanup, introduction of the Guest Information Block,
preparation for processor subfunctions in cpu models
PPC: bug fixes and improvements, especially related to machine checks
and protection keys
x86: many, many cleanups, including removing a bunch of MMU code for
unnecessary optimizations; plus AVIC fixes.
Generic: memcg accounting
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJci+7XAAoJEL/70l94x66DUMkIAKvEefhceySHYiTpfefjLjIC
16RewgHa+9CO4Oo5iXiWd90fKxtXLXmxDQOS4VGzN0rxvLGRw/fyXIxL1MDOkaAO
l8SLSNuewY4XBUgISL3PMz123r18DAGOuy9mEcYU/IMesYD2F+wy5lJ17HIGq6X2
RpoF1p3qO1jfkPTKOob6Ixd4H5beJNPKpdth7LY3PJaVhDxgouj32fxnLnATVSnN
gENQ10fnt8BCjshRYW6Z2/9bF15JCkUFR1xdBW2/xh1oj+kvPqqqk2bEN1eVQzUy
2hT/XkwtpthqjSbX8NNavWRSFnOnbMLTRKQyIXmFVsM5VoSrwtiGsCFzBgcT++I=
=XIzU
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"ARM:
- some cleanups
- direct physical timer assignment
- cache sanitization for 32-bit guests
s390:
- interrupt cleanup
- introduction of the Guest Information Block
- preparation for processor subfunctions in cpu models
PPC:
- bug fixes and improvements, especially related to machine checks
and protection keys
x86:
- many, many cleanups, including removing a bunch of MMU code for
unnecessary optimizations
- AVIC fixes
Generic:
- memcg accounting"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (147 commits)
kvm: vmx: fix formatting of a comment
KVM: doc: Document the life cycle of a VM and its resources
MAINTAINERS: Add KVM selftests to existing KVM entry
Revert "KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()"
KVM: PPC: Book3S: Add count cache flush parameters to kvmppc_get_cpu_char()
KVM: PPC: Fix compilation when KVM is not enabled
KVM: Minor cleanups for kvm_main.c
KVM: s390: add debug logging for cpu model subfunctions
KVM: s390: implement subfunction processor calls
arm64: KVM: Fix architecturally invalid reset value for FPEXC32_EL2
KVM: arm/arm64: Remove unused timer variable
KVM: PPC: Book3S: Improve KVM reference counting
KVM: PPC: Book3S HV: Fix build failure without IOMMU support
Revert "KVM: Eliminate extra function calls in kvm_get_dirty_log_protect()"
x86: kvmguest: use TSC clocksource if invariant TSC is exposed
KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start
KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter
KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns
KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and kvm_mmu_zap_mmio_sptes()
KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping children
...
- Update the ACPICA code in the kernel to upstream revision 20190215
including ACPI 6.3 support and more:
* New predefined methods: _NBS, _NCH, _NIC, _NIH, and _NIG (Erik
Schmauss).
* Update of the PCC Identifier structure in PDTT (Erik Schmauss).
* Support for new Generic Affinity Structure subtable in SRAT
(Erik Schmauss).
* New PCC operation region support (Erik Schmauss).
* Support for GICC statistical profiling for MADT (Erik Schmauss).
* New Error Disconnect Recover notification support (Erik Schmauss).
* New PPTT Processor Structure Flags fields support (Erik Schmauss).
* ACPI 6.3 HMAT updates (Erik Schmauss).
* GTDT Revision 3 support (Erik Schmauss).
* Legacy module-level code (MLC) support removal (Erik Schmauss).
* Update/clarification of messages for control method failures
(Bob Moore).
* Warning on creation of a zero-length opregion (Bob Moore).
* acpiexec option to dump extra info for memory leaks (Bob Moore).
* More ACPI error to firmware error conversions (Bob Moore).
* Debugger fix (Bob Moore).
* Copyrights update (Bob Moore).
- Clean up sleep states support code in ACPICA (Christoph Hellwig).
- Rework in_nmi() handling in the APEI code and add suppor for the
ARM Software Delegated Exception Interface (SDEI) to it (James
Morse).
- Fix possible out-of-bounds accesses in BERT-related core (Ross
Lagerwall).
- Fix the APEI code parsing HEST that includes a Deferred Machine
Check subtable (Yazen Ghannam).
- Use DEFINE_DEBUGFS_ATTRIBUTE for APEI-related debugfs files
(YueHaibing).
- Switch the APEI ERST code to the new generic UUID API (Andy
Shevchenko).
- Update the MAINTAINERS entry for APEI (Borislav Petkov).
- Fix and clean up the ACPI EC driver (Rafael Wysocki, Zhang Rui).
- Fix DMI checks handling in the ACPI backlight driver and add the
"Lunch Box" chassis-type check to it (Hans de Goede).
- Add support for using ACPI table overrides included in built-in
initrd images (Shunyong Yang).
- Update ACPI device enumeration to treat the PWM2 device as "always
present" on Lenovo Yoga Book (Yauhen Kharuzhy).
- Fix up the enumeration of device objects with the PRP0001 device
ID (Andy Shevchenko).
- Clean up PPTT parsing error messages (John Garry).
- Clean up debugfs files creation handling (Greg Kroah-Hartman,
Rafael Wysocki).
- Clean up the ACPI DPTF Makefile (Masahiro Yamada).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJcfSIaAAoJEILEb/54YlRxvL8P/2oiG+u3tm3JahQ2tk9iiX3S
4yjYMB5Gmhua3w/t6tnRHHhy3pjjgI6xH5S7WB0VPTMp57E91EQihcbLJNFiJ1Jf
zjeZtWSmoxvcVwHAXq0DZHFMRK9Xgc/1ckzWNH/pwVlBSgaYazuLr6bwtZhtorci
eNWi82abWfAp6kAXjzJkcFbEp9+H6JzseewKcT8VAKn63KZizCEzxT0PuE9c54km
QnILVB9we0aGD2i0w2BRpbz99Wse0vnoUkBcrDw0LFHCaEQjfyAa94YFVQVrkE1Q
ynH26+yQanyzH00q/HWuH7N7YdcYMYT1CgZoIKR5XtJ+CbTc63VQez4csLOgOFMM
VEwmuv5SdRQ+tLCNFn71dxRheAttKI/nGBAZWMRTLQkp412IrQP4BtWw4wFM8SHZ
3G7eReR/bBeS4u1T5KR8CVVxchinDdwnTvqQII1uEniX80AmsHsQZxtU+JdPDp+w
N6gUE+lPF8e4iT+YsrWFMoNsJ9/MoXbSPQK1oYIcL0f5+PjFMxjTbA53wDiMHAhS
9AqVW1fdSPX0ImV3DuDqHph3ekAt26QHKxIA2xj5WTRWKf+29ijO2+5zU8isT7kI
RfGzpvsSYdvPyIRLUqc/Q3d5u/ElacAaaKJNT+6gUT4AkINAZJKQRiw2dWO1g82O
HVuSc5hRfnAJ5ALfCdIG
=r6fU
-----END PGP SIGNATURE-----
Merge tag 'acpi-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI updates from Rafael Wysocki:
"These are ACPICA updates including ACPI 6.3 support among other
things, APEI updates including the ARM Software Delegated Exception
Interface (SDEI) support, ACPI EC driver fixes and cleanups and other
assorted improvements.
Specifics:
- Update the ACPICA code in the kernel to upstream revision 20190215
including ACPI 6.3 support and more:
* New predefined methods: _NBS, _NCH, _NIC, _NIH, and _NIG (Erik
Schmauss).
* Update of the PCC Identifier structure in PDTT (Erik Schmauss).
* Support for new Generic Affinity Structure subtable in SRAT
(Erik Schmauss).
* New PCC operation region support (Erik Schmauss).
* Support for GICC statistical profiling for MADT (Erik Schmauss).
* New Error Disconnect Recover notification support (Erik
Schmauss).
* New PPTT Processor Structure Flags fields support (Erik
Schmauss).
* ACPI 6.3 HMAT updates (Erik Schmauss).
* GTDT Revision 3 support (Erik Schmauss).
* Legacy module-level code (MLC) support removal (Erik Schmauss).
* Update/clarification of messages for control method failures
(Bob Moore).
* Warning on creation of a zero-length opregion (Bob Moore).
* acpiexec option to dump extra info for memory leaks (Bob Moore).
* More ACPI error to firmware error conversions (Bob Moore).
* Debugger fix (Bob Moore).
* Copyrights update (Bob Moore)
- Clean up sleep states support code in ACPICA (Christoph Hellwig)
- Rework in_nmi() handling in the APEI code and add suppor for the
ARM Software Delegated Exception Interface (SDEI) to it (James
Morse)
- Fix possible out-of-bounds accesses in BERT-related core (Ross
Lagerwall)
- Fix the APEI code parsing HEST that includes a Deferred Machine
Check subtable (Yazen Ghannam)
- Use DEFINE_DEBUGFS_ATTRIBUTE for APEI-related debugfs files
(YueHaibing)
- Switch the APEI ERST code to the new generic UUID API (Andy
Shevchenko)
- Update the MAINTAINERS entry for APEI (Borislav Petkov)
- Fix and clean up the ACPI EC driver (Rafael Wysocki, Zhang Rui)
- Fix DMI checks handling in the ACPI backlight driver and add the
"Lunch Box" chassis-type check to it (Hans de Goede)
- Add support for using ACPI table overrides included in built-in
initrd images (Shunyong Yang)
- Update ACPI device enumeration to treat the PWM2 device as "always
present" on Lenovo Yoga Book (Yauhen Kharuzhy)
- Fix up the enumeration of device objects with the PRP0001 device ID
(Andy Shevchenko)
- Clean up PPTT parsing error messages (John Garry)
- Clean up debugfs files creation handling (Greg Kroah-Hartman,
Rafael Wysocki)
- Clean up the ACPI DPTF Makefile (Masahiro Yamada)"
* tag 'acpi-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (65 commits)
ACPI / bus: Respect PRP0001 when retrieving device match data
ACPICA: Update version to 20190215
ACPI/ACPICA: Trivial: fix spelling mistakes and fix whitespace formatting
ACPICA: ACPI 6.3: add GTDT Revision 3 support
ACPICA: ACPI 6.3: HMAT updates
ACPICA: ACPI 6.3: PPTT add additional fields in Processor Structure Flags
ACPICA: ACPI 6.3: add Error Disconnect Recover Notification value
ACPICA: ACPI 6.3: MADT: add support for statistical profiling in GICC
ACPICA: ACPI 6.3: add PCC operation region support for AML interpreter
efi: cper: Fix possible out-of-bounds access
ACPI: APEI: Fix possible out-of-bounds access to BERT region
ACPICA: ACPI 6.3: SRAT: add Generic Affinity Structure subtable
ACPICA: ACPI 6.3: Add Trigger order to PCC Identifier structure in PDTT
ACPICA: ACPI 6.3: Adding predefined methods _NBS, _NCH, _NIC, _NIH, and _NIG
ACPICA: Update/clarify messages for control method failures
ACPICA: Debugger: Fix possible fault with the "test objects" command
ACPICA: Interpreter: Emit warning for creation of a zero-length op region
ACPICA: Remove legacy module-level code support
ACPI / x86: Make PWM2 device always present at Lenovo Yoga Book
ACPI / video: Extend chassis-type detection with a "Lunch Box" check
..
* acpi-apei: (29 commits)
efi: cper: Fix possible out-of-bounds access
ACPI: APEI: Fix possible out-of-bounds access to BERT region
MAINTAINERS: Add James Morse to the list of APEI reviewers
ACPI / APEI: Add support for the SDEI GHES Notification type
firmware: arm_sdei: Add ACPI GHES registration helper
ACPI / APEI: Use separate fixmap pages for arm64 NMI-like notifications
ACPI / APEI: Only use queued estatus entry during in_nmi_queue_one_entry()
ACPI / APEI: Split ghes_read_estatus() to allow a peek at the CPER length
ACPI / APEI: Make GHES estatus header validation more user friendly
ACPI / APEI: Pass ghes and estatus separately to avoid a later copy
ACPI / APEI: Let the notification helper specify the fixmap slot
ACPI / APEI: Move locking to the notification helper
arm64: KVM/mm: Move SEA handling behind a single 'claim' interface
KVM: arm/arm64: Add kvm_ras.h to collect kvm specific RAS plumbing
ACPI / APEI: Switch NOTIFY_SEA to use the estatus queue
ACPI / APEI: Move NOTIFY_SEA between the estatus-queue and NOTIFY_NMI
ACPI / APEI: Don't allow ghes_ack_error() to mask earlier errors
ACPI / APEI: Generalise the estatus queue's notify code
ACPI / APEI: Don't update struct ghes' flags in read/clear estatus
ACPI / APEI: Remove spurious GHES_TO_CLEAR check
...
debugfs can now report an error code if something went wrong instead of
just NULL. So if the return value is to be used as a "real" dentry, it
needs to be checked if it is an error before dereferencing it.
This is now happening because of ff9fb72bc0 ("debugfs: return error
values, not NULL"). syzbot has found a way to trigger multiple debugfs
files attempting to be created, which fails, and then the error code
gets passed to dentry_path_raw() which obviously does not like it.
Reported-by: Eric Biggers <ebiggers@kernel.org>
Reported-and-tested-by: syzbot+7857962b4d45e602b8ad@syzkaller.appspotmail.com
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- A number of pre-nested code rework
- Direct physical timer assignment on VHE systems
- kvm_call_hyp type safety enforcement
- Set/Way cache sanitisation for 32bit guests
- Build system cleanups
- A bunch of janitorial fixes
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAlxwH3EVHG1hcmMuenlu
Z2llckBhcm0uY29tAAoJECPQ0LrRPXpD9j4P/i1lIUKfg/bxGw46wwoqF1MjOEPD
uDQ6irms65FfqFkUkIPaU1du51UcI9nwncUPeJh+E3g2wp2f5EXsAp7tksARfIWU
YCLLez5AiuYH6Otrs7YxLm8L/Sqqc3DacGWuyOamXmdWM9wZlv7F295Yfo5nX8zk
IhksfBQH4/KvOPxkzbY6yy1StKOreuXQuboecrcfUP0lxwaUcbqxHMuynP7DneCv
EHNo5TUjK975xH4jS/K61Ji9FmTlA/PgGqgn+EOw5KXGnKlphFBaTrzuE7vPLveR
XPV1VeNEuEitH/+qVhZr8k2Du+3kKqQA8Ikxv6SasYAnqyVFTPPPMtUEgZXTLpHa
6D4kIc+5jxgxF6Dyk3PKnjoNHPolCApj/uPCcTiD8dyY4smpJQ3+gxGiJkX68e92
EkJlBj0Hn4xgudHi9UWLP+eZHT+v3L8mvVLP9N9oqapwc0x4g9YqJVbJMyAnT5Pw
pLPSKTx9ApmyAEkdzRHjB89gG5cwjUzmx5BF7gASYSmTS9el9r5Kaaxx8zCmMt1R
gM1TF7rBrgyW5s+bsIBf2rqk+5WAxag+FmeCQwghuNc+uhNfboRgoJzx7qFTpxeX
KFS86QmQPRMWGR0klXgW1+hNOD8ACnqCOPGB+3d41ql3bgQ0otLItvj4RoA44JYG
0Guq7o9EZNUYqDzA
=iEs0
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-for-v5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-next
KVM/arm updates for Linux v5.1
- A number of pre-nested code rework
- Direct physical timer assignment on VHE systems
- kvm_call_hyp type safety enforcement
- Set/Way cache sanitisation for 32bit guests
- Build system cleanups
- A bunch of janitorial fixes
This patch contains two minor cleanups: firstly it puts exported symbol
for kvm_io_bus_write() by following the function definition; secondly it
removes a redundant blank line.
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The 'timer' local variable became unused after commit bee038a674
("KVM: arm/arm64: Rework the timer code to use a timer_map").
Remove it to avoid [-Wunused-but-set-variable] warning.
Cc: Christoffer Dall <christoffer.dall@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Suzuki K Pouloze <suzuki.poulose@arm.com>
Reviewed-by: Julien Thierry <julien.thierry@arm.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
The value of "dirty_bitmap[i]" is already check before setting its value
to mask. The following check of "mask" is redundant. The check of "mask" was
introduced by commit 58d2930f4e ("KVM: Eliminate extra function calls in
kvm_get_dirty_log_protect()"), revert it.
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
grow_halt_poll_ns() have a strange behaviour in case
(vcpu->halt_poll_ns != 0) &&
(vcpu->halt_poll_ns < halt_poll_ns_grow_start).
In this case, vcpu->halt_poll_ns will be multiplied by grow factor
(halt_poll_ns_grow) which will require several grow iteration in order
to reach a value bigger than halt_poll_ns_grow_start.
This means that growing vcpu->halt_poll_ns from value of 0 is slower
than growing it from a positive value less than halt_poll_ns_grow_start.
Which is misleading and inaccurate.
Fix issue by changing grow_halt_poll_ns() to set vcpu->halt_poll_ns
to halt_poll_ns_grow_start in any case that
(vcpu->halt_poll_ns < halt_poll_ns_grow_start).
Regardless if vcpu->halt_poll_ns is 0.
use READ_ONCE to get a consistent number for all cases.
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Nir Weiner <nir.weiner@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial
start value when raising up vcpu->halt_poll_ns.
It actually sets the first timeout to the first polling session.
This value has significant effect on how tolerant we are to outliers.
On the standard case, higher value is better - we will spend more time
in the polling busyloop, handle events/interrupts faster and result
in better performance.
But on outliers it puts us in a busy loop that does nothing.
Even if the shrink factor is zero, we will still waste time on the first
iteration.
The optimal value changes between different workloads. It depends on
outliers rate and polling sessions length.
As this value has significant effect on the dynamic halt-polling
algorithm, it should be configurable and exposed.
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Nir Weiner <nir.weiner@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
grow_halt_poll_ns() have a strange behavior in case
(halt_poll_ns_grow == 0) && (vcpu->halt_poll_ns != 0).
In this case, vcpu->halt_pol_ns will be set to zero.
That results in shrinking instead of growing.
Fix issue by changing grow_halt_poll_ns() to not modify
vcpu->halt_poll_ns in case halt_poll_ns_grow is zero
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Nir Weiner <nir.weiner@oracle.com>
Suggested-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...now that KVM won't explode by moving it out of bit 0. Using bit 63
eliminates the need to jump over bit 0, e.g. when calculating a new
memslots generation or when propagating the memslots generation to an
MMIO spte.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
x86 captures a subset of the memslot generation (19 bits) in its MMIO
sptes so that it can expedite emulated MMIO handling by checking only
the releveant spte, i.e. doesn't need to do a full page fault walk.
Because the MMIO sptes capture only 19 bits (due to limited space in
the sptes), there is a non-zero probability that the MMIO generation
could wrap, e.g. after 500k memslot updates. Since normal usage is
extremely unlikely to result in 500k memslot updates, a hack was added
by commit 69c9ea93ea ("KVM: MMU: init kvm generation close to mmio
wrap-around value") to offset the MMIO generation in order to trigger
a wraparound, e.g. after 150 memslot updates.
When separate memslot generation sequences were assigned to each
address space, commit 00f034a12f ("KVM: do not bias the generation
number in kvm_current_mmio_generation") moved the offset logic into the
initialization of the memslot generation itself so that the per-address
space bit(s) were not dropped/corrupted by the MMIO shenanigans.
Remove the offset hack for three reasons:
- While it does exercise x86's kvm_mmu_invalidate_mmio_sptes(), simply
wrapping the generation doesn't actually test the interesting case
of having stale MMIO sptes with the new generation number, e.g. old
sptes with a generation number of 0.
- Triggering kvm_mmu_invalidate_mmio_sptes() prematurely makes its
performance rather important since the probability of invalidating
MMIO sptes jumps from "effectively never" to "fairly likely". This
limits what can be done in future patches, e.g. to simplify the
invalidation code, as doing so without proper caution could lead to
a noticeable performance regression.
- Forcing the memslots generation, which is a 64-bit number, to wrap
prevents KVM from assuming the memslots generation will never wrap.
This in turn prevents KVM from using an arbitrary bit for the
"update in-progress" flag, e.g. using bit 63 would immediately
collide with using a large value as the starting generation number.
The "update in-progress" flag is effectively forced into bit 0 so
that it's (subtly) taken into account when incrementing the
generation.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM uses bit 0 of the memslots generation as an "update in-progress"
flag, which is used by x86 to prevent caching MMIO access while the
memslots are changing. Although the intended behavior is flag-like,
e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
caching data from in-flux memslots, the implementation oftentimes treats
the bit as part of the generation number itself, e.g. incrementing the
generation increments twice, once to set the flag and once to clear it.
Prior to commit 4bd518f159 ("KVM: use separate generations for
each address space"), incorporating the "update in-progress" bit into
the generation number largely made sense, e.g. "real" generations are
even, "bogus" generations are odd, most code doesn't need to be aware of
the bit, etc...
Now that unique memslots generation numbers are assigned to each address
space, stealthing the in-progress status into the generation number
results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
over bit 0 when initializing the memslots generation without any hint as
to why.
Explicitly define the flag and convert as much code as possible (which
isn't much) to actually treat it like a flag. This paves the way for
eventually using a different bit for "update in-progress" so that it can
be a flag in truth instead of a awkward extension to the generation
number.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_arch_memslots_updated() is at this point in time an x86-specific
hook for handling MMIO generation wraparound. x86 stashes 19 bits of
the memslots generation number in its MMIO sptes in order to avoid
full page fault walks for repeat faults on emulated MMIO addresses.
Because only 19 bits are used, wrapping the MMIO generation number is
possible, if unlikely. kvm_arch_memslots_updated() alerts x86 that
the generation has changed so that it can invalidate all MMIO sptes in
case the effective MMIO generation has wrapped so as to avoid using a
stale spte, e.g. a (very) old spte that was created with generation==0.
Given that the purpose of kvm_arch_memslots_updated() is to prevent
consuming stale entries, it needs to be called before the new generation
is propagated to memslots. Invalidating the MMIO sptes after updating
memslots means that there is a window where a vCPU could dereference
the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
spte that was created with (pre-wrap) generation==0.
Fixes: e59dbe09f8 ("KVM: Introduce kvm_arch_memslots_updated()")
Cc: <stable@vger.kernel.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are many KVM kernel memory allocations which are tied to the life of
the VM process and should be charged to the VM process's cgroup. If the
allocations aren't tied to the process, the OOM killer will not know
that killing the process will free the associated kernel memory.
Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
charged to the VM process's cgroup.
Tested:
Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
introduced no new failures.
Ran a kernel memory accounting test which creates a VM to touch
memory and then checks that the kernel memory allocated for the
process is within certain bounds.
With this patch we account for much more of the vmalloc and slab memory
allocated for the VM.
There remain a few allocations which should be charged to the VM's
cgroup but are not. In they include:
vcpu->run
kvm->coalesced_mmio_ring
There allocations are unaccounted in this patch because they are mapped
to userspace, and accounting them to a cgroup causes problems. This
should be addressed in a future patch.
Signed-off-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
void *entry[];
};
instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>