Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull spectre/meltdown updates from Thomas Gleixner:
 "The next round of updates related to melted spectrum:

   - The initial set of spectre V1 mitigations:

       - Array index speculation blocker and its usage for syscall,
         fdtable and the n180211 driver.

       - Speculation barrier and its usage in user access functions

   - Make indirect calls in KVM speculation safe

   - Blacklisting of known to be broken microcodes so IPBP/IBSR are not
     touched.

   - The initial IBPB support and its usage in context switch

   - The exposure of the new speculation MSRs to KVM guests.

   - A fix for a regression in x86/32 related to the cpu entry area

   - Proper whitelisting for known to be safe CPUs from the mitigations.

   - objtool fixes to deal proper with retpolines and alternatives

   - Exclude __init functions from retpolines which speeds up the boot
     process.

   - Removal of the syscall64 fast path and related cleanups and
     simplifications

   - Removal of the unpatched paravirt mode which is yet another source
     of indirect unproteced calls.

   - A new and undisputed version of the module mismatch warning

   - A couple of cleanup and correctness fixes all over the place

  Yet another step towards full mitigation. There are a few things still
  missing like the RBS underflow mitigation for Skylake and other small
  details, but that's being worked on.

  That said, I'm taking a belated christmas vacation for a week and hope
  that everything is magically solved when I'm back on Feb 12th"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
  KVM/SVM: Allow direct access to MSR_IA32_SPEC_CTRL
  KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES
  KVM/x86: Add IBPB support
  KVM/x86: Update the reverse_cpuid list to include CPUID_7_EDX
  x86/speculation: Fix typo IBRS_ATT, which should be IBRS_ALL
  x86/pti: Mark constant arrays as __initconst
  x86/spectre: Simplify spectre_v2 command line parsing
  x86/retpoline: Avoid retpolines for built-in __init functions
  x86/kvm: Update spectre-v1 mitigation
  KVM: VMX: make MSR bitmaps per-VCPU
  x86/paravirt: Remove 'noreplace-paravirt' cmdline option
  x86/speculation: Use Indirect Branch Prediction Barrier in context switch
  x86/cpuid: Fix up "virtual" IBRS/IBPB/STIBP feature bits on Intel
  x86/spectre: Fix spelling mistake: "vunerable"-> "vulnerable"
  x86/spectre: Report get_user mitigation for spectre_v1
  nl80211: Sanitize array index in parse_txq_params
  vfs, fdtable: Prevent bounds-check bypass via speculative execution
  x86/syscall: Sanitize syscall table de-references under speculation
  x86/get_user: Use pointer masking to limit speculation
  ...
This commit is contained in:
Linus Torvalds 2018-02-04 11:45:55 -08:00
commit 35277995e1
38 changed files with 975 additions and 552 deletions

View File

@ -2758,8 +2758,6 @@
norandmaps Don't use address space randomization. Equivalent to
echo 0 > /proc/sys/kernel/randomize_va_space
noreplace-paravirt [X86,IA-64,PV_OPS] Don't patch paravirt_ops
noreplace-smp [X86-32,SMP] Don't replace SMP instructions
with UP alternatives

View File

@ -0,0 +1,90 @@
This document explains potential effects of speculation, and how undesirable
effects can be mitigated portably using common APIs.
===========
Speculation
===========
To improve performance and minimize average latencies, many contemporary CPUs
employ speculative execution techniques such as branch prediction, performing
work which may be discarded at a later stage.
Typically speculative execution cannot be observed from architectural state,
such as the contents of registers. However, in some cases it is possible to
observe its impact on microarchitectural state, such as the presence or
absence of data in caches. Such state may form side-channels which can be
observed to extract secret information.
For example, in the presence of branch prediction, it is possible for bounds
checks to be ignored by code which is speculatively executed. Consider the
following code:
int load_array(int *array, unsigned int index)
{
if (index >= MAX_ARRAY_ELEMS)
return 0;
else
return array[index];
}
Which, on arm64, may be compiled to an assembly sequence such as:
CMP <index>, #MAX_ARRAY_ELEMS
B.LT less
MOV <returnval>, #0
RET
less:
LDR <returnval>, [<array>, <index>]
RET
It is possible that a CPU mis-predicts the conditional branch, and
speculatively loads array[index], even if index >= MAX_ARRAY_ELEMS. This
value will subsequently be discarded, but the speculated load may affect
microarchitectural state which can be subsequently measured.
More complex sequences involving multiple dependent memory accesses may
result in sensitive information being leaked. Consider the following
code, building on the prior example:
int load_dependent_arrays(int *arr1, int *arr2, int index)
{
int val1, val2,
val1 = load_array(arr1, index);
val2 = load_array(arr2, val1);
return val2;
}
Under speculation, the first call to load_array() may return the value
of an out-of-bounds address, while the second call will influence
microarchitectural state dependent on this value. This may provide an
arbitrary read primitive.
====================================
Mitigating speculation side-channels
====================================
The kernel provides a generic API to ensure that bounds checks are
respected even under speculation. Architectures which are affected by
speculation-based side-channels are expected to implement these
primitives.
The array_index_nospec() helper in <linux/nospec.h> can be used to
prevent information from being leaked via side-channels.
A call to array_index_nospec(index, size) returns a sanitized index
value that is bounded to [0, size) even under cpu speculation
conditions.
This can be used to protect the earlier load_array() example:
int load_array(int *array, unsigned int index)
{
if (index >= MAX_ARRAY_ELEMS)
return 0;
else {
index = array_index_nospec(index, MAX_ARRAY_ELEMS);
return array[index];
}
}

View File

@ -21,6 +21,7 @@
#include <linux/export.h>
#include <linux/context_tracking.h>
#include <linux/user-return-notifier.h>
#include <linux/nospec.h>
#include <linux/uprobes.h>
#include <linux/livepatch.h>
#include <linux/syscalls.h>
@ -206,7 +207,7 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
* special case only applies after poking regs and before the
* very next return to user mode.
*/
current->thread.status &= ~(TS_COMPAT|TS_I386_REGS_POKED);
ti->status &= ~(TS_COMPAT|TS_I386_REGS_POKED);
#endif
user_enter_irqoff();
@ -282,7 +283,8 @@ __visible void do_syscall_64(struct pt_regs *regs)
* regs->orig_ax, which changes the behavior of some syscalls.
*/
if (likely((nr & __SYSCALL_MASK) < NR_syscalls)) {
regs->ax = sys_call_table[nr & __SYSCALL_MASK](
nr = array_index_nospec(nr & __SYSCALL_MASK, NR_syscalls);
regs->ax = sys_call_table[nr](
regs->di, regs->si, regs->dx,
regs->r10, regs->r8, regs->r9);
}
@ -304,7 +306,7 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
unsigned int nr = (unsigned int)regs->orig_ax;
#ifdef CONFIG_IA32_EMULATION
current->thread.status |= TS_COMPAT;
ti->status |= TS_COMPAT;
#endif
if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY) {
@ -318,6 +320,7 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
}
if (likely(nr < IA32_NR_syscalls)) {
nr = array_index_nospec(nr, IA32_NR_syscalls);
/*
* It's possible that a 32-bit syscall implementation
* takes a 64-bit parameter but nonetheless assumes that

View File

@ -236,91 +236,20 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
pushq %r9 /* pt_regs->r9 */
pushq %r10 /* pt_regs->r10 */
pushq %r11 /* pt_regs->r11 */
sub $(6*8), %rsp /* pt_regs->bp, bx, r12-15 not saved */
UNWIND_HINT_REGS extra=0
pushq %rbx /* pt_regs->rbx */
pushq %rbp /* pt_regs->rbp */
pushq %r12 /* pt_regs->r12 */
pushq %r13 /* pt_regs->r13 */
pushq %r14 /* pt_regs->r14 */
pushq %r15 /* pt_regs->r15 */
UNWIND_HINT_REGS
TRACE_IRQS_OFF
/*
* If we need to do entry work or if we guess we'll need to do
* exit work, go straight to the slow path.
*/
movq PER_CPU_VAR(current_task), %r11
testl $_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, TASK_TI_flags(%r11)
jnz entry_SYSCALL64_slow_path
entry_SYSCALL_64_fastpath:
/*
* Easy case: enable interrupts and issue the syscall. If the syscall
* needs pt_regs, we'll call a stub that disables interrupts again
* and jumps to the slow path.
*/
TRACE_IRQS_ON
ENABLE_INTERRUPTS(CLBR_NONE)
#if __SYSCALL_MASK == ~0
cmpq $__NR_syscall_max, %rax
#else
andl $__SYSCALL_MASK, %eax
cmpl $__NR_syscall_max, %eax
#endif
ja 1f /* return -ENOSYS (already in pt_regs->ax) */
movq %r10, %rcx
/*
* This call instruction is handled specially in stub_ptregs_64.
* It might end up jumping to the slow path. If it jumps, RAX
* and all argument registers are clobbered.
*/
#ifdef CONFIG_RETPOLINE
movq sys_call_table(, %rax, 8), %rax
call __x86_indirect_thunk_rax
#else
call *sys_call_table(, %rax, 8)
#endif
.Lentry_SYSCALL_64_after_fastpath_call:
movq %rax, RAX(%rsp)
1:
/*
* If we get here, then we know that pt_regs is clean for SYSRET64.
* If we see that no exit work is required (which we are required
* to check with IRQs off), then we can go straight to SYSRET64.
*/
DISABLE_INTERRUPTS(CLBR_ANY)
TRACE_IRQS_OFF
movq PER_CPU_VAR(current_task), %r11
testl $_TIF_ALLWORK_MASK, TASK_TI_flags(%r11)
jnz 1f
LOCKDEP_SYS_EXIT
TRACE_IRQS_ON /* user mode is traced as IRQs on */
movq RIP(%rsp), %rcx
movq EFLAGS(%rsp), %r11
addq $6*8, %rsp /* skip extra regs -- they were preserved */
UNWIND_HINT_EMPTY
jmp .Lpop_c_regs_except_rcx_r11_and_sysret
1:
/*
* The fast path looked good when we started, but something changed
* along the way and we need to switch to the slow path. Calling
* raise(3) will trigger this, for example. IRQs are off.
*/
TRACE_IRQS_ON
ENABLE_INTERRUPTS(CLBR_ANY)
SAVE_EXTRA_REGS
movq %rsp, %rdi
call syscall_return_slowpath /* returns with IRQs disabled */
jmp return_from_SYSCALL_64
entry_SYSCALL64_slow_path:
/* IRQs are off. */
SAVE_EXTRA_REGS
movq %rsp, %rdi
call do_syscall_64 /* returns with IRQs disabled */
return_from_SYSCALL_64:
TRACE_IRQS_IRETQ /* we're about to change IF */
/*
@ -393,7 +322,6 @@ syscall_return_via_sysret:
/* rcx and r11 are already restored (see code above) */
UNWIND_HINT_EMPTY
POP_EXTRA_REGS
.Lpop_c_regs_except_rcx_r11_and_sysret:
popq %rsi /* skip r11 */
popq %r10
popq %r9
@ -424,47 +352,6 @@ syscall_return_via_sysret:
USERGS_SYSRET64
END(entry_SYSCALL_64)
ENTRY(stub_ptregs_64)
/*
* Syscalls marked as needing ptregs land here.
* If we are on the fast path, we need to save the extra regs,
* which we achieve by trying again on the slow path. If we are on
* the slow path, the extra regs are already saved.
*
* RAX stores a pointer to the C function implementing the syscall.
* IRQs are on.
*/
cmpq $.Lentry_SYSCALL_64_after_fastpath_call, (%rsp)
jne 1f
/*
* Called from fast path -- disable IRQs again, pop return address
* and jump to slow path
*/
DISABLE_INTERRUPTS(CLBR_ANY)
TRACE_IRQS_OFF
popq %rax
UNWIND_HINT_REGS extra=0
jmp entry_SYSCALL64_slow_path
1:
JMP_NOSPEC %rax /* Called from C */
END(stub_ptregs_64)
.macro ptregs_stub func
ENTRY(ptregs_\func)
UNWIND_HINT_FUNC
leaq \func(%rip), %rax
jmp stub_ptregs_64
END(ptregs_\func)
.endm
/* Instantiate ptregs_stub for each ptregs-using syscall */
#define __SYSCALL_64_QUAL_(sym)
#define __SYSCALL_64_QUAL_ptregs(sym) ptregs_stub sym
#define __SYSCALL_64(nr, sym, qual) __SYSCALL_64_QUAL_##qual(sym)
#include <asm/syscalls_64.h>
/*
* %rdi: prev task
* %rsi: next task

View File

@ -7,14 +7,11 @@
#include <asm/asm-offsets.h>
#include <asm/syscall.h>
#define __SYSCALL_64_QUAL_(sym) sym
#define __SYSCALL_64_QUAL_ptregs(sym) ptregs_##sym
#define __SYSCALL_64(nr, sym, qual) extern asmlinkage long __SYSCALL_64_QUAL_##qual(sym)(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long);
#define __SYSCALL_64(nr, sym, qual) extern asmlinkage long sym(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long);
#include <asm/syscalls_64.h>
#undef __SYSCALL_64
#define __SYSCALL_64(nr, sym, qual) [nr] = __SYSCALL_64_QUAL_##qual(sym),
#define __SYSCALL_64(nr, sym, qual) [nr] = sym,
extern long sys_ni_syscall(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long);

View File

@ -24,6 +24,34 @@
#define wmb() asm volatile("sfence" ::: "memory")
#endif
/**
* array_index_mask_nospec() - generate a mask that is ~0UL when the
* bounds check succeeds and 0 otherwise
* @index: array element index
* @size: number of elements in array
*
* Returns:
* 0 - (index < size)
*/
static inline unsigned long array_index_mask_nospec(unsigned long index,
unsigned long size)
{
unsigned long mask;
asm ("cmp %1,%2; sbb %0,%0;"
:"=r" (mask)
:"r"(size),"r" (index)
:"cc");
return mask;
}
/* Override the default implementation from linux/nospec.h. */
#define array_index_mask_nospec array_index_mask_nospec
/* Prevent speculative execution past this barrier. */
#define barrier_nospec() alternative_2("", "mfence", X86_FEATURE_MFENCE_RDTSC, \
"lfence", X86_FEATURE_LFENCE_RDTSC)
#ifdef CONFIG_X86_PPRO_FENCE
#define dma_rmb() rmb()
#else

View File

@ -137,8 +137,10 @@ enum fixed_addresses {
extern void reserve_top_address(unsigned long reserve);
#define FIXADDR_SIZE (__end_of_permanent_fixed_addresses << PAGE_SHIFT)
#define FIXADDR_START (FIXADDR_TOP - FIXADDR_SIZE)
#define FIXADDR_SIZE (__end_of_permanent_fixed_addresses << PAGE_SHIFT)
#define FIXADDR_START (FIXADDR_TOP - FIXADDR_SIZE)
#define FIXADDR_TOT_SIZE (__end_of_fixed_addresses << PAGE_SHIFT)
#define FIXADDR_TOT_START (FIXADDR_TOP - FIXADDR_TOT_SIZE)
extern int fixmaps_set;

View File

@ -214,8 +214,7 @@ static __always_inline unsigned long long rdtsc_ordered(void)
* that some other imaginary CPU is updating continuously with a
* time stamp.
*/
alternative_2("", "mfence", X86_FEATURE_MFENCE_RDTSC,
"lfence", X86_FEATURE_LFENCE_RDTSC);
barrier_nospec();
return rdtsc();
}

View File

@ -150,7 +150,7 @@ extern char __indirect_thunk_end[];
* On VMEXIT we must ensure that no RSB predictions learned in the guest
* can be followed in the host, by overwriting the RSB completely. Both
* retpoline and IBRS mitigations for Spectre v2 need this; only on future
* CPUs with IBRS_ATT *might* it be avoided.
* CPUs with IBRS_ALL *might* it be avoided.
*/
static inline void vmexit_fill_RSB(void)
{

View File

@ -44,8 +44,9 @@ extern bool __vmalloc_start_set; /* set once high_memory is set */
*/
#define CPU_ENTRY_AREA_PAGES (NR_CPUS * 40)
#define CPU_ENTRY_AREA_BASE \
((FIXADDR_START - PAGE_SIZE * (CPU_ENTRY_AREA_PAGES + 1)) & PMD_MASK)
#define CPU_ENTRY_AREA_BASE \
((FIXADDR_TOT_START - PAGE_SIZE * (CPU_ENTRY_AREA_PAGES + 1)) \
& PMD_MASK)
#define PKMAP_BASE \
((CPU_ENTRY_AREA_BASE - PAGE_SIZE) & PMD_MASK)

View File

@ -460,8 +460,6 @@ struct thread_struct {
unsigned short gsindex;
#endif
u32 status; /* thread synchronous flags */
#ifdef CONFIG_X86_64
unsigned long fsbase;
unsigned long gsbase;

View File

@ -60,7 +60,7 @@ static inline long syscall_get_error(struct task_struct *task,
* TS_COMPAT is set for 32-bit syscall entries and then
* remains set until we return to user mode.
*/
if (task->thread.status & (TS_COMPAT|TS_I386_REGS_POKED))
if (task->thread_info.status & (TS_COMPAT|TS_I386_REGS_POKED))
/*
* Sign-extend the value so (int)-EFOO becomes (long)-EFOO
* and will match correctly in comparisons.
@ -116,7 +116,7 @@ static inline void syscall_get_arguments(struct task_struct *task,
unsigned long *args)
{
# ifdef CONFIG_IA32_EMULATION
if (task->thread.status & TS_COMPAT)
if (task->thread_info.status & TS_COMPAT)
switch (i) {
case 0:
if (!n--) break;
@ -177,7 +177,7 @@ static inline void syscall_set_arguments(struct task_struct *task,
const unsigned long *args)
{
# ifdef CONFIG_IA32_EMULATION
if (task->thread.status & TS_COMPAT)
if (task->thread_info.status & TS_COMPAT)
switch (i) {
case 0:
if (!n--) break;

View File

@ -55,6 +55,7 @@ struct task_struct;
struct thread_info {
unsigned long flags; /* low level flags */
u32 status; /* thread synchronous flags */
};
#define INIT_THREAD_INFO(tsk) \
@ -219,7 +220,7 @@ static inline int arch_within_stack_frames(const void * const stack,
#define in_ia32_syscall() true
#else
#define in_ia32_syscall() (IS_ENABLED(CONFIG_IA32_EMULATION) && \
current->thread.status & TS_COMPAT)
current_thread_info()->status & TS_COMPAT)
#endif
/*

View File

@ -174,6 +174,8 @@ struct tlb_state {
struct mm_struct *loaded_mm;
u16 loaded_mm_asid;
u16 next_asid;
/* last user mm's ctx id */
u64 last_ctx_id;
/*
* We can be in one of several states:

View File

@ -124,6 +124,11 @@ extern int __get_user_bad(void);
#define __uaccess_begin() stac()
#define __uaccess_end() clac()
#define __uaccess_begin_nospec() \
({ \
stac(); \
barrier_nospec(); \
})
/*
* This is a type: either unsigned long, if the argument fits into
@ -445,7 +450,7 @@ do { \
({ \
int __gu_err; \
__inttype(*(ptr)) __gu_val; \
__uaccess_begin(); \
__uaccess_begin_nospec(); \
__get_user_size(__gu_val, (ptr), (size), __gu_err, -EFAULT); \
__uaccess_end(); \
(x) = (__force __typeof__(*(ptr)))__gu_val; \
@ -487,6 +492,10 @@ struct __large_struct { unsigned long buf[100]; };
__uaccess_begin(); \
barrier();
#define uaccess_try_nospec do { \
current->thread.uaccess_err = 0; \
__uaccess_begin_nospec(); \
#define uaccess_catch(err) \
__uaccess_end(); \
(err) |= (current->thread.uaccess_err ? -EFAULT : 0); \
@ -548,7 +557,7 @@ struct __large_struct { unsigned long buf[100]; };
* get_user_ex(...);
* } get_user_catch(err)
*/
#define get_user_try uaccess_try
#define get_user_try uaccess_try_nospec
#define get_user_catch(err) uaccess_catch(err)
#define get_user_ex(x, ptr) do { \
@ -582,7 +591,7 @@ extern void __cmpxchg_wrong_size(void)
__typeof__(ptr) __uval = (uval); \
__typeof__(*(ptr)) __old = (old); \
__typeof__(*(ptr)) __new = (new); \
__uaccess_begin(); \
__uaccess_begin_nospec(); \
switch (size) { \
case 1: \
{ \

View File

@ -29,21 +29,21 @@ raw_copy_from_user(void *to, const void __user *from, unsigned long n)
switch (n) {
case 1:
ret = 0;
__uaccess_begin();
__uaccess_begin_nospec();
__get_user_asm_nozero(*(u8 *)to, from, ret,
"b", "b", "=q", 1);
__uaccess_end();
return ret;
case 2:
ret = 0;
__uaccess_begin();
__uaccess_begin_nospec();
__get_user_asm_nozero(*(u16 *)to, from, ret,
"w", "w", "=r", 2);
__uaccess_end();
return ret;
case 4:
ret = 0;
__uaccess_begin();
__uaccess_begin_nospec();
__get_user_asm_nozero(*(u32 *)to, from, ret,
"l", "k", "=r", 4);
__uaccess_end();

View File

@ -55,31 +55,31 @@ raw_copy_from_user(void *dst, const void __user *src, unsigned long size)
return copy_user_generic(dst, (__force void *)src, size);
switch (size) {
case 1:
__uaccess_begin();
__uaccess_begin_nospec();
__get_user_asm_nozero(*(u8 *)dst, (u8 __user *)src,
ret, "b", "b", "=q", 1);
__uaccess_end();
return ret;
case 2:
__uaccess_begin();
__uaccess_begin_nospec();
__get_user_asm_nozero(*(u16 *)dst, (u16 __user *)src,
ret, "w", "w", "=r", 2);
__uaccess_end();
return ret;
case 4:
__uaccess_begin();
__uaccess_begin_nospec();
__get_user_asm_nozero(*(u32 *)dst, (u32 __user *)src,
ret, "l", "k", "=r", 4);
__uaccess_end();
return ret;
case 8:
__uaccess_begin();
__uaccess_begin_nospec();
__get_user_asm_nozero(*(u64 *)dst, (u64 __user *)src,
ret, "q", "", "=r", 8);
__uaccess_end();
return ret;
case 10:
__uaccess_begin();
__uaccess_begin_nospec();
__get_user_asm_nozero(*(u64 *)dst, (u64 __user *)src,
ret, "q", "", "=r", 10);
if (likely(!ret))
@ -89,7 +89,7 @@ raw_copy_from_user(void *dst, const void __user *src, unsigned long size)
__uaccess_end();
return ret;
case 16:
__uaccess_begin();
__uaccess_begin_nospec();
__get_user_asm_nozero(*(u64 *)dst, (u64 __user *)src,
ret, "q", "", "=r", 16);
if (likely(!ret))

View File

@ -46,17 +46,6 @@ static int __init setup_noreplace_smp(char *str)
}
__setup("noreplace-smp", setup_noreplace_smp);
#ifdef CONFIG_PARAVIRT
static int __initdata_or_module noreplace_paravirt = 0;
static int __init setup_noreplace_paravirt(char *str)
{
noreplace_paravirt = 1;
return 1;
}
__setup("noreplace-paravirt", setup_noreplace_paravirt);
#endif
#define DPRINTK(fmt, args...) \
do { \
if (debug_alternative) \
@ -599,9 +588,6 @@ void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
struct paravirt_patch_site *p;
char insnbuf[MAX_PATCH_LEN];
if (noreplace_paravirt)
return;
for (p = start; p < end; p++) {
unsigned int used;

View File

@ -103,7 +103,7 @@ bool retpoline_module_ok(bool has_retpoline)
if (spectre_v2_enabled == SPECTRE_V2_NONE || has_retpoline)
return true;
pr_err("System may be vunerable to spectre v2\n");
pr_err("System may be vulnerable to spectre v2\n");
spectre_v2_bad_module = true;
return false;
}
@ -119,13 +119,13 @@ static inline const char *spectre_v2_module_string(void) { return ""; }
static void __init spec2_print_if_insecure(const char *reason)
{
if (boot_cpu_has_bug(X86_BUG_SPECTRE_V2))
pr_info("%s\n", reason);
pr_info("%s selected on command line.\n", reason);
}
static void __init spec2_print_if_secure(const char *reason)
{
if (!boot_cpu_has_bug(X86_BUG_SPECTRE_V2))
pr_info("%s\n", reason);
pr_info("%s selected on command line.\n", reason);
}
static inline bool retp_compiler(void)
@ -140,42 +140,68 @@ static inline bool match_option(const char *arg, int arglen, const char *opt)
return len == arglen && !strncmp(arg, opt, len);
}
static const struct {
const char *option;
enum spectre_v2_mitigation_cmd cmd;
bool secure;
} mitigation_options[] = {
{ "off", SPECTRE_V2_CMD_NONE, false },
{ "on", SPECTRE_V2_CMD_FORCE, true },
{ "retpoline", SPECTRE_V2_CMD_RETPOLINE, false },
{ "retpoline,amd", SPECTRE_V2_CMD_RETPOLINE_AMD, false },
{ "retpoline,generic", SPECTRE_V2_CMD_RETPOLINE_GENERIC, false },
{ "auto", SPECTRE_V2_CMD_AUTO, false },
};
static enum spectre_v2_mitigation_cmd __init spectre_v2_parse_cmdline(void)
{
char arg[20];
int ret;
int ret, i;
enum spectre_v2_mitigation_cmd cmd = SPECTRE_V2_CMD_AUTO;
ret = cmdline_find_option(boot_command_line, "spectre_v2", arg,
sizeof(arg));
if (ret > 0) {
if (match_option(arg, ret, "off")) {
goto disable;
} else if (match_option(arg, ret, "on")) {
spec2_print_if_secure("force enabled on command line.");
return SPECTRE_V2_CMD_FORCE;
} else if (match_option(arg, ret, "retpoline")) {
spec2_print_if_insecure("retpoline selected on command line.");
return SPECTRE_V2_CMD_RETPOLINE;
} else if (match_option(arg, ret, "retpoline,amd")) {
if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD) {
pr_err("retpoline,amd selected but CPU is not AMD. Switching to AUTO select\n");
return SPECTRE_V2_CMD_AUTO;
}
spec2_print_if_insecure("AMD retpoline selected on command line.");
return SPECTRE_V2_CMD_RETPOLINE_AMD;
} else if (match_option(arg, ret, "retpoline,generic")) {
spec2_print_if_insecure("generic retpoline selected on command line.");
return SPECTRE_V2_CMD_RETPOLINE_GENERIC;
} else if (match_option(arg, ret, "auto")) {
if (cmdline_find_option_bool(boot_command_line, "nospectre_v2"))
return SPECTRE_V2_CMD_NONE;
else {
ret = cmdline_find_option(boot_command_line, "spectre_v2", arg,
sizeof(arg));
if (ret < 0)
return SPECTRE_V2_CMD_AUTO;
for (i = 0; i < ARRAY_SIZE(mitigation_options); i++) {
if (!match_option(arg, ret, mitigation_options[i].option))
continue;
cmd = mitigation_options[i].cmd;
break;
}
if (i >= ARRAY_SIZE(mitigation_options)) {
pr_err("unknown option (%s). Switching to AUTO select\n",
mitigation_options[i].option);
return SPECTRE_V2_CMD_AUTO;
}
}
if (!cmdline_find_option_bool(boot_command_line, "nospectre_v2"))
if ((cmd == SPECTRE_V2_CMD_RETPOLINE ||
cmd == SPECTRE_V2_CMD_RETPOLINE_AMD ||
cmd == SPECTRE_V2_CMD_RETPOLINE_GENERIC) &&
!IS_ENABLED(CONFIG_RETPOLINE)) {
pr_err("%s selected but not compiled in. Switching to AUTO select\n",
mitigation_options[i].option);
return SPECTRE_V2_CMD_AUTO;
disable:
spec2_print_if_insecure("disabled on command line.");
return SPECTRE_V2_CMD_NONE;
}
if (cmd == SPECTRE_V2_CMD_RETPOLINE_AMD &&
boot_cpu_data.x86_vendor != X86_VENDOR_AMD) {
pr_err("retpoline,amd selected but CPU is not AMD. Switching to AUTO select\n");
return SPECTRE_V2_CMD_AUTO;
}
if (mitigation_options[i].secure)
spec2_print_if_secure(mitigation_options[i].option);
else
spec2_print_if_insecure(mitigation_options[i].option);
return cmd;
}
/* Check for Skylake-like CPUs (for RSB handling) */
@ -213,10 +239,10 @@ static void __init spectre_v2_select_mitigation(void)
return;
case SPECTRE_V2_CMD_FORCE:
/* FALLTRHU */
case SPECTRE_V2_CMD_AUTO:
goto retpoline_auto;
if (IS_ENABLED(CONFIG_RETPOLINE))
goto retpoline_auto;
break;
case SPECTRE_V2_CMD_RETPOLINE_AMD:
if (IS_ENABLED(CONFIG_RETPOLINE))
goto retpoline_amd;
@ -297,7 +323,7 @@ ssize_t cpu_show_spectre_v1(struct device *dev,
{
if (!boot_cpu_has_bug(X86_BUG_SPECTRE_V1))
return sprintf(buf, "Not affected\n");
return sprintf(buf, "Vulnerable\n");
return sprintf(buf, "Mitigation: __user pointer sanitization\n");
}
ssize_t cpu_show_spectre_v2(struct device *dev,

View File

@ -750,6 +750,26 @@ static void apply_forced_caps(struct cpuinfo_x86 *c)
}
}
static void init_speculation_control(struct cpuinfo_x86 *c)
{
/*
* The Intel SPEC_CTRL CPUID bit implies IBRS and IBPB support,
* and they also have a different bit for STIBP support. Also,
* a hypervisor might have set the individual AMD bits even on
* Intel CPUs, for finer-grained selection of what's available.
*
* We use the AMD bits in 0x8000_0008 EBX as the generic hardware
* features, which are visible in /proc/cpuinfo and used by the
* kernel. So set those accordingly from the Intel bits.
*/
if (cpu_has(c, X86_FEATURE_SPEC_CTRL)) {
set_cpu_cap(c, X86_FEATURE_IBRS);
set_cpu_cap(c, X86_FEATURE_IBPB);
}
if (cpu_has(c, X86_FEATURE_INTEL_STIBP))
set_cpu_cap(c, X86_FEATURE_STIBP);
}
void get_cpu_cap(struct cpuinfo_x86 *c)
{
u32 eax, ebx, ecx, edx;
@ -844,6 +864,7 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
c->x86_capability[CPUID_8000_000A_EDX] = cpuid_edx(0x8000000a);
init_scattered_cpuid_features(c);
init_speculation_control(c);
/*
* Clear/Set all flags overridden by options, after probe.
@ -879,7 +900,7 @@ static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c)
#endif
}
static const __initdata struct x86_cpu_id cpu_no_speculation[] = {
static const __initconst struct x86_cpu_id cpu_no_speculation[] = {
{ X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_CEDARVIEW, X86_FEATURE_ANY },
{ X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_CLOVERVIEW, X86_FEATURE_ANY },
{ X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_LINCROFT, X86_FEATURE_ANY },
@ -892,7 +913,7 @@ static const __initdata struct x86_cpu_id cpu_no_speculation[] = {
{}
};
static const __initdata struct x86_cpu_id cpu_no_meltdown[] = {
static const __initconst struct x86_cpu_id cpu_no_meltdown[] = {
{ X86_VENDOR_AMD },
{}
};

View File

@ -175,28 +175,17 @@ static void early_init_intel(struct cpuinfo_x86 *c)
if (c->x86 >= 6 && !cpu_has(c, X86_FEATURE_IA64))
c->microcode = intel_get_microcode_revision();
/*
* The Intel SPEC_CTRL CPUID bit implies IBRS and IBPB support,
* and they also have a different bit for STIBP support. Also,
* a hypervisor might have set the individual AMD bits even on
* Intel CPUs, for finer-grained selection of what's available.
*/
if (cpu_has(c, X86_FEATURE_SPEC_CTRL)) {
set_cpu_cap(c, X86_FEATURE_IBRS);
set_cpu_cap(c, X86_FEATURE_IBPB);
}
if (cpu_has(c, X86_FEATURE_INTEL_STIBP))
set_cpu_cap(c, X86_FEATURE_STIBP);
/* Now if any of them are set, check the blacklist and clear the lot */
if ((cpu_has(c, X86_FEATURE_IBRS) || cpu_has(c, X86_FEATURE_IBPB) ||
if ((cpu_has(c, X86_FEATURE_SPEC_CTRL) ||
cpu_has(c, X86_FEATURE_INTEL_STIBP) ||
cpu_has(c, X86_FEATURE_IBRS) || cpu_has(c, X86_FEATURE_IBPB) ||
cpu_has(c, X86_FEATURE_STIBP)) && bad_spectre_microcode(c)) {
pr_warn("Intel Spectre v2 broken microcode detected; disabling Speculation Control\n");
clear_cpu_cap(c, X86_FEATURE_IBRS);
clear_cpu_cap(c, X86_FEATURE_IBPB);
clear_cpu_cap(c, X86_FEATURE_STIBP);
clear_cpu_cap(c, X86_FEATURE_SPEC_CTRL);
clear_cpu_cap(c, X86_FEATURE_INTEL_STIBP);
setup_clear_cpu_cap(X86_FEATURE_IBRS);
setup_clear_cpu_cap(X86_FEATURE_IBPB);
setup_clear_cpu_cap(X86_FEATURE_STIBP);
setup_clear_cpu_cap(X86_FEATURE_SPEC_CTRL);
setup_clear_cpu_cap(X86_FEATURE_INTEL_STIBP);
}
/*

View File

@ -557,7 +557,7 @@ static void __set_personality_x32(void)
* Pretend to come from a x32 execve.
*/
task_pt_regs(current)->orig_ax = __NR_x32_execve | __X32_SYSCALL_BIT;
current->thread.status &= ~TS_COMPAT;
current_thread_info()->status &= ~TS_COMPAT;
#endif
}
@ -571,7 +571,7 @@ static void __set_personality_ia32(void)
current->personality |= force_personality32;
/* Prepare the first "return" to user space */
task_pt_regs(current)->orig_ax = __NR_ia32_execve;
current->thread.status |= TS_COMPAT;
current_thread_info()->status |= TS_COMPAT;
#endif
}

View File

@ -935,7 +935,7 @@ static int putreg32(struct task_struct *child, unsigned regno, u32 value)
*/
regs->orig_ax = value;
if (syscall_get_nr(child, regs) >= 0)
child->thread.status |= TS_I386_REGS_POKED;
child->thread_info.status |= TS_I386_REGS_POKED;
break;
case offsetof(struct user32, regs.eflags):

View File

@ -787,7 +787,7 @@ static inline unsigned long get_nr_restart_syscall(const struct pt_regs *regs)
* than the tracee.
*/
#ifdef CONFIG_IA32_EMULATION
if (current->thread.status & (TS_COMPAT|TS_I386_REGS_POKED))
if (current_thread_info()->status & (TS_COMPAT|TS_I386_REGS_POKED))
return __NR_ia32_restart_syscall;
#endif
#ifdef CONFIG_X86_X32_ABI

View File

@ -67,9 +67,7 @@ u64 kvm_supported_xcr0(void)
#define F(x) bit(X86_FEATURE_##x)
/* These are scattered features in cpufeatures.h. */
#define KVM_CPUID_BIT_AVX512_4VNNIW 2
#define KVM_CPUID_BIT_AVX512_4FMAPS 3
/* For scattered features from cpufeatures.h; we currently expose none */
#define KF(x) bit(KVM_CPUID_BIT_##x)
int kvm_update_cpuid(struct kvm_vcpu *vcpu)
@ -367,6 +365,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
F(3DNOWPREFETCH) | F(OSVW) | 0 /* IBS */ | F(XOP) |
0 /* SKINIT, WDT, LWP */ | F(FMA4) | F(TBM);
/* cpuid 0x80000008.ebx */
const u32 kvm_cpuid_8000_0008_ebx_x86_features =
F(IBPB) | F(IBRS);
/* cpuid 0xC0000001.edx */
const u32 kvm_cpuid_C000_0001_edx_x86_features =
F(XSTORE) | F(XSTORE_EN) | F(XCRYPT) | F(XCRYPT_EN) |
@ -392,7 +394,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
/* cpuid 7.0.edx*/
const u32 kvm_cpuid_7_0_edx_x86_features =
KF(AVX512_4VNNIW) | KF(AVX512_4FMAPS);
F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
F(ARCH_CAPABILITIES);
/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();
@ -477,7 +480,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
if (!tdp_enabled || !boot_cpu_has(X86_FEATURE_OSPKE))
entry->ecx &= ~F(PKU);
entry->edx &= kvm_cpuid_7_0_edx_x86_features;
entry->edx &= get_scattered_cpuid_leaf(7, 0, CPUID_EDX);
cpuid_mask(&entry->edx, CPUID_7_EDX);
} else {
entry->ebx = 0;
entry->ecx = 0;
@ -627,7 +630,14 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
if (!g_phys_as)
g_phys_as = phys_as;
entry->eax = g_phys_as | (virt_as << 8);
entry->ebx = entry->edx = 0;
entry->edx = 0;
/* IBRS and IBPB aren't necessarily present in hardware cpuid */
if (boot_cpu_has(X86_FEATURE_IBPB))
entry->ebx |= F(IBPB);
if (boot_cpu_has(X86_FEATURE_IBRS))
entry->ebx |= F(IBRS);
entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
break;
}
case 0x80000019:

View File

@ -54,6 +54,7 @@ static const struct cpuid_reg reverse_cpuid[] = {
[CPUID_8000_000A_EDX] = {0x8000000a, 0, CPUID_EDX},
[CPUID_7_ECX] = { 7, 0, CPUID_ECX},
[CPUID_8000_0007_EBX] = {0x80000007, 0, CPUID_EBX},
[CPUID_7_EDX] = { 7, 0, CPUID_EDX},
};
static __always_inline struct cpuid_reg x86_feature_cpuid(unsigned x86_feature)

View File

@ -184,6 +184,8 @@ struct vcpu_svm {
u64 gs_base;
} host;
u64 spec_ctrl;
u32 *msrpm;
ulong nmi_iret_rip;
@ -249,6 +251,8 @@ static const struct svm_direct_access_msrs {
{ .index = MSR_CSTAR, .always = true },
{ .index = MSR_SYSCALL_MASK, .always = true },
#endif
{ .index = MSR_IA32_SPEC_CTRL, .always = false },
{ .index = MSR_IA32_PRED_CMD, .always = false },
{ .index = MSR_IA32_LASTBRANCHFROMIP, .always = false },
{ .index = MSR_IA32_LASTBRANCHTOIP, .always = false },
{ .index = MSR_IA32_LASTINTFROMIP, .always = false },
@ -529,6 +533,7 @@ struct svm_cpu_data {
struct kvm_ldttss_desc *tss_desc;
struct page *save_area;
struct vmcb *current_vmcb;
};
static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data);
@ -880,6 +885,25 @@ static bool valid_msr_intercept(u32 index)
return false;
}
static bool msr_write_intercepted(struct kvm_vcpu *vcpu, unsigned msr)
{
u8 bit_write;
unsigned long tmp;
u32 offset;
u32 *msrpm;
msrpm = is_guest_mode(vcpu) ? to_svm(vcpu)->nested.msrpm:
to_svm(vcpu)->msrpm;
offset = svm_msrpm_offset(msr);
bit_write = 2 * (msr & 0x0f) + 1;
tmp = msrpm[offset];
BUG_ON(offset == MSR_INVALID);
return !!test_bit(bit_write, &tmp);
}
static void set_msr_interception(u32 *msrpm, unsigned msr,
int read, int write)
{
@ -1582,6 +1606,8 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
u32 dummy;
u32 eax = 1;
svm->spec_ctrl = 0;
if (!init_event) {
svm->vcpu.arch.apic_base = APIC_DEFAULT_PHYS_BASE |
MSR_IA32_APICBASE_ENABLE;
@ -1703,11 +1729,17 @@ static void svm_free_vcpu(struct kvm_vcpu *vcpu)
__free_pages(virt_to_page(svm->nested.msrpm), MSRPM_ALLOC_ORDER);
kvm_vcpu_uninit(vcpu);
kmem_cache_free(kvm_vcpu_cache, svm);
/*
* The vmcb page can be recycled, causing a false negative in
* svm_vcpu_load(). So do a full IBPB now.
*/
indirect_branch_prediction_barrier();
}
static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
struct svm_cpu_data *sd = per_cpu(svm_data, cpu);
int i;
if (unlikely(cpu != vcpu->cpu)) {
@ -1736,6 +1768,10 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (static_cpu_has(X86_FEATURE_RDTSCP))
wrmsrl(MSR_TSC_AUX, svm->tsc_aux);
if (sd->current_vmcb != svm->vmcb) {
sd->current_vmcb = svm->vmcb;
indirect_branch_prediction_barrier();
}
avic_vcpu_load(vcpu, cpu);
}
@ -3593,6 +3629,13 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_VM_CR:
msr_info->data = svm->nested.vm_cr_msr;
break;
case MSR_IA32_SPEC_CTRL:
if (!msr_info->host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_IBRS))
return 1;
msr_info->data = svm->spec_ctrl;
break;
case MSR_IA32_UCODE_REV:
msr_info->data = 0x01000065;
break;
@ -3684,6 +3727,49 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
case MSR_IA32_TSC:
kvm_write_tsc(vcpu, msr);
break;
case MSR_IA32_SPEC_CTRL:
if (!msr->host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_IBRS))
return 1;
/* The STIBP bit doesn't fault even if it's not advertised */
if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
return 1;
svm->spec_ctrl = data;
if (!data)
break;
/*
* For non-nested:
* When it's written (to non-zero) for the first time, pass
* it through.
*
* For nested:
* The handling of the MSR bitmap for L2 guests is done in
* nested_svm_vmrun_msrpm.
* We update the L1 MSR bit as well since it will end up
* touching the MSR anyway now.
*/
set_msr_interception(svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
break;
case MSR_IA32_PRED_CMD:
if (!msr->host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_IBPB))
return 1;
if (data & ~PRED_CMD_IBPB)
return 1;
if (!data)
break;
wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
if (is_guest_mode(vcpu))
break;
set_msr_interception(svm->msrpm, MSR_IA32_PRED_CMD, 0, 1);
break;
case MSR_STAR:
svm->vmcb->save.star = data;
break;
@ -4936,6 +5022,15 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
local_irq_enable();
/*
* If this vCPU has touched SPEC_CTRL, restore the guest's value if
* it's non-zero. Since vmentry is serialising on affected CPUs, there
* is no need to worry about the conditional branch over the wrmsr
* being speculatively taken.
*/
if (svm->spec_ctrl)
wrmsrl(MSR_IA32_SPEC_CTRL, svm->spec_ctrl);
asm volatile (
"push %%" _ASM_BP "; \n\t"
"mov %c[rbx](%[svm]), %%" _ASM_BX " \n\t"
@ -5028,6 +5123,27 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
#endif
);
/*
* We do not use IBRS in the kernel. If this vCPU has used the
* SPEC_CTRL MSR it may have left it on; save the value and
* turn it off. This is much more efficient than blindly adding
* it to the atomic save/restore list. Especially as the former
* (Saving guest MSRs on vmexit) doesn't even exist in KVM.
*
* For non-nested case:
* If the L01 MSR bitmap does not intercept the MSR, then we need to
* save it.
*
* For nested case:
* If the L02 MSR bitmap does not intercept the MSR, then we need to
* save it.
*/
if (!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL))
rdmsrl(MSR_IA32_SPEC_CTRL, svm->spec_ctrl);
if (svm->spec_ctrl)
wrmsrl(MSR_IA32_SPEC_CTRL, 0);
/* Eliminate branch target predictions from guest mode */
vmexit_fill_RSB();

View File

@ -34,6 +34,7 @@
#include <linux/tboot.h>
#include <linux/hrtimer.h>
#include <linux/frame.h>
#include <linux/nospec.h>
#include "kvm_cache_regs.h"
#include "x86.h"
@ -111,6 +112,14 @@ static u64 __read_mostly host_xss;
static bool __read_mostly enable_pml = 1;
module_param_named(pml, enable_pml, bool, S_IRUGO);
#define MSR_TYPE_R 1
#define MSR_TYPE_W 2
#define MSR_TYPE_RW 3
#define MSR_BITMAP_MODE_X2APIC 1
#define MSR_BITMAP_MODE_X2APIC_APICV 2
#define MSR_BITMAP_MODE_LM 4
#define KVM_VMX_TSC_MULTIPLIER_MAX 0xffffffffffffffffULL
/* Guest_tsc -> host_tsc conversion requires 64-bit division. */
@ -185,7 +194,6 @@ module_param(ple_window_max, int, S_IRUGO);
extern const ulong vmx_return;
#define NR_AUTOLOAD_MSRS 8
#define VMCS02_POOL_SIZE 1
struct vmcs {
u32 revision_id;
@ -210,6 +218,7 @@ struct loaded_vmcs {
int soft_vnmi_blocked;
ktime_t entry_time;
s64 vnmi_blocked_time;
unsigned long *msr_bitmap;
struct list_head loaded_vmcss_on_cpu_link;
};
@ -226,7 +235,7 @@ struct shared_msr_entry {
* stored in guest memory specified by VMPTRLD, but is opaque to the guest,
* which must access it using VMREAD/VMWRITE/VMCLEAR instructions.
* More than one of these structures may exist, if L1 runs multiple L2 guests.
* nested_vmx_run() will use the data here to build a vmcs02: a VMCS for the
* nested_vmx_run() will use the data here to build the vmcs02: a VMCS for the
* underlying hardware which will be used to run L2.
* This structure is packed to ensure that its layout is identical across
* machines (necessary for live migration).
@ -409,13 +418,6 @@ struct __packed vmcs12 {
*/
#define VMCS12_SIZE 0x1000
/* Used to remember the last vmcs02 used for some recently used vmcs12s */
struct vmcs02_list {
struct list_head list;
gpa_t vmptr;
struct loaded_vmcs vmcs02;
};
/*
* The nested_vmx structure is part of vcpu_vmx, and holds information we need
* for correct emulation of VMX (i.e., nested VMX) on this vcpu.
@ -440,15 +442,15 @@ struct nested_vmx {
*/
bool sync_shadow_vmcs;
/* vmcs02_list cache of VMCSs recently used to run L2 guests */
struct list_head vmcs02_pool;
int vmcs02_num;
bool change_vmcs01_virtual_x2apic_mode;
/* L2 must run next, and mustn't decide to exit to L1. */
bool nested_run_pending;
struct loaded_vmcs vmcs02;
/*
* Guest pages referred to in vmcs02 with host-physical pointers, so
* we must keep them pinned while L2 runs.
* Guest pages referred to in the vmcs02 with host-physical
* pointers, so we must keep them pinned while L2 runs.
*/
struct page *apic_access_page;
struct page *virtual_apic_page;
@ -457,8 +459,6 @@ struct nested_vmx {
bool pi_pending;
u16 posted_intr_nv;
unsigned long *msr_bitmap;
struct hrtimer preemption_timer;
bool preemption_timer_expired;
@ -581,6 +581,7 @@ struct vcpu_vmx {
struct kvm_vcpu vcpu;
unsigned long host_rsp;
u8 fail;
u8 msr_bitmap_mode;
u32 exit_intr_info;
u32 idt_vectoring_info;
ulong rflags;
@ -592,6 +593,10 @@ struct vcpu_vmx {
u64 msr_host_kernel_gs_base;
u64 msr_guest_kernel_gs_base;
#endif
u64 arch_capabilities;
u64 spec_ctrl;
u32 vm_entry_controls_shadow;
u32 vm_exit_controls_shadow;
u32 secondary_exec_control;
@ -898,21 +903,18 @@ static const unsigned short vmcs_field_to_offset_table[] = {
static inline short vmcs_field_to_offset(unsigned long field)
{
BUILD_BUG_ON(ARRAY_SIZE(vmcs_field_to_offset_table) > SHRT_MAX);
const size_t size = ARRAY_SIZE(vmcs_field_to_offset_table);
unsigned short offset;
if (field >= ARRAY_SIZE(vmcs_field_to_offset_table))
BUILD_BUG_ON(size > SHRT_MAX);
if (field >= size)
return -ENOENT;
/*
* FIXME: Mitigation for CVE-2017-5753. To be replaced with a
* generic mechanism.
*/
asm("lfence");
if (vmcs_field_to_offset_table[field] == 0)
field = array_index_nospec(field, size);
offset = vmcs_field_to_offset_table[field];
if (offset == 0)
return -ENOENT;
return vmcs_field_to_offset_table[field];
return offset;
}
static inline struct vmcs12 *get_vmcs12(struct kvm_vcpu *vcpu)
@ -935,6 +937,9 @@ static bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu);
static void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
u16 error_code);
static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
static void __always_inline vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
u32 msr, int type);
static DEFINE_PER_CPU(struct vmcs *, vmxarea);
static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
@ -954,12 +959,6 @@ static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
enum {
VMX_IO_BITMAP_A,
VMX_IO_BITMAP_B,
VMX_MSR_BITMAP_LEGACY,
VMX_MSR_BITMAP_LONGMODE,
VMX_MSR_BITMAP_LEGACY_X2APIC_APICV,
VMX_MSR_BITMAP_LONGMODE_X2APIC_APICV,
VMX_MSR_BITMAP_LEGACY_X2APIC,
VMX_MSR_BITMAP_LONGMODE_X2APIC,
VMX_VMREAD_BITMAP,
VMX_VMWRITE_BITMAP,
VMX_BITMAP_NR
@ -969,12 +968,6 @@ static unsigned long *vmx_bitmap[VMX_BITMAP_NR];
#define vmx_io_bitmap_a (vmx_bitmap[VMX_IO_BITMAP_A])
#define vmx_io_bitmap_b (vmx_bitmap[VMX_IO_BITMAP_B])
#define vmx_msr_bitmap_legacy (vmx_bitmap[VMX_MSR_BITMAP_LEGACY])
#define vmx_msr_bitmap_longmode (vmx_bitmap[VMX_MSR_BITMAP_LONGMODE])
#define vmx_msr_bitmap_legacy_x2apic_apicv (vmx_bitmap[VMX_MSR_BITMAP_LEGACY_X2APIC_APICV])
#define vmx_msr_bitmap_longmode_x2apic_apicv (vmx_bitmap[VMX_MSR_BITMAP_LONGMODE_X2APIC_APICV])
#define vmx_msr_bitmap_legacy_x2apic (vmx_bitmap[VMX_MSR_BITMAP_LEGACY_X2APIC])
#define vmx_msr_bitmap_longmode_x2apic (vmx_bitmap[VMX_MSR_BITMAP_LONGMODE_X2APIC])
#define vmx_vmread_bitmap (vmx_bitmap[VMX_VMREAD_BITMAP])
#define vmx_vmwrite_bitmap (vmx_bitmap[VMX_VMWRITE_BITMAP])
@ -1918,6 +1911,52 @@ static void update_exception_bitmap(struct kvm_vcpu *vcpu)
vmcs_write32(EXCEPTION_BITMAP, eb);
}
/*
* Check if MSR is intercepted for currently loaded MSR bitmap.
*/
static bool msr_write_intercepted(struct kvm_vcpu *vcpu, u32 msr)
{
unsigned long *msr_bitmap;
int f = sizeof(unsigned long);
if (!cpu_has_vmx_msr_bitmap())
return true;
msr_bitmap = to_vmx(vcpu)->loaded_vmcs->msr_bitmap;
if (msr <= 0x1fff) {
return !!test_bit(msr, msr_bitmap + 0x800 / f);
} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
msr &= 0x1fff;
return !!test_bit(msr, msr_bitmap + 0xc00 / f);
}
return true;
}
/*
* Check if MSR is intercepted for L01 MSR bitmap.
*/
static bool msr_write_intercepted_l01(struct kvm_vcpu *vcpu, u32 msr)
{
unsigned long *msr_bitmap;
int f = sizeof(unsigned long);
if (!cpu_has_vmx_msr_bitmap())
return true;
msr_bitmap = to_vmx(vcpu)->vmcs01.msr_bitmap;
if (msr <= 0x1fff) {
return !!test_bit(msr, msr_bitmap + 0x800 / f);
} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
msr &= 0x1fff;
return !!test_bit(msr, msr_bitmap + 0xc00 / f);
}
return true;
}
static void clear_atomic_switch_msr_special(struct vcpu_vmx *vmx,
unsigned long entry, unsigned long exit)
{
@ -2296,6 +2335,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
vmcs_load(vmx->loaded_vmcs->vmcs);
indirect_branch_prediction_barrier();
}
if (!already_loaded) {
@ -2572,36 +2612,6 @@ static void move_msr_up(struct vcpu_vmx *vmx, int from, int to)
vmx->guest_msrs[from] = tmp;
}
static void vmx_set_msr_bitmap(struct kvm_vcpu *vcpu)
{
unsigned long *msr_bitmap;
if (is_guest_mode(vcpu))
msr_bitmap = to_vmx(vcpu)->nested.msr_bitmap;
else if (cpu_has_secondary_exec_ctrls() &&
(vmcs_read32(SECONDARY_VM_EXEC_CONTROL) &
SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE)) {
if (enable_apicv && kvm_vcpu_apicv_active(vcpu)) {
if (is_long_mode(vcpu))
msr_bitmap = vmx_msr_bitmap_longmode_x2apic_apicv;
else
msr_bitmap = vmx_msr_bitmap_legacy_x2apic_apicv;
} else {
if (is_long_mode(vcpu))
msr_bitmap = vmx_msr_bitmap_longmode_x2apic;
else
msr_bitmap = vmx_msr_bitmap_legacy_x2apic;
}
} else {
if (is_long_mode(vcpu))
msr_bitmap = vmx_msr_bitmap_longmode;
else
msr_bitmap = vmx_msr_bitmap_legacy;
}
vmcs_write64(MSR_BITMAP, __pa(msr_bitmap));
}
/*
* Set up the vmcs to automatically save and restore system
* msrs. Don't touch the 64-bit msrs if the guest is in legacy
@ -2642,7 +2652,7 @@ static void setup_msrs(struct vcpu_vmx *vmx)
vmx->save_nmsrs = save_nmsrs;
if (cpu_has_vmx_msr_bitmap())
vmx_set_msr_bitmap(&vmx->vcpu);
vmx_update_msr_bitmap(&vmx->vcpu);
}
/*
@ -3276,6 +3286,20 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_TSC:
msr_info->data = guest_read_tsc(vcpu);
break;
case MSR_IA32_SPEC_CTRL:
if (!msr_info->host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_IBRS) &&
!guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
return 1;
msr_info->data = to_vmx(vcpu)->spec_ctrl;
break;
case MSR_IA32_ARCH_CAPABILITIES:
if (!msr_info->host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES))
return 1;
msr_info->data = to_vmx(vcpu)->arch_capabilities;
break;
case MSR_IA32_SYSENTER_CS:
msr_info->data = vmcs_read32(GUEST_SYSENTER_CS);
break;
@ -3383,6 +3407,70 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_TSC:
kvm_write_tsc(vcpu, msr_info);
break;
case MSR_IA32_SPEC_CTRL:
if (!msr_info->host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_IBRS) &&
!guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
return 1;
/* The STIBP bit doesn't fault even if it's not advertised */
if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
return 1;
vmx->spec_ctrl = data;
if (!data)
break;
/*
* For non-nested:
* When it's written (to non-zero) for the first time, pass
* it through.
*
* For nested:
* The handling of the MSR bitmap for L2 guests is done in
* nested_vmx_merge_msr_bitmap. We should not touch the
* vmcs02.msr_bitmap here since it gets completely overwritten
* in the merging. We update the vmcs01 here for L1 as well
* since it will end up touching the MSR anyway now.
*/
vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap,
MSR_IA32_SPEC_CTRL,
MSR_TYPE_RW);
break;
case MSR_IA32_PRED_CMD:
if (!msr_info->host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_IBPB) &&
!guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
return 1;
if (data & ~PRED_CMD_IBPB)
return 1;
if (!data)
break;
wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
/*
* For non-nested:
* When it's written (to non-zero) for the first time, pass
* it through.
*
* For nested:
* The handling of the MSR bitmap for L2 guests is done in
* nested_vmx_merge_msr_bitmap. We should not touch the
* vmcs02.msr_bitmap here since it gets completely overwritten
* in the merging.
*/
vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD,
MSR_TYPE_W);
break;
case MSR_IA32_ARCH_CAPABILITIES:
if (!msr_info->host_initiated)
return 1;
vmx->arch_capabilities = data;
break;
case MSR_IA32_CR_PAT:
if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
@ -3837,11 +3925,6 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
return vmcs;
}
static struct vmcs *alloc_vmcs(void)
{
return alloc_vmcs_cpu(raw_smp_processor_id());
}
static void free_vmcs(struct vmcs *vmcs)
{
free_pages((unsigned long)vmcs, vmcs_config.order);
@ -3857,9 +3940,38 @@ static void free_loaded_vmcs(struct loaded_vmcs *loaded_vmcs)
loaded_vmcs_clear(loaded_vmcs);
free_vmcs(loaded_vmcs->vmcs);
loaded_vmcs->vmcs = NULL;
if (loaded_vmcs->msr_bitmap)
free_page((unsigned long)loaded_vmcs->msr_bitmap);
WARN_ON(loaded_vmcs->shadow_vmcs != NULL);
}
static struct vmcs *alloc_vmcs(void)
{
return alloc_vmcs_cpu(raw_smp_processor_id());
}
static int alloc_loaded_vmcs(struct loaded_vmcs *loaded_vmcs)
{
loaded_vmcs->vmcs = alloc_vmcs();
if (!loaded_vmcs->vmcs)
return -ENOMEM;
loaded_vmcs->shadow_vmcs = NULL;
loaded_vmcs_init(loaded_vmcs);
if (cpu_has_vmx_msr_bitmap()) {
loaded_vmcs->msr_bitmap = (unsigned long *)__get_free_page(GFP_KERNEL);
if (!loaded_vmcs->msr_bitmap)
goto out_vmcs;
memset(loaded_vmcs->msr_bitmap, 0xff, PAGE_SIZE);
}
return 0;
out_vmcs:
free_loaded_vmcs(loaded_vmcs);
return -ENOMEM;
}
static void free_kvm_area(void)
{
int cpu;
@ -4918,10 +5030,8 @@ static void free_vpid(int vpid)
spin_unlock(&vmx_vpid_lock);
}
#define MSR_TYPE_R 1
#define MSR_TYPE_W 2
static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
u32 msr, int type)
static void __always_inline vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
u32 msr, int type)
{
int f = sizeof(unsigned long);
@ -4955,6 +5065,50 @@ static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
}
}
static void __always_inline vmx_enable_intercept_for_msr(unsigned long *msr_bitmap,
u32 msr, int type)
{
int f = sizeof(unsigned long);
if (!cpu_has_vmx_msr_bitmap())
return;
/*
* See Intel PRM Vol. 3, 20.6.9 (MSR-Bitmap Address). Early manuals
* have the write-low and read-high bitmap offsets the wrong way round.
* We can control MSRs 0x00000000-0x00001fff and 0xc0000000-0xc0001fff.
*/
if (msr <= 0x1fff) {
if (type & MSR_TYPE_R)
/* read-low */
__set_bit(msr, msr_bitmap + 0x000 / f);
if (type & MSR_TYPE_W)
/* write-low */
__set_bit(msr, msr_bitmap + 0x800 / f);
} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
msr &= 0x1fff;
if (type & MSR_TYPE_R)
/* read-high */
__set_bit(msr, msr_bitmap + 0x400 / f);
if (type & MSR_TYPE_W)
/* write-high */
__set_bit(msr, msr_bitmap + 0xc00 / f);
}
}
static void __always_inline vmx_set_intercept_for_msr(unsigned long *msr_bitmap,
u32 msr, int type, bool value)
{
if (value)
vmx_enable_intercept_for_msr(msr_bitmap, msr, type);
else
vmx_disable_intercept_for_msr(msr_bitmap, msr, type);
}
/*
* If a msr is allowed by L0, we should check whether it is allowed by L1.
* The corresponding bit will be cleared unless both of L0 and L1 allow it.
@ -5001,28 +5155,68 @@ static void nested_vmx_disable_intercept_for_msr(unsigned long *msr_bitmap_l1,
}
}
static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only)
static u8 vmx_msr_bitmap_mode(struct kvm_vcpu *vcpu)
{
if (!longmode_only)
__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
msr, MSR_TYPE_R | MSR_TYPE_W);
__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
msr, MSR_TYPE_R | MSR_TYPE_W);
u8 mode = 0;
if (cpu_has_secondary_exec_ctrls() &&
(vmcs_read32(SECONDARY_VM_EXEC_CONTROL) &
SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE)) {
mode |= MSR_BITMAP_MODE_X2APIC;
if (enable_apicv && kvm_vcpu_apicv_active(vcpu))
mode |= MSR_BITMAP_MODE_X2APIC_APICV;
}
if (is_long_mode(vcpu))
mode |= MSR_BITMAP_MODE_LM;
return mode;
}
static void vmx_disable_intercept_msr_x2apic(u32 msr, int type, bool apicv_active)
#define X2APIC_MSR(r) (APIC_BASE_MSR + ((r) >> 4))
static void vmx_update_msr_bitmap_x2apic(unsigned long *msr_bitmap,
u8 mode)
{
if (apicv_active) {
__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy_x2apic_apicv,
msr, type);
__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode_x2apic_apicv,
msr, type);
} else {
__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy_x2apic,
msr, type);
__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode_x2apic,
msr, type);
int msr;
for (msr = 0x800; msr <= 0x8ff; msr += BITS_PER_LONG) {
unsigned word = msr / BITS_PER_LONG;
msr_bitmap[word] = (mode & MSR_BITMAP_MODE_X2APIC_APICV) ? 0 : ~0;
msr_bitmap[word + (0x800 / sizeof(long))] = ~0;
}
if (mode & MSR_BITMAP_MODE_X2APIC) {
/*
* TPR reads and writes can be virtualized even if virtual interrupt
* delivery is not in use.
*/
vmx_disable_intercept_for_msr(msr_bitmap, X2APIC_MSR(APIC_TASKPRI), MSR_TYPE_RW);
if (mode & MSR_BITMAP_MODE_X2APIC_APICV) {
vmx_enable_intercept_for_msr(msr_bitmap, X2APIC_MSR(APIC_TMCCT), MSR_TYPE_R);
vmx_disable_intercept_for_msr(msr_bitmap, X2APIC_MSR(APIC_EOI), MSR_TYPE_W);
vmx_disable_intercept_for_msr(msr_bitmap, X2APIC_MSR(APIC_SELF_IPI), MSR_TYPE_W);
}
}
}
static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap;
u8 mode = vmx_msr_bitmap_mode(vcpu);
u8 changed = mode ^ vmx->msr_bitmap_mode;
if (!changed)
return;
vmx_set_intercept_for_msr(msr_bitmap, MSR_KERNEL_GS_BASE, MSR_TYPE_RW,
!(mode & MSR_BITMAP_MODE_LM));
if (changed & (MSR_BITMAP_MODE_X2APIC | MSR_BITMAP_MODE_X2APIC_APICV))
vmx_update_msr_bitmap_x2apic(msr_bitmap, mode);
vmx->msr_bitmap_mode = mode;
}
static bool vmx_get_enable_apicv(struct kvm_vcpu *vcpu)
@ -5274,7 +5468,7 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
}
if (cpu_has_vmx_msr_bitmap())
vmx_set_msr_bitmap(vcpu);
vmx_update_msr_bitmap(vcpu);
}
static u32 vmx_exec_control(struct vcpu_vmx *vmx)
@ -5461,7 +5655,7 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap));
}
if (cpu_has_vmx_msr_bitmap())
vmcs_write64(MSR_BITMAP, __pa(vmx_msr_bitmap_legacy));
vmcs_write64(MSR_BITMAP, __pa(vmx->vmcs01.msr_bitmap));
vmcs_write64(VMCS_LINK_POINTER, -1ull); /* 22.3.1.5 */
@ -5539,6 +5733,8 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
++vmx->nmsrs;
}
if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
rdmsrl(MSR_IA32_ARCH_CAPABILITIES, vmx->arch_capabilities);
vm_exit_controls_init(vmx, vmcs_config.vmexit_ctrl);
@ -5567,6 +5763,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
u64 cr0;
vmx->rmode.vm86_active = 0;
vmx->spec_ctrl = 0;
vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
kvm_set_cr8(vcpu, 0);
@ -6744,7 +6941,7 @@ void vmx_enable_tdp(void)
static __init int hardware_setup(void)
{
int r = -ENOMEM, i, msr;
int r = -ENOMEM, i;
rdmsrl_safe(MSR_EFER, &host_efer);
@ -6764,9 +6961,6 @@ static __init int hardware_setup(void)
memset(vmx_io_bitmap_b, 0xff, PAGE_SIZE);
memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE);
memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE);
if (setup_vmcs_config(&vmcs_config) < 0) {
r = -EIO;
goto out;
@ -6835,42 +7029,8 @@ static __init int hardware_setup(void)
kvm_tsc_scaling_ratio_frac_bits = 48;
}
vmx_disable_intercept_for_msr(MSR_FS_BASE, false);
vmx_disable_intercept_for_msr(MSR_GS_BASE, false);
vmx_disable_intercept_for_msr(MSR_KERNEL_GS_BASE, true);
vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_CS, false);
vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_ESP, false);
vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
memcpy(vmx_msr_bitmap_legacy_x2apic_apicv,
vmx_msr_bitmap_legacy, PAGE_SIZE);
memcpy(vmx_msr_bitmap_longmode_x2apic_apicv,
vmx_msr_bitmap_longmode, PAGE_SIZE);
memcpy(vmx_msr_bitmap_legacy_x2apic,
vmx_msr_bitmap_legacy, PAGE_SIZE);
memcpy(vmx_msr_bitmap_longmode_x2apic,
vmx_msr_bitmap_longmode, PAGE_SIZE);
set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */
for (msr = 0x800; msr <= 0x8ff; msr++) {
if (msr == 0x839 /* TMCCT */)
continue;
vmx_disable_intercept_msr_x2apic(msr, MSR_TYPE_R, true);
}
/*
* TPR reads and writes can be virtualized even if virtual interrupt
* delivery is not in use.
*/
vmx_disable_intercept_msr_x2apic(0x808, MSR_TYPE_W, true);
vmx_disable_intercept_msr_x2apic(0x808, MSR_TYPE_R | MSR_TYPE_W, false);
/* EOI */
vmx_disable_intercept_msr_x2apic(0x80b, MSR_TYPE_W, true);
/* SELF-IPI */
vmx_disable_intercept_msr_x2apic(0x83f, MSR_TYPE_W, true);
if (enable_ept)
vmx_enable_tdp();
else
@ -6973,94 +7133,6 @@ static int handle_monitor(struct kvm_vcpu *vcpu)
return handle_nop(vcpu);
}
/*
* To run an L2 guest, we need a vmcs02 based on the L1-specified vmcs12.
* We could reuse a single VMCS for all the L2 guests, but we also want the
* option to allocate a separate vmcs02 for each separate loaded vmcs12 - this
* allows keeping them loaded on the processor, and in the future will allow
* optimizations where prepare_vmcs02 doesn't need to set all the fields on
* every entry if they never change.
* So we keep, in vmx->nested.vmcs02_pool, a cache of size VMCS02_POOL_SIZE
* (>=0) with a vmcs02 for each recently loaded vmcs12s, most recent first.
*
* The following functions allocate and free a vmcs02 in this pool.
*/
/* Get a VMCS from the pool to use as vmcs02 for the current vmcs12. */
static struct loaded_vmcs *nested_get_current_vmcs02(struct vcpu_vmx *vmx)
{
struct vmcs02_list *item;
list_for_each_entry(item, &vmx->nested.vmcs02_pool, list)
if (item->vmptr == vmx->nested.current_vmptr) {
list_move(&item->list, &vmx->nested.vmcs02_pool);
return &item->vmcs02;
}
if (vmx->nested.vmcs02_num >= max(VMCS02_POOL_SIZE, 1)) {
/* Recycle the least recently used VMCS. */
item = list_last_entry(&vmx->nested.vmcs02_pool,
struct vmcs02_list, list);
item->vmptr = vmx->nested.current_vmptr;
list_move(&item->list, &vmx->nested.vmcs02_pool);
return &item->vmcs02;
}
/* Create a new VMCS */
item = kzalloc(sizeof(struct vmcs02_list), GFP_KERNEL);
if (!item)
return NULL;
item->vmcs02.vmcs = alloc_vmcs();
item->vmcs02.shadow_vmcs = NULL;
if (!item->vmcs02.vmcs) {
kfree(item);
return NULL;
}
loaded_vmcs_init(&item->vmcs02);
item->vmptr = vmx->nested.current_vmptr;
list_add(&(item->list), &(vmx->nested.vmcs02_pool));
vmx->nested.vmcs02_num++;
return &item->vmcs02;
}
/* Free and remove from pool a vmcs02 saved for a vmcs12 (if there is one) */
static void nested_free_vmcs02(struct vcpu_vmx *vmx, gpa_t vmptr)
{
struct vmcs02_list *item;
list_for_each_entry(item, &vmx->nested.vmcs02_pool, list)
if (item->vmptr == vmptr) {
free_loaded_vmcs(&item->vmcs02);
list_del(&item->list);
kfree(item);
vmx->nested.vmcs02_num--;
return;
}
}
/*
* Free all VMCSs saved for this vcpu, except the one pointed by
* vmx->loaded_vmcs. We must be running L1, so vmx->loaded_vmcs
* must be &vmx->vmcs01.
*/
static void nested_free_all_saved_vmcss(struct vcpu_vmx *vmx)
{
struct vmcs02_list *item, *n;
WARN_ON(vmx->loaded_vmcs != &vmx->vmcs01);
list_for_each_entry_safe(item, n, &vmx->nested.vmcs02_pool, list) {
/*
* Something will leak if the above WARN triggers. Better than
* a use-after-free.
*/
if (vmx->loaded_vmcs == &item->vmcs02)
continue;
free_loaded_vmcs(&item->vmcs02);
list_del(&item->list);
kfree(item);
vmx->nested.vmcs02_num--;
}
}
/*
* The following 3 functions, nested_vmx_succeed()/failValid()/failInvalid(),
* set the success or error code of an emulated VMX instruction, as specified
@ -7241,13 +7313,11 @@ static int enter_vmx_operation(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct vmcs *shadow_vmcs;
int r;
if (cpu_has_vmx_msr_bitmap()) {
vmx->nested.msr_bitmap =
(unsigned long *)__get_free_page(GFP_KERNEL);
if (!vmx->nested.msr_bitmap)
goto out_msr_bitmap;
}
r = alloc_loaded_vmcs(&vmx->nested.vmcs02);
if (r < 0)
goto out_vmcs02;
vmx->nested.cached_vmcs12 = kmalloc(VMCS12_SIZE, GFP_KERNEL);
if (!vmx->nested.cached_vmcs12)
@ -7264,9 +7334,6 @@ static int enter_vmx_operation(struct kvm_vcpu *vcpu)
vmx->vmcs01.shadow_vmcs = shadow_vmcs;
}
INIT_LIST_HEAD(&(vmx->nested.vmcs02_pool));
vmx->nested.vmcs02_num = 0;
hrtimer_init(&vmx->nested.preemption_timer, CLOCK_MONOTONIC,
HRTIMER_MODE_REL_PINNED);
vmx->nested.preemption_timer.function = vmx_preemption_timer_fn;
@ -7278,9 +7345,9 @@ static int enter_vmx_operation(struct kvm_vcpu *vcpu)
kfree(vmx->nested.cached_vmcs12);
out_cached_vmcs12:
free_page((unsigned long)vmx->nested.msr_bitmap);
free_loaded_vmcs(&vmx->nested.vmcs02);
out_msr_bitmap:
out_vmcs02:
return -ENOMEM;
}
@ -7423,10 +7490,6 @@ static void free_nested(struct vcpu_vmx *vmx)
free_vpid(vmx->nested.vpid02);
vmx->nested.posted_intr_nv = -1;
vmx->nested.current_vmptr = -1ull;
if (vmx->nested.msr_bitmap) {
free_page((unsigned long)vmx->nested.msr_bitmap);
vmx->nested.msr_bitmap = NULL;
}
if (enable_shadow_vmcs) {
vmx_disable_shadow_vmcs(vmx);
vmcs_clear(vmx->vmcs01.shadow_vmcs);
@ -7434,7 +7497,7 @@ static void free_nested(struct vcpu_vmx *vmx)
vmx->vmcs01.shadow_vmcs = NULL;
}
kfree(vmx->nested.cached_vmcs12);
/* Unpin physical memory we referred to in current vmcs02 */
/* Unpin physical memory we referred to in the vmcs02 */
if (vmx->nested.apic_access_page) {
kvm_release_page_dirty(vmx->nested.apic_access_page);
vmx->nested.apic_access_page = NULL;
@ -7450,7 +7513,7 @@ static void free_nested(struct vcpu_vmx *vmx)
vmx->nested.pi_desc = NULL;
}
nested_free_all_saved_vmcss(vmx);
free_loaded_vmcs(&vmx->nested.vmcs02);
}
/* Emulate the VMXOFF instruction */
@ -7493,8 +7556,6 @@ static int handle_vmclear(struct kvm_vcpu *vcpu)
vmptr + offsetof(struct vmcs12, launch_state),
&zero, sizeof(zero));
nested_free_vmcs02(vmx, vmptr);
nested_vmx_succeed(vcpu);
return kvm_skip_emulated_instruction(vcpu);
}
@ -8406,10 +8467,11 @@ static bool nested_vmx_exit_reflected(struct kvm_vcpu *vcpu, u32 exit_reason)
/*
* The host physical addresses of some pages of guest memory
* are loaded into VMCS02 (e.g. L1's Virtual APIC Page). The CPU
* may write to these pages via their host physical address while
* L2 is running, bypassing any address-translation-based dirty
* tracking (e.g. EPT write protection).
* are loaded into the vmcs02 (e.g. vmcs12's Virtual APIC
* Page). The CPU may write to these pages via their host
* physical address while L2 is running, bypassing any
* address-translation-based dirty tracking (e.g. EPT write
* protection).
*
* Mark them dirty on every exit from L2 to prevent them from
* getting out of sync with dirty tracking.
@ -8943,7 +9005,7 @@ static void vmx_set_virtual_x2apic_mode(struct kvm_vcpu *vcpu, bool set)
}
vmcs_write32(SECONDARY_VM_EXEC_CONTROL, sec_exec_control);
vmx_set_msr_bitmap(vcpu);
vmx_update_msr_bitmap(vcpu);
}
static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu, hpa_t hpa)
@ -9373,6 +9435,15 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
vmx_arm_hv_timer(vcpu);
/*
* If this vCPU has touched SPEC_CTRL, restore the guest's value if
* it's non-zero. Since vmentry is serialising on affected CPUs, there
* is no need to worry about the conditional branch over the wrmsr
* being speculatively taken.
*/
if (vmx->spec_ctrl)
wrmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
vmx->__launched = vmx->loaded_vmcs->launched;
asm(
/* Store host registers */
@ -9491,6 +9562,27 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
#endif
);
/*
* We do not use IBRS in the kernel. If this vCPU has used the
* SPEC_CTRL MSR it may have left it on; save the value and
* turn it off. This is much more efficient than blindly adding
* it to the atomic save/restore list. Especially as the former
* (Saving guest MSRs on vmexit) doesn't even exist in KVM.
*
* For non-nested case:
* If the L01 MSR bitmap does not intercept the MSR, then we need to
* save it.
*
* For nested case:
* If the L02 MSR bitmap does not intercept the MSR, then we need to
* save it.
*/
if (!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL))
rdmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
if (vmx->spec_ctrl)
wrmsrl(MSR_IA32_SPEC_CTRL, 0);
/* Eliminate branch target predictions from guest mode */
vmexit_fill_RSB();
@ -9604,6 +9696,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
{
int err;
struct vcpu_vmx *vmx = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL);
unsigned long *msr_bitmap;
int cpu;
if (!vmx)
@ -9636,13 +9729,20 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
if (!vmx->guest_msrs)
goto free_pml;
vmx->loaded_vmcs = &vmx->vmcs01;
vmx->loaded_vmcs->vmcs = alloc_vmcs();
vmx->loaded_vmcs->shadow_vmcs = NULL;
if (!vmx->loaded_vmcs->vmcs)
err = alloc_loaded_vmcs(&vmx->vmcs01);
if (err < 0)
goto free_msrs;
loaded_vmcs_init(vmx->loaded_vmcs);
msr_bitmap = vmx->vmcs01.msr_bitmap;
vmx_disable_intercept_for_msr(msr_bitmap, MSR_FS_BASE, MSR_TYPE_RW);
vmx_disable_intercept_for_msr(msr_bitmap, MSR_GS_BASE, MSR_TYPE_RW);
vmx_disable_intercept_for_msr(msr_bitmap, MSR_KERNEL_GS_BASE, MSR_TYPE_RW);
vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_SYSENTER_CS, MSR_TYPE_RW);
vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_SYSENTER_ESP, MSR_TYPE_RW);
vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_SYSENTER_EIP, MSR_TYPE_RW);
vmx->msr_bitmap_mode = 0;
vmx->loaded_vmcs = &vmx->vmcs01;
cpu = get_cpu();
vmx_vcpu_load(&vmx->vcpu, cpu);
vmx->vcpu.cpu = cpu;
@ -10105,10 +10205,25 @@ static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
int msr;
struct page *page;
unsigned long *msr_bitmap_l1;
unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.msr_bitmap;
unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.vmcs02.msr_bitmap;
/*
* pred_cmd & spec_ctrl are trying to verify two things:
*
* 1. L0 gave a permission to L1 to actually passthrough the MSR. This
* ensures that we do not accidentally generate an L02 MSR bitmap
* from the L12 MSR bitmap that is too permissive.
* 2. That L1 or L2s have actually used the MSR. This avoids
* unnecessarily merging of the bitmap if the MSR is unused. This
* works properly because we only update the L01 MSR bitmap lazily.
* So even if L0 should pass L1 these MSRs, the L01 bitmap is only
* updated to reflect this when L1 (or its L2s) actually write to
* the MSR.
*/
bool pred_cmd = msr_write_intercepted_l01(vcpu, MSR_IA32_PRED_CMD);
bool spec_ctrl = msr_write_intercepted_l01(vcpu, MSR_IA32_SPEC_CTRL);
/* This shortcut is ok because we support only x2APIC MSRs so far. */
if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
if (!nested_cpu_has_virt_x2apic_mode(vmcs12) &&
!pred_cmd && !spec_ctrl)
return false;
page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->msr_bitmap);
@ -10141,6 +10256,19 @@ static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
MSR_TYPE_W);
}
}
if (spec_ctrl)
nested_vmx_disable_intercept_for_msr(
msr_bitmap_l1, msr_bitmap_l0,
MSR_IA32_SPEC_CTRL,
MSR_TYPE_R | MSR_TYPE_W);
if (pred_cmd)
nested_vmx_disable_intercept_for_msr(
msr_bitmap_l1, msr_bitmap_l0,
MSR_IA32_PRED_CMD,
MSR_TYPE_W);
kunmap(page);
kvm_release_page_clean(page);
@ -10682,6 +10810,9 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
if (kvm_has_tsc_control)
decache_tsc_multiplier(vmx);
if (cpu_has_vmx_msr_bitmap())
vmcs_write64(MSR_BITMAP, __pa(vmx->nested.vmcs02.msr_bitmap));
if (enable_vpid) {
/*
* There is no direct mapping between vpid02 and vpid12, the
@ -10903,20 +11034,15 @@ static int enter_vmx_non_root_mode(struct kvm_vcpu *vcpu, bool from_vmentry)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
struct loaded_vmcs *vmcs02;
u32 msr_entry_idx;
u32 exit_qual;
vmcs02 = nested_get_current_vmcs02(vmx);
if (!vmcs02)
return -ENOMEM;
enter_guest_mode(vcpu);
if (!(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS))
vmx->nested.vmcs01_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
vmx_switch_vmcs(vcpu, vmcs02);
vmx_switch_vmcs(vcpu, &vmx->nested.vmcs02);
vmx_segment_cache_clear(vmx);
if (prepare_vmcs02(vcpu, vmcs12, from_vmentry, &exit_qual)) {
@ -11485,7 +11611,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
if (cpu_has_vmx_msr_bitmap())
vmx_set_msr_bitmap(vcpu);
vmx_update_msr_bitmap(vcpu);
if (nested_vmx_load_msr(vcpu, vmcs12->vm_exit_msr_load_addr,
vmcs12->vm_exit_msr_load_count))
@ -11534,10 +11660,6 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
vm_exit_controls_reset_shadow(vmx);
vmx_segment_cache_clear(vmx);
/* if no vmcs02 cache requested, remove the one we used */
if (VMCS02_POOL_SIZE == 0)
nested_free_vmcs02(vmx, vmx->nested.current_vmptr);
/* Update any VMCS fields that might have changed while L2 ran */
vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.nr);
vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.nr);

View File

@ -1009,6 +1009,7 @@ static u32 msrs_to_save[] = {
#endif
MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
MSR_IA32_SPEC_CTRL, MSR_IA32_ARCH_CAPABILITIES
};
static unsigned num_msrs_to_save;

View File

@ -40,6 +40,8 @@ ENTRY(__get_user_1)
mov PER_CPU_VAR(current_task), %_ASM_DX
cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
jae bad_get_user
sbb %_ASM_DX, %_ASM_DX /* array_index_mask_nospec() */
and %_ASM_DX, %_ASM_AX
ASM_STAC
1: movzbl (%_ASM_AX),%edx
xor %eax,%eax
@ -54,6 +56,8 @@ ENTRY(__get_user_2)
mov PER_CPU_VAR(current_task), %_ASM_DX
cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
jae bad_get_user
sbb %_ASM_DX, %_ASM_DX /* array_index_mask_nospec() */
and %_ASM_DX, %_ASM_AX
ASM_STAC
2: movzwl -1(%_ASM_AX),%edx
xor %eax,%eax
@ -68,6 +72,8 @@ ENTRY(__get_user_4)
mov PER_CPU_VAR(current_task), %_ASM_DX
cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
jae bad_get_user
sbb %_ASM_DX, %_ASM_DX /* array_index_mask_nospec() */
and %_ASM_DX, %_ASM_AX
ASM_STAC
3: movl -3(%_ASM_AX),%edx
xor %eax,%eax
@ -83,6 +89,8 @@ ENTRY(__get_user_8)
mov PER_CPU_VAR(current_task), %_ASM_DX
cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
jae bad_get_user
sbb %_ASM_DX, %_ASM_DX /* array_index_mask_nospec() */
and %_ASM_DX, %_ASM_AX
ASM_STAC
4: movq -7(%_ASM_AX),%rdx
xor %eax,%eax
@ -94,6 +102,8 @@ ENTRY(__get_user_8)
mov PER_CPU_VAR(current_task), %_ASM_DX
cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
jae bad_get_user_8
sbb %_ASM_DX, %_ASM_DX /* array_index_mask_nospec() */
and %_ASM_DX, %_ASM_AX
ASM_STAC
4: movl -7(%_ASM_AX),%edx
5: movl -3(%_ASM_AX),%ecx

View File

@ -331,12 +331,12 @@ do { \
unsigned long __copy_user_ll(void *to, const void *from, unsigned long n)
{
stac();
__uaccess_begin_nospec();
if (movsl_is_ok(to, from, n))
__copy_user(to, from, n);
else
n = __copy_user_intel(to, from, n);
clac();
__uaccess_end();
return n;
}
EXPORT_SYMBOL(__copy_user_ll);
@ -344,7 +344,7 @@ EXPORT_SYMBOL(__copy_user_ll);
unsigned long __copy_from_user_ll_nocache_nozero(void *to, const void __user *from,
unsigned long n)
{
stac();
__uaccess_begin_nospec();
#ifdef CONFIG_X86_INTEL_USERCOPY
if (n > 64 && static_cpu_has(X86_FEATURE_XMM2))
n = __copy_user_intel_nocache(to, from, n);
@ -353,7 +353,7 @@ unsigned long __copy_from_user_ll_nocache_nozero(void *to, const void __user *fr
#else
__copy_user(to, from, n);
#endif
clac();
__uaccess_end();
return n;
}
EXPORT_SYMBOL(__copy_from_user_ll_nocache_nozero);

View File

@ -6,13 +6,14 @@
#include <linux/interrupt.h>
#include <linux/export.h>
#include <linux/cpu.h>
#include <linux/debugfs.h>
#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
#include <asm/nospec-branch.h>
#include <asm/cache.h>
#include <asm/apic.h>
#include <asm/uv/uv.h>
#include <linux/debugfs.h>
/*
* TLB flushing, formerly SMP-only
@ -247,6 +248,27 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
} else {
u16 new_asid;
bool need_flush;
u64 last_ctx_id = this_cpu_read(cpu_tlbstate.last_ctx_id);
/*
* Avoid user/user BTB poisoning by flushing the branch
* predictor when switching between processes. This stops
* one process from doing Spectre-v2 attacks on another.
*
* As an optimization, flush indirect branches only when
* switching into processes that disable dumping. This
* protects high value processes like gpg, without having
* too high performance overhead. IBPB is *expensive*!
*
* This will not flush branches when switching into kernel
* threads. It will also not flush if we switch to idle
* thread and back to the same process. It will flush if we
* switch to a different non-dumpable process.
*/
if (tsk && tsk->mm &&
tsk->mm->context.ctx_id != last_ctx_id &&
get_dumpable(tsk->mm) != SUID_DUMP_USER)
indirect_branch_prediction_barrier();
if (IS_ENABLED(CONFIG_VMAP_STACK)) {
/*
@ -292,6 +314,14 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
}
/*
* Record last user mm's context id, so we can avoid
* flushing branch buffer with IBPB if we switch back
* to the same user.
*/
if (next != &init_mm)
this_cpu_write(cpu_tlbstate.last_ctx_id, next->context.ctx_id);
this_cpu_write(cpu_tlbstate.loaded_mm, next);
this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
}
@ -369,6 +399,7 @@ void initialize_tlbstate_and_flush(void)
write_cr3(build_cr3(mm->pgd, 0));
/* Reinitialize tlbstate. */
this_cpu_write(cpu_tlbstate.last_ctx_id, mm->context.ctx_id);
this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);
this_cpu_write(cpu_tlbstate.next_asid, 1);
this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id);

View File

@ -10,6 +10,7 @@
#include <linux/compiler.h>
#include <linux/spinlock.h>
#include <linux/rcupdate.h>
#include <linux/nospec.h>
#include <linux/types.h>
#include <linux/init.h>
#include <linux/fs.h>
@ -82,8 +83,10 @@ static inline struct file *__fcheck_files(struct files_struct *files, unsigned i
{
struct fdtable *fdt = rcu_dereference_raw(files->fdt);
if (fd < fdt->max_fds)
if (fd < fdt->max_fds) {
fd = array_index_nospec(fd, fdt->max_fds);
return rcu_dereference_raw(fdt->fd[fd]);
}
return NULL;
}

View File

@ -5,6 +5,13 @@
#include <linux/compiler.h>
#include <linux/types.h>
/* Built-in __init functions needn't be compiled with retpoline */
#if defined(RETPOLINE) && !defined(MODULE)
#define __noretpoline __attribute__((indirect_branch("keep")))
#else
#define __noretpoline
#endif
/* These macros are used to mark some functions or
* initialized data (doesn't apply to uninitialized data)
* as `initialization' functions. The kernel can take this
@ -40,7 +47,7 @@
/* These are for everybody (although not all archs will actually
discard it in modules) */
#define __init __section(.init.text) __cold __latent_entropy
#define __init __section(.init.text) __cold __latent_entropy __noretpoline
#define __initdata __section(.init.data)
#define __initconst __section(.init.rodata)
#define __exitdata __section(.exit.data)

72
include/linux/nospec.h Normal file
View File

@ -0,0 +1,72 @@
// SPDX-License-Identifier: GPL-2.0
// Copyright(c) 2018 Linus Torvalds. All rights reserved.
// Copyright(c) 2018 Alexei Starovoitov. All rights reserved.
// Copyright(c) 2018 Intel Corporation. All rights reserved.
#ifndef _LINUX_NOSPEC_H
#define _LINUX_NOSPEC_H
/**
* array_index_mask_nospec() - generate a ~0 mask when index < size, 0 otherwise
* @index: array element index
* @size: number of elements in array
*
* When @index is out of bounds (@index >= @size), the sign bit will be
* set. Extend the sign bit to all bits and invert, giving a result of
* zero for an out of bounds index, or ~0 if within bounds [0, @size).
*/
#ifndef array_index_mask_nospec
static inline unsigned long array_index_mask_nospec(unsigned long index,
unsigned long size)
{
/*
* Warn developers about inappropriate array_index_nospec() usage.
*
* Even if the CPU speculates past the WARN_ONCE branch, the
* sign bit of @index is taken into account when generating the
* mask.
*
* This warning is compiled out when the compiler can infer that
* @index and @size are less than LONG_MAX.
*/
if (WARN_ONCE(index > LONG_MAX || size > LONG_MAX,
"array_index_nospec() limited to range of [0, LONG_MAX]\n"))
return 0;
/*
* Always calculate and emit the mask even if the compiler
* thinks the mask is not needed. The compiler does not take
* into account the value of @index under speculation.
*/
OPTIMIZER_HIDE_VAR(index);
return ~(long)(index | (size - 1UL - index)) >> (BITS_PER_LONG - 1);
}
#endif
/*
* array_index_nospec - sanitize an array index after a bounds check
*
* For a code sequence like:
*
* if (index < size) {
* index = array_index_nospec(index, size);
* val = array[index];
* }
*
* ...if the CPU speculates past the bounds check then
* array_index_nospec() will clamp the index within the range of [0,
* size).
*/
#define array_index_nospec(index, size) \
({ \
typeof(index) _i = (index); \
typeof(size) _s = (size); \
unsigned long _mask = array_index_mask_nospec(_i, _s); \
\
BUILD_BUG_ON(sizeof(_i) > sizeof(long)); \
BUILD_BUG_ON(sizeof(_s) > sizeof(long)); \
\
_i &= _mask; \
_i; \
})
#endif /* _LINUX_NOSPEC_H */

View File

@ -16,6 +16,7 @@
#include <linux/nl80211.h>
#include <linux/rtnetlink.h>
#include <linux/netlink.h>
#include <linux/nospec.h>
#include <linux/etherdevice.h>
#include <net/net_namespace.h>
#include <net/genetlink.h>
@ -2080,20 +2081,22 @@ static const struct nla_policy txq_params_policy[NL80211_TXQ_ATTR_MAX + 1] = {
static int parse_txq_params(struct nlattr *tb[],
struct ieee80211_txq_params *txq_params)
{
u8 ac;
if (!tb[NL80211_TXQ_ATTR_AC] || !tb[NL80211_TXQ_ATTR_TXOP] ||
!tb[NL80211_TXQ_ATTR_CWMIN] || !tb[NL80211_TXQ_ATTR_CWMAX] ||
!tb[NL80211_TXQ_ATTR_AIFS])
return -EINVAL;
txq_params->ac = nla_get_u8(tb[NL80211_TXQ_ATTR_AC]);
ac = nla_get_u8(tb[NL80211_TXQ_ATTR_AC]);
txq_params->txop = nla_get_u16(tb[NL80211_TXQ_ATTR_TXOP]);
txq_params->cwmin = nla_get_u16(tb[NL80211_TXQ_ATTR_CWMIN]);
txq_params->cwmax = nla_get_u16(tb[NL80211_TXQ_ATTR_CWMAX]);
txq_params->aifs = nla_get_u8(tb[NL80211_TXQ_ATTR_AIFS]);
if (txq_params->ac >= NL80211_NUM_ACS)
if (ac >= NL80211_NUM_ACS)
return -EINVAL;
txq_params->ac = array_index_nospec(ac, NL80211_NUM_ACS);
return 0;
}

View File

@ -544,18 +544,14 @@ static int add_call_destinations(struct objtool_file *file)
dest_off = insn->offset + insn->len + insn->immediate;
insn->call_dest = find_symbol_by_offset(insn->sec,
dest_off);
/*
* FIXME: Thanks to retpolines, it's now considered
* normal for a function to call within itself. So
* disable this warning for now.
*/
#if 0
if (!insn->call_dest) {
WARN_FUNC("can't find call dest symbol at offset 0x%lx",
insn->sec, insn->offset, dest_off);
if (!insn->call_dest && !insn->ignore) {
WARN_FUNC("unsupported intra-function call",
insn->sec, insn->offset);
WARN("If this is a retpoline, please patch it in with alternatives and annotate it with ANNOTATE_NOSPEC_ALTERNATIVE.");
return -1;
}
#endif
} else if (rela->sym->type == STT_SECTION) {
insn->call_dest = find_symbol_by_offset(rela->sym->sec,
rela->addend+4);
@ -599,7 +595,7 @@ static int handle_group_alt(struct objtool_file *file,
struct instruction *orig_insn,
struct instruction **new_insn)
{
struct instruction *last_orig_insn, *last_new_insn, *insn, *fake_jump;
struct instruction *last_orig_insn, *last_new_insn, *insn, *fake_jump = NULL;
unsigned long dest_off;
last_orig_insn = NULL;
@ -615,28 +611,30 @@ static int handle_group_alt(struct objtool_file *file,
last_orig_insn = insn;
}
if (!next_insn_same_sec(file, last_orig_insn)) {
WARN("%s: don't know how to handle alternatives at end of section",
special_alt->orig_sec->name);
return -1;
}
if (next_insn_same_sec(file, last_orig_insn)) {
fake_jump = malloc(sizeof(*fake_jump));
if (!fake_jump) {
WARN("malloc failed");
return -1;
}
memset(fake_jump, 0, sizeof(*fake_jump));
INIT_LIST_HEAD(&fake_jump->alts);
clear_insn_state(&fake_jump->state);
fake_jump = malloc(sizeof(*fake_jump));
if (!fake_jump) {
WARN("malloc failed");
return -1;
fake_jump->sec = special_alt->new_sec;
fake_jump->offset = -1;
fake_jump->type = INSN_JUMP_UNCONDITIONAL;
fake_jump->jump_dest = list_next_entry(last_orig_insn, list);
fake_jump->ignore = true;
}
memset(fake_jump, 0, sizeof(*fake_jump));
INIT_LIST_HEAD(&fake_jump->alts);
clear_insn_state(&fake_jump->state);
fake_jump->sec = special_alt->new_sec;
fake_jump->offset = -1;
fake_jump->type = INSN_JUMP_UNCONDITIONAL;
fake_jump->jump_dest = list_next_entry(last_orig_insn, list);
fake_jump->ignore = true;
if (!special_alt->new_len) {
if (!fake_jump) {
WARN("%s: empty alternative at end of section",
special_alt->orig_sec->name);
return -1;
}
*new_insn = fake_jump;
return 0;
}
@ -649,6 +647,8 @@ static int handle_group_alt(struct objtool_file *file,
last_new_insn = insn;
insn->ignore = orig_insn->ignore_alts;
if (insn->type != INSN_JUMP_CONDITIONAL &&
insn->type != INSN_JUMP_UNCONDITIONAL)
continue;
@ -657,8 +657,14 @@ static int handle_group_alt(struct objtool_file *file,
continue;
dest_off = insn->offset + insn->len + insn->immediate;
if (dest_off == special_alt->new_off + special_alt->new_len)
if (dest_off == special_alt->new_off + special_alt->new_len) {
if (!fake_jump) {
WARN("%s: alternative jump to end of section",
special_alt->orig_sec->name);
return -1;
}
insn->jump_dest = fake_jump;
}
if (!insn->jump_dest) {
WARN_FUNC("can't find alternative jump destination",
@ -673,7 +679,8 @@ static int handle_group_alt(struct objtool_file *file,
return -1;
}
list_add(&fake_jump->list, &last_new_insn->list);
if (fake_jump)
list_add(&fake_jump->list, &last_new_insn->list);
return 0;
}
@ -730,10 +737,6 @@ static int add_special_section_alts(struct objtool_file *file)
goto out;
}
/* Ignore retpoline alternatives. */
if (orig_insn->ignore_alts)
continue;
new_insn = NULL;
if (!special_alt->group || special_alt->new_len) {
new_insn = find_insn(file, special_alt->new_sec,
@ -1090,11 +1093,11 @@ static int decode_sections(struct objtool_file *file)
if (ret)
return ret;
ret = add_call_destinations(file);
ret = add_special_section_alts(file);
if (ret)
return ret;
ret = add_special_section_alts(file);
ret = add_call_destinations(file);
if (ret)
return ret;
@ -1721,10 +1724,12 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
insn->visited = true;
list_for_each_entry(alt, &insn->alts, list) {
ret = validate_branch(file, alt->insn, state);
if (ret)
return 1;
if (!insn->ignore_alts) {
list_for_each_entry(alt, &insn->alts, list) {
ret = validate_branch(file, alt->insn, state);
if (ret)
return 1;
}
}
switch (insn->type) {

View File

@ -98,6 +98,11 @@ static int create_orc_entry(struct section *u_sec, struct section *ip_relasec,
struct orc_entry *orc;
struct rela *rela;
if (!insn_sec->sym) {
WARN("missing symbol for section %s", insn_sec->name);
return -1;
}
/* populate ORC data */
orc = (struct orc_entry *)u_sec->data->d_buf + idx;
memcpy(orc, o, sizeof(*orc));