kernel_optimize_test/Documentation
Oza Pawandeep 7e9084b367 PCI/AER: Handle ERR_FATAL with removal and re-enumeration of devices
PCIe ERR_FATAL errors mean the Link is unreliable.  Components on the Link
may need to be reset to return to reliable operation (PCIe r4.0, sec
6.2.2).  We previously handled these errors much differently depending on
whether the platform supports Downstream Port Containment (DPC) (PCIe r4.0,
sec 6.2.10) or not.

The AER driver has historically logged the error details, called
driver-supplied pci_error_handlers callbacks, and reset the Link.  This
reset downstream devices, but did not remove them from the PCI subsystem,
re-enumerate them, or call their driver .remove() or .probe() methods.

DPC is different because the hardware automatically disables the Link when
it detects ERR_FATAL, which resets downstream devices.  There's no
opportunity for pci_error_handlers callbacks before resetting the Link.
The DPC driver removes affected devices (which calls their driver .remove()
methods), brings the Link back up, and re-enumerates (which calls driver
.probe() methods).

Align AER ERR_FATAL handling with DPC by resetting the Link in software,
skipping the driver pci_error_handlers callbacks, removing the devices from
the PCI subsystem, and re-enumerating.  The idea is that drivers and
devices should see the same behavior for ERR_FATAL events, regardless of
whether they're handled by AER or DPC.

Here are the basic ERR_FATAL recovery steps, showing the previous AER
behavior, the AER behavior after this patch, and the DPC behavior:

                          AER        AER      DPC
                          previous   new      behavior
                          --------   ---      --------
  Log error               yes        yes      yes (minimal)
  drv.error_detected()    yes        no       no
  Reset Link              yes        yes      yes
  drv.mmio_enabled()      yes        no       no
  drv.slot_reset()        yes        no       no
  drv.resume()            yes        no       no
  Remove PCI devices      no         yes      yes
    (calls drv.remove())
  Re-enumerate            no         yes      yes
    (calls drv.probe())

N.B. With DPC, the Link reset happens before the driver .remove() calls,
while with AER, the reset happens *after* the .remove() calls.  The goal is
to eventually do the reset before .remove() for AER as well.

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>
[bhelgaas: changelog, squash doc patch into this, remove unused
"result_data"]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
2018-05-17 16:44:13 -05:00
..
ABI
accelerators
accounting
acpi
admin-guide
aoe
arm
arm64
auxdisplay
backlight
block
blockdev
bpf
bus-devices
cdrom
cgroup-v1
cma
connector
console
core-api
cpu-freq
cpuidle
crypto
dev-tools
device-mapper
devicetree The large diff this time around is from the addition of a new clk driver 2018-04-13 15:51:06 -07:00
doc-guide
driver-api
driver-model
early-userspace
EDID
extcon
fault-injection
fb
features
filesystems Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs 2018-04-13 16:55:41 -07:00
firmware_class
fmc
fpga
gpio
gpu
hid
hwmon
i2c
ia64
ide
iio
infiniband
input
ioctl
isdn
kbuild
kdump
kernel-hacking
laptops
leds
lightnvm
livepatch
locking
m68k
maintainer
md
media
memory-devices
mic
mips
misc-devices
mmc
mtd
namespaces
netlabel
networking
nfc
nios2
nvdimm
nvmem
openrisc
parisc
PCI PCI/AER: Handle ERR_FATAL with removal and re-enumeration of devices 2018-05-17 16:44:13 -05:00
pcmcia
perf
phy
platform
power
powerpc
pps
process Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-04-15 16:12:35 -07:00
pti
ptp
rapidio
RCU
s390
scheduler
scsi
security
serial
sh
sound
sparc
sphinx
sphinx-static
spi
sysctl
target
thermal
timers
trace
translations
usb
userspace-api
virtual
vm
w1
watchdog
wimax
x86 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-04-15 16:12:35 -07:00
xtensa
.gitignore
00-INDEX
atomic_bitops.txt
atomic_t.txt
bcache.txt
bt8xxgpio.txt
btmrvl.txt
bus-virt-phys-mapping.txt
cachetlb.txt
cgroup-v2.txt
Changes
circular-buffers.txt
clearing-warn-once.txt
clk.txt
CodingStyle
conf.py
cpu-load.txt
cputopology.txt
crc32.txt
dcdbas.txt
debugging-modules.txt
debugging-via-ohci1394.txt
dell_rbu.txt
digsig.txt
DMA-API-HOWTO.txt
DMA-API.txt
DMA-attributes.txt
DMA-ISA-LPC.txt
docutils.conf
dontdiff
efi-stub.txt
eisa.txt
flexible-arrays.txt
futex-requeue-pi.txt
gcc-plugins.txt
highuid.txt
hw_random.txt
hwspinlock.txt
index.rst
intel_txt.txt
Intel-IOMMU.txt
io_ordering.txt
io-mapping.txt
iostats.txt
IPMI.txt
IRQ-affinity.txt
IRQ-domain.txt
IRQ.txt
irqflags-tracing.txt
isa.txt
isapnp.txt
kernel-per-CPU-kthreads.txt
kobject.txt
kprobes.txt
kref.txt
ldm.txt
lockup-watchdogs.txt
logo.gif
logo.txt
lsm.txt
lzo.txt
mailbox.txt
Makefile
memory-barriers.txt
memory-hotplug.txt
men-chameleon-bus.txt
nommu-mmap.txt
ntb.txt
numastat.txt
padata.txt
parport-lowlevel.txt
percpu-rw-semaphore.txt
phy.txt
pi-futex.txt
pnp.txt
preempt-locking.txt
pwm.txt
rbtree.txt
remoteproc.txt
rfkill.txt
robust-futex-ABI.txt
robust-futexes.txt
rpmsg.txt
rtc.txt
SAK.txt
sgi-ioc4.txt
siphash.txt
SM501.txt
smsc_ece1099.txt
speculation.txt
static-keys.txt
SubmittingPatches
svga.txt
switchtec.txt
sync_file.txt
tee.txt
this_cpu_ops.txt
unaligned-memory-access.txt
vfio-mediated-device.txt
vfio.txt
video-output.txt
xillybus.txt
xz.txt
zorro.txt