kernel_optimize_test/Documentation
Johannes Weiner 241994ed86 mm: memcontrol: default hierarchy interface for memory
Introduce the basic control files to account, partition, and limit
memory using cgroups in default hierarchy mode.

This interface versioning allows us to address fundamental design
issues in the existing memory cgroup interface, further explained
below.  The old interface will be maintained indefinitely, but a
clearer model and improved workload performance should encourage
existing users to switch over to the new one eventually.

The control files are thus:

  - memory.current shows the current consumption of the cgroup and its
    descendants, in bytes.

  - memory.low configures the lower end of the cgroup's expected
    memory consumption range.  The kernel considers memory below that
    boundary to be a reserve - the minimum that the workload needs in
    order to make forward progress - and generally avoids reclaiming
    it, unless there is an imminent risk of entering an OOM situation.

  - memory.high configures the upper end of the cgroup's expected
    memory consumption range.  A cgroup whose consumption grows beyond
    this threshold is forced into direct reclaim, to work off the
    excess and to throttle new allocations heavily, but is generally
    allowed to continue and the OOM killer is not invoked.

  - memory.max configures the hard maximum amount of memory that the
    cgroup is allowed to consume before the OOM killer is invoked.

  - memory.events shows event counters that indicate how often the
    cgroup was reclaimed while below memory.low, how often it was
    forced to reclaim excess beyond memory.high, how often it hit
    memory.max, and how often it entered OOM due to memory.max.  This
    allows users to identify configuration problems when observing a
    degradation in workload performance.  An overcommitted system will
    have an increased rate of low boundary breaches, whereas increased
    rates of high limit breaches, maximum hits, or even OOM situations
    will indicate internally overcommitted cgroups.

For existing users of memory cgroups, the following deviations from
the current interface are worth pointing out and explaining:

  - The original lower boundary, the soft limit, is defined as a limit
    that is per default unset.  As a result, the set of cgroups that
    global reclaim prefers is opt-in, rather than opt-out.  The costs
    for optimizing these mostly negative lookups are so high that the
    implementation, despite its enormous size, does not even provide
    the basic desirable behavior.  First off, the soft limit has no
    hierarchical meaning.  All configured groups are organized in a
    global rbtree and treated like equal peers, regardless where they
    are located in the hierarchy.  This makes subtree delegation
    impossible.  Second, the soft limit reclaim pass is so aggressive
    that it not just introduces high allocation latencies into the
    system, but also impacts system performance due to overreclaim, to
    the point where the feature becomes self-defeating.

    The memory.low boundary on the other hand is a top-down allocated
    reserve.  A cgroup enjoys reclaim protection when it and all its
    ancestors are below their low boundaries, which makes delegation
    of subtrees possible.  Secondly, new cgroups have no reserve per
    default and in the common case most cgroups are eligible for the
    preferred reclaim pass.  This allows the new low boundary to be
    efficiently implemented with just a minor addition to the generic
    reclaim code, without the need for out-of-band data structures and
    reclaim passes.  Because the generic reclaim code considers all
    cgroups except for the ones running low in the preferred first
    reclaim pass, overreclaim of individual groups is eliminated as
    well, resulting in much better overall workload performance.

  - The original high boundary, the hard limit, is defined as a strict
    limit that can not budge, even if the OOM killer has to be called.
    But this generally goes against the goal of making the most out of
    the available memory.  The memory consumption of workloads varies
    during runtime, and that requires users to overcommit.  But doing
    that with a strict upper limit requires either a fairly accurate
    prediction of the working set size or adding slack to the limit.
    Since working set size estimation is hard and error prone, and
    getting it wrong results in OOM kills, most users tend to err on
    the side of a looser limit and end up wasting precious resources.

    The memory.high boundary on the other hand can be set much more
    conservatively.  When hit, it throttles allocations by forcing
    them into direct reclaim to work off the excess, but it never
    invokes the OOM killer.  As a result, a high boundary that is
    chosen too aggressively will not terminate the processes, but
    instead it will lead to gradual performance degradation.  The user
    can monitor this and make corrections until the minimal memory
    footprint that still gives acceptable performance is found.

    In extreme cases, with many concurrent allocations and a complete
    breakdown of reclaim progress within the group, the high boundary
    can be exceeded.  But even then it's mostly better to satisfy the
    allocation from the slack available in other groups or the rest of
    the system than killing the group.  Otherwise, memory.max is there
    to limit this type of spillover and ultimately contain buggy or
    even malicious applications.

  - The original control file names are unwieldy and inconsistent in
    many different ways.  For example, the upper boundary hit count is
    exported in the memory.failcnt file, but an OOM event count has to
    be manually counted by listening to memory.oom_control events, and
    lower boundary / soft limit events have to be counted by first
    setting a threshold for that value and then counting those events.
    Also, usage and limit files encode their units in the filename.
    That makes the filenames very long, even though this is not
    information that a user needs to be reminded of every time they
    type out those names.

    To address these naming issues, as well as to signal clearly that
    the new interface carries a new configuration model, the naming
    conventions in it necessarily differ from the old interface.

  - The original limit files indicate the state of an unset limit with
    a very high number, and a configured limit can be unset by echoing
    -1 into those files.  But that very high number is implementation
    and architecture dependent and not very descriptive.  And while -1
    can be understood as an underflow into the highest possible value,
    -2 or -10M etc. do not work, so it's not inconsistent.

    memory.low, memory.high, and memory.max will use the string
    "infinity" to indicate and set the highest possible value.

[akpm@linux-foundation.org: use seq_puts() for basic strings]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 17:06:02 -08:00
..
ABI Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2015-02-11 09:32:08 -08:00
accounting Documentation: use subdir-y to avoid unnecessary built-in.o files 2014-09-26 11:02:55 +02:00
acpi ACPI / Documentation: add a missing '=' 2015-01-22 01:21:45 +01:00
aoe
arm Merge branch 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm 2014-12-12 15:26:48 -08:00
arm64 arm64: Emulate CP15 Barrier instructions 2014-11-20 16:34:48 +00:00
auxdisplay Documentation: use subdir-y to avoid unnecessary built-in.o files 2014-09-26 11:02:55 +02:00
backlight
blackfin Documentation: add makefiles for more targets 2014-09-26 11:02:56 +02:00
block Merge branch 'for-3.19/core' of git://git.kernel.dk/linux-block 2014-12-13 14:14:23 -08:00
blockdev zram: report maximum used memory 2014-10-09 22:26:02 -04:00
bus-devices
cdrom
cgroups mm: memcontrol: default hierarchy interface for memory 2015-02-11 17:06:02 -08:00
connector
console
cpu-freq intel_pstate: Add num_pstates to sysfs 2015-01-30 01:52:17 +01:00
cpuidle
cris
crypto crypto: doc - userspace interface spec 2014-11-13 22:31:38 +08:00
development-process Documentation: remove outdated references to the linux-next wiki 2014-10-28 09:06:11 -04:00
device-mapper dm cache policy mq: simplify ability to promote sequential IO to the cache 2014-11-10 15:25:30 -05:00
devicetree MMC core: 2015-02-11 10:56:48 -08:00
dmaengine Documentation: dmanegine: move dmatest.txt to dmaengine folder 2014-11-06 11:17:37 +05:30
DocBook media updates for v3.20-rc1 2015-02-11 08:45:40 -08:00
driver-model PCI changes for the v3.18 merge window: 2014-10-09 15:03:49 -04:00
dvb [media] get_dvb_firmware: Update firmware of ITEtech IT9135 2014-09-21 17:03:04 -03:00
early-userspace
EDID
extcon
fault-injection
fb
filesystems Merge branch 'akpm' (patches from Andrew) 2015-02-10 16:45:56 -08:00
firmware_class
fmc
frv
gpio This is the bulk of GPIO changes for the v3.19 series: 2014-12-14 14:05:05 -08:00
hid HID: uhid: update documentation 2014-08-25 03:28:09 -05:00
hwmon hwmon: (ina2xx) Add ina231 compatible string 2015-01-25 21:23:59 -08:00
i2c Documentation: i2c: Use PM ops instead of legacy suspend/resume 2014-12-04 19:09:03 +01:00
i2o
ia64 kvm: Documentation: remove ia64 2014-11-20 11:08:55 +01:00
ide
infiniband IB/mad: add new ioctl to ABI to support new registration options 2014-08-10 20:36:00 -07:00
input Docs changes for the 3.19 merge window 2014-12-12 14:42:48 -08:00
ioctl cxl: Add documentation for userspace APIs 2014-10-08 20:16:19 +11:00
isdn
ja_JP
kbuild Documentation: kbuild: Improve grammar 2014-08-19 10:02:56 +02:00
kdump kernel: add panic_on_warn 2014-12-10 17:41:10 -08:00
ko_KR
laptops Documentation: update .gitignore files 2014-09-26 11:02:59 +02:00
leds
locking locking/Documentation: Update code path 2015-01-16 09:09:21 +01:00
m68k
memory-devices
metag
mic Documentation: Build mic/mpssd only for x86_64 2014-12-05 11:18:36 -05:00
mips Documentation: au1xxx-ide.c has moved 2014-08-26 09:35:53 +02:00
misc-devices Documentation: use subdir-y to avoid unnecessary built-in.o files 2014-09-26 11:02:55 +02:00
mmc
mn10300
mtd
namespaces
netlabel
networking tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks 2015-02-08 01:03:12 -08:00
nfc
nios2 Documentation: Add documentation for Nios2 architecture 2014-12-08 12:56:06 +08:00
parisc
PCI
pcmcia Documentation: use subdir-y to avoid unnecessary built-in.o files 2014-09-26 11:02:55 +02:00
phy
platform
power PM / sleep: Mention async suspend in PM_TRACE documentation 2015-01-30 01:29:46 +01:00
powerpc cxl: Add documentation for userspace APIs 2014-10-08 20:16:19 +11:00
pps
prctl Documentation: Restrict TSC test code to x86 2014-10-28 08:46:27 -04:00
pti
ptp ptp: restore the makefile for building the test program. 2014-10-24 16:07:10 -04:00
rapidio rapidio/tsi721_dma: rework scatter-gather list handling 2014-08-08 15:57:24 -07:00
RCU Merge branches 'doc.2015.01.07a', 'fixes.2015.01.15a', 'preempt.2015.01.06a', 'srcu.2015.01.06a', 'stall.2015.01.16a' and 'torture.2015.01.11a' into HEAD 2015-01-15 23:34:34 -08:00
s390 s390/docs: Remove sections that are not related to s390 2014-11-18 18:22:59 +01:00
scheduler Documentation/scheduler/sched-deadline.txt: Add minimal main() appendix 2014-09-16 10:23:45 +02:00
scsi Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2014-12-12 10:08:06 -08:00
security Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity into next 2014-11-19 21:36:07 +11:00
serial serial: Fix locking for uart driver set_termios() method 2014-11-05 18:53:54 -08:00
sh
sound ALSA: hda - Add "eapd" model string for AD1986A codec 2014-12-10 14:00:13 +01:00
spi Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/doc 2014-10-07 21:14:57 -04:00
sysctl Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2015-02-10 20:01:30 -08:00
target Documentation/target: Update fabric_ops to latest code 2015-01-06 13:46:49 -08:00
thermal Documentation: thermal: document of_cpufreq_cooling_register() 2015-01-06 14:39:17 -04:00
timers Documentation: update .gitignore files 2014-09-26 11:02:59 +02:00
tpm
trace Char/Misc driver patches for 3.19-rc1 2014-12-14 16:43:47 -08:00
usb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2015-02-10 18:57:15 -08:00
vDSO vdso: don't require 64-bit math in standalone test 2014-10-25 10:53:44 -04:00
video4linux [media] Documentation/video4linux: remove obsolete text files 2015-01-29 19:16:30 -02:00
virtual Second round of changes for KVM for arm/arm64 for v3.19; fixes reboot 2014-12-15 13:06:40 +01:00
vm mm:add KPF_ZERO_PAGE flag for /proc/kpageflags 2015-02-11 17:06:00 -08:00
w1
watchdog Documentation: use subdir-y to avoid unnecessary built-in.o files 2014-09-26 11:02:55 +02:00
wimax
x86 x86, entry: Switch stacks on a paranoid entry from userspace 2015-01-02 10:22:45 -08:00
xtensa
zh_CN
00-INDEX locking/Documentation: Move locking related docs into Documentation/locking/ 2014-08-13 10:32:03 +02:00
applying-patches.txt Documentation: change "&" to "and" in Documentation/applying-patches.txt 2014-09-26 11:10:11 +02:00
assoc_array.txt
atomic_ops.txt documentation: Add atomic_long_t to atomic_ops.txt 2014-11-13 10:34:54 -08:00
bad_memory.txt
basic_profiling.txt
bcache.txt
binfmt_misc.txt binfmt_misc: touch up documentation a bit 2014-10-14 02:18:16 +02:00
braille-console.txt
bt8xxgpio.txt
btmrvl.txt
BUG-HUNTING
bus-virt-phys-mapping.txt
cachetlb.txt rmap: drop support of non-linear mappings 2015-02-10 14:30:31 -08:00
Changes Update old iproute2 and Xen Remus links 2014-12-09 13:38:13 -05:00
circular-buffers.txt
clk.txt clk: Change clk_ops->determine_rate to return a clk_hw as the best parent 2014-12-03 16:21:37 -08:00
coccinelle.txt
CodingStyle CodingStyle: add some more error handling guidelines 2014-12-02 08:55:32 -05:00
cpu-hotplug.txt
cpu-load.txt
cputopology.txt
crc32.txt
dcdbas.txt
debugging-modules.txt
debugging-via-ohci1394.txt
dell_rbu.txt
devices.txt
digsig.txt
DMA-API-HOWTO.txt Documentation: correct parameter error for dma_mapping_error 2014-09-26 11:22:29 +02:00
DMA-API.txt
DMA-attributes.txt
dma-buf-sharing.txt Documentation/dma-buf-sharing.txt: update API descriptions 2014-08-28 11:57:24 +05:30
DMA-ISA-LPC.txt
dontdiff
dynamic-debug-howto.txt
edac.txt
efi-stub.txt
eisa.txt
email-clients.txt Documentation/email-clients.txt: add info about Claws Mail 2014-12-02 11:55:29 -05:00
flexible-arrays.txt
futex-requeue-pi.txt doc: Fix misnamed FUTEX_CMP_REQUEUE_PI op constants 2015-01-19 12:05:32 +01:00
gcov.txt
highuid.txt
HOWTO Documentation: remove outdated references to the linux-next wiki 2014-10-28 09:06:11 -04:00
hsi.txt
hw_random.txt
hwspinlock.txt
init.txt
initrd.txt
intel_txt.txt
Intel-IOMMU.txt
io_ordering.txt
io-mapping.txt
iostats.txt
IPMI.txt ipmi: Add SMBus interface driver (SSIF) 2014-12-11 15:04:11 -06:00
IRQ-affinity.txt
IRQ-domain.txt irqdomain: Introduce new interfaces to support hierarchy irqdomains 2014-11-23 13:01:45 +01:00
IRQ.txt
irqflags-tracing.txt
isapnp.txt
java.txt
kernel-doc-nano-HOWTO.txt
kernel-docs.txt
kernel-parameters.txt Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2015-02-10 20:01:30 -08:00
kernel-per-CPU-kthreads.txt
kmemcheck.txt
kmemleak.txt Documentation: Add CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF case 2014-10-24 13:59:03 -04:00
kobject.txt kobject: grammar fix 2014-12-08 09:07:11 -05:00
kprobes.txt Documentation/kprobes: add s390 to list of supported architectures 2014-09-09 08:53:27 +02:00
kref.txt
kselftest.txt kselftest: Move the docs to the Documentation dir 2014-11-24 10:49:54 -07:00
ldm.txt
local_ops.txt percpu: update local_ops.txt to reflect this_cpu operations 2014-12-13 12:42:53 -08:00
lockup-watchdogs.txt lockup-watchdogs: Fix a typo 2014-08-26 09:35:52 +02:00
logo.gif
logo.txt
lzo.txt Documentation: lzo: document part of the encoding 2014-09-28 11:08:00 +02:00
magic-number.txt
mailbox.txt Documentation: Fix a typo in mailbox.txt 2014-11-03 11:54:50 -05:00
Makefile Documentation: add makefiles for more targets 2014-09-26 11:02:56 +02:00
ManagementStyle
md.txt
media-framework.txt
memory-barriers.txt documentation: Fix smp typo in memory-barriers.txt 2015-01-07 08:59:32 -08:00
memory-hotplug.txt memory-hotplug: add sysfs valid_zones attribute 2014-10-09 22:25:52 -04:00
module-signing.txt
mono.txt
nommu-mmap.txt
numastat.txt
oops-tracing.txt livepatch: kernel: add TAINT_LIVEPATCH 2014-12-22 15:40:48 +01:00
padata.txt
parport-lowlevel.txt
parport.txt
percpu-rw-semaphore.txt
phy.txt phy: improved lookup method 2014-11-21 19:48:50 +05:30
pi-futex.txt
pinctrl.txt pinctrl: clean up after enable refactoring 2014-09-04 10:05:07 +02:00
pnp.txt
preempt-locking.txt
printk-formats.txt lib/vsprintf: add %*pE[achnops] format specifier 2014-10-14 02:18:26 +02:00
pwm.txt
ramoops.txt pstore-ram: Allow optional mapping with pgprot_noncached 2014-12-11 13:38:31 -08:00
rbtree.txt
remoteproc.txt
rfkill.txt rfkill: document rfkill module parameters 2015-01-09 23:22:12 +01:00
robust-futex-ABI.txt
robust-futexes.txt
rpmsg.txt
rtc.txt
SAK.txt
SecurityBugs
serial-console.txt
sgi-ioc4.txt
SM501.txt
smsc_ece1099.txt
sparse.txt
stable_api_nonsense.txt
stable_kernel_rules.txt
static-keys.txt
SubmitChecklist
SubmittingDrivers
SubmittingPatches Documentation/SubmittingPatches: Reported-by tags and permission 2014-10-29 08:56:46 -04:00
svga.txt
sysfs-rules.txt Documentation/sysfs-rules.txt: Add device attribute error code documentation 2014-09-19 14:44:51 -07:00
sysrq.txt
this_cpu_ops.txt Docs: this_cpu_ops: remove redundant add forms 2014-09-26 11:03:00 +02:00
unaligned-memory-access.txt
unicode.txt
unshare.txt
vfio.txt
VGA-softcursor.txt
vgaarbiter.txt
video-output.txt
vme_api.txt
volatile-considered-harmful.txt
workqueue.txt
xillybus.txt xillybus: Move out of staging 2014-09-23 23:44:16 -07:00
xz.txt
zorro.txt