kernel_optimize_test

Author	SHA1	Message	Date
Davide Caratti	f67169fef8	net/sched: act_pedit: fix WARN() in the traffic path when configuring act_pedit rules, the number of keys is validated only on addition of a new entry. This is not sufficient to avoid hitting a WARN() in the traffic path: for example, it is possible to replace a valid entry with a new one having 0 extended keys, thus causing splats in dmesg like: pedit BUG: index 42 WARNING: CPU: 2 PID: 4054 at net/sched/act_pedit.c:410 tcf_pedit_act+0xc84/0x1200 [act_pedit] [...] RIP: 0010:tcf_pedit_act+0xc84/0x1200 [act_pedit] Code: 89 fa 48 c1 ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e ac 00 00 00 48 8b 44 24 10 48 c7 c7 a0 c4 e4 c0 8b 70 18 e8 1c 30 95 ea <0f> 0b e9 a0 fa ff ff e8 00 03 f5 ea e9 14 f4 ff ff 48 89 58 40 e9 RSP: 0018:ffff888077c9f320 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffac2983a2 RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff888053927bec RBP: dffffc0000000000 R08: ffffed100a726209 R09: ffffed100a726209 R10: 0000000000000001 R11: ffffed100a726208 R12: ffff88804beea780 R13: ffff888079a77400 R14: ffff88804beea780 R15: ffff888027ab2000 FS: 00007fdeec9bd740(0000) GS:ffff888053900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffdb3dfd000 CR3: 000000004adb4006 CR4: 00000000001606e0 Call Trace: tcf_action_exec+0x105/0x3f0 tcf_classify+0xf2/0x410 __dev_queue_xmit+0xcbf/0x2ae0 ip_finish_output2+0x711/0x1fb0 ip_output+0x1bf/0x4b0 ip_send_skb+0x37/0xa0 raw_sendmsg+0x180c/0x2430 sock_sendmsg+0xdb/0x110 __sys_sendto+0x257/0x2b0 __x64_sys_sendto+0xdd/0x1b0 do_syscall_64+0xa5/0x4e0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fdeeb72e993 Code: 48 8b 0d e0 74 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 0d d6 2c 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 4b cc 00 00 48 89 04 24 RSP: 002b:00007ffdb3de8a18 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 000055c81972b700 RCX: 00007fdeeb72e993 RDX: 0000000000000040 RSI: 000055c81972b700 RDI: 0000000000000003 RBP: 00007ffdb3dea130 R08: 000055c819728510 R09: 0000000000000010 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000040 R13: 000055c81972b6c0 R14: 000055c81972969c R15: 0000000000000080 Fix this moving the check on 'nkeys' earlier in tcf_pedit_init(), so that attempts to install rules having 0 keys are always rejected with -EINVAL. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-19 18:57:16 -08:00
Russell King	d9922c0e91	net: phylink: fix link mode modification in PHY mode Modifying the link settings via phylink_ethtool_ksettings_set() and phylink_ethtool_set_pauseparam() didn't always work as intended for PHY based setups, as calling phylink_mac_config() would result in the unresolved configuration being committed to the MAC, rather than the configuration with the speed and duplex setting. This would work fine if the update caused the link to renegotiate, but if no settings have changed, phylib won't trigger a renegotiation cycle, and the MAC will be left incorrectly configured. Avoid calling phylink_mac_config() unless we are using an inband mode in phylink_ethtool_ksettings_set(), and use phy_set_asym_pause() as introduced in 4.20 to set the PHY settings in phylink_ethtool_set_pauseparam(). Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-19 18:55:38 -08:00
Russell King	269a6b5f23	net: phylink: update documentation on create and destroy Update the documentation on phylink's create and destroy functions to explicitly state that the rtnl lock must not be held while calling these. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-19 18:55:02 -08:00
Corinna Vinschen	a0783cd0c8	r8169: disable TSO on a single version of RTL8168c to fix performance During performance testing, I found that one of my r8169 NICs suffered a major performance loss, a 8168c model. Running netperf's TCP_STREAM test didn't return the expected throughput of > 900 Mb/s, but rather only about 22 Mb/s. Strange enough, running the TCP_MAERTS and UDP_STREAM tests all returned with throughput > 900 Mb/s, as did TCP_STREAM with the other r8169 NICs I can test (either one of 8169s, 8168e, 8168f). Bisecting turned up commit `93681cd7d9`, "r8169: enable HW csum and TSO" as the culprit. I added my 8168c version, RTL_GIGA_MAC_VER_22, to the code special-casing the 8168evl as per the patch below. This fixed the performance problem for me. Fixes: `93681cd7d9` ("r8169: enable HW csum and TSO") Signed-off-by: Corinna Vinschen <vinschen@redhat.com> Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-19 16:41:11 -08:00
Zhu Yanjun	c9d55b62c9	MAINTAINERS: forcedeth: Change Zhu Yanjun's email address I prefer to use my personal email address for kernel related work. Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Acked-by: Rain River <rain.1986.08.12@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-19 16:40:31 -08:00
Ivan Khoronzhuk	b5a0faa357	taprio: don't reject same mqprio settings The taprio qdisc allows to set mqprio setting but only once. In case if mqprio settings are provided next time the error is returned as it's not allowed to change traffic class mapping in-flignt and that is normal. But if configuration is absolutely the same - no need to return error. It allows to provide same command couple times, changing only base time for instance, or changing only scheds maps, but leaving mqprio setting w/o modification. It more corresponds the message: "Changing the traffic mapping of a running schedule is not supported", so reject mqprio if it's really changed. Also corrected TC_BITMASK + 1 for consistency, as proposed. Fixes: `a3d43c0d56` ("taprio: Add support adding an admin schedule") Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Tested-by: Vladimir Oltean <olteanv@gmail.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-19 15:23:15 -08:00
Willem de Bruijn	d4ffb02dee	net/tls: enable sk_msg redirect to tls socket egress Bring back tls_sw_sendpage_locked. sk_msg redirection into a socket with TLS_TX takes the following path: tcp_bpf_sendmsg_redir tcp_bpf_push_locked tcp_bpf_push kernel_sendpage_locked sock->ops->sendpage_locked Also update the flags test in tls_sw_sendpage_locked to allow flag MSG_NO_SHARED_FRAGS. bpf_tcp_sendmsg sets this. Link: https://lore.kernel.org/netdev/CA+FuTSdaAawmZ2N8nfDDKu3XLpXBbMtcCT0q4FntDD2gn8ASUw@mail.gmail.com/T/#t Link: https://github.com/wdebruij/kerneltools/commits/icept.2 Fixes: `0608c69c9a` ("bpf: sk_msg, sock{map\|hash} redirect through ULP") Fixes: `f3de19af0f` ("Revert \"net/tls: remove unused function tls_sw_sendpage_locked\"") Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-19 15:03:02 -08:00
David Howells	c74386d50f	afs: Fix missing timeout reset In afs_wait_for_call_to_complete(), rather than immediately aborting an operation if a signal occurs, the code attempts to wait for it to complete, using a schedule timeout of 2*RTT (or min 2 jiffies) and a check that we're still receiving relevant packets from the server before we consider aborting the call. We may even ping the server to check on the status of the call. However, there's a missing timeout reset in the event that we do actually get a packet to process, such that if we then get a couple of short stalls, we then time out when progress is actually being made. Fix this by resetting the timeout any time we get something to process. If it's the failure of the call then the call state will get changed and we'll exit the loop shortly thereafter. A symptom of this is data fetches and stores failing with EINTR when they really shouldn't. Fixes: `bc5e3a546d` ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-11-19 14:36:38 -08:00
Adi Suresh	db96c2cb48	gve: fix dma sync bug where not all pages synced The previous commit had a bug where the last page in the memory range could not be synced. This change fixes the behavior so that all the required pages are synced. Fixes: `9cfeeb576d` ("gve: Fixes DMA synchronization") Signed-off-by: Adi Suresh <adisuresh@google.com> Reviewed-by: Catherine Sullivan <csully@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-19 12:58:18 -08:00
Matthew Auld	d43e24533d	drm/i915: make pool objects read-only For our current users we don't expect pool objects to be writable from the gpu. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Fixes: `4f7af1948a` ("drm/i915: Support ro ppgtt mapped cmdparser shadow buffers") Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Link: https://patchwork.freedesktop.org/patch/msgid/20191119150154.18249-1-matthew.auld@intel.com (cherry picked from commit `d18580b08b`) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2019-11-19 12:43:27 -08:00
Geert Uytterhoeven	fd8f64df95	mdio_bus: Fix init if CONFIG_RESET_CONTROLLER=n Commit `1d4639567d` ("mdio_bus: Fix PTR_ERR applied after initialization to constant") accidentally changed a check from -ENOTSUPP to -ENOSYS, causing failures if reset controller support is not enabled. E.g. on r7s72100/rskrza1: sh-eth e8203000.ethernet: MDIO init failed: -524 sh-eth: probe of e8203000.ethernet failed with error -524 Seen on r8a7740/armadillo, r7s72100/rskrza1, and r7s9210/rza2mevb. Fixes: `1d4639567d` ("mdio_bus: Fix PTR_ERR applied after initialization to constant") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Cc: YueHaibing <yuehaibing@huawei.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-11-19 08:34:15 -08:00
Sun Ke	dff10bbea4	nbd:fix memory leak in nbd_get_socket() Before returning NULL, put the sock first. Cc: stable@vger.kernel.org Fixes: `cf1b2326b7` ("nbd: verify socket is supported during setup") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Sun Ke <sunke32@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-19 09:23:26 -07:00
Laurent Vivier	d791cfcbf9	virtio_console: allocate inbufs in add_port() only if it is needed When we hot unplug a virtserialport and then try to hot plug again, it fails: (qemu) chardev-add socket,id=serial0,path=/tmp/serial0,server,nowait (qemu) device_add virtserialport,bus=virtio-serial0.0,nr=2,\ chardev=serial0,id=serial0,name=serial0 (qemu) device_del serial0 (qemu) device_add virtserialport,bus=virtio-serial0.0,nr=2,\ chardev=serial0,id=serial0,name=serial0 kernel error: virtio-ports vport2p2: Error allocating inbufs qemu error: virtio-serial-bus: Guest failure in adding port 2 for device \ virtio-serial0.0 This happens because buffers for the in_vq are allocated when the port is added but are not released when the port is unplugged. They are only released when virtconsole is removed (see `a7a69ec0d8`) To avoid the problem and to be symmetric, we could allocate all the buffers in init_vqs() as they are released in remove_vqs(), but it sounds like a waste of memory. Rather than that, this patch changes add_port() logic to ignore ENOSPC error in fill_queue(), which means queue has already been filled. Fixes: `a7a69ec0d8` ("virtio_console: free buffers after reset") Cc: mst@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2019-11-19 05:13:49 -05:00
Halil Pasic	f7728002c1	virtio_ring: fix return code on DMA mapping fails Commit `780bc7903a` ("virtio_ring: Support DMA APIs") makes virtqueue_add() return -EIO when we fail to map our I/O buffers. This is a very realistic scenario for guests with encrypted memory, as swiotlb may run out of space, depending on it's size and the I/O load. The virtio-blk driver interprets -EIO form virtqueue_add() as an IO error, despite the fact that swiotlb full is in absence of bugs a recoverable condition. Let us change the return code to -ENOMEM, and make the block layer recover form these failures when virtio-blk encounters the condition described above. Cc: stable@vger.kernel.org Fixes: `780bc7903a` ("virtio_ring: Support DMA APIs") Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2019-11-19 05:13:49 -05:00
Marek Behún	075e238d12	mdio_bus: fix mdio_register_device when RESET_CONTROLLER is disabled When CONFIG_RESET_CONTROLLER is disabled, the devm_reset_control_get_exclusive function returns -ENOTSUPP. This is not handled in subsequent check and then the mdio device fails to probe. When CONFIG_RESET_CONTROLLER is enabled, its code checks in OF for reset device, and since it is not present, returns -ENOENT. -ENOENT is handled. Add -ENOTSUPP also. This happened to me when upgrading kernel on Turris Omnia. You either have to enable CONFIG_RESET_CONTROLLER or use this patch. Signed-off-by: Marek Behún <marek.behun@nic.cz> Fixes: `71dd6c0dff` ("net: phy: add support for reset-controller") Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-18 17:59:32 -08:00
Marcelo Ricardo Leitner	ca749bbb10	net/ipv4: fix sysctl max for fib_multipath_hash_policy Commit `eec4844fae` ("proc/sysctl: add shared variables for range check") did: - .extra2 = &two, + .extra2 = SYSCTL_ONE, here, which doesn't seem to be intentional, given the changelog. This patch restores it to the previous, as the value of 2 still makes sense (used in fib_multipath_hash()). Fixes: `eec4844fae` ("proc/sysctl: add shared variables for range check") Cc: Matteo Croce <mcroce@redhat.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-18 17:25:36 -08:00
Chuhong Yuan	39c68b3fc2	phy: mdio-sun4i: add missed regulator_disable in remove The driver forgets to disable the regulator in remove like what is done in probe failure. Add the missed call to fix it. Signed-off-by: Chuhong Yuan <hslester96@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-18 17:22:40 -08:00
Tariq Toukan	2744bf4268	net/mlx4_en: Fix wrong limitation for number of TX rings XDP_TX rings should not be limited by max_num_tx_rings_p_up. To make sure total number of TX rings never exceed MAX_TX_RINGS, add similar check in mlx4_en_alloc_tx_queue_per_tc(), where a new value is assigned for num_up. Fixes: `7e1dc5e926` ("net/mlx4_en: Limit the number of TX rings") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-18 17:18:34 -08:00
Xin Long	4f0e97d070	net: sched: ensure opts_len <= IP_TUNNEL_OPTS_MAX in act_tunnel_key info->options_len is 'u8' type, and when opts_len with a value > IP_TUNNEL_OPTS_MAX, 'info->options_len = opts_len' will cast int to u8 and set a wrong value to info->options_len. Kernel crashed in my test when doing: # opts="0102:80:00800022" # for i in {1..99}; do opts="$opts,0102:80:00800022"; done # ip link add name geneve0 type geneve dstport 0 external # tc qdisc add dev eth0 ingress # tc filter add dev eth0 protocol ip parent ffff: \ flower indev eth0 ip_proto udp action tunnel_key \ set src_ip 10.0.99.192 dst_ip 10.0.99.193 \ dst_port 6081 id 11 geneve_opts $opts \ action mirred egress redirect dev geneve0 So we should do the similar check as cls_flower does, return error when opts_len > IP_TUNNEL_OPTS_MAX in tunnel_key_copy_opts(). Fixes: `0ed5269f9e` ("net/sched: add tunnel option support to act_tunnel_key") Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-18 17:17:07 -08:00
Petr Machata	1fc1657775	mlxsw: spectrum_router: Fix determining underlay for a GRE tunnel The helper mlxsw_sp_ipip_dev_ul_tb_id() determines the underlay VRF of a GRE tunnel. For a tunnel without a bound device, it uses the same VRF that the tunnel is in. However in Linux, a GRE tunnel without a bound device uses the main VRF as the underlay. Fix the function accordingly. mlxsw further assumed that moving a tunnel to a different VRF could cause conflict in local tunnel endpoint address, which cannot be offloaded. However, the only way that an underlay could be changed by moving the tunnel device itself is if the tunnel device does not have a bound device. But in that case the underlay is always the main VRF, so there is no opportunity to introduce a conflict by moving such device. Thus this check constitutes a dead code, and can be removed, which do. Fixes: `6ddb7426a7` ("mlxsw: spectrum_router: Introduce loopback RIFs") Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-18 17:09:31 -08:00
Aditya Pakki	60f5c4aaae	net: atm: Reduce the severity of logging in unlink_clip_vcc In case of errors in unlink_clip_vcc, the logging level is set to pr_crit but failures in clip_setentry are handled by pr_err(). The patch changes the severity consistent across invocations. Signed-off-by: Aditya Pakki <pakki001@umn.edu> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-18 17:08:20 -08:00
David Sterba	fa17ed069c	btrfs: drop bdev argument from submit_extent_page After previous patches removing bdev being passed around to set it to bio, it has become unused in submit_extent_page. So it now has "only" 13 parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 23:43:58 +01:00
David Sterba	a019e9e197	btrfs: remove extent_map::bdev We can now remove the bdev from extent_map. Previous patches made sure that bio_set_dev is correctly in all places and that we don't need to grab it from latest_bdev or pass it around inside the extent map. Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 23:43:44 +01:00
David Sterba	1a41802701	btrfs: drop bio_set_dev where not needed bio_set_dev sets a bdev to a bio and is not only setting a pointer bug also changing some state bits if there was a different bdev set before. This is one thing that's not needed. Another thing is that setting a bdev at bio allocation time is too early and actually does not work with plain redundancy profiles, where each time we submit a bio to a device, the bdev is set correctly. In many places the bio bdev is set to latest_bdev that seems to serve as a stub pointer "just to put something to bio". But we don't have to do that. Where do we know which bdev to set: * for regular IO: submit_stripe_bio that's called by btrfs_map_bio * repair IO: repair_io_failure, read or write from specific device * super block write (using buffer_heads but uses raw bdev) and barriers * scrub: this does not use all regular IO paths as it needs to reach all copies, verify and fixup eventually, and for that all bdev management is independent * raid56: rbio_add_io_page, for the RMW write * integrity-checker: does it's own low-level block tracking Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 23:39:30 +01:00
David Sterba	429aebc0a9	btrfs: get bdev directly from fs_devices in submit_extent_page This is preparatory patch to remove @bdev parameter from submit_extent_page. It can't be removed completely, because the cgroups need it for wbc when initializing the bio wbc_init_bio bio_associate_blkg_from_css dereference bdev->bi_disk->queue The bdev pointer is the same as latest_bdev, thus no functional change. We can retrieve it from fs_devices that's reachable through several dereferences. The local variable shadows the parameter, but that's only temporary. Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 23:38:46 +01:00
Chris Wilson	c0fa92ec89	drm/i915: Protect request peeking with RCU Since the execlists_active() is no longer protected by the engine->active.lock, we need to protect the request pointer with RCU to prevent it being freed as we evaluate whether or not we need to preempt. Fixes: `df40306902` ("drm/i915/execlists: Lift process_csb() out of the irq-off spinlock") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20191104090158.2959-2-chris@chris-wilson.co.uk (cherry picked from commit `7d14863525`) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> (cherry picked from commit `8eb4704b12`) (cherry picked from commit 7e27238e149ce4f00d9cd801fe3aa0ea55e986a2) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2019-11-18 11:25:16 -08:00
Josef Bacik	3e1740993e	btrfs: record all roots for rename exchange on a subvol Testing with the new fsstress support for subvolumes uncovered a pretty bad problem with rename exchange on subvolumes. We're modifying two different subvolumes, but we only start the transaction on one of them, so the other one is not added to the dirty root list. This is caught by btrfs_cow_block() with a warning because the root has not been updated, however if we do not modify this root again we'll end up pointing at an invalid root because the root item is never updated. Fix this by making sure we add the destination root to the trans list, the same as we do with normal renames. This fixes the corruption. Fixes: `cdd1fedf82` ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 20:08:31 +01:00
Chris Wilson	2d691aeca4	drm/i915/userptr: Try to acquire the page lock around set_page_dirty() set_page_dirty says: For pages with a mapping this should be done under the page lock for the benefit of asynchronous memory errors who prefer a consistent dirty state. This rule can be broken in some special cases, but should be better not to. Under those rules, it is only safe for us to use the plain set_page_dirty calls for shmemfs/anonymous memory. Userptr may be used with real mappings and so needs to use the locked version (set_page_dirty_lock). However, following a try_to_unmap() we may want to remove the userptr and so call put_pages(). However, try_to_unmap() acquires the page lock and so we must avoid recursively locking the pages ourselves -- which means that we cannot safely acquire the lock around set_page_dirty(). Since we can't be sure of the lock, we have to risk skip dirtying the page, or else risk calling set_page_dirty() without a lock and so risk fs corruption. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203317 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=112012 Fixes: `5cc9ed4b9a` ("drm/i915: Introduce mapping of user pages into video memory (userptr) ioctl") References: `cb6d7c7dc7` ("drm/i915/userptr: Acquire the page lock around set_page_dirty()") References: `505a8ec7e1` ("Revert "drm/i915/userptr: Acquire the page lock around set_page_dirty()"") References: `6dcc693bc5` ("ext4: warn when page is dirtied without buffers") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: stable@vger.kernel.org Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20191111133205.11590-1-chris@chris-wilson.co.uk (cherry picked from commit `0d4bbe3d40`) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> (cherry picked from commit `cee7fb437e`) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2019-11-18 09:20:45 -08:00
Chris Wilson	add3eeed36	drm/i915/pmu: "Frequency" is reported as accumulated cycles We report "frequencies" (actual-frequency, requested-frequency) as the number of accumulated cycles so that the average frequency over that period may be determined by the user. This means the units we report to the user are Mcycles (or just M), not MHz. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: stable@vger.kernel.org Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20191109105356.5273-1-chris@chris-wilson.co.uk (cherry picked from commit `e88866ef02`) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> (cherry picked from commit `a7d87b70d6`) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2019-11-18 09:20:38 -08:00
Ville Syrjälä	1aa4df7e41	drm/i915: Preload LUTs if the hw isn't currently using them The LUTs are single buffered so in order to program them without tearing we'd have to do it during vblank (actually to be 100% effective it has to happen between start of vblank and frame start). We have no proper mechanism for that at the moment so we just defer loading them after the vblank waits have happened. That is not quite sufficient (especially when committing multiple pipes whose vblanks don't line up) so the LUT load will often leak into the following frame causing tearing. However in case the hardware wasn't previously using the LUT we can preload it before setting the enable bit (which is double buffered so won't tear). Let's determine if we can do such preloading and make it happen. Slight variation between the hardware requires some platforms specifics in the checks. Hans is seeing ugly colored flash on VLV/CHV macchines (GPD win and Asus T100HA) when the gamma LUT gets loaded for the first time as the BIOS has left some junk in the LUT memory. v2: Deal with uapi vs. hw crtc state split s/GCM/CGM/ typo fix Cc: Hans de Goede <hdegoede@redhat.com> Fixes: `051a6d8d3c` ("drm/i915: Move LUT programming to happen after vblank waits") Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20191030190815.7359-1-ville.syrjala@linux.intel.com Tested-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> (cherry picked from commit `0ccc42a2fd`) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> (cherry picked from commit `f77021372e`) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2019-11-18 09:20:27 -08:00
Ville Syrjälä	8ac495f624	drm/i915: Don't oops in dumb_create ioctl if we have no crtcs Make sure we have a crtc before probing its primary plane's max stride. Initially I thought we can't get this far without crtcs, but looks like we can via the dumb_create ioctl. Not sure if we shouldn't disable dumb buffer support entirely when we have no crtcs, but that would require some amount of work as the only thing currently being checked is dev->driver->dumb_create which we'd have to convert to some device specific dynamic thing. Cc: stable@vger.kernel.org Reported-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Fixes: `aa5ca8b742` ("drm/i915: Align dumb buffer stride to 4k to allow for gtt remapping") Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20191106172349.11987-1-ville.syrjala@linux.intel.com Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> (cherry picked from commit `baea9ffe64`) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> (cherry picked from commit `aeec766133`) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2019-11-18 09:20:23 -08:00
Filipe Manana	042528f8d8	Btrfs: fix block group remaining RO forever after error during device replace When doing a device replace, while at scrub.c:scrub_enumerate_chunks(), we set the block group to RO mode and then wait for any ongoing writes into extents of the block group to complete. While doing that wait we overwrite the value of the variable 'ret' and can break out of the loop if an error happens without turning the block group back into RW mode. So what happens is the following: 1) btrfs_inc_block_group_ro() returns 0, meaning it set the block group to RO mode (its ->ro field set to 1 or incremented to some value > 1); 2) Then btrfs_wait_ordered_roots() returns a value > 0; 3) Then if either joining or committing the transaction fails, we break out of the loop wihtout calling btrfs_dec_block_group_ro(), leaving the block group in RO mode forever. To fix this, just remove the code that waits for ongoing writes to extents of the block group, since it's not needed because in the initial setup phase of a device replace operation, before starting to find all chunks and their extents, we set the target device for replace while holding fs_info->dev_replace->rwsem, which ensures that after releasing that semaphore, any writes into the source device are made to the target device as well (__btrfs_map_block() guarantees that). So while at scrub_enumerate_chunks() we only need to worry about finding and copying extents (from the source device to the target device) that were written before we started the device replace operation. Fixes: `f0e9b7d640` ("Btrfs: fix race setting block group readonly during device replace") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 18:07:55 +01:00
Qu Wenruo	b12de52896	btrfs: scrub: Don't check free space before marking a block group RO [BUG] When running btrfs/072 with only one online CPU, it has a pretty high chance to fail: btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg) - output mismatch (see xfstests-dev/results//btrfs/072.out.bad) --- tests/btrfs/072.out 2019-10-22 15:18:14.008965340 +0800 +++ /xfstests-dev/results//btrfs/072.out.bad 2019-11-14 15:56:45.877152240 +0800 @@ -1,2 +1,3 @@ QA output created by 072 Silence is golden +Scrub find errors in "-m dup -d single" test ... And with the following call trace: BTRFS info (device dm-5): scrub: started on devid 1 ------------[ cut here ]------------ BTRFS: Transaction aborted (error -27) WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs] CPU: 0 PID: 55087 Comm: btrfs Tainted: G W O 5.4.0-rc1-custom+ #13 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs] Call Trace: __btrfs_end_transaction+0xdb/0x310 [btrfs] btrfs_end_transaction+0x10/0x20 [btrfs] btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs] scrub_enumerate_chunks+0x264/0x940 [btrfs] btrfs_scrub_dev+0x45c/0x8f0 [btrfs] btrfs_ioctl+0x31a1/0x3fb0 [btrfs] do_vfs_ioctl+0x636/0xaa0 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x43/0x50 do_syscall_64+0x79/0xe0 entry_SYSCALL_64_after_hwframe+0x49/0xbe ---[ end trace 166c865cec7688e7 ]--- [CAUSE] The error number -27 is -EFBIG, returned from the following call chain: btrfs_end_transaction() \|- __btrfs_end_transaction() \|- btrfs_create_pending_block_groups() \|- btrfs_finish_chunk_alloc() \|- btrfs_add_system_chunk() This happens because we have used up all space of btrfs_super_block::sys_chunk_array. The root cause is, we have the following bad loop of creating tons of system chunks: 1. The only SYSTEM chunk is being scrubbed It's very common to have only one SYSTEM chunk. 2. New SYSTEM bg will be allocated As btrfs_inc_block_group_ro() will check if we have enough space after marking current bg RO. If not, then allocate a new chunk. 3. New SYSTEM bg is still empty, will be reclaimed During the reclaim, we will mark it RO again. 4. That newly allocated empty SYSTEM bg get scrubbed We go back to step 2, as the bg is already mark RO but still not cleaned up yet. If the cleaner kthread doesn't get executed fast enough (e.g. only one CPU), then we will get more and more empty SYSTEM chunks, using up all the space of btrfs_super_block::sys_chunk_array. [FIX] Since scrub/dev-replace doesn't always need to allocate new extent, especially chunk tree extent, so we don't really need to do chunk pre-allocation. To break above spiral, here we introduce a new parameter to btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we need extra chunk pre-allocation. For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass @do_chunk_alloc=false. This should keep unnecessary empty chunks from popping up for scrub. Also, since there are two parameters for btrfs_inc_block_group_ro(), add more comment for it. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 18:07:55 +01:00
Johannes Thumshirn	7f0432d0d8	btrfs: change btrfs_fs_devices::rotating to bool struct btrfs_fs_devices::rotating currently is declared as an integer variable but only used as a boolean. Change the variable definition to bool and update to code touching it to set 'true' and 'false'. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:51 +01:00
Johannes Thumshirn	0395d84f8e	btrfs: change btrfs_fs_devices::seeding to bool struct btrfs_fs_devices::seeding currently is declared as an integer variable but only used as a boolean. Change the variable definition to bool and update to code touching it to set 'true' and 'false'. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:51 +01:00
David Sterba	32da5386d9	btrfs: rename btrfs_block_group_cache The type name is misleading, a single entry is named 'cache' while this normally means a collection of objects. Rename that everywhere. Also the identifier was quite long, making function prototypes harder to format. Suggested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:51 +01:00
Qu Wenruo	d49a2ddb15	btrfs: block-group: Reuse the item key from caller of read_one_block_group() For read_one_block_group(), its only caller has already got the item key to search next block group item. So we can use that key directly without doing our own convertion on stack. Also, since that key used in btrfs_read_block_groups() is vital for block group item search, add 'const' keyword for that parameter to prevent read_one_block_group() to modify it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:50 +01:00
Qu Wenruo	ffb9e0f05f	btrfs: block-group: Refactor btrfs_read_block_groups() Refactor the work inside the loop of btrfs_read_block_groups() into one separate function, read_one_block_group(). This allows read_one_block_group to be reused for later BG_TREE feature. The refactor does the following extra fix: - Use btrfs_fs_incompat() to replace open-coded feature check Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:50 +01:00
David Sterba	d4e253bbbc	btrfs: document extent buffer locking Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:50 +01:00
David Sterba	a4477988cf	btrfs: access eb::blocking_writers according to ACCESS_ONCE policies A nice writeup of the LKMM (Linux Kernel Memory Model) rules for access once policies can be found here https://lwn.net/Articles/799218/#Access-Marking%20Policies . The locked and unlocked access to eb::blocking_writers should be annotated accordingly, following this: Writes: - locked write must use ONCE, may use plain read - unlocked write must use ONCE Reads: - unlocked read must use ONCE - locked read may use plain read iff not mixed with unlocked read - unlocked read then locked must use ONCE There's one difference on the assembly level, where btrfs_tree_read_lock_atomic and btrfs_try_tree_read_lock used the cached value and did not reevaluate it after taking the lock. This could have missed some opportunities to take the lock in case blocking writers changed between the calls, but the window is just a few instructions long. As this is in try-lock, the callers handle that. Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:50 +01:00
David Sterba	40d38f53d4	btrfs: set blocking_writers directly, no increment or decrement The increment and decrement was inherited from previous version that used atomics, switched in commit `06297d8cef` ("btrfs: switch extent_buffer blocking_writers from atomic to int"). The only possible values are 0 and 1 so we can set them directly. The generated assembly (gcc 9.x) did the direct value assignment in btrfs_set_lock_blocking_write (asm diff after change in `06297d8cef`): 5d: test %eax,%eax 5f: je 62 <btrfs_set_lock_blocking_write+0x22> 61: retq - 62: lock incl 0x44(%rdi) - 66: add $0x50,%rdi - 6a: jmpq 6f <btrfs_set_lock_blocking_write+0x2f> + 62: movl $0x1,0x44(%rdi) + 69: add $0x50,%rdi + 6d: jmpq 72 <btrfs_set_lock_blocking_write+0x32> The part in btrfs_tree_unlock did a decrement because BUG_ON(blockers > 1) is probably not a strong hint for the compiler, but otherwise the output looks safe: - lock decl 0x44(%rdi) + sub $0x1,%eax + mov %eax,0x44(%rdi) Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:50 +01:00
David Sterba	f5c2a52590	btrfs: merge blocking_writers branches in btrfs_tree_read_lock There are two ifs that use eb::blocking_writers. As this is a variable modified inside and outside of locks, we could minimize number of accesses to avoid problems with getting different results at different times. The access here is locked so this can only race with btrfs_tree_unlock that sets blocking_writers to 0 without lock and unsets the lock owner. The first branch is taken only if the same thread already holds the lock, the second if checks for blocking writers. Here we'd either unlock and wait, or proceed. Both are valid states of the locking protocol. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:50 +01:00
David Sterba	9c907446dc	btrfs: drop incompat bit for raid1c34 after last block group is gone When there are no raid1c3 or raid1c4 block groups left after balance (either convert or with other filters applied), remove the incompat bit. This is already done for RAID56, do the same for RAID1C34. Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:49 +01:00
David Sterba	cfbb825c76	btrfs: add incompat for raid1 with 3, 4 copies The new raid1c3 and raid1c4 profiles are backward incompatible and the name shall be 'raid1c34', the status can be found in the global supported features in /sys/fs/btrfs/features or in the per-filesystem directory. Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:49 +01:00
David Sterba	8d6fac0087	btrfs: add support for 4-copy replication (raid1c4) Add new block group profile to store 4 copies in a simliar way that current RAID1 does. The profile attributes and constraints are defined in the raid table and used by the same code that already handles the 2- and 3-copy RAID1. The minimum number of devices is 4, the maximum number of devices/chunks that can be lost/damaged is 3. There is no comparable traditional RAID level, the profile is added for future needs to accompany triple-parity and beyond. Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:49 +01:00
David Sterba	47e6f7423b	btrfs: add support for 3-copy replication (raid1c3) Add new block group profile to store 3 copies in a simliar way that current RAID1 does. The profile attributes and constraints are defined in the raid table and used by the same code that already handles the 2-copy RAID1. The minimum number of devices is 3, the maximum number of devices/chunks that can be lost/damaged is 2. Like RAID6 but with 33% space utilization. Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:49 +01:00
David Sterba	fac07d2b09	btrfs: sink write flags to cow_file_range_async In commit "Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios", cow_file_range_async gained wbc as a parameter and this makes passing write flags redundant. Set it inside the function and remove the parameter. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:49 +01:00
David Sterba	57e5ffeb87	btrfs: sink write_flags to __extent_writepage_io __extent_writepage reads write flags from wbc and passes both to __extent_writepage_io. This makes write_flags redundant and we can remove it. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:48 +01:00
Filipe Manana	fd0ddbe250	Btrfs: send, skip backreference walking for extents with many references Backreference walking, which is used by send to figure if it can issue clone operations instead of write operations, can be very slow and use too much memory when extents have many references. This change simply skips backreference walking when an extent has more than 64 references, in which case we fallback to a write operation instead of a clone operation. This limit is conservative and in practice I observed no signicant slowdown with up to 100 references and still low memory usage up to that limit. This is a temporary workaround until there are speedups in the backref walking code, and as such it does not attempt to add extra interfaces or knobs to tweak the threshold. Reported-by: Atemu <atemu.main@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:48 +01:00
Filipe Manana	11f2069c11	Btrfs: send, allow clone operations within the same file For send we currently skip clone operations when the source and destination files are the same. This is so because clone didn't support this case in its early days, but support for it was added back in May 2013 by commit `a96fbc7288` ("Btrfs: allow file data clone within a file"). This change adds support for it. Example: $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt/sdd $ xfs_io -f -c "pwrite -S 0xab -b 64K 0 64K" /mnt/sdd/foobar $ xfs_io -c "reflink /mnt/sdd/foobar 0 64K 64K" /mnt/sdd/foobar $ btrfs subvolume snapshot -r /mnt/sdd /mnt/sdd/snap $ mkfs.btrfs -f /dev/sde $ mount /dev/sde /mnt/sde $ btrfs send /mnt/sdd/snap \| btrfs receive /mnt/sde Without this change file foobar at the destination has a single 128Kb extent: $ filefrag -v /mnt/sde/snap/foobar Filesystem type is: 9123683e File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 0.. 31: 32: last,unknown_loc,delalloc,eof /mnt/sde/snap/foobar: 1 extent found With this we get a single 64Kb extent that is shared at file offsets 0 and 64K, just like in the source filesystem: $ filefrag -v /mnt/sde/snap/foobar Filesystem type is: 9123683e File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 15: 3328.. 3343: 16: shared 1: 16.. 31: 3328.. 3343: 16: 3344: last,shared,eof /mnt/sde/snap/foobar: 2 extents found Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:48 +01:00

1 2 3 4 5 ...

873878 Commits