aboutsummaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2021-10-12dm: fix mempool NULL pointer race when completing IOGravatar Jiazi Li 1-7/+10
dm_io_dec_pending() calls end_io_acct() first and will then dec md in-flight pending count. But if a task is swapping DM table at same time this can result in a crash due to mempool->elements being NULL: task1 task2 do_resume ->do_suspend ->dm_wait_for_completion bio_endio ->clone_endio ->dm_io_dec_pending ->end_io_acct ->wakeup task1 ->dm_swap_table ->__bind ->__bind_mempools ->bioset_exit ->mempool_exit ->free_io [ 67.330330] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 ...... [ 67.330494] pstate: 80400085 (Nzcv daIf +PAN -UAO) [ 67.330510] pc : mempool_free+0x70/0xa0 [ 67.330515] lr : mempool_free+0x4c/0xa0 [ 67.330520] sp : ffffff8008013b20 [ 67.330524] x29: ffffff8008013b20 x28: 0000000000000004 [ 67.330530] x27: ffffffa8c2ff40a0 x26: 00000000ffff1cc8 [ 67.330535] x25: 0000000000000000 x24: ffffffdada34c800 [ 67.330541] x23: 0000000000000000 x22: ffffffdada34c800 [ 67.330547] x21: 00000000ffff1cc8 x20: ffffffd9a1304d80 [ 67.330552] x19: ffffffdada34c970 x18: 000000b312625d9c [ 67.330558] x17: 00000000002dcfbf x16: 00000000000006dd [ 67.330563] x15: 000000000093b41e x14: 0000000000000010 [ 67.330569] x13: 0000000000007f7a x12: 0000000034155555 [ 67.330574] x11: 0000000000000001 x10: 0000000000000001 [ 67.330579] x9 : 0000000000000000 x8 : 0000000000000000 [ 67.330585] x7 : 0000000000000000 x6 : ffffff80148b5c1a [ 67.330590] x5 : ffffff8008013ae0 x4 : 0000000000000001 [ 67.330596] x3 : ffffff80080139c8 x2 : ffffff801083bab8 [ 67.330601] x1 : 0000000000000000 x0 : ffffffdada34c970 [ 67.330609] Call trace: [ 67.330616] mempool_free+0x70/0xa0 [ 67.330627] bio_put+0xf8/0x110 [ 67.330638] dec_pending+0x13c/0x230 [ 67.330644] clone_endio+0x90/0x180 [ 67.330649] bio_endio+0x198/0x1b8 [ 67.330655] dec_pending+0x190/0x230 [ 67.330660] clone_endio+0x90/0x180 [ 67.330665] bio_endio+0x198/0x1b8 [ 67.330673] blk_update_request+0x214/0x428 [ 67.330683] scsi_end_request+0x2c/0x300 [ 67.330688] scsi_io_completion+0xa0/0x710 [ 67.330695] scsi_finish_command+0xd8/0x110 [ 67.330700] scsi_softirq_done+0x114/0x148 [ 67.330708] blk_done_softirq+0x74/0xd0 [ 67.330716] __do_softirq+0x18c/0x374 [ 67.330724] irq_exit+0xb4/0xb8 [ 67.330732] __handle_domain_irq+0x84/0xc0 [ 67.330737] gic_handle_irq+0x148/0x1b0 [ 67.330744] el1_irq+0xe8/0x190 [ 67.330753] lpm_cpuidle_enter+0x4f8/0x538 [ 67.330759] cpuidle_enter_state+0x1fc/0x398 [ 67.330764] cpuidle_enter+0x18/0x20 [ 67.330772] do_idle+0x1b4/0x290 [ 67.330778] cpu_startup_entry+0x20/0x28 [ 67.330786] secondary_start_kernel+0x160/0x170 Fix this by: 1) Establishing pointers to 'struct dm_io' members in dm_io_dec_pending() so that they may be passed into end_io_acct() _after_ free_io() is called. 2) Moving end_io_acct() after free_io(). Cc: stable@vger.kernel.org Signed-off-by: Jiazi Li <lijiazi@xiaomi.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-10-12dm rq: don't queue request to blk-mq during DM suspendGravatar Ming Lei 1-0/+8
DM uses blk-mq's quiesce/unquiesce to stop/start device mapper queue. But blk-mq's unquiesce may come from outside events, such as elevator switch, updating nr_requests or others, and request may come during suspend, so simply ask for blk-mq to requeue it. Fixes one kernel panic issue when running updating nr_requests and dm-mpath suspend/resume stress test. Cc: stable@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-10-12dm clone: make array 'descs' staticGravatar Colin Ian King 1-1/+1
Don't populate the read-only array descs on the stack but instead it static and add extra const. Also makes the object code smaller by 66 bytes: Before: text data bss dec hex filename 42382 11140 512 54034 d312 ./drivers/md/dm-clone-target.o After: text data bss dec hex filename 42220 11236 512 53968 d2d0 ./drivers/md/dm-clone-target.o (gcc version 11.2.0) Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-10-12dm verity: skip redundant verity_handle_err() on I/O errorsGravatar Akilesh Kailash 1-3/+12
Without FEC, dm-verity won't call verity_handle_err() when I/O fails, but with FEC enabled, it currently does even if an I/O error has occurred. If there is an I/O error and FEC correction fails, return the error instead of calling verity_handle_err() again. Suggested-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Akilesh Kailash <akailash@google.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-09-22md: fix a lock order reversal in md_allocGravatar Christoph Hellwig 1-5/+0
Commit b0140891a8cea3 ("md: Fix race when creating a new md device.") not only moved assigning mddev->gendisk before calling add_disk, which fixes the races described in the commit log, but also added a mddev->open_mutex critical section over add_disk and creation of the md kobj. Adding a kobject after add_disk is racy vs deleting the gendisk right after adding it, but md already prevents against that by holding a mddev->active reference. On the other hand taking this lock added a lock order reversal with what is not disk->open_mutex (used to be bdev->bd_mutex when the commit was added) for partition devices, which need that lock for the internal open for the partition scan, and a recent commit also takes it for non-partitioned devices, leading to further lockdep splatter. Fixes: b0140891a8ce ("md: Fix race when creating a new md device.") Fixes: d62633873590 ("block: support delayed holder registration") Reported-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2021-09-09Merge tag 'libnvdimm-for-5.15' of ↵Gravatar Linus Torvalds 2-8/+3
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: - Fix a race condition in the teardown path of raw mode pmem namespaces. - Cleanup the code that filesystems use to detect filesystem-dax capabilities of their underlying block device. * tag 'libnvdimm-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: remove bdev_dax_supported xfs: factor out a xfs_buftarg_is_dax helper dax: stub out dax_supported for !CONFIG_FS_DAX dax: remove __generic_fsdax_supported dax: move the dax_read_lock() locking into dax_supported dax: mark dax_get_by_host static dm: use fs_dax_get_by_bdev instead of dax_get_by_host dax: stop using bdevname fsdax: improve the FS_DAX Kconfig description and help text libnvdimm/pmem: Fix crash triggered when I/O in-flight during unbind
2021-09-02Merge tag 'integrity-v5.15' of ↵Gravatar Linus Torvalds 1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity Pull integrity subsystem updates from Mimi Zohar: - Limit the allowed hash algorithms when writing security.ima xattrs or verifying them, based on the IMA policy and the configured hash algorithms. - Return the calculated "critical data" measurement hash and size to avoid code duplication. (Preparatory change for a proposed LSM.) - and a single patch to address a compiler warning. * tag 'integrity-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity: IMA: reject unknown hash algorithms in ima_get_hash_algo IMA: prevent SETXATTR_CHECK policy rules with unavailable algorithms IMA: introduce a new policy option func=SETXATTR_CHECK IMA: add a policy option to restrict xattr hash algorithms on appraisal IMA: add support to restrict the hash algorithms used for file appraisal IMA: block writes of the security.ima xattr with unsupported algorithms IMA: remove the dependency on CRYPTO_MD5 ima: Add digest and digest_len params to the functions to measure a buffer ima: Return int in the functions to measure a buffer ima: Introduce ima_get_current_hash_algo() IMA: remove -Wmissing-prototypes warning
2021-08-31Merge tag 'for-5.15/dm-changes' of ↵Gravatar Linus Torvalds 37-194/+1493
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Add DM infrastructure for IMA-based remote attestion. These changes are the basis for deploying DM-based storage in a "cloud" that must validate configurations end-users run to maintain trust. These DM changes allow supported DM targets' configurations to be measured via IMA. But the policy and enforcement (of which configurations are valid) is managed by something outside the kernel (e.g. Keylime). - Fix DM crypt scalability regression on systems with many cpus due to percpu_counter spinlock contention in crypt_page_alloc(). - Use in_hardirq() instead of deprecated in_irq() in DM crypt. - Add event counters to DM writecache to allow users to further assess how the writecache is performing. - Various code cleanup in DM writecache's main IO mapping function. * tag 'for-5.15/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm crypt: use in_hardirq() instead of deprecated in_irq() dm ima: update dm documentation for ima measurement support dm ima: update dm target attributes for ima measurements dm ima: add a warning in dm_init if duplicate ima events are not measured dm ima: prefix ima event name related to device mapper with dm_ dm ima: add version info to dm related events in ima log dm ima: prefix dm table hashes in ima log with hash algorithm dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc() dm: add documentation for IMA measurement support dm: update target status functions to support IMA measurement dm ima: measure data on device rename dm ima: measure data on table clear dm ima: measure data on device remove dm ima: measure data on device resume dm ima: measure data on table load dm writecache: add event counters dm writecache: report invalid return from writecache_map helpers dm writecache: further writecache_map() cleanup dm writecache: factor out writecache_map_remap_origin() dm writecache: split up writecache_map() to improve code readability
2021-08-30Merge tag 'for-5.15/drivers-2021-08-30' of git://git.kernel.dk/linux-blockGravatar Linus Torvalds 2-4/+29
Pull block driver updates from Jens Axboe: "Sitting on top of the core block changes, here are the driver changes for the 5.15 merge window: - NVMe updates via Christoph: - suspend improvements for devices with an HMB (Keith Busch) - handle double completions more gacefull (Sagi Grimberg) - cleanup the selects for the nvme core code a bit (Sagi Grimberg) - don't update queue count when failing to set io queues (Ruozhu Li) - various nvmet connect fixes (Amit Engel) - cleanup lightnvm leftovers (Keith Busch, me) - small cleanups (Colin Ian King, Hou Pu) - add tracing for the Set Features command (Hou Pu) - CMB sysfs cleanups (Keith Busch) - add a mutex_destroy call (Keith Busch) - remove lightnvm subsystem. It's served its purpose and ultimately led to zoned nvme support, we no longer need it (Christoph) - revert floppy O_NDELAY fix (Denis) - nbd fixes (Hou, Pavel, Baokun) - nbd locking fixes (Tetsuo) - nbd device removal fixes (Christoph) - raid10 rcu warning fix (Xiao) - raid1 write behind fix (Guoqing) - rnbd fixes (Gioh, Md Haris) - misc fixes (Colin)" * tag 'for-5.15/drivers-2021-08-30' of git://git.kernel.dk/linux-block: (42 commits) Revert "floppy: reintroduce O_NDELAY fix" raid1: ensure write behind bio has less than BIO_MAX_VECS sectors md/raid10: Remove unnecessary rcu_dereference in raid10_handle_discard nbd: remove nbd->destroy_complete nbd: only return usable devices from nbd_find_unused nbd: set nbd->index before releasing nbd_index_mutex nbd: prevent IDR lookups from finding partially initialized devices nbd: reset NBD to NULL when restarting in nbd_genl_connect nbd: add missing locking to the nbd_dev_add error path nvme: remove the unused NVME_NS_* enum nvme: remove nvm_ndev from ns nvme: Have NVME_FABRICS select NVME_CORE instead of transport drivers block: nbd: add sanity check for first_minor nvmet: check that host sqsize does not exceed ctrl MQES nvmet: avoid duplicate qid in connect cmd nvmet: pass back cntlid on successful completion nvme-rdma: don't update queue count when failing to set io queues nvme-tcp: don't update queue count when failing to set io queues nvme-tcp: pair send_mutex init with destroy nvme: allow user toggling hmb usage ...
2021-08-30Merge tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-blockGravatar Linus Torvalds 13-48/+41
Pull block updates from Jens Axboe: "Nothing major in here - lots of good cleanups and tech debt handling, which is also evident in the diffstats. In particular: - Add disk sequence numbers (Matteo) - Discard merge fix (Ming) - Relax disk zoned reporting restrictions (Niklas) - Bio error handling zoned leak fix (Pavel) - Start of proper add_disk() error handling (Luis, Christoph) - blk crypto fix (Eric) - Non-standard GPT location support (Dmitry) - IO priority improvements and cleanups (Damien)o - blk-throtl improvements (Chunguang) - diskstats_show() stack reduction (Abd-Alrhman) - Loop scheduler selection (Bart) - Switch block layer to use kmap_local_page() (Christoph) - Remove obsolete disk_name helper (Christoph) - block_device refcounting improvements (Christoph) - Ensure gendisk always has a request queue reference (Christoph) - Misc fixes/cleanups (Shaokun, Oliver, Guoqing)" * tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block: (129 commits) sg: pass the device name to blk_trace_setup block, bfq: cleanup the repeated declaration blk-crypto: fix check for too-large dun_bytes blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMIN blk-zoned: allow zone management send operations without CAP_SYS_ADMIN block: mark blkdev_fsync static block: refine the disk_live check in del_gendisk mmc: sdhci-tegra: Enable MMC_CAP2_ALT_GPT_TEGRA mmc: block: Support alternative_gpt_sector() operation partitions/efi: Support non-standard GPT location block: Add alternative_gpt_sector() operation bio: fix page leak bio_add_hw_page failure block: remove CONFIG_DEBUG_BLOCK_EXT_DEVT block: remove a pointless call to MINOR() in device_add_disk null_blk: add error handling support for add_disk() virtio_blk: add error handling support for add_disk() block: add error handling for device_add_disk / add_disk block: return errors from disk_alloc_events block: return errors from blk_integrity_add block: call blk_register_queue earlier in device_add_disk ...
2021-08-28md/raid5: Replace deprecated CPU-hotplug functions.Gravatar Sebastian Andrzej Siewior 1-2/+2
The functions get_online_cpus() and put_online_cpus() have been deprecated during the CPU hotplug rework. They map directly to cpus_read_lock() and cpus_read_unlock(). Replace deprecated CPU-hotplug functions with the official version. The behavior remains unchanged. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20210803141621.780504-16-bigeasy@linutronix.de
2021-08-27raid1: ensure write behind bio has less than BIO_MAX_VECS sectorsGravatar Guoqing Jiang 1-0/+19
We can't split write behind bio with more than BIO_MAX_VECS sectors, otherwise the below call trace was triggered because we could allocate oversized write behind bio later. [ 8.097936] bvec_alloc+0x90/0xc0 [ 8.098934] bio_alloc_bioset+0x1b3/0x260 [ 8.099959] raid1_make_request+0x9ce/0xc50 [raid1] [ 8.100988] ? __bio_clone_fast+0xa8/0xe0 [ 8.102008] md_handle_request+0x158/0x1d0 [md_mod] [ 8.103050] md_submit_bio+0xcd/0x110 [md_mod] [ 8.104084] submit_bio_noacct+0x139/0x530 [ 8.105127] submit_bio+0x78/0x1d0 [ 8.106163] ext4_io_submit+0x48/0x60 [ext4] [ 8.107242] ext4_writepages+0x652/0x1170 [ext4] [ 8.108300] ? do_writepages+0x41/0x100 [ 8.109338] ? __ext4_mark_inode_dirty+0x240/0x240 [ext4] [ 8.110406] do_writepages+0x41/0x100 [ 8.111450] __filemap_fdatawrite_range+0xc5/0x100 [ 8.112513] file_write_and_wait_range+0x61/0xb0 [ 8.113564] ext4_sync_file+0x73/0x370 [ext4] [ 8.114607] __x64_sys_fsync+0x33/0x60 [ 8.115635] do_syscall_64+0x33/0x40 [ 8.116670] entry_SYSCALL_64_after_hwframe+0x44/0xae Thanks for the comment from Christoph. [1]. https://bugs.archlinux.org/task/70992 Cc: stable@vger.kernel.org # v5.12+ Reported-by: Jens Stutte <jens@chianterastutte.eu> Tested-by: Jens Stutte <jens@chianterastutte.eu> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <songliubraving@fb.com>
2021-08-26dax: move the dax_read_lock() locking into dax_supportedGravatar Christoph Hellwig 1-7/+2
Move the dax_read_lock/dax_read_unlock pair from the callers into dax_supported to make it a little easier to use. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/20210826135510.6293-6-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-08-26dm: use fs_dax_get_by_bdev instead of dax_get_by_hostGravatar Christoph Hellwig 1-1/+1
There is no point in trying to finding the dax device if the DAX flag is not set on the queue as none of the users of the device mapper exported block devices could make use of the DAX capability. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20210826135510.6293-4-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-08-26md/raid10: Remove unnecessary rcu_dereference in raid10_handle_discardGravatar Xiao Ni 1-4/+10
We are seeing the following warning in raid10_handle_discard. [ 695.110751] ============================= [ 695.131439] WARNING: suspicious RCU usage [ 695.151389] 4.18.0-319.el8.x86_64+debug #1 Not tainted [ 695.174413] ----------------------------- [ 695.192603] drivers/md/raid10.c:1776 suspicious rcu_dereference_check() usage! [ 695.225107] other info that might help us debug this: [ 695.260940] rcu_scheduler_active = 2, debug_locks = 1 [ 695.290157] no locks held by mkfs.xfs/10186. In the first loop of function raid10_handle_discard. It already determines which disk need to handle discard request and add the rdev reference count rdev->nr_pending. So the conf->mirrors will not change until all bios come back from underlayer disks. It doesn't need to use rcu_dereference to get rdev. Cc: stable@vger.kernel.org Fixes: d30588b2731f ('md/raid10: improve raid10 discard request') Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <songliubraving@fb.com>
2021-08-20dm crypt: use in_hardirq() instead of deprecated in_irq()Gravatar Changbin Du 1-2/+2
Signed-off-by: Changbin Du <changbin.du@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-08-20dm ima: update dm target attributes for ima measurementsGravatar Tushar Sugandhi 3-9/+20
Certain DM targets ('integrity', 'multipath', 'verity') need to update the way their attributes are recorded in the ima log, so that the attestation servers can interpret the data correctly and decide if the devices meet the attestation requirements. For instance, the "mode=%c" attribute in the 'integrity' target is measured twice, the 'verity' target is missing the attribute "root_hash_sig_key_desc=%s", and the 'multipath' target needs to index the attributes properly. Update 'integrity' target to remove the duplicate measurement of the attribute "mode=%c". Add "root_hash_sig_key_desc=%s" attribute for the 'verity' target. Index various attributes in 'multipath' target. Also, add "nr_priority_groups=%u" attribute to 'multipath' target to record the number of priority groups. Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com> Suggested-by: Thore Sommer <public@thson.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-08-20dm ima: add a warning in dm_init if duplicate ima events are not measuredGravatar Tushar Sugandhi 1-3/+6
The end-users of DM devices/targets may remove and re-create the same device multiple times. IMA does not measure such duplicate events if the configuration CONFIG_IMA_DISABLE_HTABLE is set to 'n'. To avoid confusion, the end-users need some indication on the client if that configuration option is disabled. Add a one-time warning during dm_init() if CONFIG_IMA_DISABLE_HTABLE is set to 'n', to notify the end-users that duplicate events will not be measured in the ima log. Also cleanup some whitespace in dm_init(). Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-08-20dm ima: prefix ima event name related to device mapper with dm_Gravatar Tushar Sugandhi 1-9/+10
The event names for the DM events recorded in the ima log do not contain any information to indicate the events are part of the DM devices/targets. Prefix the event names for DM events with "dm_" to indicate that they are part of device-mapper. Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com> Suggested-by: Thore Sommer <public@thson.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-08-20dm ima: add version info to dm related events in ima logGravatar Tushar Sugandhi 2-12/+57
The DM events present in the ima log contain various attributes in the key=value format. The attributes' names/values may change in future, and new attributes may also get added. The attestation server needs some versioning to determine which attributes are supported and are expected in the ima log. Add version information to the DM events present in the ima log to help attestation servers to correctly process the attributes across different versions. Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com> Suggested-by: Mimi Zohar <zohar@linux.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-08-20dm ima: prefix dm table hashes in ima log with hash algorithmGravatar Tushar Sugandhi 2-3/+12
The active/inactive table hashes measured in the ima log do not contain the information about hash algorithm. This information is useful for the attestation servers to recreate the hashes and compare them with the ones present in the ima log to verify the table contents. Prefix the table hashes in various DM events in ima log with the hash algorithm used to compute those hashes. Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com> Suggested-by: Mimi Zohar <zohar@linux.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-08-18dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc()Gravatar Arne Welzel 1-1/+6
On systems with many cores using dm-crypt, heavy spinlock contention in percpu_counter_compare() can be observed when the page allocation limit for a given device is reached or close to be reached. This is due to percpu_counter_compare() taking a spinlock to compute an exact result on potentially many CPUs at the same time. Switch to non-exact comparison of allocated and allowed pages by using the value returned by percpu_counter_read_positive() to avoid taking the percpu_counter spinlock. This may over/under estimate the actual number of allocated pages by at most (batch-1) * num_online_cpus(). Currently, batch is bounded by 32. The system on which this issue was first observed has 256 CPUs and 512GB of RAM. With a 4k page size, this change may over/under estimate by 31MB. With ~10G (2%) allowed dm-crypt allocations, this seems an acceptable error. Certainly preferred over running into the spinlock contention. This behavior was reproduced on an EC2 c5.24xlarge instance with 96 CPUs and 192GB RAM as follows, but can be provoked on systems with less CPUs as well. * Disable swap * Tune vm settings to promote regular writeback $ echo 50 > /proc/sys/vm/dirty_expire_centisecs $ echo 25 > /proc/sys/vm/dirty_writeback_centisecs $ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes * Create 8 dmcrypt devices based on files on a tmpfs * Create and mount an ext4 filesystem on each crypt devices * Run stress-ng --hdd 8 within one of above filesystems Total %system usage collected from sysstat goes to ~35%. Write throughput on the underlying loop device is ~2GB/s. perf profiling an individual kworker kcryptd thread shows the following profile, indicating spinlock contention in percpu_counter_compare(): 99.98% 0.00% kworker/u193:46 [kernel.kallsyms] [k] ret_from_fork | --ret_from_fork kthread worker_thread | --99.92%--process_one_work | |--80.52%--kcryptd_crypt | | | |--62.58%--mempool_alloc | | | | | --62.24%--crypt_page_alloc | | | | | --61.51%--__percpu_counter_compare | | | | | --61.34%--__percpu_counter_sum | | | | | |--58.68%--_raw_spin_lock_irqsave | | | | | | | --58.30%--native_queued_spin_lock_slowpath | | | | | --0.69%--cpumask_next | | | | | --0.51%--_find_next_bit | | | |--10.61%--crypt_convert | | | | | |--6.05%--xts_crypt ... After applying this patch and running the same test, %system usage is lowered to ~7% and write throughput on the loop device increases to ~2.7GB/s. perf report shows mempool_alloc() as ~8% rather than ~62% in the profile and not hitting the percpu_counter() spinlock anymore. |--8.15%--mempool_alloc | | | |--3.93%--crypt_page_alloc | | | | | --3.75%--__alloc_pages | | | | | --3.62%--get_page_from_freelist | | | | | --3.22%--rmqueue_bulk | | | | | --2.59%--_raw_spin_lock | | | | | --2.57%--native_queued_spin_lock_slowpath | | | --3.05%--_raw_spin_lock_irqsave | | | --2.49%--native_queued_spin_lock_slowpath Suggested-by: DJ Gregor <dj@corelight.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Arne Welzel <arne.welzel@corelight.com> Fixes: 5059353df86e ("dm crypt: limit the number of allocated pages") Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-08-16bcache: use bvec_virtGravatar Christoph Hellwig 1-1/+1
Use bvec_virt instead of open coding it. Note that the existing code is fine despite ignoring bv_offset as the bio is known to contain exactly one page from the page allocator per bio_vec. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20210804095634.460779-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-16dm-integrity: use bvec_virtGravatar Christoph Hellwig 1-2/+2
Use bvec_virt instead of open coding it. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210804095634.460779-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-16dm-ebs: use bvec_virtGravatar Christoph Hellwig 1-1/+1
Use bvec_virt instead of open coding it. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210804095634.460779-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-16dm: make EBS depend on !HIGHMEMGravatar Christoph Hellwig 1-1/+1
__ebs_rw_bvec use page_address on the submitted bios data, and thus can't deal with highmem. Disable the target on highmem configs. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210804095634.460779-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-12block: remove GENHD_FL_UPGravatar Christoph Hellwig 1-3/+1
Just check inode_unhashed on the whole device bdev inode instead, and provide a helper to check for that information. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210809064028.1198327-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-12bcache: move the del_gendisk call out of bcache_device_freeGravatar Christoph Hellwig 1-6/+4
Let the callers call del_gendisk so that we can check if add_disk has been called properly for the cached device case instead of relying on the block layer internal GENHD_FL_UP flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20210809064028.1198327-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-12bcache: add proper error unwinding in bcache_device_initGravatar Christoph Hellwig 1-5/+11
Except for the IDA none of the allocations in bcache_device_init is unwound on error, fix that. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20210809064028.1198327-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-11block: move some macros to blkdev.hGravatar Guoqing Jiang 1-2/+0
Move them (PAGE_SECTORS_SHIFT, PAGE_SECTORS and SECTOR_MASK) to the generic header file to remove redundancy. Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Link: https://lore.kernel.org/r/20210721025315.1729118-1-guoqing.jiang@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-10dm: update target status functions to support IMA measurementGravatar Tushar Sugandhi 31-2/+324
For device mapper targets to take advantage of IMA's measurement capabilities, the status functions for the individual targets need to be updated to handle the status_type_t case for value STATUSTYPE_IMA. Update status functions for the following target types, to log their respective attributes to be measured using IMA. 01. cache 02. crypt 03. integrity 04. linear 05. mirror 06. multipath 07. raid 08. snapshot 09. striped 10. verity For rest of the targets, handle the STATUSTYPE_IMA case by setting the measurement buffer to NULL. For IMA to measure the data on a given system, the IMA policy on the system needs to be updated to have the following line, and the system needs to be restarted for the measurements to take effect. /etc/ima/ima-policy measure func=CRITICAL_DATA label=device-mapper template=ima-buf The measurements will be reflected in the IMA logs, which are located at: /sys/kernel/security/integrity/ima/ascii_runtime_measurements /sys/kernel/security/integrity/ima/binary_runtime_measurements These IMA logs can later be consumed by various attestation clients running on the system, and send them to external services for attesting the system. The DM target data measured by IMA subsystem can alternatively be queried from userspace by setting DM_IMA_MEASUREMENT_FLAG with DM_TABLE_STATUS_CMD. Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-08-10dm ima: measure data on device renameGravatar Tushar Sugandhi 3-0/+53
A given block device is identified by it's name and UUID. However, both these parameters can be renamed. For an external attestation service to correctly attest a given device, it needs to keep track of these rename events. Update the device data with the new values for IMA measurements. Measure both old and new device name/UUID parameters in the same IMA measurement event, so that the old and the new values can be connected later. Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-08-10dm ima: measure data on table clearGravatar Tushar Sugandhi 3-0/+97
For a given block device, an inactive table slot contains the parameters to configure the device with. The inactive table can be cleared multiple times, accidentally or maliciously, which may impact the functionality of the device, and compromise the system. Therefore it is important to measure and log the event when a table is cleared. Measure device parameters, and table hashes when the inactive table slot is cleared. Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-08-10dm ima: measure data on device removeGravatar Tushar Sugandhi 3-0/+124
Presence of an active block-device, configured with expected parameters, is important for an external attestation service to determine if a system meets the attestation requirements. Therefore it is important for DM to measure the device remove events. Measure device parameters and table hashes when the device is removed, using either remove or remove_all. Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-08-10dm ima: measure data on device resumeGravatar Tushar Sugandhi 3-2/+125
A given block device can load a table multiple times, with different input parameters, before eventually resuming it. Further, a device may be suspended and then resumed. The device may never resume after a table-load. Because of the above valid scenarios for a given device, it is important to measure and log the device resume event using IMA. Also, if the table is large, measuring it in clear-text each time the device changes state, will unnecessarily increase the size of IMA log. Since the table clear-text is already measured during table-load event, measuring the hash during resume should be sufficient to validate the table contents. Measure the device parameters, and hash of the active table, when the device is resumed. Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-08-10dm ima: measure data on table loadGravatar Tushar Sugandhi 6-1/+407
DM configures a block device with various target specific attributes passed to it as a table. DM loads the table, and calls each target’s respective constructors with the attributes as input parameters. Some of these attributes are critical to ensure the device meets certain security bar. Thus, IMA should measure these attributes, to ensure they are not tampered with, during the lifetime of the device. So that the external services can have high confidence in the configuration of the block-devices on a given system. Some devices may have large tables. And a given device may change its state (table-load, suspend, resume, rename, remove, table-clear etc.) many times. Measuring these attributes each time when the device changes its state will significantly increase the size of the IMA logs. Further, once configured, these attributes are not expected to change unless a new table is loaded, or a device is removed and recreated. Therefore the clear-text of the attributes should only be measured during table load, and the hash of the active/inactive table should be measured for the remaining device state changes. Export IMA function ima_measure_critical_data() to allow measurement of DM device parameters, as well as target specific attributes, during table load. Compute the hash of the inactive table and store it for measurements during future state change. If a load is called multiple times, update the inactive table hash with the hash of the latest populated table. So that the correct inactive table hash is measured when the device transitions to different states like resume, remove, rename, etc. Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com> Signed-off-by: Colin Ian King <colin.king@canonical.com> # leak fix Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-08-10dm writecache: add event countersGravatar Mikulas Patocka 1-3/+53
Add 10 counters for various events (hit, miss, etc) and export them in the status line (accessed from userspace with "dmsetup status"). Also add a message "clear_stats" that resets these counters. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-08-10dm writecache: report invalid return from writecache_map helpersGravatar Mikulas Patocka 1-1/+4
If some "writecache_map_*" function returns invalid state, it is a bug. So, we should report it and not fail silently. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-08-10dm writecache: further writecache_map() cleanupGravatar Mike Snitzer 1-32/+43
Factor out writecache_map_flush() and writecache_map_discard() from writecache_map(). Also eliminate the various goto labels in writecache_map(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-08-10dm writecache: factor out writecache_map_remap_origin()Gravatar Mike Snitzer 1-15/+15
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-08-10dm writecache: split up writecache_map() to improve code readabilityGravatar Mike Snitzer 1-151/+187
writecache_map() has grown too large and can be confusing to read given all the goto statements. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-08-09block: pass a gendisk to blk_queue_update_readaheadGravatar Christoph Hellwig 1-1/+1
.. and rename the function to disk_update_readahead. This is in preparation for moving the BDI from the request_queue to the gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210809141744.1203023-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09dm: delay registering the gendiskGravatar Christoph Hellwig 2-13/+11
device mapper is currently the only outlier that tries to call register_disk after add_disk, leading to fairly inconsistent state of these block layer data structures. Instead change device-mapper to just register the gendisk later now that the holder mechanism can cope with that. Note that this introduces a user visible change: the dm kobject is now only visible after the initial table has been loaded. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20210804094147.459763-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09dm: move setting md->type into dm_setup_md_queueGravatar Christoph Hellwig 2-6/+3
Move setting md->type from both callers into dm_setup_md_queue. This ensures that md->type is only set to a valid value after the queue has been fully setup, something we'll rely on future changes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20210804094147.459763-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09dm: cleanup cleanup_mapped_deviceGravatar Christoph Hellwig 1-5/+1
md->queue is now always set when md->disk is set, so simplify the conditionals a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20210804094147.459763-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09block: make the block holder code optionalGravatar Christoph Hellwig 2-0/+3
Move the block holder code into a separate file as it is not in any way related to the other block_dev.c code, and add a new selectable config option for it so that we don't have to build it without any remapped drivers selected. The Kconfig symbol contains a _DEPRECATED suffix to match the comments added in commit 49731baa41df ("block: restore multiple bd_link_disk_holder() support"). Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20210804094147.459763-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-07Merge tag 'block-5.14-2021-08-07' of git://git.kernel.dk/linux-blockGravatar Linus Torvalds 2-4/+2
Pull block fixes from Jens Axboe: "A few minor fixes: - Fix ldm kernel-doc warning (Bart) - Fix adding offset twice for DMA address in n64cart (Christoph) - Fix use-after-free in dasd path handling (Stefan) - Order kyber insert trace correctly (Vincent) - raid1 errored write handling fix (Wei) - Fix blk-iolatency queue get failure handling (Yu)" * tag 'block-5.14-2021-08-07' of git://git.kernel.dk/linux-block: kyber: make trace_block_rq call consistent with documentation block/partitions/ldm.c: Fix a kernel-doc warning blk-iolatency: error out if blk_get_queue() failed in iolatency_set_limit() n64cart: fix the dma address in n64cart_do_bvec s390/dasd: fix use after free in dasd path handling md/raid10: properly indicate failure when ending a failed write request
2021-08-04Merge branch 'md-fixes' of ↵Gravatar Jens Axboe 2-4/+2
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-5.14 * 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md/raid10: properly indicate failure when ending a failed write request
2021-08-02dm-writecache: use bvec_kmap_local instead of bvec_kmap_irqGravatar Christoph Hellwig 1-3/+2
There is no need to disable interrupts in bio_copy_block, and the local only mappings helps to avoid any sort of problems with stray writes into the bio data. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20210727055646.118787-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-07-23md/raid10: properly indicate failure when ending a failed write requestGravatar Wei Shuyu 2-4/+2
Similar to [1], this patch fixes the same bug in raid10. Also cleanup the comments. [1] commit 2417b9869b81 ("md/raid1: properly indicate failure when ending a failed write request") Cc: stable@vger.kernel.org Fixes: 7cee6d4e6035 ("md/raid10: end bio when the device faulty") Signed-off-by: Wei Shuyu <wsy@dogben.com> Acked-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>