aboutsummaryrefslogtreecommitdiff
path: root/io_uring
AgeCommit message (Collapse)AuthorFilesLines
2024-03-01io_uring/sqpoll: statistics of the true utilization of sq threadsGravatar Xiaobing Li 3-1/+24
Count the running time and actual IO processing time of the sqpoll thread, and output the statistical data to fdinfo. Variable description: "work_time" in the code represents the sum of the jiffies of the sq thread actually processing IO, that is, how many milliseconds it actually takes to process IO. "total_time" represents the total time that the sq thread has elapsed from the beginning of the loop to the current time point, that is, how many milliseconds it has spent in total. The test tool is fio, and its parameters are as follows: [global] ioengine=io_uring direct=1 group_reporting bs=128k norandommap=1 randrepeat=0 refill_buffers ramp_time=30s time_based runtime=1m clocksource=clock_gettime overwrite=1 log_avg_msec=1000 numjobs=1 [disk0] filename=/dev/nvme0n1 rw=read iodepth=16 hipri sqthread_poll=1 The test results are as follows: Every 2.0s: cat /proc/9230/fdinfo/6 | grep -E Sq SqMask: 0x3 SqHead: 3197153 SqTail: 3197153 CachedSqHead: 3197153 SqThread: 9231 SqThreadCpu: 11 SqTotalTime: 18099614 SqWorkTime: 16748316 The test results corresponding to different iodepths are as follows: |-----------|-------|-------|-------|------|-------| | iodepth | 1 | 4 | 8 | 16 | 64 | |-----------|-------|-------|-------|------|-------| |utilization| 2.9% | 8.8% | 10.9% | 92.9%| 84.4% | |-----------|-------|-------|-------|------|-------| | idle | 97.1% | 91.2% | 89.1% | 7.1% | 15.6% | |-----------|-------|-------|-------|------|-------| Signed-off-by: Xiaobing Li <xiaobing.li@samsung.com> Link: https://lore.kernel.org/r/20240228091251.543383-1-xiaobing.li@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01io_uring/net: move recv/recvmsg flags out of retry loopGravatar Jens Axboe 1-7/+8
The flags don't change, just intialize them once rather than every loop for multishot. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27io_uring/kbuf: flag request if buffer pool is empty after buffer pickGravatar Jens Axboe 1-2/+8
Normally we do an extra roundtrip for retries even if the buffer pool has depleted, as we don't check that upfront. Rather than add this check, have the buffer selection methods mark the request with REQ_F_BL_EMPTY if the used buffer group is out of buffers after this selection. This is very cheap to do once we're all the way inside there anyway, and it gives the caller a chance to make better decisions on how to proceed. For example, recv/recvmsg multishot could check this flag when it decides whether to keep receiving or not. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27io_uring/net: improve the usercopy for sendmsg/recvmsgGravatar Jens Axboe 1-7/+22
We're spending a considerable amount of the sendmsg/recvmsg time just copying in the message header. And for provided buffers, the known single entry iovec. Be a bit smarter about it and enable/disable user access around our copying. In a test case that does both sendmsg and recvmsg, the runtime before this change (averaged over multiple runs, very stable times however): Kernel Time Diff ==================================== -git 4720 usec -git+commit 4311 usec -8.7% and looking at a profile diff, we see the following: 0.25% +9.33% [kernel.kallsyms] [k] _copy_from_user 4.47% -3.32% [kernel.kallsyms] [k] __io_msg_copy_hdr.constprop.0 where we drop more than 9% of _copy_from_user() time, and consequently add time to __io_msg_copy_hdr() where the copies are now attributed to, but with a net win of 6%. In comparison, the same test case with send/recv runs in 3745 usec, which is (expectedly) still quite a bit faster. But at least sendmsg/recvmsg is now only ~13% slower, where it was ~21% slower before. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27io_uring/net: move receive multishot out of the generic msghdr pathGravatar Jens Axboe 1-70/+91
Move the actual user_msghdr / compat_msghdr into the send and receive sides, respectively, so we can move the uaddr receive handling into its own handler, and ditto the multishot with buffer selection logic. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27io_uring/net: unify how recvmsg and sendmsg copy in the msghdrGravatar Jens Axboe 1-129/+142
For recvmsg, we roll our own since we support buffer selections. This isn't the case for sendmsg right now, but in preparation for doing so, make the recvmsg copy helpers generic so we can call them from the sendmsg side as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-15io_uring/napi: enable even with a timeout of 0Gravatar Jens Axboe 1-4/+5
1 usec is not as short as it used to be, and it makes sense to allow 0 for a busy poll timeout - this means just do one loop to check if we have anything available. Add a separate ->napi_enabled to check if napi has been enabled or not. While at it, move the writing of the ctx napi values after we've copied the old values back to userspace. This ensures that if the call fails, we'll be in the same state as we were before, rather than some indeterminate state. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-15io_uring: kill stale comment for io_cqring_overflow_kill()Gravatar Jens Axboe 1-1/+0
This function now deals only with discarding overflow entries on ring free and exit, and it no longer returns whether we successfully flushed all entries as there's no CQE posting involved anymore. Kill the outdated comment. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-14io_uring/sqpoll: use the correct check for pending task_workGravatar Jens Axboe 1-1/+8
A previous commit moved to using just the private task_work list for SQPOLL, but it neglected to update the check for whether we have pending task_work. Normally this is fine as we'll attempt to run it unconditionally, but if we race with going to sleep AND task_work being added, then we certainly need the right check here. Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-14io_uring: wake SQPOLL task when task_work is added to an empty queueGravatar Jens Axboe 1-1/+6
If there's no current work on the list, we still need to potentially wake the SQPOLL task if it is sleeping. This is ordered with the wait queue addition in sqpoll, which adds to the wait queue before checking for pending work items. Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-14io_uring/napi: ensure napi polling is aborted when work is availableGravatar Jens Axboe 3-12/+12
While testing io_uring NAPI with DEFER_TASKRUN, I ran into slowdowns and stalls in packet delivery. Turns out that while io_napi_busy_loop_should_end() aborts appropriately on regular task_work, it does not abort if we have local task_work pending. Move io_has_work() into the private io_uring.h header, and gate whether we should continue polling on that as well. This makes NAPI polling on send/receive work as designed with IORING_SETUP_DEFER_TASKRUN as well. Fixes: 8d0c12a80cde ("io-uring: add napi busy poll support") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-12io_uring: Don't include af_unix.h.Gravatar Kuniyuki Iwashima 4-3/+2
Changes to AF_UNIX trigger rebuild of io_uring, but io_uring does not use AF_UNIX anymore. Let's not include af_unix.h and instead include necessary headers. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240212234236.63714-1-kuniyu@amazon.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09io_uring: add register/unregister napi functionGravatar Stefan Roesch 3-0/+76
This adds an api to register and unregister the napi for io-uring. If the arg value is specified when unregistering, the current napi setting for the busy poll timeout is copied into the user structure. If this is not required, NULL can be passed as the arg value. Signed-off-by: Stefan Roesch <shr@devkernel.io> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230608163839.2891748-7-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09io-uring: add sqpoll support for napi busy pollGravatar Stefan Roesch 3-1/+33
This adds the sqpoll support to the io-uring napi. Signed-off-by: Stefan Roesch <shr@devkernel.io> Suggested-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/20230608163839.2891748-6-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09io-uring: add napi busy poll supportGravatar Stefan Roesch 6-0/+359
This adds the napi busy polling support in io_uring.c. It adds a new napi_list to the io_ring_ctx structure. This list contains the list of napi_id's that are currently enabled for busy polling. The list is synchronized by the new napi_lock spin lock. The current default napi busy polling time is stored in napi_busy_poll_to. If napi busy polling is not enabled, the value is 0. In addition there is also a hash table. The hash table store the napi id and the pointer to the above list nodes. The hash table is used to speed up the lookup to the list elements. The hash table is synchronized with rcu. The NAPI_TIMEOUT is stored as a timeout to make sure that the time a napi entry is stored in the napi list is limited. The busy poll timeout is also stored as part of the io_wait_queue. This is necessary as for sq polling the poll interval needs to be adjusted and the napi callback allows only to pass in one value. This has been tested with two simple programs from the liburing library repository: the napi client and the napi server program. The client sends a request, which has a timestamp in its payload and the server replies with the same payload. The client calculates the roundtrip time and stores it to calculate the results. The client is running on host1 and the server is running on host 2 (in the same rack). The measured times below are roundtrip times. They are average times over 5 runs each. Each run measures 1 million roundtrips. no rx coal rx coal: frames=88,usecs=33 Default 57us 56us client_poll=100us 47us 46us server_poll=100us 51us 46us client_poll=100us+ 40us 40us server_poll=100us client_poll=100us+ 41us 39us server_poll=100us+ prefer napi busy poll on client client_poll=100us+ 41us 39us server_poll=100us+ prefer napi busy poll on server client_poll=100us+ 41us 39us server_poll=100us+ prefer napi busy poll on client + server Signed-off-by: Stefan Roesch <shr@devkernel.io> Suggested-by: Olivier Langlois <olivier@trillion01.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230608163839.2891748-5-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09io-uring: move io_wait_queue definition to header fileGravatar Stefan Roesch 2-21/+22
This moves the definition of the io_wait_queue structure to the header file so it can be also used from other files. Signed-off-by: Stefan Roesch <shr@devkernel.io> Link: https://lore.kernel.org/r/20230608163839.2891748-4-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09io_uring: add support for ftruncateGravatar Tony Solomonik 4-1/+63
Adds support for doing truncate through io_uring, eliminating the need for applications to roll their own thread pool or offload mechanism to be able to do non-blocking truncates. Signed-off-by: Tony Solomonik <tony.solomonik@gmail.com> Link: https://lore.kernel.org/r/20240202121724.17461-3-tony.solomonik@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: Simplify the allocation of slab cachesGravatar Kunwu Chan 1-3/+2
commit 0a31bd5f2bbb ("KMEM_CACHE(): simplify slab cache creation") introduces a new macro. Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Link: https://lore.kernel.org/r/20240130100247.81460-1-chentao@kylinos.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring/sqpoll: manage task_work privatelyGravatar Jens Axboe 3-17/+82
Decouple from task_work running, and cap the number of entries we process at the time. If we exceed that number, push remaining entries to a retry list that we'll process first next time. We cap the number of entries to process at 8, which is fairly random. We just want to get enough per-ctx batching here, while not processing endlessly. Since we manually run PF_IO_WORKER related task_work anyway as the task never exits to userspace, with this we no longer need to add an actual task_work item to the per-process list. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: pass in counter to handle_tw_list() rather than return itGravatar Jens Axboe 1-5/+3
No functional changes in this patch, just in preparation for returning something other than count from this helper. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: cleanup handle_tw_list() calling conventionGravatar Jens Axboe 1-16/+13
Now that we don't loop around task_work anymore, there's no point in maintaining the ring and locked state outside of handle_tw_list(). Get rid of passing in those pointers (and pointers to pointers) and just do the management internally in handle_tw_list(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring/poll: improve readability of poll reference decrementingGravatar Jens Axboe 1-2/+2
This overly long line is hard to read. Break it up by AND'ing the ref mask first, then perform the atomic_sub_return() with the value itself. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: remove unconditional looping in local task_work handlingGravatar Jens Axboe 1-15/+29
If we have a ton of notifications coming in, we can be looping in here for a long time. This can be problematic for various reasons, mostly because we can starve userspace. If the application is waiting on N events, then only re-run if we need more events. Fixes: c0e0d6ba25f1 ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: remove next io_kiocb fetch in task_work runningGravatar Jens Axboe 1-3/+0
We just reversed the task_work list and that will have touched requests as well, just get rid of this optimization as it should not make a difference anymore. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: handle traditional task_work in FIFO orderGravatar Jens Axboe 1-1/+1
For local task_work, which is used if IORING_SETUP_DEFER_TASKRUN is set, we reverse the order of the lockless list before processing the work. This is done to process items in the order in which they were queued, as the llist always adds to the head. Do the same for traditional task_work, so we have the same behavior for both types. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: remove 'loops' argument from trace_io_uring_task_work_run()Gravatar Jens Axboe 1-1/+1
We no longer loop in task_work handling, hence delete the argument from the tracepoint as it's always 1 and hence not very informative. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: remove looping around handling traditional task_workGravatar Jens Axboe 1-38/+7
A previous commit added looping around handling traditional task_work as an optimization, and while that may seem like a good idea, it's also possible to run into application starvation doing so. If the task_work generation is bursty, we can get very deep task_work queues, and we can end up looping in here for a very long time. One immediately observable problem with that is handling network traffic using provided buffers, where flooding incoming traffic and looping task_work handling will very quickly lead to buffer starvation as we keep running task_work rather than returning to the application so it can handle the associated CQEs and also provide buffers back. Fixes: 3a0c037b0e16 ("io_uring: batch task_work") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring/kbuf: cleanup passing back cflagsGravatar Jens Axboe 2-24/+31
We have various functions calculating the CQE cflags we need to pass back, but it's all the same everywhere. Make a number of the putting functions void, and just have the two main helps for this, io_put_kbuf() and io_put_kbuf_comp() calculate the actual mask and pass it back. While at it, cleanup how we put REQ_F_BUFFER_RING buffers. Before this change, we would call into __io_put_kbuf() only to go right back in to the header defined functions. As clearing this type of buffer is just re-assigning the buf_index and incrementing the head, this is very wasteful. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring/rw: remove dead file == NULL checkGravatar Jens Axboe 1-1/+1
Any read/write opcode has needs_file == true, which means that we would've failed the request long before reaching the issue stage if we didn't successfully assign a file. This check has been dead forever, and is really a leftover from generic code. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: cleanup io_req_complete_post()Gravatar Jens Axboe 1-4/+4
Move the ctx declaration and assignment up to be generally available in the function, as we use req->ctx at the top anyway. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: mark the need to lock/unlock the ring as unlikelyGravatar Jens Axboe 1-2/+2
Any of the fast paths will already have this locked, this helper only exists to deal with io-wq invoking request issue where we do not have the ctx->uring_lock held already. This means that any common or fast path will already have this locked, mark it as such. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: add io_file_can_poll() helperGravatar Jens Axboe 5-6/+18
This adds a flag to avoid dipping dereferencing file and then f_op to figure out if the file has a poll handler defined or not. We generally call this at least twice for networked workloads, and if using ring provided buffers, we do it on every buffer selection. Particularly the latter is troublesome, as it's otherwise a very fast operation. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring/cancel: don't default to setting req->work.cancel_seqGravatar Jens Axboe 4-8/+12
Just leave it unset by default, avoiding dipping into the last cacheline (which is otherwise untouched) for the fast path of using poll to drive networked traffic. Add a flag that tells us if the sequence is valid or not, and then we can defer actually assigning the flag and sequence until someone runs cancelations. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: expand main struct io_kiocb flags to 64-bitsGravatar Jens Axboe 2-5/+6
We're out of space here, and none of the flags are easily reclaimable. Bump it to 64-bits and re-arrange the struct a bit to avoid gaps. Add a specific bitwise type for the request flags, io_request_flags_t. This will help catch violations of casting this value to a smaller type on 32-bit archs, like unsigned int. This creates a hole in the io_kiocb, so move nr_tw up and rsrc_node down to retain needing only cacheline 0 and 1 for non-polled opcodes. No functional changes intended in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-06io_uring: use file_mnt_idmap helperGravatar Alexander Mikhalitsyn 1-1/+1
Let's use file_mnt_idmap() as we do that across the tree. No functional impact. Cc: Christian Brauner <brauner@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: io-uring@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Link: https://lore.kernel.org/r/20240129180024.219766-2-aleksandr.mikhalitsyn@canonical.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-01io_uring/net: fix sr->len for IORING_OP_RECV with MSG_WAITALL and buffersGravatar Jens Axboe 1-0/+1
If we use IORING_OP_RECV with provided buffers and pass in '0' as the length of the request, the length is retrieved from the selected buffer. If MSG_WAITALL is also set and we get a short receive, then we may hit the retry path which decrements sr->len and increments the buffer for a retry. However, the length is still zero at this point, which means that sr->len now becomes huge and import_ubuf() will cap it to MAX_RW_COUNT and subsequently return -EFAULT for the range as a whole. Fix this by always assigning sr->len once the buffer has been selected. Cc: stable@vger.kernel.org Fixes: 7ba89d2af17a ("io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-29io_uring/net: limit inline multishot retriesGravatar Jens Axboe 1-3/+20
If we have multiple clients and some/all are flooding the receives to such an extent that we can retry a LOT handling multishot receives, then we can be starving some clients and hence serving traffic in an imbalanced fashion. Limit multishot retry attempts to some arbitrary value, whose only purpose serves to ensure that we don't keep serving a single connection for way too long. We default to 32 retries, which should be more than enough to provide fairness, yet not so small that we'll spend too much time requeuing rather than handling traffic. Cc: stable@vger.kernel.org Depends-on: 704ea888d646 ("io_uring/poll: add requeue return code from poll multishot handling") Depends-on: 1e5d765a82f ("io_uring/net: un-indent mshot retry path in io_recv_finish()") Depends-on: e84b01a880f6 ("io_uring/poll: move poll execution helpers higher up") Fixes: b3fdea6ecb55 ("io_uring: multishot recv") Fixes: 9bb66906f23e ("io_uring: support multishot in recvmsg") Link: https://github.com/axboe/liburing/issues/1043 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-29io_uring/poll: add requeue return code from poll multishot handlingGravatar Jens Axboe 2-2/+15
Since our poll handling is edge triggered, multishot handlers retry internally until they know that no more data is available. In preparation for limiting these retries, add an internal return code, IOU_REQUEUE, which can be used to inform the poll backend about the handler wanting to retry, but that this should happen through a normal task_work requeue rather than keep hammering on the issue side for this one request. No functional changes in this patch, nobody is using this return code just yet. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-29io_uring/net: un-indent mshot retry path in io_recv_finish()Gravatar Jens Axboe 1-16/+20
In preparation for putting some retry logic in there, have the done path just skip straight to the end rather than have too much nesting in here. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-29io_uring/poll: move poll execution helpers higher upGravatar Jens Axboe 1-20/+20
In preparation for calling __io_poll_execute() higher up, move the functions to avoid forward declarations. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-28io_uring/rw: ensure poll based multishot read retries appropriatelyGravatar Jens Axboe 2-1/+18
io_read_mshot() always relies on poll triggering retries, and this works fine as long as we do a retry per size of the buffer being read. The buffer size is given by the size of the buffer(s) in the given buffer group ID. But if we're reading less than what is available, then we don't always get to read everything that is available. For example, if the buffers available are 32 bytes and we have 64 bytes to read, then we'll correctly read the first 32 bytes and then wait for another poll trigger before we attempt the next read. This next poll trigger may never happen, in which case we just sit forever and never make progress, or it may trigger at some point in the future, and now we're just delivering the available data much later than we should have. io_read_mshot() could do retries itself, but that is wasteful as we'll be going through all of __io_read() again, and most likely in vain. Rather than do that, bump our poll reference count and have io_poll_check_events() do one more loop and check with vfs_poll() if we have more data to read. If we do, io_read_mshot() will get invoked again directly and we'll read the next chunk. io_poll_multishot_retry() must only get called from inside io_poll_issue(), which is our multishot retry handler, as we know we already "own" the request at this point. Cc: stable@vger.kernel.org Link: https://github.com/axboe/liburing/issues/1041 Fixes: fc68fcda0491 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-23io_uring: enable audit and restrict cred override for IORING_OP_FIXED_FD_INSTALLGravatar Paul Moore 2-1/+4
We need to correct some aspects of the IORING_OP_FIXED_FD_INSTALL command to take into account the security implications of making an io_uring-private file descriptor generally accessible to a userspace task. The first change in this patch is to enable auditing of the FD_INSTALL operation as installing a file descriptor into a task's file descriptor table is a security relevant operation and something that admins/users may want to audit. The second change is to disable the io_uring credential override functionality, also known as io_uring "personalities", in the FD_INSTALL command. The credential override in FD_INSTALL is particularly problematic as it affects the credentials used in the security_file_receive() LSM hook. If a task were to request a credential override via REQ_F_CREDS on a FD_INSTALL operation, the LSM would incorrectly check to see if the overridden credentials of the io_uring were able to "receive" the file as opposed to the task's credentials. After discussions upstream, it's difficult to imagine a use case where we would want to allow a credential override on a FD_INSTALL operation so we are simply going to block REQ_F_CREDS on IORING_OP_FIXED_FD_INSTALL operations. Fixes: dc18b89ab113 ("io_uring/openclose: add support for IORING_OP_FIXED_FD_INSTALL") Signed-off-by: Paul Moore <paul@paul-moore.com> Link: https://lore.kernel.org/r/20240123215501.289566-2-paul@paul-moore.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-18Merge tag 'for-6.8/io_uring-2024-01-18' of git://git.kernel.dk/linuxGravatar Linus Torvalds 4-47/+86
Pull io_uring fixes from Jens Axboe: "Nothing major in here, just a few fixes and cleanups that arrived after the initial merge window pull request got finalized, as well as a fix for a patch that got merged earlier" * tag 'for-6.8/io_uring-2024-01-18' of git://git.kernel.dk/linux: io_uring: combine cq_wait_nr checks io_uring: clean *local_work_add var naming io_uring: clean up local tw add-wait sync io_uring: adjust defer tw counting io_uring/register: guard compat syscall with CONFIG_COMPAT io_uring/rsrc: improve code generation for fixed file assignment io_uring/rw: cleanup io_rw_done()
2024-01-17Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmGravatar Linus Torvalds 1-1/+2
Pull kvm updates from Paolo Bonzini: "Generic: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures. - Clean up Kconfigs that all KVM architectures were selecting - New functionality around "guest_memfd", a new userspace API that creates an anonymous file and returns a file descriptor that refers to it. guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to switch a memory area between guest_memfd and regular anonymous memory. - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify per-page attributes for a given page of guest memory; right now the only attribute is whether the guest expects to access memory via guest_memfd or not, which in Confidential SVMs backed by SEV-SNP, TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM). x86: - Support for "software-protected VMs" that can use the new guest_memfd and page attributes infrastructure. This is mostly useful for testing, since there is no pKVM-like infrastructure to provide a meaningfully reduced TCB. - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Use more generic lockdep assertions in paths that don't actually care about whether the caller is a reader or a writer. - let Xen guests opt out of having PV clock reported as "based on a stable TSC", because some of them don't expect the "TSC stable" bit (added to the pvclock ABI by KVM, but never set by Xen) to be set. - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - On AMD machines with vNMI, always rely on hardware instead of intercepting IRET in some cases to detect unmasking of NMIs - Support for virtualizing Linear Address Masking (LAM) - Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. ARM64: - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base granule sizes. Branch shared with the arm64 tree. - Large Fine-Grained Trap rework, bringing some sanity to the feature, although there is more to come. This comes with a prefix branch shared with the arm64 tree. - Some additional Nested Virtualization groundwork, mostly introducing the NV2 VNCR support and retargetting the NV support to that version of the architecture. - A small set of vgic fixes and associated cleanups. Loongarch: - Optimization for memslot hugepage checking - Cleanup and fix some HW/SW timer issues - Add LSX/LASX (128bit/256bit SIMD) support RISC-V: - KVM_GET_REG_LIST improvement for vector registers - Generate ISA extension reg_list using macros in get-reg-list selftest - Support for reporting steal time along with selftest s390: - Bugfixes Selftests: - Fix an annoying goof where the NX hugepage test prints out garbage instead of the magic token needed to run the test. - Fix build errors when a header is delete/moved due to a missing flag in the Makefile. - Detect if KVM bugged/killed a selftest's VM and print out a helpful message instead of complaining that a random ioctl() failed. - Annotate the guest printf/assert helpers with __printf(), and fix the various bugs that were lurking due to lack of said annotation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits) x86/kvm: Do not try to disable kvmclock if it was not enabled KVM: x86: add missing "depends on KVM" KVM: fix direction of dependency on MMU notifiers KVM: introduce CONFIG_KVM_COMMON KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache RISC-V: KVM: selftests: Add get-reg-list test for STA registers RISC-V: KVM: selftests: Add steal_time test support RISC-V: KVM: selftests: Add guest_sbi_probe_extension RISC-V: KVM: selftests: Move sbi_ecall to processor.c RISC-V: KVM: Implement SBI STA extension RISC-V: KVM: Add support for SBI STA registers RISC-V: KVM: Add support for SBI extension registers RISC-V: KVM: Add SBI STA info to vcpu_arch RISC-V: KVM: Add steal-update vcpu request RISC-V: KVM: Add SBI STA extension skeleton RISC-V: paravirt: Implement steal-time support RISC-V: Add SBI STA extension definitions RISC-V: paravirt: Add skeleton for pv-time support RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr() ...
2024-01-17io_uring: combine cq_wait_nr checksGravatar Pavel Begunkov 1-7/+27
Instead of explicitly checking ->cq_wait_nr for whether there are waiting, which is currently represented by 0, we can store there a large value and the nr_tw will automatically filter out those cases. Add a named constant for that and for the wake up bias value. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/38def30282654d980673976cd42fde9bab19b297.1705438669.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-17io_uring: clean *local_work_add var namingGravatar Pavel Begunkov 1-7/+7
if (!first) { ... } While it reads as do something if it's not the first entry, it does exactly the opposite because "first" here is a pointer to the first entry. Remove the confusion by renaming it into "head". Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3b8be483b52f58a524185bb88694b8a268e7e85d.1705438669.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-17io_uring: clean up local tw add-wait syncGravatar Pavel Begunkov 1-2/+8
Kill a smp_mb__after_atomic() right before wake_up, it's useless, and add a comment explaining implicit barriers from cmpxchg and synchronsation around ->cq_wait_nr with the waiter. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3007f3c2d53c72b61de56919ef56b53158b8276f.1705438669.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-17io_uring: adjust defer tw countingGravatar Pavel Begunkov 1-1/+1
The UINT_MAX work item counting bias in io_req_local_work_add() in case of !IOU_F_TWQ_LAZY_WAKE works in a sense that we will not miss a wake up, however it's still eerie. In particular, if we add a lazy work item after a non-lazy one, we'll increment it and get nr_tw==0, and subsequent adds may try to unnecessarily wake up the task, which is though not so likely to happen in real workloads. Half the bias, it's still large enough to be larger than any valid ->cq_wait_nr, which is limited by IORING_MAX_CQ_ENTRIES, but further have a good enough of space before it overflows. Fixes: 8751d15426a31 ("io_uring: reduce scheduling due to tw") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/108b971e958deaf7048342930c341ba90f75d806.1705438669.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-17io_uring/register: guard compat syscall with CONFIG_COMPATGravatar Jens Axboe 1-3/+5
Add compat.h include to avoid a potential build issue: io_uring/register.c:281:6: error: call to undeclared function 'in_compat_syscall'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration] if (in_compat_syscall()) { ^ 1 warning generated. io_uring/register.c:282:9: error: call to undeclared function 'compat_get_bitmap'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration] ret = compat_get_bitmap(cpumask_bits(new_mask), ^ Fixes: c43203154d8a ("io_uring/register: move io_uring_register(2) related code to register.c") Reported-by: Manu Bretelle <chantra@meta.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-11Merge tag 'for-6.8/io_uring-2024-01-08' of git://git.kernel.dk/linuxGravatar Linus Torvalds 15-850/+752
Pull io_uring updates from Jens Axboe: "Mostly just come fixes and cleanups, but one feature as well. In detail: - Harden the check for handling IOPOLL based on return (Pavel) - Various minor optimizations (Pavel) - Drop remnants of SCM_RIGHTS fd passing support, now that it's no longer supported since 6.7 (me) - Fix for a case where bytes_done wasn't initialized properly on a failure condition for read/write requests (me) - Move the register related code to a separate file (me) - Add support for returning the provided ring buffer head (me) - Add support for adding a direct descriptor to the normal file table (me, Christian Brauner) - Fix for ensuring pending task_work for a ring with DEFER_TASKRUN is run even if we timeout waiting (me)" * tag 'for-6.8/io_uring-2024-01-08' of git://git.kernel.dk/linux: io_uring: ensure local task_work is run on wait timeout io_uring/kbuf: add method for returning provided buffer ring head io_uring/rw: ensure io->bytes_done is always initialized io_uring: drop any code related to SCM_RIGHTS io_uring/unix: drop usage of io_uring socket io_uring/register: move io_uring_register(2) related code to register.c io_uring/openclose: add support for IORING_OP_FIXED_FD_INSTALL io_uring/cmd: inline io_uring_cmd_get_task io_uring/cmd: inline io_uring_cmd_do_in_task_lazy io_uring: split out cmd api into a separate header io_uring: optimise ltimeout for inline execution io_uring: don't check iopoll if request completes