aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-03-08io_uring: refactor DEFER_TASKRUN multishot checksGravatar Pavel Begunkov 3-23/+20
We disallow DEFER_TASKRUN multishots from running by io-wq, which is checked by individual opcodes in the issue path. We can consolidate all it in io_wq_submit_work() at the same time moving the checks out of the hot path. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e492f0f11588bb5aa11d7d24e6f53b7c7628afdb.1709905727.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring: fix mshot io-wq checksGravatar Pavel Begunkov 1-1/+1
When checking for concurrent CQE posting, we're not only interested in requests running from the poll handler but also strayed requests ended up in normal io-wq execution. We're disallowing multishots in general from io-wq, not only when they came in a certain way. Cc: stable@vger.kernel.org Fixes: 17add5cea2bba ("io_uring: force multishot CQEs into task context") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d8c5b36a39258036f93301cd60d3cd295e40653d.1709905727.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring/net: add io_req_msg_cleanup() helperGravatar Jens Axboe 1-12/+15
For the fast inline path, we manually recycle the io_async_msghdr and free the iovec, and then clear the REQ_F_NEED_CLEANUP flag to avoid that needing doing in the slower path. We already do that in 2 spots, and in preparation for adding more, add a helper and use it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring/net: simplify msghd->msg_inq checkingGravatar Jens Axboe 1-2/+2
Just check for larger than zero rather than check for non-zero and not -1. This is easier to read, and also protects against any errants < 0 values that aren't -1. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLEGravatar Jens Axboe 5-35/+16
We only use the flag for this purpose, so rename it accordingly. This further prevents various other use cases of it, keeping it clean and consistent. Then we can also check it in one spot, when it's being attempted recycled, and remove some dead code in io_kbuf_recycle_ring(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_ioGravatar Jens Axboe 1-5/+7
Ensure that prep handlers always initialize sr->done_io before any potential failure conditions, and with that, we now it's always been set even for the failure case. With that, we don't need to use the REQ_F_PARTIAL_IO flag to gate on that. Additionally, we should not overwrite req->cqe.res unless sr->done_io is actually positive. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-07io_uring/net: correctly handle multishot recvmsg retry setupGravatar Jens Axboe 1-1/+2
If we loop for multishot receive on the initial attempt, and then abort later on to wait for more, we miss a case where we should be copying the io_async_msghdr from the stack to stable storage. This leads to the next retry potentially failing, if the application had the msghdr on the stack. Cc: stable@vger.kernel.org Fixes: 9bb66906f23e ("io_uring: support multishot in recvmsg") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-07io_uring/net: clear REQ_F_BL_EMPTY in the multishot retry handlerGravatar Jens Axboe 1-0/+1
This flag should not be persistent across retries, so ensure we clear it before potentially attemting a retry. Fixes: c3f9109dbc9e ("io_uring/kbuf: flag request if buffer pool is empty after buffer pick") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-07io_uring: fix io_queue_proc modifying req->flagsGravatar Pavel Begunkov 1-8/+11
With multiple poll entries __io_queue_proc() might be running in parallel with poll handlers and possibly task_work, we should not be carelessly modifying req->flags there. io_poll_double_prepare() handles a similar case with locking but it's much easier to move it into __io_arm_poll_handler(). Cc: stable@vger.kernel.org Fixes: 595e52284d24a ("io_uring/poll: don't enable lazy wake for POLLEXCLUSIVE") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/455cc49e38cf32026fa1b49670be8c162c2cb583.1709834755.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-07io_uring: fix mshot read defer taskrun cqe postingGravatar Pavel Begunkov 1-0/+2
We can't post CQEs from io-wq with DEFER_TASKRUN set, normal completions are handled but aux should be explicitly disallowed by opcode handlers. Cc: stable@vger.kernel.org Fixes: fc68fcda04910 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6fb7cba6f5366da25f4d3eb95273f062309d97fa.1709740837.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-04io_uring/net: fix overflow check in io_recvmsg_mshot_prep()Gravatar Dan Carpenter 1-2/+2
The "controllen" variable is type size_t (unsigned long). Casting it to int could lead to an integer underflow. The check_add_overflow() function considers the type of the destination which is type int. If we add two positive values and the result cannot fit in an integer then that's counted as an overflow. However, if we cast "controllen" to an int and it turns negative, then negative values *can* fit into an int type so there is no overflow. Good: 100 + (unsigned long)-4 = 96 <-- overflow Bad: 100 + (int)-4 = 96 <-- no overflow I deleted the cast of the sizeof() as well. That's not a bug but the cast is unnecessary. Fixes: 9b0fc3c054ff ("io_uring: fix types in io_recvmsg_multishot_overflow") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/138bd2e2-ede8-4bcc-aa7b-f3d9de167a37@moroto.mountain Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-04io_uring/net: correct the type of variableGravatar Muhammad Usama Anjum 1-1/+1
The namelen is of type int. It shouldn't be made size_t which is unsigned. The signed number is needed for error checking before use. Fixes: c55978024d12 ("io_uring/net: move receive multishot out of the generic msghdr path") Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Link: https://lore.kernel.org/r/20240301144349.2807544-1-usama.anjum@collabora.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01io_uring/sqpoll: statistics of the true utilization of sq threadsGravatar Xiaobing Li 3-1/+24
Count the running time and actual IO processing time of the sqpoll thread, and output the statistical data to fdinfo. Variable description: "work_time" in the code represents the sum of the jiffies of the sq thread actually processing IO, that is, how many milliseconds it actually takes to process IO. "total_time" represents the total time that the sq thread has elapsed from the beginning of the loop to the current time point, that is, how many milliseconds it has spent in total. The test tool is fio, and its parameters are as follows: [global] ioengine=io_uring direct=1 group_reporting bs=128k norandommap=1 randrepeat=0 refill_buffers ramp_time=30s time_based runtime=1m clocksource=clock_gettime overwrite=1 log_avg_msec=1000 numjobs=1 [disk0] filename=/dev/nvme0n1 rw=read iodepth=16 hipri sqthread_poll=1 The test results are as follows: Every 2.0s: cat /proc/9230/fdinfo/6 | grep -E Sq SqMask: 0x3 SqHead: 3197153 SqTail: 3197153 CachedSqHead: 3197153 SqThread: 9231 SqThreadCpu: 11 SqTotalTime: 18099614 SqWorkTime: 16748316 The test results corresponding to different iodepths are as follows: |-----------|-------|-------|-------|------|-------| | iodepth | 1 | 4 | 8 | 16 | 64 | |-----------|-------|-------|-------|------|-------| |utilization| 2.9% | 8.8% | 10.9% | 92.9%| 84.4% | |-----------|-------|-------|-------|------|-------| | idle | 97.1% | 91.2% | 89.1% | 7.1% | 15.6% | |-----------|-------|-------|-------|------|-------| Signed-off-by: Xiaobing Li <xiaobing.li@samsung.com> Link: https://lore.kernel.org/r/20240228091251.543383-1-xiaobing.li@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01io_uring/net: move recv/recvmsg flags out of retry loopGravatar Jens Axboe 1-7/+8
The flags don't change, just intialize them once rather than every loop for multishot. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27io_uring/kbuf: flag request if buffer pool is empty after buffer pickGravatar Jens Axboe 2-2/+11
Normally we do an extra roundtrip for retries even if the buffer pool has depleted, as we don't check that upfront. Rather than add this check, have the buffer selection methods mark the request with REQ_F_BL_EMPTY if the used buffer group is out of buffers after this selection. This is very cheap to do once we're all the way inside there anyway, and it gives the caller a chance to make better decisions on how to proceed. For example, recv/recvmsg multishot could check this flag when it decides whether to keep receiving or not. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27io_uring/net: improve the usercopy for sendmsg/recvmsgGravatar Jens Axboe 1-7/+22
We're spending a considerable amount of the sendmsg/recvmsg time just copying in the message header. And for provided buffers, the known single entry iovec. Be a bit smarter about it and enable/disable user access around our copying. In a test case that does both sendmsg and recvmsg, the runtime before this change (averaged over multiple runs, very stable times however): Kernel Time Diff ==================================== -git 4720 usec -git+commit 4311 usec -8.7% and looking at a profile diff, we see the following: 0.25% +9.33% [kernel.kallsyms] [k] _copy_from_user 4.47% -3.32% [kernel.kallsyms] [k] __io_msg_copy_hdr.constprop.0 where we drop more than 9% of _copy_from_user() time, and consequently add time to __io_msg_copy_hdr() where the copies are now attributed to, but with a net win of 6%. In comparison, the same test case with send/recv runs in 3745 usec, which is (expectedly) still quite a bit faster. But at least sendmsg/recvmsg is now only ~13% slower, where it was ~21% slower before. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27io_uring/net: move receive multishot out of the generic msghdr pathGravatar Jens Axboe 1-70/+91
Move the actual user_msghdr / compat_msghdr into the send and receive sides, respectively, so we can move the uaddr receive handling into its own handler, and ditto the multishot with buffer selection logic. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27io_uring/net: unify how recvmsg and sendmsg copy in the msghdrGravatar Jens Axboe 1-129/+142
For recvmsg, we roll our own since we support buffer selections. This isn't the case for sendmsg right now, but in preparation for doing so, make the recvmsg copy helpers generic so we can call them from the sendmsg side as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-15io_uring/napi: enable even with a timeout of 0Gravatar Jens Axboe 2-4/+6
1 usec is not as short as it used to be, and it makes sense to allow 0 for a busy poll timeout - this means just do one loop to check if we have anything available. Add a separate ->napi_enabled to check if napi has been enabled or not. While at it, move the writing of the ctx napi values after we've copied the old values back to userspace. This ensures that if the call fails, we'll be in the same state as we were before, rather than some indeterminate state. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-15io_uring: kill stale comment for io_cqring_overflow_kill()Gravatar Jens Axboe 1-1/+0
This function now deals only with discarding overflow entries on ring free and exit, and it no longer returns whether we successfully flushed all entries as there's no CQE posting involved anymore. Kill the outdated comment. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-14io_uring/sqpoll: use the correct check for pending task_workGravatar Jens Axboe 1-1/+8
A previous commit moved to using just the private task_work list for SQPOLL, but it neglected to update the check for whether we have pending task_work. Normally this is fine as we'll attempt to run it unconditionally, but if we race with going to sleep AND task_work being added, then we certainly need the right check here. Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-14io_uring: wake SQPOLL task when task_work is added to an empty queueGravatar Jens Axboe 1-1/+6
If there's no current work on the list, we still need to potentially wake the SQPOLL task if it is sleeping. This is ordered with the wait queue addition in sqpoll, which adds to the wait queue before checking for pending work items. Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-14io_uring/napi: ensure napi polling is aborted when work is availableGravatar Jens Axboe 3-12/+12
While testing io_uring NAPI with DEFER_TASKRUN, I ran into slowdowns and stalls in packet delivery. Turns out that while io_napi_busy_loop_should_end() aborts appropriately on regular task_work, it does not abort if we have local task_work pending. Move io_has_work() into the private io_uring.h header, and gate whether we should continue polling on that as well. This makes NAPI polling on send/receive work as designed with IORING_SETUP_DEFER_TASKRUN as well. Fixes: 8d0c12a80cde ("io-uring: add napi busy poll support") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-12io_uring: Don't include af_unix.h.Gravatar Kuniyuki Iwashima 4-3/+2
Changes to AF_UNIX trigger rebuild of io_uring, but io_uring does not use AF_UNIX anymore. Let's not include af_unix.h and instead include necessary headers. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240212234236.63714-1-kuniyu@amazon.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09io_uring: add register/unregister napi functionGravatar Stefan Roesch 4-0/+88
This adds an api to register and unregister the napi for io-uring. If the arg value is specified when unregistering, the current napi setting for the busy poll timeout is copied into the user structure. If this is not required, NULL can be passed as the arg value. Signed-off-by: Stefan Roesch <shr@devkernel.io> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230608163839.2891748-7-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09io-uring: add sqpoll support for napi busy pollGravatar Stefan Roesch 3-1/+33
This adds the sqpoll support to the io-uring napi. Signed-off-by: Stefan Roesch <shr@devkernel.io> Suggested-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/20230608163839.2891748-6-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09io-uring: add napi busy poll supportGravatar Stefan Roesch 7-1/+373
This adds the napi busy polling support in io_uring.c. It adds a new napi_list to the io_ring_ctx structure. This list contains the list of napi_id's that are currently enabled for busy polling. The list is synchronized by the new napi_lock spin lock. The current default napi busy polling time is stored in napi_busy_poll_to. If napi busy polling is not enabled, the value is 0. In addition there is also a hash table. The hash table store the napi id and the pointer to the above list nodes. The hash table is used to speed up the lookup to the list elements. The hash table is synchronized with rcu. The NAPI_TIMEOUT is stored as a timeout to make sure that the time a napi entry is stored in the napi list is limited. The busy poll timeout is also stored as part of the io_wait_queue. This is necessary as for sq polling the poll interval needs to be adjusted and the napi callback allows only to pass in one value. This has been tested with two simple programs from the liburing library repository: the napi client and the napi server program. The client sends a request, which has a timestamp in its payload and the server replies with the same payload. The client calculates the roundtrip time and stores it to calculate the results. The client is running on host1 and the server is running on host 2 (in the same rack). The measured times below are roundtrip times. They are average times over 5 runs each. Each run measures 1 million roundtrips. no rx coal rx coal: frames=88,usecs=33 Default 57us 56us client_poll=100us 47us 46us server_poll=100us 51us 46us client_poll=100us+ 40us 40us server_poll=100us client_poll=100us+ 41us 39us server_poll=100us+ prefer napi busy poll on client client_poll=100us+ 41us 39us server_poll=100us+ prefer napi busy poll on server client_poll=100us+ 41us 39us server_poll=100us+ prefer napi busy poll on client + server Signed-off-by: Stefan Roesch <shr@devkernel.io> Suggested-by: Olivier Langlois <olivier@trillion01.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230608163839.2891748-5-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09io-uring: move io_wait_queue definition to header fileGravatar Stefan Roesch 2-21/+22
This moves the definition of the io_wait_queue structure to the header file so it can be also used from other files. Signed-off-by: Stefan Roesch <shr@devkernel.io> Link: https://lore.kernel.org/r/20230608163839.2891748-4-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09Merge branch 'for-io_uring-add-napi-busy-polling-support' of ↵Gravatar Jens Axboe 2-14/+47
git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux into for-6.9/io_uring Pull netdev side of the io_uring napi support. * 'for-io_uring-add-napi-busy-polling-support' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux: net: add napi_busy_loop_rcu() net: split off __napi_busy_poll from napi_busy_poll
2024-02-09net: add napi_busy_loop_rcu()Gravatar Stefan Roesch 2-0/+19
This adds the napi_busy_loop_rcu() function. This function assumes that the calling function is already holding the rcu read lock and napi_busy_loop() does not need to take the rcu read lock. Add a NAPI_F_NO_SCHED flag, which tells __napi_busy_loop() to abort if we need to reschedule rather than drop the RCU read lock and reschedule. Signed-off-by: Stefan Roesch <shr@devkernel.io> Link: https://lore.kernel.org/r/20230608163839.2891748-3-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09net: split off __napi_busy_poll from napi_busy_pollGravatar Stefan Roesch 1-14/+28
This splits off the key part of the napi_busy_poll function into its own function, __napi_busy_poll, and changes the prefer_busy_poll bool to be flag based to allow passing in more flags in the future. This is done in preparation for an additional napi_busy_poll() function, that doesn't take the rcu_read_lock(). The new function is introduced in the next patch. Signed-off-by: Stefan Roesch <shr@devkernel.io> Link: https://lore.kernel.org/r/20230608163839.2891748-2-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09io_uring: add support for ftruncateGravatar Tony Solomonik 5-1/+64
Adds support for doing truncate through io_uring, eliminating the need for applications to roll their own thread pool or offload mechanism to be able to do non-blocking truncates. Signed-off-by: Tony Solomonik <tony.solomonik@gmail.com> Link: https://lore.kernel.org/r/20240202121724.17461-3-tony.solomonik@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09Add do_ftruncate that truncates a struct fileGravatar Tony Solomonik 2-25/+29
do_sys_ftruncate receives a file descriptor, fgets the struct file, and finally actually truncates the file. do_ftruncate allows for passing in a file directly, with the caller already holding a reference to it. Signed-off-by: Tony Solomonik <tony.solomonik@gmail.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20240202121724.17461-2-tony.solomonik@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: Simplify the allocation of slab cachesGravatar Kunwu Chan 1-3/+2
commit 0a31bd5f2bbb ("KMEM_CACHE(): simplify slab cache creation") introduces a new macro. Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Link: https://lore.kernel.org/r/20240130100247.81460-1-chentao@kylinos.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: re-arrange struct io_ring_ctx to reduce paddingGravatar Jens Axboe 1-15/+16
Nothing major here, just moving a few things around to reduce the padding. This reduces the size on a non-debug kernel from 1536 to 1472 bytes, saving a full cacheline. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring/sqpoll: manage task_work privatelyGravatar Jens Axboe 3-17/+82
Decouple from task_work running, and cap the number of entries we process at the time. If we exceed that number, push remaining entries to a retry list that we'll process first next time. We cap the number of entries to process at 8, which is fairly random. We just want to get enough per-ctx batching here, while not processing endlessly. Since we manually run PF_IO_WORKER related task_work anyway as the task never exits to userspace, with this we no longer need to add an actual task_work item to the per-process list. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: pass in counter to handle_tw_list() rather than return itGravatar Jens Axboe 1-5/+3
No functional changes in this patch, just in preparation for returning something other than count from this helper. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: cleanup handle_tw_list() calling conventionGravatar Jens Axboe 1-16/+13
Now that we don't loop around task_work anymore, there's no point in maintaining the ring and locked state outside of handle_tw_list(). Get rid of passing in those pointers (and pointers to pointers) and just do the management internally in handle_tw_list(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring/poll: improve readability of poll reference decrementingGravatar Jens Axboe 1-2/+2
This overly long line is hard to read. Break it up by AND'ing the ref mask first, then perform the atomic_sub_return() with the value itself. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: remove unconditional looping in local task_work handlingGravatar Jens Axboe 1-15/+29
If we have a ton of notifications coming in, we can be looping in here for a long time. This can be problematic for various reasons, mostly because we can starve userspace. If the application is waiting on N events, then only re-run if we need more events. Fixes: c0e0d6ba25f1 ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: remove next io_kiocb fetch in task_work runningGravatar Jens Axboe 1-3/+0
We just reversed the task_work list and that will have touched requests as well, just get rid of this optimization as it should not make a difference anymore. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: handle traditional task_work in FIFO orderGravatar Jens Axboe 1-1/+1
For local task_work, which is used if IORING_SETUP_DEFER_TASKRUN is set, we reverse the order of the lockless list before processing the work. This is done to process items in the order in which they were queued, as the llist always adds to the head. Do the same for traditional task_work, so we have the same behavior for both types. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: remove 'loops' argument from trace_io_uring_task_work_run()Gravatar Jens Axboe 2-8/+4
We no longer loop in task_work handling, hence delete the argument from the tracepoint as it's always 1 and hence not very informative. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: remove looping around handling traditional task_workGravatar Jens Axboe 1-38/+7
A previous commit added looping around handling traditional task_work as an optimization, and while that may seem like a good idea, it's also possible to run into application starvation doing so. If the task_work generation is bursty, we can get very deep task_work queues, and we can end up looping in here for a very long time. One immediately observable problem with that is handling network traffic using provided buffers, where flooding incoming traffic and looping task_work handling will very quickly lead to buffer starvation as we keep running task_work rather than returning to the application so it can handle the associated CQEs and also provide buffers back. Fixes: 3a0c037b0e16 ("io_uring: batch task_work") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring/kbuf: cleanup passing back cflagsGravatar Jens Axboe 2-24/+31
We have various functions calculating the CQE cflags we need to pass back, but it's all the same everywhere. Make a number of the putting functions void, and just have the two main helps for this, io_put_kbuf() and io_put_kbuf_comp() calculate the actual mask and pass it back. While at it, cleanup how we put REQ_F_BUFFER_RING buffers. Before this change, we would call into __io_put_kbuf() only to go right back in to the header defined functions. As clearing this type of buffer is just re-assigning the buf_index and incrementing the head, this is very wasteful. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring/rw: remove dead file == NULL checkGravatar Jens Axboe 1-1/+1
Any read/write opcode has needs_file == true, which means that we would've failed the request long before reaching the issue stage if we didn't successfully assign a file. This check has been dead forever, and is really a leftover from generic code. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: cleanup io_req_complete_post()Gravatar Jens Axboe 1-4/+4
Move the ctx declaration and assignment up to be generally available in the function, as we use req->ctx at the top anyway. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: mark the need to lock/unlock the ring as unlikelyGravatar Jens Axboe 1-2/+2
Any of the fast paths will already have this locked, this helper only exists to deal with io-wq invoking request issue where we do not have the ctx->uring_lock held already. This means that any common or fast path will already have this locked, mark it as such. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: add io_file_can_poll() helperGravatar Jens Axboe 6-6/+21
This adds a flag to avoid dipping dereferencing file and then f_op to figure out if the file has a poll handler defined or not. We generally call this at least twice for networked workloads, and if using ring provided buffers, we do it on every buffer selection. Particularly the latter is troublesome, as it's otherwise a very fast operation. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring/cancel: don't default to setting req->work.cancel_seqGravatar Jens Axboe 5-8/+15
Just leave it unset by default, avoiding dipping into the last cacheline (which is otherwise untouched) for the fast path of using poll to drive networked traffic. Add a flag that tells us if the sequence is valid or not, and then we can defer actually assigning the flag and sequence until someone runs cancelations. Signed-off-by: Jens Axboe <axboe@kernel.dk>