aboutsummaryrefslogtreecommitdiff
path: root/io_uring/io_uring.h
AgeCommit message (Collapse)AuthorFilesLines
2023-11-01Merge tag 'for-6.7/io_uring-2023-10-30' of git://git.kernel.dk/linuxGravatar Linus Torvalds 1-0/+1
Pull io_uring updates from Jens Axboe: "This contains the core io_uring updates, of which there are not many, and adds support for using WAITID through io_uring and hence not needing to block on these kinds of events. Outside of that, tweaks to the legacy provided buffer handling and some cleanups related to cancelations for uring_cmd support" * tag 'for-6.7/io_uring-2023-10-30' of git://git.kernel.dk/linux: io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeups io_uring/kbuf: Use slab for struct io_buffer objects io_uring/kbuf: Allow the full buffer id space for provided buffers io_uring/kbuf: Fix check of BID wrapping in provided buffers io_uring/rsrc: cleanup io_pin_pages() io_uring: cancelable uring_cmd io_uring: retain top 8bits of uring_cmd flags for kernel internal use io_uring: add IORING_OP_WAITID support exit: add internal include file with helpers exit: add kernel_waitid_prepare() helper exit: move core of do_wait() into helper exit: abstract out should_wake helper for child_wait_callback() io_uring/rw: add support for IORING_OP_READ_MULTISHOT io_uring/rw: mark readv/writev as vectored in the opcode definition io_uring/rw: split io_read() into a helper
2023-10-05io_uring/kbuf: Use slab for struct io_buffer objectsGravatar Gabriel Krisman Bertazi 1-0/+1
The allocation of struct io_buffer for metadata of provided buffers is done through a custom allocator that directly gets pages and fragments them. But, slab would do just fine, as this is not a hot path (in fact, it is a deprecated feature) and, by keeping a custom allocator implementation we lose benefits like tracking, poisoning, sanitizers. Finally, the custom code is more complex and requires keeping the list of pages in struct ctx for no good reason. This patch cleans this path up and just uses slab. I microbenchmarked it by forcing the allocation of a large number of objects with the least number of io_uring commands possible (keeping nbufs=USHRT_MAX), with and without the patch. There is a slight increase in time spent in the allocation with slab, of course, but even when allocating to system resources exhaustion, which is not very realistic and happened around 1/2 billion provided buffers for me, it wasn't a significant hit in system time. Specially if we think of a real-world scenario, an application doing register/unregister of provided buffers will hit ctx->io_buffers_cache more often than actually going to slab. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20231005000531.30800-4-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-03io_uring: ensure io_lockdep_assert_cq_locked() handles disabled ringsGravatar Jens Axboe 1-14/+27
io_lockdep_assert_cq_locked() checks that locking is correctly done when a CQE is posted. If the ring is setup in a disabled state with IORING_SETUP_R_DISABLED, then ctx->submitter_task isn't assigned until the ring is later enabled. We generally don't post CQEs in this state, as no SQEs can be submitted. However it is possible to generate a CQE if tagged resources are being updated. If this happens and PROVE_LOCKING is enabled, then the locking check helper will dereference ctx->submitter_task, which hasn't been set yet. Fixup io_lockdep_assert_cq_locked() to handle this case correctly. While at it, convert it to a static inline as well, so that generated line offsets will actually reflect which condition failed, rather than just the line offset for io_lockdep_assert_cq_locked() itself. Reported-and-tested-by: syzbot+efc45d4e7ba6ab4ef1eb@syzkaller.appspotmail.com Fixes: f26cc9593581 ("io_uring: lockdep annotate CQ locking") Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24io_uring: force inline io_fill_cqe_reqGravatar Pavel Begunkov 1-1/+2
There are only 2 callers of io_fill_cqe_req left, and one of them is extremely hot. Force inline the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ffce4fc5e3521966def848a4d930586dfe33ae11.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24io_uring: merge iopoll and normal completion pathsGravatar Pavel Begunkov 1-1/+1
io_do_iopoll() and io_submit_flush_completions() are pretty similar, both filling CQEs and then free a list of requests. Don't duplicate it and make iopoll use __io_submit_flush_completions(), which also helps with inlining and other optimisations. For that, we need to first find all completed iopoll requests and splice them from the iopoll list and then pass it down. This adds one extra list traversal, which should be fine as requests will stay hot in cache. CQ locking is already conditional, introduce ->lockless_cq and skip locking for IOPOLL as it's protected by ->uring_lock. We also add a wakeup optimisation for IOPOLL to __io_cq_unlock_post(), so it works just like io_cqring_ev_posted_iopoll(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3840473f5e8a960de35b77292026691880f6bdbc.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24io_uring: optimise extra io_get_cqe null checkGravatar Pavel Begunkov 1-11/+9
If the cached cqe check passes in io_get_cqe*() it already means that the cqe we return is valid and non-zero, however the compiler is unable to optimise null checks like in io_fill_cqe_req(). Do a bit of trickery, return success/fail boolean from io_get_cqe*() and store cqe in the cqe parameter. That makes it do the right thing, erasing the check together with the introduced indirection. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/322ea4d3377d3d4efd8ae90ab8ed28a99f518210.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24io_uring: refactor __io_get_cqe()Gravatar Pavel Begunkov 1-11/+12
Make __io_get_cqe simpler by not grabbing the cqe from refilled cached, but letting io_get_cqe() do it for us. That's cleaner and removes some duplication. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/74dc8fdf2657e438b2e05e1d478a3596924604e9.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24io_uring: simplify big_cqe handlingGravatar Pavel Begunkov 1-12/+3
Don't keep big_cqe bits of req in a union with hash_node, find a separate space for it. It's bit safer, but also if we keep it always initialised, we can get rid of ugly REQ_F_CQE32_INIT handling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/447aa1b2968978c99e655ba88db536e903df0fe9.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24io_uring: improve cqe !tracing hot pathGravatar Pavel Begunkov 1-4/+5
While looking at io_fill_cqe_req()'s asm I stumbled on our trace points turning into the chunk below: trace_io_uring_complete(req->ctx, req, req->cqe.user_data, req->cqe.res, req->cqe.flags, req->extra1, req->extra2); io_uring/io_uring.c:898: trace_io_uring_complete(req->ctx, req, req->cqe.user_data, movq 232(%rbx), %rdi # req_44(D)->big_cqe.extra2, _5 movq 224(%rbx), %rdx # req_44(D)->big_cqe.extra1, _6 movl 84(%rbx), %r9d # req_44(D)->cqe.D.81184.flags, _7 movl 80(%rbx), %r8d # req_44(D)->cqe.res, _8 movq 72(%rbx), %rcx # req_44(D)->cqe.user_data, _9 movq 88(%rbx), %rsi # req_44(D)->ctx, _10 ./arch/x86/include/asm/jump_label.h:27: asm_volatile_goto("1:" 1:jmp .L1772 # objtool NOPs this # ... It does a jump_label for actual tracing, but those 6 moves will stay there in the hottest io_uring path. As an optimisation, add a trace_io_uring_complete_enabled() check, which is also uses jump_labels, it tricks the compiler into behaving. It removes the junk without changing anything else int the hot path. Note: apparently, it's not only me noticing it, and people are also working it around. We should remove the check when it's solved generically or rework tracing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/555d8312644b3776f4be7e23f9b92943875c4bc7.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11io_uring: never overflow io_aux_cqeGravatar Pavel Begunkov 1-2/+1
Now all callers of io_aux_cqe() set allow_overflow to false, remove the parameter and not allow overflowing auxilary multishot cqes. When CQ is full the function callers and all multishot requests in general are expected to complete the request. That prevents indefinite in-background grows of the overflow list and let's the userspace to handle the backlog at its own pace. Resubmitting a request should also be faster than accounting a bunch of overflows, so it should be better for perf when it happens, but a well behaving userspace should be trying to avoid overflows in any case. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bb20d14d708ea174721e58bb53786b0521e4dd6d.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11io_uring: remove return from io_req_cqe_overflow()Gravatar Pavel Begunkov 1-1/+1
Nobody checks io_req_cqe_overflow()'s return, make it return void. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8f2029ad0c22f73451664172d834372608ee0a77.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11io_uring: open code io_fill_cqe_req()Gravatar Pavel Begunkov 1-10/+1
io_fill_cqe_req() is only called from one place, open code it, and rename __io_fill_cqe_req(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f432ce75bb1c94cadf0bd2add4d6aa510bd1fb36.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-10io_uring: have io_file_put() take an io_kiocb rather than the fileGravatar Jens Axboe 1-3/+3
No functional changes in this patch, just a prep patch for needing the request in io_file_put(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09io_uring: fix false positive KASAN warningsGravatar Pavel Begunkov 1-1/+0
io_req_local_work_add() peeks into the work list, which can be executed in the meanwhile. It's completely fine without KASAN as we're in an RCU read section and it's SLAB_TYPESAFE_BY_RCU. With KASAN though it may trigger a false positive warning because internal io_uring caches are sanitised. Remove sanitisation from the io_uring request cache for now. Cc: stable@vger.kernel.org Fixes: 8751d15426a31 ("io_uring: reduce scheduling due to tw") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c6fbf7a82a341e66a0007c76eefd9d57f2d3ba51.1691541473.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: make io_cq_unlock_post staticGravatar Pavel Begunkov 1-2/+0
io_cq_unlock_post() is exclusively used in io_uring/io_uring.c, mark it static and don't expose to other files. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3dc8127dda4514e1dd24bb32035faac887c5fa37.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: remove IOU_F_TWQ_FORCE_NORMALGravatar Pavel Begunkov 1-4/+1
Extract a function for non-local task_work_add, and use it directly from io_move_task_work_from_local(). Now we don't use IOU_F_TWQ_FORCE_NORMAL and it can be killed. As a small positive side effect we don't grab task->io_uring in io_req_normal_work_add anymore, which is not needed for io_req_local_work_add(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2e55571e8ff2927ae3cc12da606d204e2485525b.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20io_uring: remove io_req_ffs_setGravatar Christoph Hellwig 1-5/+0
Just checking the flag directly makes it a lot more obvious what is going on here. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-07io_uring: cleanup io_aux_cqe() APIGravatar Jens Axboe 1-1/+1
Everybody is passing in the request, so get rid of the io_ring_ctx and explicit user_data pass-in. Both the ctx and user_data can be deduced from the request at hand. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-16io_uring: Add io_uring_setup flag to pre-register ring fd and never install itGravatar Josh Triplett 1-0/+3
With IORING_REGISTER_USE_REGISTERED_RING, an application can register the ring fd and use it via registered index rather than installed fd. This allows using a registered ring for everything *except* the initial mmap. With IORING_SETUP_NO_MMAP, io_uring_setup uses buffers allocated by the user, rather than requiring a subsequent mmap. The combination of the two allows a user to operate *entirely* via a registered ring fd, making it unnecessary to ever install the fd in the first place. So, add a flag IORING_SETUP_REGISTERED_FD_ONLY to make io_uring_setup register the fd and return a registered index, without installing the fd. This allows an application to avoid touching the fd table at all, and allows a library to never even momentarily install a file descriptor. This splits out an io_ring_add_registered_file helper from io_ring_add_registered_fd, for use by io_uring_setup. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Link: https://lore.kernel.org/r/bc8f431bada371c183b95a83399628b605e978a3.1682699803.git.josh@joshtriplett.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-04io_uring: Create a helper to return the SQE sizeGravatar Breno Leitao 1-0/+10
Create a simple helper that returns the size of the SQE. The SQE could have two size, depending of the flags. If IO_URING_SETUP_SQE128 flag is set, then return a double SQE, otherwise returns the sizeof of io_uring_sqe (64 bytes). Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20230504121856.904491-2-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-12io_uring: add irq lockdep checksGravatar Pavel Begunkov 1-0/+2
We don't post CQEs from the IRQ context, add a check catching that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f23f7a24dbe8027b3d37873fece2b6488f878b31.1681210788.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-06io_uring: reduce scheduling due to twGravatar Pavel Begunkov 1-0/+9
Every task_work will try to wake the task to be executed, which causes excessive scheduling and additional overhead. For some tw it's justified, but others won't do much but post a single CQE. When a task waits for multiple cqes, every such task_work will wake it up. Instead, the task may give a hint about how many cqes it waits for, io_req_local_work_add() will compare against it and skip wake ups if #cqes + #tw is not enough to satisfy the waiting condition. Task_work that uses the optimisation should be simple enough and never post more than one CQE. It's also ignored for non DEFER_TASKRUN rings. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d2b77e99d1e86624d8a69f7037d764b739dcd225.1680782017.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-06io_uring: add tw add flagsGravatar Pavel Begunkov 1-2/+7
We pass 'allow_local' into io_req_task_work_add() but will need more flags. Replace it with a flags bit field and name this allow_local flag. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4c0f01e7ef4e6feebfb199093cc995af7a19befa.1680782017.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-06io_uring: refactor io_cqring_wake()Gravatar Pavel Begunkov 1-9/+2
Instead of smp_mb() + __io_cqring_wake() in __io_cq_unlock_post_flush() use equivalent io_cqring_wake(). With that we can clean it up further and remove __io_cqring_wake(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/662ee5d898168ac206be06038525e97b64072a46.1680782017.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03io_uring: cap io_sqring_entries() at SQ ring sizeGravatar Jens Axboe 1-1/+3
We already do this manually for the !SQPOLL case, do it in general and we can also dump the ugly min3() in io_submit_sqes(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03io_uring: encapsulate task_work stateGravatar Pavel Begunkov 1-7/+7
For task works we're passing around a bool pointer for whether the current ring is locked or not, let's wrap it in a structure, that will make it more opaque preventing abuse and will also help us to pass more info in the future if needed. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1ecec9483d58696e248d1bfd52cf62b04442df1d.1679931367.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-06io_uring: mark task TASK_RUNNING before handling resume/task workGravatar Jens Axboe 1-1/+3
Just like for task_work, set the task mode to TASK_RUNNING before doing any potential resume work. We're not holding any locks at this point, but we may have already set the task state to TASK_INTERRUPTIBLE in preparation for going to sleep waiting for events. Ensure that we set it back to TASK_RUNNING if we have work to process, to avoid warnings on calling blocking operations with !TASK_RUNNING. Fixes: b5d3ae202fbf ("io_uring: handle TIF_NOTIFY_RESUME when checking for task_work") Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/oe-lkp/202302062208.24d3e563-oliver.sang@intel.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29io_uring: refactor req allocationGravatar Pavel Begunkov 1-8/+11
Follow the io_get_sqe pattern returning the result via a pointer and hide request cache refill inside io_alloc_req(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8c37c2e8a3cb5e4cd6a8ae3b91371227a92708a6.1674484266.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29io_uring: Enable KASAN for request cacheGravatar Breno Leitao 1-3/+8
Every io_uring request is represented by struct io_kiocb, which is cached locally by io_uring (not SLAB/SLUB) in the list called submit_state.freelist. This patch simply enabled KASAN for this free list. This list is initially created by KMEM_CACHE, but later, managed by io_uring. This patch basically poisons the objects that are not used (i.e., they are the free list), and unpoisons it when the object is allocated/removed from the list. Touching these poisoned objects while in the freelist will cause a KASAN warning. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29io_uring: handle TIF_NOTIFY_RESUME when checking for task_workGravatar Jens Axboe 1-0/+8
If TIF_NOTIFY_RESUME is set, then we need to call resume_user_mode_work() for PF_IO_WORKER threads. They never return to usermode, hence never get a chance to process any items that are marked by this flag. Most notably this includes the final put of files, but also any throttling markers set by block cgroups. Cc: stable@vger.kernel.org # 5.10+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29io_uring: don't export io_put_task()Gravatar Pavel Begunkov 1-10/+0
io_put_task() is only used in uring.c so enclose it there together with __io_put_task(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/43c7f9227e2ab215f1a6069dadbc5382bed346fe.1673887636.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29io_uring: add lazy poll_wq activationGravatar Pavel Begunkov 1-4/+3
Even though io_poll_wq_wake()'s waitqueue_active reuses a barrier we do for another waitqueue, it's not going to be the case in the future and so we want to have a fast path for it when the ring has never been polled. Move poll_wq wake ups into __io_commit_cqring_flush() using a new flag called ->poll_activated. The idea behind the flag is to set it when the ring was polled for the first time. This requires additional sync to not miss events, which is done here by using task_work for ->task_complete rings, and by default enabling the flag for all other types of rings. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/060785e8e9137a920b232c0c7f575b131af19cac.1673274244.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29io_uring: separate wq for ring pollingGravatar Pavel Begunkov 1-0/+9
Don't use ->cq_wait for ring polling but add a separate wait queue for it. We need it for following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/dea0be0bf990503443c5c6c337fc66824af7d590.1673274244.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29io_uring: move io_run_local_work_lockedGravatar Pavel Begunkov 1-17/+0
io_run_local_work_locked() is only used in io_uring.c, move it there. With that we can also make __io_run_local_work() static. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/91757bcb33e5774e49fed6f2b6e058630608119b.1673274244.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29io_uring: mark io_run_local_work staticGravatar Pavel Begunkov 1-1/+0
io_run_local_work is enclosed in io_uring.c, we don't need to export it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b477fb81f5e77044f724a06fe245d5c078659364.1673274244.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29io_uring: move defer tw task checksGravatar Pavel Begunkov 1-0/+5
Most places that want to run local tw explicitly and in advance check if they are allowed to do so. Don't rely on a similar check in __io_run_local_work(), leave it as a just-in-case warning and make sure callers checks capabilities themselves. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/990fe0e8e70fd4d57e43625e5ce8fba584821d1a.1672916894.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29io_uring: kill io_run_task_work_ctxGravatar Pavel Begunkov 1-20/+0
There is only one user of io_run_task_work_ctx(), inline it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/40953c65f7c88fb00cdc4d870ca5d5319fb3d7ea.1672916894.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29io_uring: rearrange defer list checksGravatar Pavel Begunkov 1-1/+1
There should be nothing in the ->work_llist for non DEFER_TASKRUN rings, so we can skip flag checks and test the list emptiness directly. Also move it out of io_run_local_work() for inlining. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/331d63fd15ca79b35b95c82a82d9246110686392.1672916894.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-03io_uring: lockdep annotate CQ lockingGravatar Pavel Begunkov 1-0/+15
Locking around CQE posting is complex and depends on options the ring is created with, add more thorough lockdep annotations checking all invariants. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/aa3770b4eacae3915d782cc2ab2f395a99b4b232.1672795976.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-17io_uring: don't use TIF_NOTIFY_SIGNAL to test for availability of task_workGravatar Jens Axboe 1-2/+1
Use task_work_pending() as a better test for whether we have task_work or not, TIF_NOTIFY_SIGNAL is only valid if the any of the task_work items had been queued with TWA_SIGNAL as the notification mechanism. Hence task_work_pending() is a more reliable check. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-14io_uring: revise completion_lock lockingGravatar Pavel Begunkov 1-11/+0
io_kill_timeouts() doesn't post any events but queues everything to task_work. Locking there is needed for protecting linked requests traversing, we should grab completion_lock directly instead of using io_cq_[un]lock helpers. Same goes for __io_req_find_next_prep(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/88e75d481a65dc295cb59722bb1cf76402d1c06b.1670002973.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07io_uring: skip spinlocking for ->task_completeGravatar Pavel Begunkov 1-1/+9
->task_complete was added to serialised CQE posting by doing it from the task context only (or fallback wq when the task is dead), and now we can use that to avoid taking ->completion_lock while filling CQ entries. The patch skips spinlocking only in two spots, __io_submit_flush_completions() and flushing in io_aux_cqe, it's safer and covers all cases we care about. Extra care is taken to force taking the lock while queueing overflow entries. It fundamentally relies on SINGLE_ISSUER to have only one task posting events. It also need to take into account overflowed CQEs, flushing of which happens in the cq wait path, and so this implementation also needs DEFER_TASKRUN to limit waiters. For the same reason we disable it for SQPOLL, and for IOPOLL as it won't benefit from it in any case. DEFER_TASKRUN, SQPOLL and IOPOLL requirement may be relaxed in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2a8c91fd82cfcdcc1d2e5bac7051fe2c183bda73.1670384893.git.asml.silence@gmail.com [axboe: modify to apply] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-30io_uring: don't raw spin unlock to match cq_lockGravatar Pavel Begunkov 1-0/+5
There is one newly added place when we lock ring with io_cq_lock() but unlocking is hand coded calling spin_unlock directly. It's ugly and troublesome in the long run. Make it consistent with the other completion locking. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4ca4f0564492b90214a190cd5b2a6c76522de138.1669821213.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-29Merge branch 'for-6.2/io_uring' into for-6.2/io_uring-nextGravatar Jens Axboe 1-14/+29
* for-6.2/io_uring: (41 commits) io_uring: keep unlock_post inlined in hot path io_uring: don't use complete_post in kbuf io_uring: spelling fix io_uring: remove io_req_complete_post_tw io_uring: allow multishot polled reqs to defer completion io_uring: remove overflow param from io_post_aux_cqe io_uring: add lockdep assertion in io_fill_cqe_aux io_uring: make io_fill_cqe_aux static io_uring: add io_aux_cqe which allows deferred completion io_uring: allow defer completion for aux posted cqes io_uring: defer all io_req_complete_failed io_uring: always lock in io_apoll_task_func io_uring: remove iopoll spinlock io_uring: iopoll protect complete_post io_uring: inline __io_req_complete_put() io_uring: remove io_req_tw_post_queue io_uring: use io_req_task_complete() in timeout io_uring: hold locks for io_req_complete_failed io_uring: add completion locking for iopoll io_uring: kill io_cqring_ev_posted() and __io_cq_unlock_post() ...
2022-11-25io_uring: clear TIF_NOTIFY_SIGNAL if set and task_work not availableGravatar Jens Axboe 1-2/+7
With how task_work is added and signaled, we can have TIF_NOTIFY_SIGNAL set and no task_work pending as it got run in a previous loop. Treat TIF_NOTIFY_SIGNAL like get_signal(), always clear it if set regardless of whether or not task_work is pending to run. Cc: stable@vger.kernel.org Fixes: 46a525e199e4 ("io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25io_uring: remove io_req_complete_post_twGravatar Dylan Yudaken 1-7/+0
It's only used in one place. Inline it. Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221125103412.1425305-2-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25io_uring: remove overflow param from io_post_aux_cqeGravatar Dylan Yudaken 1-2/+1
The only call sites which would not allow overflow are also call sites which would use the io_aux_cqe as they care about ordering. So remove this parameter from io_post_aux_cqe. Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221124093559.3780686-9-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25io_uring: make io_fill_cqe_aux staticGravatar Dylan Yudaken 1-2/+0
This is only used in io_uring.c Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221124093559.3780686-7-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25io_uring: add io_aux_cqe which allows deferred completionGravatar Dylan Yudaken 1-0/+2
Use the just introduced deferred post cqe completion state when possible in io_aux_cqe. If not possible fallback to io_post_aux_cqe. This introduces a complication because of allow_overflow. For deferred completions we cannot know without locking the completion_lock if it will overflow (and even if we locked it, another post could sneak in and cause this cqe to be in overflow). However since overflow protection is mostly a best effort defence in depth to prevent infinite loops of CQEs for poll, just checking the overflow bit is going to be good enough and will result in at most 16 (array size of deferred cqes) overflows. Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221124093559.3780686-6-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25io_uring: defer all io_req_complete_failedGravatar Dylan Yudaken 1-1/+1
All failures happen under lock now, and can be deferred. To be consistent when the failure has happened after some multishot cqe has been deferred (and keep ordering), always defer failures. To make this obvious at the caller (and to help prevent a future bug) rename io_req_complete_failed to io_req_defer_failed. Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221124093559.3780686-4-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>