aboutsummaryrefslogtreecommitdiff
path: root/include/trace
diff options
context:
space:
mode:
authorGravatar David Howells <dhowells@redhat.com> 2024-03-18 16:52:05 +0000
committerGravatar David Howells <dhowells@redhat.com> 2024-05-01 18:07:36 +0100
commit288ace2f57c9d06dd2e42bd80d03747d879a4068 (patch)
tree09b6e75dfb7b1120ce04771c6629bd5525ae446b /include/trace
parentnetfs: Switch to using unsigned long long rather than loff_t (diff)
downloadlinux-288ace2f57c9d06dd2e42bd80d03747d879a4068.tar.gz
linux-288ace2f57c9d06dd2e42bd80d03747d879a4068.tar.bz2
linux-288ace2f57c9d06dd2e42bd80d03747d879a4068.zip
netfs: New writeback implementation
The current netfslib writeback implementation creates writeback requests of contiguous folio data and then separately tiles subrequests over the space twice, once for the server and once for the cache. This creates a few issues: (1) Every time there's a discontiguity or a change between writing to only one destination or writing to both, it must create a new request. This makes it harder to do vectored writes. (2) The folios don't have the writeback mark removed until the end of the request - and a request could be hundreds of megabytes. (3) In future, I want to support a larger cache granularity, which will require aggregation of some folios that contain unmodified data (which only need to go to the cache) and some which contain modifications (which need to be uploaded and stored to the cache) - but, currently, these are treated as discontiguous. There's also a move to get everyone to use writeback_iter() to extract writable folios from the pagecache. That said, currently writeback_iter() has some issues that make it less than ideal: (1) there's no way to cancel the iteration, even if you find a "temporary" error that means the current folio and all subsequent folios are going to fail; (2) there's no way to filter the folios being written back - something that will impact Ceph with it's ordered snap system; (3) and if you get a folio you can't immediately deal with (say you need to flush the preceding writes), you are left with a folio hanging in the locked state for the duration, when really we should unlock it and relock it later. In this new implementation, I use writeback_iter() to pump folios, progressively creating two parallel, but separate streams and cleaning up the finished folios as the subrequests complete. Either or both streams can contain gaps, and the subrequests in each stream can be of variable size, don't need to align with each other and don't need to align with the folios. Indeed, subrequests can cross folio boundaries, may cover several folios or a folio may be spanned by multiple folios, e.g.: +---+---+-----+-----+---+----------+ Folios: | | | | | | | +---+---+-----+-----+---+----------+ +------+------+ +----+----+ Upload: | | |.....| | | +------+------+ +----+----+ +------+------+------+------+------+ Cache: | | | | | | +------+------+------+------+------+ The progressive subrequest construction permits the algorithm to be preparing both the next upload to the server and the next write to the cache whilst the previous ones are already in progress. Throttling can be applied to control the rate of production of subrequests - and, in any case, we probably want to write them to the server in ascending order, particularly if the file will be extended. Content crypto can also be prepared at the same time as the subrequests and run asynchronously, with the prepped requests being stalled until the crypto catches up with them. This might also be useful for transport crypto, but that happens at a lower layer, so probably would be harder to pull off. The algorithm is split into three parts: (1) The issuer. This walks through the data, packaging it up, encrypting it and creating subrequests. The part of this that generates subrequests only deals with file positions and spans and so is usable for DIO/unbuffered writes as well as buffered writes. (2) The collector. This asynchronously collects completed subrequests, unlocks folios, frees crypto buffers and performs any retries. This runs in a work queue so that the issuer can return to the caller for writeback (so that the VM can have its kswapd thread back) or async writes. (3) The retryer. This pauses the issuer, waits for all outstanding subrequests to complete and then goes through the failed subrequests to reissue them. This may involve reprepping them (with cifs, the credits must be renegotiated, and a subrequest may need splitting), and doing RMW for content crypto if there's a conflicting change on the server. [!] Note that some of the functions are prefixed with "new_" to avoid clashes with existing functions. These will be renamed in a later patch that cuts over to the new algorithm. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: Eric Van Hensbergen <ericvh@kernel.org> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org
Diffstat (limited to 'include/trace')
-rw-r--r--include/trace/events/netfs.h232
1 files changed, 229 insertions, 3 deletions
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index 7126d2ea459c..e7700172ae7e 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -44,14 +44,18 @@
#define netfs_rreq_traces \
EM(netfs_rreq_trace_assess, "ASSESS ") \
EM(netfs_rreq_trace_copy, "COPY ") \
+ EM(netfs_rreq_trace_collect, "COLLECT") \
EM(netfs_rreq_trace_done, "DONE ") \
EM(netfs_rreq_trace_free, "FREE ") \
EM(netfs_rreq_trace_redirty, "REDIRTY") \
EM(netfs_rreq_trace_resubmit, "RESUBMT") \
+ EM(netfs_rreq_trace_set_pause, "PAUSE ") \
EM(netfs_rreq_trace_unlock, "UNLOCK ") \
EM(netfs_rreq_trace_unmark, "UNMARK ") \
EM(netfs_rreq_trace_wait_ip, "WAIT-IP") \
+ EM(netfs_rreq_trace_wait_pause, "WT-PAUS") \
EM(netfs_rreq_trace_wake_ip, "WAKE-IP") \
+ EM(netfs_rreq_trace_unpause, "UNPAUSE") \
E_(netfs_rreq_trace_write_done, "WR-DONE")
#define netfs_sreq_sources \
@@ -64,11 +68,15 @@
E_(NETFS_INVALID_WRITE, "INVL")
#define netfs_sreq_traces \
+ EM(netfs_sreq_trace_discard, "DSCRD") \
EM(netfs_sreq_trace_download_instead, "RDOWN") \
+ EM(netfs_sreq_trace_fail, "FAIL ") \
EM(netfs_sreq_trace_free, "FREE ") \
EM(netfs_sreq_trace_limited, "LIMIT") \
EM(netfs_sreq_trace_prepare, "PREP ") \
+ EM(netfs_sreq_trace_prep_failed, "PRPFL") \
EM(netfs_sreq_trace_resubmit_short, "SHORT") \
+ EM(netfs_sreq_trace_retry, "RETRY") \
EM(netfs_sreq_trace_submit, "SUBMT") \
EM(netfs_sreq_trace_terminated, "TERM ") \
EM(netfs_sreq_trace_write, "WRITE") \
@@ -88,6 +96,7 @@
#define netfs_rreq_ref_traces \
EM(netfs_rreq_trace_get_for_outstanding,"GET OUTSTND") \
EM(netfs_rreq_trace_get_subreq, "GET SUBREQ ") \
+ EM(netfs_rreq_trace_get_work, "GET WORK ") \
EM(netfs_rreq_trace_put_complete, "PUT COMPLT ") \
EM(netfs_rreq_trace_put_discard, "PUT DISCARD") \
EM(netfs_rreq_trace_put_failed, "PUT FAILED ") \
@@ -95,6 +104,8 @@
EM(netfs_rreq_trace_put_return, "PUT RETURN ") \
EM(netfs_rreq_trace_put_subreq, "PUT SUBREQ ") \
EM(netfs_rreq_trace_put_work, "PUT WORK ") \
+ EM(netfs_rreq_trace_put_work_complete, "PUT WORK CP") \
+ EM(netfs_rreq_trace_put_work_nq, "PUT WORK NQ") \
EM(netfs_rreq_trace_see_work, "SEE WORK ") \
E_(netfs_rreq_trace_new, "NEW ")
@@ -103,11 +114,14 @@
EM(netfs_sreq_trace_get_resubmit, "GET RESUBMIT") \
EM(netfs_sreq_trace_get_short_read, "GET SHORTRD") \
EM(netfs_sreq_trace_new, "NEW ") \
+ EM(netfs_sreq_trace_put_cancel, "PUT CANCEL ") \
EM(netfs_sreq_trace_put_clear, "PUT CLEAR ") \
EM(netfs_sreq_trace_put_discard, "PUT DISCARD") \
+ EM(netfs_sreq_trace_put_done, "PUT DONE ") \
EM(netfs_sreq_trace_put_failed, "PUT FAILED ") \
EM(netfs_sreq_trace_put_merged, "PUT MERGED ") \
EM(netfs_sreq_trace_put_no_copy, "PUT NO COPY") \
+ EM(netfs_sreq_trace_put_oom, "PUT OOM ") \
EM(netfs_sreq_trace_put_wip, "PUT WIP ") \
EM(netfs_sreq_trace_put_work, "PUT WORK ") \
E_(netfs_sreq_trace_put_terminated, "PUT TERM ")
@@ -124,7 +138,9 @@
EM(netfs_streaming_filled_page, "mod-streamw-f") \
EM(netfs_streaming_cont_filled_page, "mod-streamw-f+") \
/* The rest are for writeback */ \
+ EM(netfs_folio_trace_cancel_copy, "cancel-copy") \
EM(netfs_folio_trace_clear, "clear") \
+ EM(netfs_folio_trace_clear_cc, "clear-cc") \
EM(netfs_folio_trace_clear_s, "clear-s") \
EM(netfs_folio_trace_clear_g, "clear-g") \
EM(netfs_folio_trace_copy, "copy") \
@@ -133,16 +149,26 @@
EM(netfs_folio_trace_end_copy, "end-copy") \
EM(netfs_folio_trace_filled_gaps, "filled-gaps") \
EM(netfs_folio_trace_kill, "kill") \
+ EM(netfs_folio_trace_kill_cc, "kill-cc") \
+ EM(netfs_folio_trace_kill_g, "kill-g") \
+ EM(netfs_folio_trace_kill_s, "kill-s") \
EM(netfs_folio_trace_mkwrite, "mkwrite") \
EM(netfs_folio_trace_mkwrite_plus, "mkwrite+") \
+ EM(netfs_folio_trace_not_under_wback, "!wback") \
EM(netfs_folio_trace_read_gaps, "read-gaps") \
EM(netfs_folio_trace_redirty, "redirty") \
EM(netfs_folio_trace_redirtied, "redirtied") \
EM(netfs_folio_trace_store, "store") \
+ EM(netfs_folio_trace_store_copy, "store-copy") \
EM(netfs_folio_trace_store_plus, "store+") \
EM(netfs_folio_trace_wthru, "wthru") \
E_(netfs_folio_trace_wthru_plus, "wthru+")
+#define netfs_collect_contig_traces \
+ EM(netfs_contig_trace_collect, "Collect") \
+ EM(netfs_contig_trace_jump, "-->JUMP-->") \
+ E_(netfs_contig_trace_unlock, "Unlock")
+
#ifndef __NETFS_DECLARE_TRACE_ENUMS_ONCE_ONLY
#define __NETFS_DECLARE_TRACE_ENUMS_ONCE_ONLY
@@ -159,6 +185,7 @@ enum netfs_failure { netfs_failures } __mode(byte);
enum netfs_rreq_ref_trace { netfs_rreq_ref_traces } __mode(byte);
enum netfs_sreq_ref_trace { netfs_sreq_ref_traces } __mode(byte);
enum netfs_folio_trace { netfs_folio_traces } __mode(byte);
+enum netfs_collect_contig_trace { netfs_collect_contig_traces } __mode(byte);
#endif
@@ -180,6 +207,7 @@ netfs_failures;
netfs_rreq_ref_traces;
netfs_sreq_ref_traces;
netfs_folio_traces;
+netfs_collect_contig_traces;
/*
* Now redefine the EM() and E_() macros to map the enums to the strings that
@@ -413,16 +441,18 @@ TRACE_EVENT(netfs_write_iter,
__field(unsigned long long, start )
__field(size_t, len )
__field(unsigned int, flags )
+ __field(unsigned int, ino )
),
TP_fast_assign(
__entry->start = iocb->ki_pos;
__entry->len = iov_iter_count(from);
+ __entry->ino = iocb->ki_filp->f_inode->i_ino;
__entry->flags = iocb->ki_flags;
),
- TP_printk("WRITE-ITER s=%llx l=%zx f=%x",
- __entry->start, __entry->len, __entry->flags)
+ TP_printk("WRITE-ITER i=%x s=%llx l=%zx f=%x",
+ __entry->ino, __entry->start, __entry->len, __entry->flags)
);
TRACE_EVENT(netfs_write,
@@ -434,6 +464,7 @@ TRACE_EVENT(netfs_write,
TP_STRUCT__entry(
__field(unsigned int, wreq )
__field(unsigned int, cookie )
+ __field(unsigned int, ino )
__field(enum netfs_write_trace, what )
__field(unsigned long long, start )
__field(unsigned long long, len )
@@ -444,18 +475,213 @@ TRACE_EVENT(netfs_write,
struct fscache_cookie *__cookie = netfs_i_cookie(__ctx);
__entry->wreq = wreq->debug_id;
__entry->cookie = __cookie ? __cookie->debug_id : 0;
+ __entry->ino = wreq->inode->i_ino;
__entry->what = what;
__entry->start = wreq->start;
__entry->len = wreq->len;
),
- TP_printk("R=%08x %s c=%08x by=%llx-%llx",
+ TP_printk("R=%08x %s c=%08x i=%x by=%llx-%llx",
__entry->wreq,
__print_symbolic(__entry->what, netfs_write_traces),
__entry->cookie,
+ __entry->ino,
__entry->start, __entry->start + __entry->len - 1)
);
+TRACE_EVENT(netfs_collect,
+ TP_PROTO(const struct netfs_io_request *wreq),
+
+ TP_ARGS(wreq),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, wreq )
+ __field(unsigned int, len )
+ __field(unsigned long long, transferred )
+ __field(unsigned long long, start )
+ ),
+
+ TP_fast_assign(
+ __entry->wreq = wreq->debug_id;
+ __entry->start = wreq->start;
+ __entry->len = wreq->len;
+ __entry->transferred = wreq->transferred;
+ ),
+
+ TP_printk("R=%08x s=%llx-%llx",
+ __entry->wreq,
+ __entry->start + __entry->transferred,
+ __entry->start + __entry->len)
+ );
+
+TRACE_EVENT(netfs_collect_contig,
+ TP_PROTO(const struct netfs_io_request *wreq, unsigned long long to,
+ enum netfs_collect_contig_trace type),
+
+ TP_ARGS(wreq, to, type),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, wreq)
+ __field(enum netfs_collect_contig_trace, type)
+ __field(unsigned long long, contiguity)
+ __field(unsigned long long, to)
+ ),
+
+ TP_fast_assign(
+ __entry->wreq = wreq->debug_id;
+ __entry->type = type;
+ __entry->contiguity = wreq->contiguity;
+ __entry->to = to;
+ ),
+
+ TP_printk("R=%08x %llx -> %llx %s",
+ __entry->wreq,
+ __entry->contiguity,
+ __entry->to,
+ __print_symbolic(__entry->type, netfs_collect_contig_traces))
+ );
+
+TRACE_EVENT(netfs_collect_sreq,
+ TP_PROTO(const struct netfs_io_request *wreq,
+ const struct netfs_io_subrequest *subreq),
+
+ TP_ARGS(wreq, subreq),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, wreq )
+ __field(unsigned int, subreq )
+ __field(unsigned int, stream )
+ __field(unsigned int, len )
+ __field(unsigned int, transferred )
+ __field(unsigned long long, start )
+ ),
+
+ TP_fast_assign(
+ __entry->wreq = wreq->debug_id;
+ __entry->subreq = subreq->debug_index;
+ __entry->stream = subreq->stream_nr;
+ __entry->start = subreq->start;
+ __entry->len = subreq->len;
+ __entry->transferred = subreq->transferred;
+ ),
+
+ TP_printk("R=%08x[%u:%02x] s=%llx t=%x/%x",
+ __entry->wreq, __entry->stream, __entry->subreq,
+ __entry->start, __entry->transferred, __entry->len)
+ );
+
+TRACE_EVENT(netfs_collect_folio,
+ TP_PROTO(const struct netfs_io_request *wreq,
+ const struct folio *folio,
+ unsigned long long fend,
+ unsigned long long collected_to),
+
+ TP_ARGS(wreq, folio, fend, collected_to),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, wreq )
+ __field(unsigned long, index )
+ __field(unsigned long long, fend )
+ __field(unsigned long long, cleaned_to )
+ __field(unsigned long long, collected_to )
+ ),
+
+ TP_fast_assign(
+ __entry->wreq = wreq->debug_id;
+ __entry->index = folio->index;
+ __entry->fend = fend;
+ __entry->cleaned_to = wreq->cleaned_to;
+ __entry->collected_to = collected_to;
+ ),
+
+ TP_printk("R=%08x ix=%05lx r=%llx-%llx t=%llx/%llx",
+ __entry->wreq, __entry->index,
+ (unsigned long long)__entry->index * PAGE_SIZE, __entry->fend,
+ __entry->cleaned_to, __entry->collected_to)
+ );
+
+TRACE_EVENT(netfs_collect_state,
+ TP_PROTO(const struct netfs_io_request *wreq,
+ unsigned long long collected_to,
+ unsigned int notes),
+
+ TP_ARGS(wreq, collected_to, notes),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, wreq )
+ __field(unsigned int, notes )
+ __field(unsigned long long, collected_to )
+ __field(unsigned long long, cleaned_to )
+ __field(unsigned long long, contiguity )
+ ),
+
+ TP_fast_assign(
+ __entry->wreq = wreq->debug_id;
+ __entry->notes = notes;
+ __entry->collected_to = collected_to;
+ __entry->cleaned_to = wreq->cleaned_to;
+ __entry->contiguity = wreq->contiguity;
+ ),
+
+ TP_printk("R=%08x cto=%llx fto=%llx ctg=%llx n=%x",
+ __entry->wreq, __entry->collected_to,
+ __entry->cleaned_to, __entry->contiguity,
+ __entry->notes)
+ );
+
+TRACE_EVENT(netfs_collect_gap,
+ TP_PROTO(const struct netfs_io_request *wreq,
+ const struct netfs_io_stream *stream,
+ unsigned long long jump_to, char type),
+
+ TP_ARGS(wreq, stream, jump_to, type),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, wreq)
+ __field(unsigned char, stream)
+ __field(unsigned char, type)
+ __field(unsigned long long, from)
+ __field(unsigned long long, to)
+ ),
+
+ TP_fast_assign(
+ __entry->wreq = wreq->debug_id;
+ __entry->stream = stream->stream_nr;
+ __entry->from = stream->collected_to;
+ __entry->to = jump_to;
+ __entry->type = type;
+ ),
+
+ TP_printk("R=%08x[%x:] %llx->%llx %c",
+ __entry->wreq, __entry->stream,
+ __entry->from, __entry->to, __entry->type)
+ );
+
+TRACE_EVENT(netfs_collect_stream,
+ TP_PROTO(const struct netfs_io_request *wreq,
+ const struct netfs_io_stream *stream),
+
+ TP_ARGS(wreq, stream),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, wreq)
+ __field(unsigned char, stream)
+ __field(unsigned long long, collected_to)
+ __field(unsigned long long, front)
+ ),
+
+ TP_fast_assign(
+ __entry->wreq = wreq->debug_id;
+ __entry->stream = stream->stream_nr;
+ __entry->collected_to = stream->collected_to;
+ __entry->front = stream->front ? stream->front->start : UINT_MAX;
+ ),
+
+ TP_printk("R=%08x[%x:] cto=%llx frn=%llx",
+ __entry->wreq, __entry->stream,
+ __entry->collected_to, __entry->front)
+ );
+
#undef EM
#undef E_
#endif /* _TRACE_NETFS_H */