aboutsummaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)AuthorFilesLines
2024-03-01ipv6: annotate data-races around idev->cnf.ignore_routes_with_linkdownGravatar Eric Dumazet 1-1/+1
idev->cnf.ignore_routes_with_linkdown can be used without any locks, add appropriate annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-01ipv6: annotate data-races around cnf.forwardingGravatar Eric Dumazet 1-3/+5
idev->cnf.forwarding and net->ipv6.devconf_all->forwarding might be read locklessly, add appropriate READ_ONCE() and WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-01ipv6: annotate data-races around cnf.mtu6Gravatar Eric Dumazet 1-1/+1
idev->cnf.mtu6 might be read locklessly, add appropriate READ_ONCE() and WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netGravatar Jakub Kicinski 1-0/+1
Cross-merge networking fixes after downstream PR. Conflicts: net/mptcp/protocol.c adf1bb78dab5 ("mptcp: fix snd_wnd initialization for passive socket") 9426ce476a70 ("mptcp: annotate lockless access for RX path fields") https://lore.kernel.org/all/20240228103048.19255709@canb.auug.org.au/ Adjacent changes: drivers/dpll/dpll_core.c 0d60d8df6f49 ("dpll: rely on rcu for netdev_dpll_pin()") e7f8df0e81bf ("dpll: move xa_erase() call in to match dpll_pin_alloc() error path order") drivers/net/veth.c 1ce7d306ea63 ("veth: try harder when allocating queue memory") 0bef512012b1 ("net: add netdev_lockdep_set_classes() to virtual drivers") drivers/net/wireless/intel/iwlwifi/mvm/d3.c 8c9bef26e98b ("wifi: iwlwifi: mvm: d3: implement suspend with MLO") 78f65fbf421a ("wifi: iwlwifi: mvm: ensure offloading TID queue exists") net/wireless/nl80211.c f78c1375339a ("wifi: nl80211: reject iftype change with mesh ID change") 414532d8aa89 ("wifi: cfg80211: use IEEE80211_MAX_MESH_ID_LEN appropriately") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-28net: ioam6: multicast eventGravatar Justin Iurman 1-0/+4
Add a multicast group to the ioam6 generic netlink family and provide ioam6_event() to send an ioam6 event to the multicast group. Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-28tcp: make the dropreason really work when calling tcp_rcv_state_process()Gravatar Jason Xing 1-2/+2
Update three callers including both ipv4 and ipv6 and let the dropreason mechanism work in reality. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-28tcp: add dropreasons in tcp_rcv_state_process()Gravatar Jason Xing 1-1/+1
In this patch, I equipped this function with more dropreasons, but it still doesn't work yet, which I will do later. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-28tcp: introduce dropreasons in receive pathGravatar Jason Xing 1-1/+14
Soon later patches can use these relatively more accurate reasons to recognise and find out the cause. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-28tcp: add a dropreason definitions and prepare for cookie checkGravatar Jason Xing 1-1/+10
Adding one drop reason to detect the condition of skb dropped because of hook points in cookie check and extending NO_SOCKET to consider another two cases can be used later. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-28net: make SK_MEMORY_PCPU_RESERV tunableGravatar Adam Li 1-2/+3
This patch adds /proc/sys/net/core/mem_pcpu_rsv sysctl file, to make SK_MEMORY_PCPU_RESERV tunable. Commit 3cd3399dd7a8 ("net: implement per-cpu reserves for memory_allocated") introduced per-cpu forward alloc cache: "Implement a per-cpu cache of +1/-1 MB, to reduce number of changes to sk->sk_prot->memory_allocated, which would otherwise be cause of false sharing." sk_prot->memory_allocated points to global atomic variable: atomic_long_t tcp_memory_allocated ____cacheline_aligned_in_smp; If increasing the per-cpu cache size from 1MB to e.g. 16MB, changes to sk->sk_prot->memory_allocated can be further reduced. Performance may be improved on system with many cores. Signed-off-by: Adam Li <adamli@os.amperecomputing.com> Reviewed-by: Christoph Lameter (Ampere) <cl@linux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-26ipv6: anycast: complete RCU handling of struct ifacaddr6Gravatar Eric Dumazet 1-2/+2
struct ifacaddr6 are already freed after RCU grace period. Add __rcu qualifier to aca_next pointer, and idev->ac_list Add relevant rcu_assign_pointer() and dereference accessors. ipv6_chk_acast_dev() no longer needs to acquire idev->lock. /proc/net/anycast6 is now purely RCU protected, it no longer acquires idev->lock. Similarly in6_dump_addrs() can use RCU protection to iterate through anycast addresses. It was relying on a mixture of RCU and RTNL but next patches will get rid of RTNL there. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240223201054.220534-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-26nexthop: allow nexthop_mpath_fill_node() to be called without RTNLGravatar Eric Dumazet 1-1/+1
nexthop_mpath_fill_node() will be potentially called from contexts holding rcu_lock instead of RTNL. Suggested-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/all/ZdZDWVdjMaQkXBgW@shredder/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-26inet: allow ip_valid_fib_dump_req() to be called with RTNL or RCUGravatar Eric Dumazet 1-0/+1
Add a new field into struct fib_dump_filter, to let callers tell if they use RTNL locking or RCU. This is used in the following patch, when inet_dump_fib() no longer holds RTNL. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-26rtnetlink: add RTNL_FLAG_DUMP_UNLOCKED flagGravatar Eric Dumazet 1-0/+1
Similarly to RTNL_FLAG_DOIT_UNLOCKED, this new flag allows dump operations registered via rtnl_register() or rtnl_register_module() to opt-out from RTNL protection. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-23genetlink: make info in GENL_REQ_ATTR_CHECK() constGravatar Jakub Kicinski 1-1/+1
Make the local variable in GENL_REQ_ATTR_CHECK() const. genl_info_dump() returns a const pointer, so the macro is currently hard to use in genl dumps. Link: https://lore.kernel.org/r/20240222222819.156320-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-22net: mctp: take ownership of skb in mctp_local_outputGravatar Jeremy Kerr 1-0/+1
Currently, mctp_local_output only takes ownership of skb on success, and we may leak an skb if mctp_local_output fails in specific states; the skb ownership isn't transferred until the actual output routing occurs. Instead, make mctp_local_output free the skb on all error paths up to the route action, so it always consumes the passed skb. Fixes: 833ef3b91de6 ("mctp: Populate socket implementation") Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240220081053.1439104-1-jk@codeconstruct.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-22Merge tag 'nf-next-24-02-21' of ↵Gravatar Jakub Kicinski 1-1/+0
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter updates for net-next 1. Prefer KMEM_CACHE() macro to create kmem caches, from Kunwu Chan. Patches 2 and 3 consolidate nf_log NULL checks and introduces extra boundary checks on family and type to make it clear that no out of bounds access will happen. No in-tree user currently passes such values, but thats not clear from looking at the function. From Pablo Neira Ayuso. Patch 4, also from Pablo, gets rid of unneeded conditional in nft_osf init function. Patch 5, from myself, fixes erroneous Kconfig dependencies that came in an earlier net-next pull request. This should get rid of the xtables related build failure reports. Patches 6 to 10 are an update to nftables' concatenated-ranges set type to speed up element insertions. This series also compacts a few data structures and cleans up a few oddities such as reliance on ZERO_SIZE_PTR when asking to allocate a set with no elements. From myself. Patches 11 moves the nf_reinject function from the netfilter core (vmlinux) into the nfnetlink_queue backend, the only location where this is called from. Also from myself. Patch 12, from Kees Cook, switches xtables' compat layer to use unsafe_memcpy because xt_entry_target cannot easily get converted to a real flexible array (its UAPI and used inside other structs). * tag 'nf-next-24-02-21' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: x_tables: Use unsafe_memcpy() for 0-sized destination netfilter: move nf_reinject into nfnetlink_queue modules netfilter: nft_set_pipapo: use GFP_KERNEL for insertions netfilter: nft_set_pipapo: speed up bulk element insertions netfilter: nft_set_pipapo: shrink data structures netfilter: nft_set_pipapo: do not rely on ZERO_SIZE_PTR netfilter: nft_set_pipapo: constify lookup fn args where possible netfilter: xtables: fix up kconfig dependencies netfilter: nft_osf: simplify init path netfilter: nf_log: validate nf_logger_find_get() netfilter: nf_log: consolidate check for NULL logger in lookup function netfilter: expect: Simplify the allocation of slab caches in nf_conntrack_expect_init ==================== Link: https://lore.kernel.org/r/20240221112637.5396-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netGravatar Jakub Kicinski 3-2/+5
Cross-merge networking fixes after downstream PR. Conflicts: net/ipv4/udp.c f796feabb9f5 ("udp: add local "peek offset enabled" flag") 56667da7399e ("net: implement lockless setsockopt(SO_PEEK_OFF)") Adjacent changes: net/unix/garbage.c aa82ac51d633 ("af_unix: Drop oob_skb ref before purging queue in GC.") 11498715f266 ("af_unix: Remove io_uring code for GC.") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-22Merge tag 'wireless-next-2024-02-22' of ↵Gravatar Jakub Kicinski 2-1/+10
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.9 The third "new features" pull request for v6.9. This is a quick followup to send commit 04edb5dc68f4 ("wifi: ath12k: Fix uninitialized use of ret in ath12k_mac_allocate()") to fix the ath12k clang warning introduced in the previous pull request. We also have support for QCA2066 in ath11k, several new features in ath12k and few other changes in drivers. In stack it's mostly cleanup and refactoring. Major changes: ath12k * firmware-2.bin support * support having multiple identical PCI devices (firmware needs to have ATH12K_FW_FEATURE_MULTI_QRTR_ID) * QCN9274: support split-PHY devices * WCN7850: enable Power Save Mode in station mode * WCN7850: P2P support ath11k: * QCA6390 & WCN6855: support 2 concurrent station interfaces * QCA2066 support iwlwifi * mvm: support wider-bandwidth OFDMA * bump firmware API to 90 for BZ/SC devices brcmfmac * DMI nvram filename quirk for ACEPC W5 Pro * tag 'wireless-next-2024-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (75 commits) wifi: wilc1000: revert reset line logic flip wifi: brcmfmac: Add DMI nvram filename quirk for ACEPC W5 Pro wifi: rtlwifi: set initial values for unexpected cases of USB endpoint priority wifi: rtl8xxxu: check vif before using in rtl8xxxu_tx() wifi: rtlwifi: rtl8192cu: Fix TX aggregation wifi: wilc1000: remove AKM suite be32 conversion for external auth request wifi: nl80211: refactor parsing CSA offsets wifi: nl80211: force WLAN_AKM_SUITE_SAE in big endian in NL80211_CMD_EXTERNAL_AUTH wifi: iwlwifi: load b0 version of ucode for HR1/HR2 wifi: iwlwifi: handle per-phy statistics from fw wifi: iwlwifi: iwl-fh.h: fix kernel-doc issues wifi: iwlwifi: api: fix kernel-doc reference wifi: iwlwifi: mvm: unlock mvm if there is no primary link wifi: iwlwifi: bump FW API to 90 for BZ/SC devices wifi: iwlwifi: mvm: support PHY context version 6 wifi: iwlwifi: mvm: partially support PHY context version 6 wifi: iwlwifi: mvm: support wider-bandwidth OFDMA wifi: cfg80211: use ML element parsing helpers wifi: mac80211: align ieee80211_mle_get_bss_param_ch_cnt() wifi: cfg80211: refactor RNR parsing ... ==================== Link: https://lore.kernel.org/r/20240222105205.CEC54C433F1@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-22net: mctp: separate key correlation across netsGravatar Jeremy Kerr 1-0/+2
Currently, we lookup sk_keys from the entire struct net_namespace, which may contain multiple MCTP net IDs. In those cases we want to distinguish between endpoints with the same EID but different net ID. Add the net ID data to the struct mctp_sk_key, populate on add and filter on this during route lookup. For the ioctl interface, we use a default net of MCTP_INITIAL_DEFAULT_NET (ie., what will be in use for single-net configurations), but we'll extend the ioctl interface to provide net-specific tag allocation in an upcoming change. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-02-22net: mctp: avoid confusion over local/peer dest/source addressesGravatar Jeremy Kerr 1-2/+2
We have a double-swap of local and peer addresses in mctp_alloc_local_tag; the arguments in both call sites are swapped, but there is also a swap in the implementation of alloc_local_tag. This is opaque because we're using source/dest address references, which don't match the local/peer semantics. Avoid this confusion by naming the arguments as 'local' and 'peer', and remove the double swap. The calling order now matches mctp_key_alloc. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-02-22netfilter: nft_flow_offload: reset dst in route object after setting up flowGravatar Pablo Neira Ayuso 1-1/+1
dst is transferred to the flow object, route object does not own it anymore. Reset dst in route object, otherwise if flow_offload_add() fails, error path releases dst twice, leading to a refcount underflow. Fixes: a3c90f7a2323 ("netfilter: nf_tables: flow offload expression") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-02-21wifi: cfg80211: clean up cfg80211_inform_bss_frame_data()Gravatar Johannes Berg 1-0/+2
Make cfg80211_inform_bss_frame_data() call the existing cfg80211_inform_bss_data() after parsing the frame in the appropriate way, so we have less code duplication. This required introducing a new CFG80211_BSS_FTYPE_S1G_BEACON, but that can be used by other drivers as well. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240216135047.874aed1eff5f.Ib7d88d126eec50c64763251a78cb432bb5df14df@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-02-21wifi: mac80211: make associated BSS pointer visible to the driverGravatar Miri Korenblit 1-0/+5
Some drivers need the data in it, so move it to the link conf, which is exposed to the driver. Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240206164849.6fe9782b87b4.Ifbffef638f07ca7f5c2b27f40d2cf2942d21de0b@changeid [remove bss pointer from internal struct, update docs] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-02-21wifi: mac80211: check beacon countdown is complete on per link basisGravatar Aditya Kumar Singh 1-1/+3
Currently, function to check if beacon countdown is complete uses deflink to fetch the beacon and check the counter. However, with MLO, there is a need to check the counter for the beacon in a particular link. Add support to use link_id in order to fetch the beacon from a particular link data. Signed-off-by: Aditya Kumar Singh <quic_adisi@quicinc.com> Link: https://msgid.link/20240216144621.514385-2-quic_adisi@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-02-21Merge tag 'wireless-next-2024-02-20' of ↵Gravatar David S. Miller 2-57/+161
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.9 The second "new features" pull request for v6.9. Lots of iwlwifi and stack changes this time. And naturally smaller changes to other drivers. We also twice merged wireless into wireless-next to avoid conflicts between the trees. Major changes: stack * mac80211: negotiated TTLM request support * SPP A-MSDU support * mac80211: wider bandwidth OFDMA config support iwlwifi * kunit tests * bump FW API to 89 for AX/BZ/SC devices * enable SPP A-MSDUs * support for new devices ath12k * refactoring in preparation for Multi-Link Operation (MLO) support * 1024 Block Ack window size support * provide firmware wmi logs via a trace event ath11k * 36 bit DMA mask support * support 6 GHz station power modes: Low Power Indoor (LPI), Standard Power) SP and Very Low Power (VLP) rtl8xxxu * TP-Link TL-WN823N V2 support ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-21netfilter: move nf_reinject into nfnetlink_queue modulesGravatar Florian Westphal 1-1/+0
No need to keep this in the core, move it to the nfnetlink_queue module. nf_reroute is moved too, there were no other callers. Signed-off-by: Florian Westphal <fw@strlen.de>
2024-02-20net: reorganize "struct sock" fieldsGravatar Eric Dumazet 1-47/+61
Last major reorg happened in commit 9115e8cd2a0c ("net: reorganize struct sock for better data locality") Since then, many changes have been done. Before SO_PEEK_OFF support is added to TCP, we need to move sk_peek_off to a better location. It is time to make another pass, and add six groups, without explicit alignment. - sock_write_rx (following sk_refcnt) read-write fields in rx path. - sock_read_rx read-mostly fields in rx path. - sock_read_rxtx read-mostly fields in both rx and tx paths. - sock_write_rxtx read-write fields in both rx and tx paths. - sock_write_tx read-write fields in tx paths. - sock_read_tx read-mostly fields in tx paths. Results on TCP_RR benchmarks seem to show a gain (4 to 5 %). It is possible UDP needs a change, because sk_peek_off shares a cache line with sk_receive_queue. If this the case, we can exchange roles of sk->sk_receive and up->reader_queue queues. After this change, we have the following layout: struct sock { struct sock_common __sk_common; /* 0 0x88 */ /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */ __u8 __cacheline_group_begin__sock_write_rx[0]; /* 0x88 0 */ atomic_t sk_drops; /* 0x88 0x4 */ __s32 sk_peek_off; /* 0x8c 0x4 */ struct sk_buff_head sk_error_queue; /* 0x90 0x18 */ struct sk_buff_head sk_receive_queue; /* 0xa8 0x18 */ /* --- cacheline 3 boundary (192 bytes) --- */ struct { atomic_t rmem_alloc; /* 0xc0 0x4 */ int len; /* 0xc4 0x4 */ struct sk_buff * head; /* 0xc8 0x8 */ struct sk_buff * tail; /* 0xd0 0x8 */ } sk_backlog; /* 0xc0 0x18 */ struct { atomic_t rmem_alloc; /* 0 0x4 */ int len; /* 0x4 0x4 */ struct sk_buff * head; /* 0x8 0x8 */ struct sk_buff * tail; /* 0x10 0x8 */ /* size: 24, cachelines: 1, members: 4 */ /* last cacheline: 24 bytes */ }; __u8 __cacheline_group_end__sock_write_rx[0]; /* 0xd8 0 */ __u8 __cacheline_group_begin__sock_read_rx[0]; /* 0xd8 0 */ rcu * sk_rx_dst; /* 0xd8 0x8 */ int sk_rx_dst_ifindex; /* 0xe0 0x4 */ u32 sk_rx_dst_cookie; /* 0xe4 0x4 */ unsigned int sk_ll_usec; /* 0xe8 0x4 */ unsigned int sk_napi_id; /* 0xec 0x4 */ u16 sk_busy_poll_budget; /* 0xf0 0x2 */ u8 sk_prefer_busy_poll; /* 0xf2 0x1 */ u8 sk_userlocks; /* 0xf3 0x1 */ int sk_rcvbuf; /* 0xf4 0x4 */ rcu * sk_filter; /* 0xf8 0x8 */ /* --- cacheline 4 boundary (256 bytes) --- */ union { rcu * sk_wq; /* 0x100 0x8 */ struct socket_wq * sk_wq_raw; /* 0x100 0x8 */ }; /* 0x100 0x8 */ union { rcu * sk_wq; /* 0 0x8 */ struct socket_wq * sk_wq_raw; /* 0 0x8 */ }; void (*sk_data_ready)(struct sock *); /* 0x108 0x8 */ long sk_rcvtimeo; /* 0x110 0x8 */ int sk_rcvlowat; /* 0x118 0x4 */ __u8 __cacheline_group_end__sock_read_rx[0]; /* 0x11c 0 */ __u8 __cacheline_group_begin__sock_read_rxtx[0]; /* 0x11c 0 */ int sk_err; /* 0x11c 0x4 */ struct socket * sk_socket; /* 0x120 0x8 */ struct mem_cgroup * sk_memcg; /* 0x128 0x8 */ rcu * sk_policy[2]; /* 0x130 0x10 */ /* --- cacheline 5 boundary (320 bytes) --- */ __u8 __cacheline_group_end__sock_read_rxtx[0]; /* 0x140 0 */ __u8 __cacheline_group_begin__sock_write_rxtx[0]; /* 0x140 0 */ socket_lock_t sk_lock; /* 0x140 0x20 */ u32 sk_reserved_mem; /* 0x160 0x4 */ int sk_forward_alloc; /* 0x164 0x4 */ u32 sk_tsflags; /* 0x168 0x4 */ __u8 __cacheline_group_end__sock_write_rxtx[0]; /* 0x16c 0 */ __u8 __cacheline_group_begin__sock_write_tx[0]; /* 0x16c 0 */ int sk_write_pending; /* 0x16c 0x4 */ atomic_t sk_omem_alloc; /* 0x170 0x4 */ int sk_sndbuf; /* 0x174 0x4 */ int sk_wmem_queued; /* 0x178 0x4 */ refcount_t sk_wmem_alloc; /* 0x17c 0x4 */ /* --- cacheline 6 boundary (384 bytes) --- */ unsigned long sk_tsq_flags; /* 0x180 0x8 */ union { struct sk_buff * sk_send_head; /* 0x188 0x8 */ struct rb_root tcp_rtx_queue; /* 0x188 0x8 */ }; /* 0x188 0x8 */ union { struct sk_buff * sk_send_head; /* 0 0x8 */ struct rb_root tcp_rtx_queue; /* 0 0x8 */ }; struct sk_buff_head sk_write_queue; /* 0x190 0x18 */ u32 sk_dst_pending_confirm; /* 0x1a8 0x4 */ u32 sk_pacing_status; /* 0x1ac 0x4 */ struct page_frag sk_frag; /* 0x1b0 0x10 */ /* --- cacheline 7 boundary (448 bytes) --- */ struct timer_list sk_timer; /* 0x1c0 0x28 */ /* XXX last struct has 4 bytes of padding */ unsigned long sk_pacing_rate; /* 0x1e8 0x8 */ atomic_t sk_zckey; /* 0x1f0 0x4 */ atomic_t sk_tskey; /* 0x1f4 0x4 */ __u8 __cacheline_group_end__sock_write_tx[0]; /* 0x1f8 0 */ __u8 __cacheline_group_begin__sock_read_tx[0]; /* 0x1f8 0 */ unsigned long sk_max_pacing_rate; /* 0x1f8 0x8 */ /* --- cacheline 8 boundary (512 bytes) --- */ long sk_sndtimeo; /* 0x200 0x8 */ u32 sk_priority; /* 0x208 0x4 */ u32 sk_mark; /* 0x20c 0x4 */ rcu * sk_dst_cache; /* 0x210 0x8 */ netdev_features_t sk_route_caps; /* 0x218 0x8 */ u16 sk_gso_type; /* 0x220 0x2 */ u16 sk_gso_max_segs; /* 0x222 0x2 */ unsigned int sk_gso_max_size; /* 0x224 0x4 */ gfp_t sk_allocation; /* 0x228 0x4 */ u32 sk_txhash; /* 0x22c 0x4 */ u8 sk_pacing_shift; /* 0x230 0x1 */ bool sk_use_task_frag; /* 0x231 0x1 */ __u8 __cacheline_group_end__sock_read_tx[0]; /* 0x232 0 */ u8 sk_gso_disabled:1; /* 0x232: 0 0x1 */ u8 sk_kern_sock:1; /* 0x232:0x1 0x1 */ u8 sk_no_check_tx:1; /* 0x232:0x2 0x1 */ u8 sk_no_check_rx:1; /* 0x232:0x3 0x1 */ /* XXX 4 bits hole, try to pack */ u8 sk_shutdown; /* 0x233 0x1 */ u16 sk_type; /* 0x234 0x2 */ u16 sk_protocol; /* 0x236 0x2 */ unsigned long sk_lingertime; /* 0x238 0x8 */ /* --- cacheline 9 boundary (576 bytes) --- */ struct proto * sk_prot_creator; /* 0x240 0x8 */ rwlock_t sk_callback_lock; /* 0x248 0x8 */ int sk_err_soft; /* 0x250 0x4 */ u32 sk_ack_backlog; /* 0x254 0x4 */ u32 sk_max_ack_backlog; /* 0x258 0x4 */ kuid_t sk_uid; /* 0x25c 0x4 */ spinlock_t sk_peer_lock; /* 0x260 0x4 */ int sk_bind_phc; /* 0x264 0x4 */ struct pid * sk_peer_pid; /* 0x268 0x8 */ const struct cred * sk_peer_cred; /* 0x270 0x8 */ ktime_t sk_stamp; /* 0x278 0x8 */ /* --- cacheline 10 boundary (640 bytes) --- */ int sk_disconnects; /* 0x280 0x4 */ u8 sk_txrehash; /* 0x284 0x1 */ u8 sk_clockid; /* 0x285 0x1 */ u8 sk_txtime_deadline_mode:1; /* 0x286: 0 0x1 */ u8 sk_txtime_report_errors:1; /* 0x286:0x1 0x1 */ u8 sk_txtime_unused:6; /* 0x286:0x2 0x1 */ /* XXX 1 byte hole, try to pack */ void * sk_user_data; /* 0x288 0x8 */ void * sk_security; /* 0x290 0x8 */ struct sock_cgroup_data sk_cgrp_data; /* 0x298 0x8 */ void (*sk_state_change)(struct sock *); /* 0x2a0 0x8 */ void (*sk_write_space)(struct sock *); /* 0x2a8 0x8 */ void (*sk_error_report)(struct sock *); /* 0x2b0 0x8 */ int (*sk_backlog_rcv)(struct sock *, struct sk_buff *); /* 0x2b8 0x8 */ /* --- cacheline 11 boundary (704 bytes) --- */ void (*sk_destruct)(struct sock *); /* 0x2c0 0x8 */ rcu * sk_reuseport_cb; /* 0x2c8 0x8 */ rcu * sk_bpf_storage; /* 0x2d0 0x8 */ struct callback_head sk_rcu __attribute__((__aligned__(8))); /* 0x2d8 0x10 */ netns_tracker ns_tracker; /* 0x2e8 0x8 */ /* size: 752, cachelines: 12, members: 105 */ /* sum members: 749, holes: 1, sum holes: 1 */ /* sum bitfield members: 12 bits, bit holes: 1, sum bit holes: 4 bits */ /* paddings: 1, sum paddings: 4 */ /* forced alignments: 1 */ /* last cacheline: 48 bytes */ }; Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240216162006.2342759-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-02-20net: introduce abstraction for network memoryGravatar Mina Almasry 1-0/+41
Add the netmem_ref type, an abstraction for network memory. To add support for new memory types to the net stack, we must first abstract the current memory type. Currently parts of the net stack use struct page directly: - page_pool - drivers - skb_frag_t Originally the plan was to reuse struct page* for the new memory types, and to set the LSB on the page* to indicate it's not really a page. However, for compiler type checking we need to introduce a new type. netmem_ref is introduced to abstract the underlying memory type. Currently it's a no-op abstraction that is always a struct page underneath. In parallel there is an undergoing effort to add support for devmem to the net stack: https://lore.kernel.org/netdev/20231208005250.2910004-1-almasrymina@google.com/ netmem_ref can be pointers to different underlying memory types, and the low bits are set to indicate the memory type. Helpers are provided to convert netmem pointers to the underlying memory type (currently only struct page). In the devmem series helpers are provided so that calling code can use netmem without worrying about the underlying memory type unless absolutely necessary. Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-02-19net: page_pool: fix recycle stats for system page_pool allocatorGravatar Lorenzo Bianconi 1-2/+3
Use global percpu page_pool_recycle_stats counter for system page_pool allocator instead of allocating a separate percpu variable for each (also percpu) page pool instance. Reviewed-by: Toke Hoiland-Jorgensen <toke@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/87f572425e98faea3da45f76c3c68815c01a20ee.1708075412.git.lorenzo@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-19page_pool: disable direct recycling based on pool->cpuid on destroyGravatar Alexander Lobakin 1-5/+0
Now that direct recycling is performed basing on pool->cpuid when set, memory leaks are possible: 1. A pool is destroyed. 2. Alloc cache is emptied (it's done only once). 3. pool->cpuid is still set. 4. napi_pp_put_page() does direct recycling basing on pool->cpuid. 5. Now alloc cache is not empty, but it won't ever be freed. In order to avoid that, rewrite pool->cpuid to -1 when unlinking NAPI to make sure no direct recycling will be possible after emptying the cache. This involves a bit of overhead as pool->cpuid now must be accessed via READ_ONCE() to avoid partial reads. Rename page_pool_unlink_napi() -> page_pool_disable_direct_recycling() to reflect what it actually does and unexport it. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20240215113905.96817-1-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-18mptcp: fix lockless access in subflow ULP diagGravatar Paolo Abeni 1-1/+1
Since the introduction of the subflow ULP diag interface, the dump callback accessed all the subflow data with lockless. We need either to annotate all the read and write operation accordingly, or acquire the subflow socket lock. Let's do latter, even if slower, to avoid a diffstat havoc. Fixes: 5147dfb50832 ("mptcp: allow dumping subflow context to userspace") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16net: bridge: switchdev: Skip MDB replays of deferred events on offloadGravatar Tobias Waldekranz 1-0/+3
Before this change, generation of the list of MDB events to replay would race against the creation of new group memberships, either from the IGMP/MLD snooping logic or from user configuration. While new memberships are immediately visible to walkers of br->mdb_list, the notification of their existence to switchdev event subscribers is deferred until a later point in time. So if a replay list was generated during a time that overlapped with such a window, it would also contain a replay of the not-yet-delivered event. The driver would thus receive two copies of what the bridge internally considered to be one single event. On destruction of the bridge, only a single membership deletion event was therefore sent. As a consequence of this, drivers which reference count memberships (at least DSA), would be left with orphan groups in their hardware database when the bridge was destroyed. This is only an issue when replaying additions. While deletion events may still be pending on the deferred queue, they will already have been removed from br->mdb_list, so no duplicates can be generated in that scenario. To a user this meant that old group memberships, from a bridge in which a port was previously attached, could be reanimated (in hardware) when the port joined a new bridge, without the new bridge's knowledge. For example, on an mv88e6xxx system, create a snooping bridge and immediately add a port to it: root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \ > ip link set dev x3 up master br0 And then destroy the bridge: root@infix-06-0b-00:~$ ip link del dev br0 root@infix-06-0b-00:~$ mvls atu ADDRESS FID STATE Q F 0 1 2 3 4 5 6 7 8 9 a DEV:0 Marvell 88E6393X 33:33:00:00:00:6a 1 static - - 0 . . . . . . . . . . 33:33:ff:87:e4:3f 1 static - - 0 . . . . . . . . . . ff:ff:ff:ff:ff:ff 1 static - - 0 1 2 3 4 5 6 7 8 9 a root@infix-06-0b-00:~$ The two IPv6 groups remain in the hardware database because the port (x3) is notified of the host's membership twice: once via the original event and once via a replay. Since only a single delete notification is sent, the count remains at 1 when the bridge is destroyed. Then add the same port (or another port belonging to the same hardware domain) to a new bridge, this time with snooping disabled: root@infix-06-0b-00:~$ ip link add dev br1 up type bridge mcast_snooping 0 && \ > ip link set dev x3 up master br1 All multicast, including the two IPv6 groups from br0, should now be flooded, according to the policy of br1. But instead the old memberships are still active in the hardware database, causing the switch to only forward traffic to those groups towards the CPU (port 0). Eliminate the race in two steps: 1. Grab the write-side lock of the MDB while generating the replay list. This prevents new memberships from showing up while we are generating the replay list. But it leaves the scenario in which a deferred event was already generated, but not delivered, before we grabbed the lock. Therefore: 2. Make sure that no deferred version of a replay event is already enqueued to the switchdev deferred queue, before adding it to the replay list, when replaying additions. Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries") Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-15Merge branch 'for-thermal-genetlink-family-bind-unbind-callbacks'Gravatar Jakub Kicinski 1-0/+4
Stanislaw Gruszka says: ==================== thermal/netlink/intel_hfi: Enable HFI feature only when required The patchset introduces a new genetlink family bind/unbind callbacks and thermal/netlink notifications, which allow drivers to send netlink multicast events based on the presence of actual user-space consumers. This functionality optimizes resource usage by allowing disabling of features when not needed. v1: https://lore.kernel.org/linux-pm/20240131120535.933424-1-stanislaw.gruszka@linux.intel.com// v2: https://lore.kernel.org/linux-pm/20240206133605.1518373-1-stanislaw.gruszka@linux.intel.com/ v3: https://lore.kernel.org/linux-pm/20240209120625.1775017-1-stanislaw.gruszka@linux.intel.com/ ==================== Link: https://lore.kernel.org/r/20240212161615.161935-1-stanislaw.gruszka@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-15genetlink: Add per family bind/unbind callbacksGravatar Stanislaw Gruszka 1-0/+4
Add genetlink family bind()/unbind() callbacks when adding/removing multicast group to/from netlink client socket via setsockopt() or bind() syscall. They can be used to track if consumers of netlink multicast messages emerge or disappear. Thus, a client implementing callbacks, can now send events only when there are active consumers, preventing unnecessary work when none exist. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240212161615.161935-2-stanislaw.gruszka@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netGravatar Jakub Kicinski 1-5/+0
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: net/core/dev.c 9f30831390ed ("net: add rcu safety to rtnl_prop_list_size()") 723de3ebef03 ("net: free altname using an RCU callback") net/unix/garbage.c 11498715f266 ("af_unix: Remove io_uring code for GC.") 25236c91b5ab ("af_unix: Fix task hung while purging oob_skb in GC.") drivers/net/ethernet/renesas/ravb_main.c ed4adc07207d ("net: ravb: Count packets instead of descriptors in GbEth RX path" ) c2da9408579d ("ravb: Add Rx checksum offload support for GbEth") net/mptcp/protocol.c bdd70eb68913 ("mptcp: drop the push_pending field") 28e5c1380506 ("mptcp: annotate lockless accesses around read-mostly fields") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-15net: ipv6/addrconf: introduce a regen_min_advance sysctlGravatar Alex Henrie 1-2/+3
In RFC 8981, REGEN_ADVANCE cannot be less than 2 seconds, and the RFC does not permit the creation of temporary addresses with lifetimes shorter than that: > When processing a Router Advertisement with a > Prefix Information option carrying a prefix for the purposes of > address autoconfiguration (i.e., the A bit is set), the host MUST > perform the following steps: > 5. A temporary address is created only if this calculated preferred > lifetime is greater than REGEN_ADVANCE time units. However, some users want to change their IPv6 address as frequently as possible regardless of the RFC's arbitrary minimum lifetime. For the benefit of those users, add a regen_min_advance sysctl parameter that can be set to below or above 2 seconds. Link: https://datatracker.ietf.org/doc/html/rfc8981 Signed-off-by: Alex Henrie <alexhenrie24@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-02-15wifi: cfg80211: use IEEE80211_MAX_MESH_ID_LEN appropriatelyGravatar Johannes Berg 1-1/+1
Even if that's the same as IEEE80211_MAX_SSID_LEN, we really should just use IEEE80211_MAX_MESH_ID_LEN for mesh, rather than having the BUILD_BUG_ON()s. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-02-13net: add generic percpu page_pool allocatorGravatar Lorenzo Bianconi 1-0/+3
Introduce generic percpu page_pools allocator. Moreover add page_pool_create_percpu() and cpuid filed in page_pool struct in order to recycle the page in the page_pool "hot" cache if napi_pp_put_page() is running on the same cpu. This is a preliminary patch to add xdp multi-buff support for xdp running in generic mode. Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Toke Hoiland-Jorgensen <toke@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/80bc4285228b6f4220cd03de1999d86e46e3fcbd.1707729884.git.lorenzo@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-12ipv4: Set the routing scope properly in ip_route_output_ports().Gravatar Guillaume Nault 1-5/+2
Set scope automatically in ip_route_output_ports() (using the socket SOCK_LOCALROUTE flag). This way, callers don't have to overload the tos with the RTO_ONLINK flag, like RT_CONN_FLAGS() does. For callers that don't pass a struct sock, this doesn't change anything as the scope is still set to RT_SCOPE_UNIVERSE when sk is NULL. Callers that passed a struct sock and used RT_CONN_FLAGS(sk) or RT_CONN_FLAGS_TOS(sk, tos) for the tos are modified to use ip_sock_tos(sk) and RT_TOS(tos) respectively, as overloading tos with the RTO_ONLINK flag now becomes unnecessary. In drivers/net/amt.c, all ip_route_output_ports() calls use a 0 tos parameter, ignoring the SOCK_LOCALROUTE flag of the socket. But the sk parameter is a kernel socket, which doesn't have any configuration path for setting SOCK_LOCALROUTE anyway. Therefore, ip_route_output_ports() will continue to initialise scope with RT_SCOPE_UNIVERSE and amt.c doesn't need to be modified. Also, remove RT_CONN_FLAGS() and RT_CONN_FLAGS_TOS() from route.h as these macros are now unused. The objective is to eventually remove RTO_ONLINK entirely to allow converting ->flowi4_tos to dscp_t. This will ensure proper isolation between the DSCP and ECN bits, thus minimising the risk of introducing bugs where TOS values interfere with ECN. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/dacfd2ab40685e20959ab7b53c427595ba229e7d.1707496938.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-12wifi: cfg80211: report unprotected deauth/disassoc in wowlanGravatar Shaul Triebitz 1-1/+4
Add to cfg80211_wowlan_wakeup another wakeup reason - unprot_deauth_disassoc. To be set to true if the woke up was due to an unprotected deauth or disassoc frame in MFP. In that case report WOWLAN_TRIG_UNPROTECTED_DEAUTH_DISASSOC. Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240206164849.a3d739850d03.I8f52a21c4f36d1af1f8068bed79e2f9cbf8289ef@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-02-12wifi: cfg80211: optionally support monitor on disabled channelsGravatar Johannes Berg 1-0/+4
If the hardware supports a disabled channel, it may in some cases be possible to use monitor mode (without any transmit) on it when it's otherwise disabled. Add a new channel flag IEEE80211_CHAN_CAN_MONITOR that makes it possible for a driver to indicate such a thing. Make it per channel so drivers could have a choice with it, perhaps it's only possible on some channels, perhaps some channels are not supported at all, but still there and marked disabled. In _nl80211_parse_chandef() simplify the code and check only for an unknown channel, _cfg80211_chandef_usable() will later check for IEEE80211_CHAN_DISABLED anyway. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240206164849.87fad3a21a09.I9116b2fdc2e2c9fd59a9273a64db7fcb41fc0328@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-02-12wifi: cfg80211: rename UHB to 6 GHzGravatar Johannes Berg 1-5/+5
UHB stands for "Ultra High Band", but this term doesn't really exist in the spec. Rename all occurrences to "6 GHz", but keep a few defines for userspace API compatibility. Link: https://msgid.link/20240206164849.c9cfb9400839.I153db3b951934a1d84409c17fbe1f1d1782543fa@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-02-12wifi: cfg80211: add support for link id attribute in NL80211_CMD_DEL_STATIONGravatar Aditya Kumar Singh 1-0/+4
Currently whenever NL80211_CMD_DEL_STATION command is called without any MAC address, all stations present on that interface are flushed. However with MLO there is a need to flush such stations only which are using at least a particular link from the AP MLD interface. For example - 2 GHz and 5 GHz are part of an AP MLD. To this interface, following stations are connected - 1. One non-EHT STA on 2 GHz link. 2. One non-EHT STA on 5 GHz link. 3. One Multi-Link STA having 2 GHz and 5 GHz as active links. Now if currently, NL80211_CMD_DEL_STATION is issued by the 2 GHz link without any MAC address, it would flush all station entries. However, flushing of station entry #2 at least is not desireable since it is connected to 5 GHz link alone. Hence, add an option to pass link ID as well in the command so that if link ID is passed, stations using that passed link ID alone would be flushed and others will not. So after this, station entries #1 and #3 alone would be flushed and #2 will remain as it is. Signed-off-by: Aditya Kumar Singh <quic_adisi@quicinc.com> Link: https://msgid.link/20240205162952.1697646-2-quic_adisi@quicinc.com [clarify documentation] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-02-12wifi: mac80211: remove gfp parameter from ieee80211_obss_color_collision_notifyGravatar Lorenzo Bianconi 1-2/+1
Get rid of gfp parameter from ieee80211_obss_color_collision_notify since it is no longer used. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Jeff Johnson <quic_jjohnson@quicinc.com> Acked-by: Jeff Johnson <quic_jjohnson@quicinc.com> Link: https://msgid.link/f91e1c78896408ac556586ba8c99e4e389aeba02.1707389901.git.lorenzo@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-02-12net/ipv6: Remove expired routes with a separated list of routes.Gravatar Kui-Feng Lee 1-1/+45
FIB6 GC walks trees of fib6_tables to remove expired routes. Walking a tree can be expensive if the number of routes in a table is big, even if most of them are permanent. Checking routes in a separated list of routes having expiration will avoid this potential issue. Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-12net/ipv6: set expires in rt6_add_dflt_router().Gravatar Kui-Feng Lee 1-1/+2
Pass the duration of a lifetime (in seconds) to the function rt6_add_dflt_router() so that it can properly set the expiration time. The function ndisc_router_discovery() is the only one that calls rt6_add_dflt_router(), and it will later set the expiration time for the route created by rt6_add_dflt_router(). However, there is a gap of time between calling rt6_add_dflt_router() and setting the expiration time in ndisc_router_discovery(). During this period, there is a possibility that a new route may be removed from the routing table. By setting the correct expiration time in rt6_add_dflt_router(), we can prevent this from happening. The reason for setting RTF_EXPIRES in rt6_add_dflt_router() is to start the Garbage Collection (GC) timer, as it only activates when a route with RTF_EXPIRES is added to a table. Suggested-by: David Ahern <dsahern@kernel.org> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-10tls: fix race between async notify and socket closeGravatar Jakub Kicinski 1-5/+0
The submitting thread (one which called recvmsg/sendmsg) may exit as soon as the async crypto handler calls complete() so any code past that point risks touching already freed data. Try to avoid the locking and extra flags altogether. Have the main thread hold an extra reference, this way we can depend solely on the atomic ref counter for synchronization. Don't futz with reiniting the completion, either, we are now tightly controlling when completion fires. Reported-by: valis <sec@valis.email> Fixes: 0cada33241d9 ("net/tls: fix race condition causing kernel panic") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-09Merge branch 'for-io_uring-add-napi-busy-polling-support'Gravatar Jakub Kicinski 1-0/+4
Merge netdev bits of io_uring busy polling support. Jens Axboe says: ==================== io_uring: add napi busy polling support I finally got around to testing this patchset in its current form, and results look fine to me. It Works. Using the basic ping/pong test that's part of the liburing addition, without enabling NAPI I get: Stock settings, no NAPI, 100k packets: rtt(us) min/avg/max/mdev = 31.730/37.006/87.960/0.497 and with -t10 -b enabled: rtt(us) min/avg/max/mdev = 23.250/29.795/63.511/1.203 In short, this patchset enables per io_uring NAPI enablement, rather than need to enable that globally. This allows targeted NAPI usage with io_uring. Here's Stefan's v15 posting, which predates this one: https://lore.kernel.org/io-uring/20230608163839.2891748-1-shr@devkernel.io/ ==================== Link: https://lore.kernel.org/r/20240206163422.646218-1-axboe@kernel.dk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09net: add napi_busy_loop_rcu()Gravatar Stefan Roesch 1-0/+4
This adds the napi_busy_loop_rcu() function. This function assumes that the calling function is already holding the rcu read lock and napi_busy_loop() does not need to take the rcu read lock. Add a NAPI_F_NO_SCHED flag, which tells __napi_busy_loop() to abort if we need to reschedule rather than drop the RCU read lock and reschedule. Signed-off-by: Stefan Roesch <shr@devkernel.io> Link: https://lore.kernel.org/r/20230608163839.2891748-3-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>