aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2022-06-29bpf: expose bpf_{g,s}etsockopt to lsm cgroupGravatar Stanislav Fomichev 1-0/+2
I don't see how to make it nice without introducing btf id lists for the hooks where these helpers are allowed. Some LSM hooks work on the locked sockets, some are triggering early and don't grab any locks, so have two lists for now: 1. LSM hooks which trigger under socket lock - minority of the hooks, but ideal case for us, we can expose existing BTF-based helpers 2. LSM hooks which trigger without socket lock, but they trigger early in the socket creation path where it should be safe to do setsockopt without any locks 3. The rest are prohibited. I'm thinking that this use-case might be a good gateway to sleeping lsm cgroup hooks in the future. We can either expose lock/unlock operations (and add tracking to the verifier) or have another set of bpf_setsockopt wrapper that grab the locks and might sleep. Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-7-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29bpf: implement BPF_PROG_QUERY for BPF_LSM_CGROUPGravatar Stanislav Fomichev 1-0/+3
We have two options: 1. Treat all BPF_LSM_CGROUP the same, regardless of attach_btf_id 2. Treat BPF_LSM_CGROUP+attach_btf_id as a separate hook point I was doing (2) in the original patch, but switching to (1) here: * bpf_prog_query returns all attached BPF_LSM_CGROUP programs regardless of attach_btf_id * attach_btf_id is exported via bpf_prog_info Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-6-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29bpf: minimize number of allocated lsm slots per programGravatar Stanislav Fomichev 3-8/+10
Previous patch adds 1:1 mapping between all 211 LSM hooks and bpf_cgroup program array. Instead of reserving a slot per possible hook, reserve 10 slots per cgroup for lsm programs. Those slots are dynamically allocated on demand and reclaimed. struct cgroup_bpf { struct bpf_prog_array * effective[33]; /* 0 264 */ /* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */ struct hlist_head progs[33]; /* 264 264 */ /* --- cacheline 8 boundary (512 bytes) was 16 bytes ago --- */ u8 flags[33]; /* 528 33 */ /* XXX 7 bytes hole, try to pack */ struct list_head storages; /* 568 16 */ /* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */ struct bpf_prog_array * inactive; /* 584 8 */ struct percpu_ref refcnt; /* 592 16 */ struct work_struct release_work; /* 608 72 */ /* size: 680, cachelines: 11, members: 7 */ /* sum members: 673, holes: 1, sum holes: 7 */ /* last cacheline: 40 bytes */ }; Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-5-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29bpf: per-cgroup lsm flavorGravatar Stanislav Fomichev 6-1/+55
Allow attaching to lsm hooks in the cgroup context. Attaching to per-cgroup LSM works exactly like attaching to other per-cgroup hooks. New BPF_LSM_CGROUP is added to trigger new mode; the actual lsm hook we attach to is signaled via existing attach_btf_id. For the hooks that have 'struct socket' or 'struct sock' as its first argument, we use the cgroup associated with that socket. For the rest, we use 'current' cgroup (this is all on default hierarchy == v2 only). Note that for some hooks that work on 'struct sock' we still take the cgroup from 'current' because some of them work on the socket that hasn't been properly initialized yet. Behind the scenes, we allocate a shim program that is attached to the trampoline and runs cgroup effective BPF programs array. This shim has some rudimentary ref counting and can be shared between several programs attaching to the same lsm hook from different cgroups. Note that this patch bloats cgroup size because we add 211 cgroup_bpf_attach_type(s) for simplicity sake. This will be addressed in the subsequent patch. Also note that we only add non-sleepable flavor for now. To enable sleepable use-cases, bpf_prog_run_array_cg has to grab trace rcu, shim programs have to be freed via trace rcu, cgroup_bpf.effective should be also trace-rcu-managed + maybe some other changes that I'm not aware of. Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-4-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29bpf: convert cgroup_bpf.progs to hlistGravatar Stanislav Fomichev 2-3/+3
This lets us reclaim some space to be used by new cgroup lsm slots. Before: struct cgroup_bpf { struct bpf_prog_array * effective[23]; /* 0 184 */ /* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */ struct list_head progs[23]; /* 184 368 */ /* --- cacheline 8 boundary (512 bytes) was 40 bytes ago --- */ u32 flags[23]; /* 552 92 */ /* XXX 4 bytes hole, try to pack */ /* --- cacheline 10 boundary (640 bytes) was 8 bytes ago --- */ struct list_head storages; /* 648 16 */ struct bpf_prog_array * inactive; /* 664 8 */ struct percpu_ref refcnt; /* 672 16 */ struct work_struct release_work; /* 688 32 */ /* size: 720, cachelines: 12, members: 7 */ /* sum members: 716, holes: 1, sum holes: 4 */ /* last cacheline: 16 bytes */ }; After: struct cgroup_bpf { struct bpf_prog_array * effective[23]; /* 0 184 */ /* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */ struct hlist_head progs[23]; /* 184 184 */ /* --- cacheline 5 boundary (320 bytes) was 48 bytes ago --- */ u8 flags[23]; /* 368 23 */ /* XXX 1 byte hole, try to pack */ /* --- cacheline 6 boundary (384 bytes) was 8 bytes ago --- */ struct list_head storages; /* 392 16 */ struct bpf_prog_array * inactive; /* 408 8 */ struct percpu_ref refcnt; /* 416 16 */ struct work_struct release_work; /* 432 72 */ /* size: 504, cachelines: 8, members: 7 */ /* sum members: 503, holes: 1, sum holes: 1 */ /* last cacheline: 56 bytes */ }; Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-3-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29bpf: add bpf_func_t and trampoline helpersGravatar Stanislav Fomichev 1-6/+5
I'll be adding lsm cgroup specific helpers that grab trampoline mutex. No functional changes. Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-2-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-21bpf, x64: Add predicate for bpf2bpf with tailcalls support in JITGravatar Tony Ambardar 1-0/+1
The BPF core/verifier is hard-coded to permit mixing bpf2bpf and tail calls for only x86-64. Change the logic to instead rely on a new weak function 'bool bpf_jit_supports_subprog_tailcalls(void)', which a capable JIT backend can override. Update the x86-64 eBPF JIT to reflect this. Signed-off-by: Tony Ambardar <Tony.Ambardar@gmail.com> [jakub: drop MIPS bits and tweak patch subject] Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220617105735.733938-2-jakub@cloudflare.com
2022-06-20bpf: Inline calls to bpf_loop when callback is knownGravatar Eduard Zingerman 2-0/+15
Calls to `bpf_loop` are replaced with direct loops to avoid indirection. E.g. the following: bpf_loop(10, foo, NULL, 0); Is replaced by equivalent of the following: for (int i = 0; i < 10; ++i) foo(i, NULL); This transformation could be applied when: - callback is known and does not change during program execution; - flags passed to `bpf_loop` are always zero. Inlining logic works as follows: - During execution simulation function `update_loop_inline_state` tracks the following information for each `bpf_loop` call instruction: - is callback known and constant? - are flags constant and zero? - Function `optimize_bpf_loop` increases stack depth for functions where `bpf_loop` calls can be inlined and invokes `inline_bpf_loop` to apply the inlining. The additional stack space is used to spill registers R6, R7 and R8. These registers are used as loop counter, loop maximal bound and callback context parameter; Measurements using `benchs/run_bench_bpf_loop.sh` inside QEMU / KVM on i7-4710HQ CPU show a drop in latency from 14 ns/op to 2 ns/op. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/r/20220620235344.569325-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-20net: Introduce a new proto_ops ->read_skb()Gravatar Cong Wang 3-4/+6
Currently both splice() and sockmap use ->read_sock() to read skb from receive queue, but for sockmap we only read one entire skb at a time, so ->read_sock() is too conservative to use. Introduce a new proto_ops ->read_skb() which supports this sematic, with this we can finally pass the ownership of skb to recv actors. For non-TCP protocols, all ->read_sock() can be simply converted to ->read_skb(). Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220615162014.89193-3-xiyou.wangcong@gmail.com
2022-06-20tcp: Introduce tcp_read_skb()Gravatar Cong Wang 1-0/+2
This patch inroduces tcp_read_skb() based on tcp_read_sock(), a preparation for the next patch which actually introduces a new sock ops. TCP is special here, because it has tcp_read_sock() which is mainly used by splice(). tcp_read_sock() supports partial read and arbitrary offset, neither of them is needed for sockmap. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220615162014.89193-2-xiyou.wangcong@gmail.com
2022-06-19net: mii: add mii_bmcr_encode_fixed()Gravatar Russell King (Oracle) 1-0/+35
Add a function to encode a fixed speed/duplex to a BMCR value. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-19raw: convert raw sockets to RCUGravatar Eric Dumazet 2-1/+11
Using rwlock in networking code is extremely risky. writers can starve if enough readers are constantly grabing the rwlock. I thought rwlock were at fault and sent this patch: https://lkml.org/lkml/2022/6/17/272 But Peter and Linus essentially told me rwlock had to be unfair. We need to get rid of rwlock in networking code. Without this fix, following script triggers soft lockups: for i in {1..48} do ping -f -n -q 127.0.0.1 & sleep 0.1 done Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-19raw: use more conventional iteratorsGravatar Eric Dumazet 2-6/+5
In order to prepare the following patch, I change raw v4 & v6 code to use more conventional iterators. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-19net: dsa: felix: update base time of time-aware shaper when adjusting PTP timeGravatar Xiaoliang Yang 1-0/+7
When adjusting the PTP clock, the base time of the TAS configuration will become unreliable. We need reset the TAS configuration by using a new base time. For example, if the driver gets a base time 0 of Qbv configuration from user, and current time is 20000. The driver will set the TAS base time to be 20000. After the PTP clock adjustment, the current time becomes 10000. If the TAS base time is still 20000, it will be a future time, and TAS entry list will stop running. Another example, if the current time becomes to be 10000000 after PTP clock adjust, a large time offset can cause the hardware to hang. This patch introduces a tas_clock_adjust() function to reset the TAS module by using a new base time after the PTP clock adjustment. This can avoid issues above. Due to PTP clock adjustment can occur at any time, it may conflict with the TAS configuration. We introduce a new TAS lock to serialize the access to the TAS registers. Signed-off-by: Xiaoliang Yang <xiaoliang.yang_1@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-17Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextGravatar Jakub Kicinski 8-45/+231
Daniel Borkmann says: ==================== pull-request: bpf-next 2022-06-17 We've added 72 non-merge commits during the last 15 day(s) which contain a total of 92 files changed, 4582 insertions(+), 834 deletions(-). The main changes are: 1) Add 64 bit enum value support to BTF, from Yonghong Song. 2) Implement support for sleepable BPF uprobe programs, from Delyan Kratunov. 3) Add new BPF helpers to issue and check TCP SYN cookies without binding to a socket especially useful in synproxy scenarios, from Maxim Mikityanskiy. 4) Fix libbpf's internal USDT address translation logic for shared libraries as well as uprobe's symbol file offset calculation, from Andrii Nakryiko. 5) Extend libbpf to provide an API for textual representation of the various map/prog/attach/link types and use it in bpftool, from Daniel Müller. 6) Provide BTF line info for RV64 and RV32 JITs, and fix a put_user bug in the core seen in 32 bit when storing BPF function addresses, from Pu Lehui. 7) Fix libbpf's BTF pointer size guessing by adding a list of various aliases for 'long' types, from Douglas Raillard. 8) Fix bpftool to readd setting rlimit since probing for memcg-based accounting has been unreliable and caused a regression on COS, from Quentin Monnet. 9) Fix UAF in BPF cgroup's effective program computation triggered upon BPF link detachment, from Tadeusz Struk. 10) Fix bpftool build bootstrapping during cross compilation which was pointing to the wrong AR process, from Shahab Vahedi. 11) Fix logic bug in libbpf's is_pow_of_2 implementation, from Yuze Chi. 12) BPF hash map optimization to avoid grabbing spinlocks of all CPUs when there is no free element. Also add a benchmark as reproducer, from Feng Zhou. 13) Fix bpftool's codegen to bail out when there's no BTF, from Michael Mullin. 14) Various minor cleanup and improvements all over the place. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (72 commits) bpf: Fix bpf_skc_lookup comment wrt. return type bpf: Fix non-static bpf_func_proto struct definitions selftests/bpf: Don't force lld on non-x86 architectures selftests/bpf: Add selftests for raw syncookie helpers in TC mode bpf: Allow the new syncookie helpers to work with SKBs selftests/bpf: Add selftests for raw syncookie helpers bpf: Add helpers to issue and check SYN cookies in XDP bpf: Allow helpers to accept pointers with a fixed size bpf: Fix documentation of th_len in bpf_tcp_{gen,check}_syncookie selftests/bpf: add tests for sleepable (uk)probes libbpf: add support for sleepable uprobe programs bpf: allow sleepable uprobe programs to attach bpf: implement sleepable uprobes by chaining gps bpf: move bpf_prog to bpf.h libbpf: Fix internal USDT address translation logic for shared libraries samples/bpf: Check detach prog exist or not in xdp_fwd selftests/bpf: Avoid skipping certain subtests selftests/bpf: Fix test_varlen verification failure with latest llvm bpftool: Do not check return value from libbpf_set_strict_mode() Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK" ... ==================== Link: https://lore.kernel.org/r/20220617220836.7373-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-17bpf: Fix non-static bpf_func_proto struct definitionsGravatar Joanne Koong 1-3/+0
This patch does two things: 1) Marks the dynptr bpf_func_proto structs that were added in [1] as static, as pointed out by the kernel test robot in [2]. 2) There are some bpf_func_proto structs marked as extern which can instead be statically defined. [1] https://lore.kernel.org/bpf/20220523210712.3641569-1-joannelkoong@gmail.com/ [2] https://lore.kernel.org/bpf/62ab89f2.Pko7sI08RAKdF8R6%25lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220616225407.1878436-1-joannelkoong@gmail.com
2022-06-17net: pcs: xpcs: add CL37 1000BASE-X AN supportGravatar Ong Boon Leong 1-0/+1
For CL37 1000BASE-X AN, DW xPCS does not support C22 method but offers C45 vendor-specific MII MMD for programming. We also add the ability to disable Autoneg (through ethtool for certain network switch that supports 1000BASE-X (1000Mbps and Full-Duplex) but not Autoneg capability. v4: Fixes to comment from Russell King. Thanks! https://patchwork.kernel.org/comment/24894239/ Make xpcs_modify_changed() as private, change to use mdiodev_modify_changed() for cleaner code. v3: Fixes to issues spotted by Russell King. Thanks! https://patchwork.kernel.org/comment/24890210/ Use phylink_mii_c22_pcs_decode_state(), remove unnecessary interrupt clearing and skip speed & duplex setting if AN is enabled. v2: Fixes to issues spotted by Russell King in v1. Thanks! https://patchwork.kernel.org/comment/24826650/ Use phylink_mii_c22_pcs_encode_advertisement() and implement C45 MII ADV handling since IP only support C45 access. Tested-by: Emilio Riva <emilio.riva@ericsson.com> Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-17net: make xpcs_do_config to accept advertising for pcs-xpcs and sja1105Gravatar Ong Boon Leong 1-1/+1
xpcs_config() has 'advertising' input that is required for C37 1000BASE-X AN in later patch series. So, we prepare xpcs_do_config() for it. For sja1105, xpcs_do_config() is used for xpcs configuration without depending on advertising input, so set to NULL. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-16bpf: Add helpers to issue and check SYN cookies in XDPGravatar Maxim Mikityanskiy 2-0/+79
The new helpers bpf_tcp_raw_{gen,check}_syncookie_ipv{4,6} allow an XDP program to generate SYN cookies in response to TCP SYN packets and to check those cookies upon receiving the first ACK packet (the final packet of the TCP handshake). Unlike bpf_tcp_{gen,check}_syncookie these new helpers don't need a listening socket on the local machine, which allows to use them together with synproxy to accelerate SYN cookie generation. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20220615134847.3753567-4-maximmi@nvidia.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-16bpf: Allow helpers to accept pointers with a fixed sizeGravatar Maxim Mikityanskiy 1-0/+13
Before this commit, the BPF verifier required ARG_PTR_TO_MEM arguments to be followed by ARG_CONST_SIZE holding the size of the memory region. The helpers had to check that size in runtime. There are cases where the size expected by a helper is a compile-time constant. Checking it in runtime is an unnecessary overhead and waste of BPF registers. This commit allows helpers to accept pointers to memory without the corresponding ARG_CONST_SIZE, given that they define the memory region size in struct bpf_func_proto and use ARG_PTR_TO_FIXED_SIZE_MEM type. arg_size is unionized with arg_btf_id to reduce the kernel image size, and it's valid because they are used by different argument types. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20220615134847.3753567-3-maximmi@nvidia.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-16bpf: Fix documentation of th_len in bpf_tcp_{gen,check}_syncookieGravatar Maxim Mikityanskiy 1-4/+6
bpf_tcp_gen_syncookie expects the full length of the TCP header (with all options), and bpf_tcp_check_syncookie accepts lengths bigger than sizeof(struct tcphdr). Fix the documentation that says these lengths should be exactly sizeof(struct tcphdr). While at it, fix a typo in the name of struct ipv6hdr. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20220615134847.3753567-2-maximmi@nvidia.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-16linux/phy.h: add phydev_err_probe() wrapper for dev_err_probe()Gravatar Rasmus Villemoes 1-0/+3
The dev_err_probe() function is quite useful to avoid boilerplate related to -EPROBE_DEFER handling. Add a phydev_err_probe() helper to simplify making use of that from phy drivers which otherwise use the phydev_* helpers. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netGravatar Jakub Kicinski 15-118/+103
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-16bpf: implement sleepable uprobes by chaining gpsGravatar Delyan Kratunov 1-0/+52
uprobes work by raising a trap, setting a task flag from within the interrupt handler, and processing the actual work for the uprobe on the way back to userspace. As a result, uprobe handlers already execute in a might_fault/_sleep context. The primary obstacle to sleepable bpf uprobe programs is therefore on the bpf side. Namely, the bpf_prog_array attached to the uprobe is protected by normal rcu. In order for uprobe bpf programs to become sleepable, it has to be protected by the tasks_trace rcu flavor instead (and kfree() called after a corresponding grace period). Therefore, the free path for bpf_prog_array now chains a tasks_trace and normal grace periods one after the other. Users who iterate under tasks_trace read section would be safe, as would users who iterate under normal read sections (from non-sleepable locations). The downside is that the tasks_trace latency affects all perf_event-attached bpf programs (and not just uprobe ones). This is deemed safe given the possible attach rates for kprobe/uprobe/tp programs. Separately, non-sleepable programs need access to dynamically sized rcu-protected maps, so bpf_run_prog_array_sleepables now conditionally takes an rcu read section, in addition to the overarching tasks_trace section. Signed-off-by: Delyan Kratunov <delyank@fb.com> Link: https://lore.kernel.org/r/ce844d62a2fd0443b08c5ab02e95bc7149f9aeb1.1655248076.git.delyank@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-16bpf: move bpf_prog to bpf.hGravatar Delyan Kratunov 2-34/+36
In order to add a version of bpf_prog_run_array which accesses the bpf_prog->aux member, bpf_prog needs to be more than a forward declaration inside bpf.h. Given that filter.h already includes bpf.h, this merely reorders the type declarations for filter.h users. bpf.h users now have access to bpf_prog internals. Signed-off-by: Delyan Kratunov <delyank@fb.com> Link: https://lore.kernel.org/r/3ed7824e3948f22d84583649ccac0ff0d38b6b58.1655248076.git.delyank@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-16Merge tag 'net-5.19-rc3' of ↵Gravatar Linus Torvalds 3-84/+1
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Mostly driver fixes. Current release - regressions: - Revert "net: Add a second bind table hashed by port and address", needs more work - amd-xgbe: use platform_irq_count(), static setup of IRQ resources had been removed from DT core - dts: at91: ksz9477_evb: add phy-mode to fix port/phy validation Current release - new code bugs: - hns3: modify the ring param print info Previous releases - always broken: - axienet: make the 64b addressable DMA depends on 64b architectures - iavf: fix issue with MAC address of VF shown as zero - ice: fix PTP TX timestamp offset calculation - usb: ax88179_178a needs FLAG_SEND_ZLP Misc: - document some net.sctp.* sysctls" * tag 'net-5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (31 commits) net: axienet: add missing error return code in axienet_probe() Revert "net: Add a second bind table hashed by port and address" net: ax25: Fix deadlock caused by skb_recv_datagram in ax25_recvmsg net: usb: ax88179_178a needs FLAG_SEND_ZLP MAINTAINERS: add include/dt-bindings/net to NETWORKING DRIVERS ARM: dts: at91: ksz9477_evb: fix port/phy validation net: bgmac: Fix an erroneous kfree() in bgmac_remove() ice: Fix memory corruption in VF driver ice: Fix queue config fail handling ice: Sync VLAN filtering features for DVM ice: Fix PTP TX timestamp offset calculation mlxsw: spectrum_cnt: Reorder counter pools docs: networking: phy: Fix a typo amd-xgbe: Use platform_irq_count() octeontx2-vf: Add support for adaptive interrupt coalescing xilinx: Fix build on x86. net: axienet: Use iowrite64 to write all 64b descriptor pointers net: axienet: make the 64b addresable DMA depends on 64b archectures net: hns3: fix tm port shapping of fibre port is incorrect after driver initialization net: hns3: fix PF rss size initialization bug ...
2022-06-16Revert "net: Add a second bind table hashed by port and address"Gravatar Joanne Koong 3-84/+1
This reverts: commit d5a42de8bdbe ("net: Add a second bind table hashed by port and address") commit 538aaf9b2383 ("selftests: Add test for timing a bind request to a port with a populated bhash entry") Link: https://lore.kernel.org/netdev/20220520001834.2247810-1-kuba@kernel.org/ There are a few things that need to be fixed here: * Updating bhash2 in cases where the socket's rcv saddr changes * Adding bhash2 hashbucket locks Links to syzbot reports: https://lore.kernel.org/netdev/00000000000022208805e0df247a@google.com/ https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/ Fixes: d5a42de8bdbe ("net: Add a second bind table hashed by port and address") Reported-by: syzbot+015d756bbd1f8b5c8f09@syzkaller.appspotmail.com Reported-by: syzbot+98fd2d1422063b0f8c44@syzkaller.appspotmail.com Reported-by: syzbot+0a847a982613c6438fba@syzkaller.appspotmail.com Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/r/20220615193213.2419568-1-joannelkoong@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-15Merge tag 'hardening-v5.19-rc3' of ↵Gravatar Linus Torvalds 1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening fixes from Kees Cook: - Correctly handle vm_map areas in hardened usercopy (Matthew Wilcox) - Adjust CFI RCU usage to avoid boot splats with cpuidle (Sami Tolvanen) * tag 'hardening-v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: usercopy: Make usercopy resilient against ridiculously large copies usercopy: Cast pointer to an integer once usercopy: Handle vm_map_ram() areas cfi: Fix __cfi_slowpath_diag RCU usage with cpuidle
2022-06-14Merge branch 'mlx5-next' of ↵Gravatar Jakub Kicinski 5-32/+171
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Saeed Mahameed says: ==================== mlx5-next: updates 2022-06-14 1) Updated HW bits and definitions for upcoming features 1.1) vport debug counters 1.2) flow meter 1.3) Execute ASO action for flow entry 1.4) enhanced CQE compression 2) Add ICM header-modify-pattern RDMA API Leon Says ========= SW steering manipulates packet's header using "modifying header" actions. Many of these actions do the same operation, but use different data each time. Currently we create and keep every one of these actions, which use expensive and limited resources. Now we introduce a new mechanism - pattern and argument, which splits a modifying action into two parts: 1. action pattern: contains the operations to be applied on packet's header, mainly set/add/copy of fields in the packet 2. action data/argument: contains the data to be used by each operation in the pattern. This way we reuse same patterns with different arguments to create new modifying actions, and since many actions share the same operations, we end up creating a small number of patterns that we keep in a dedicated cache. These modify header patterns are implemented as new type of ICM memory, so the following kernel patch series add the support for this new ICM type. ========== * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: Add bits and fields to support enhanced CQE compression net/mlx5: Remove not used MLX5_CAP_BITS_RW_MASK net/mlx5: group fdb cleanup to single function net/mlx5: Add support EXECUTE_ASO action for flow entry net/mlx5: Add HW definitions of vport debug counters net/mlx5: Add IFC bits and enums for flow meter RDMA/mlx5: Support handling of modify-header pattern ICM area net/mlx5: Manage ICM of type modify-header pattern net/mlx5: Introduce header-modify-pattern ICM properties ==================== Link: https://lore.kernel.org/r/20220614184028.51548-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-14netfs: fix up netfs_inode_init() docbook commentGravatar Linus Torvalds 1-1/+1
Commit e81fb4198e27 ("netfs: Further cleanups after struct netfs_inode wrapper introduced") changed the argument types and names, and actually updated the comment too (although that was thanks to David Howells, not me: my original patch only changed the code). But the comment fixup didn't go quite far enough, and didn't change the argument name in the comment, resulting in include/linux/netfs.h:314: warning: Function parameter or member 'ctx' not described in 'netfs_inode_init' include/linux/netfs.h:314: warning: Excess function parameter 'inode' description in 'netfs_inode_init' during htmldoc generation. Fixes: e81fb4198e27 ("netfs: Further cleanups after struct netfs_inode wrapper introduced") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-06-14bpf: Fix spelling in bpf_verifier.hGravatar Hongyi Lu 1-1/+1
Minor spelling fix spotted in bpf_verifier.h. Spelling is no big deal, but it is still an improvement when reading through the code. Signed-off-by: Hongyi Lu <jwnhy0@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220613211633.58647-1-jwnhy0@gmail.com
2022-06-14Merge tag 'x86-bugs-2022-06-01' of ↵Gravatar Linus Torvalds 1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 MMIO stale data fixes from Thomas Gleixner: "Yet another hw vulnerability with a software mitigation: Processor MMIO Stale Data. They are a class of MMIO-related weaknesses which can expose stale data by propagating it into core fill buffers. Data which can then be leaked using the usual speculative execution methods. Mitigations include this set along with microcode updates and are similar to MDS and TAA vulnerabilities: VERW now clears those buffers too" * tag 'x86-bugs-2022-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/speculation/mmio: Print SMT warning KVM: x86/speculation: Disable Fill buffer clear within guests x86/speculation/mmio: Reuse SRBDS mitigation for SBDS x86/speculation/srbds: Update SRBDS mitigation selection x86/speculation/mmio: Add sysfs reporting for Processor MMIO Stale Data x86/speculation/mmio: Enable CPU Fill buffer clearing on idle x86/bugs: Group MDS, TAA & Processor MMIO Stale Data mitigations x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data x86/speculation: Add a common function for MD_CLEAR mitigation update x86/speculation/mmio: Enumerate Processor MMIO Stale Data bug Documentation: Add documentation for Processor MMIO Stale Data
2022-06-13net/mlx5: Add bits and fields to support enhanced CQE compressionGravatar Ofer Levi 2-3/+20
Expose ifc bits and add needed structure fields and methods to support enhanced CQE compression feature. The enhanced CQE compression feature improves cpu utiliziation with better packet latency from nic to host. Signed-off-by: Ofer Levi <oferle@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-06-13net/mlx5: Remove not used MLX5_CAP_BITS_RW_MASKGravatar Shay Drory 1-19/+0
Remove not used MLX5_CAP_BITS_RW_MASK. While at it, remove CAP_MASK, MLX5_CAP_OFF_CMDIF_CSUM and MLX5_DEV_CAP_FLAG_*, since MLX5_CAP_BITS_RW_MASK was their only user. Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-06-13net/mlx5: Add support EXECUTE_ASO action for flow entryGravatar Jianbo Liu 1-0/+14
Attach flow meter to FTE with object id and index. Use metadata register C5 to store the packet color meter result. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-06-13net/mlx5: Add HW definitions of vport debug countersGravatar Saeed Mahameed 1-4/+19
total_q_under_processor_handle - number of queues in error state due to an async error or errored command. send_queue_priority_update_flow - number of QP/SQ priority/SL update events. cq_overrun - number of times CQ entered an error state due to an overflow. async_eq_overrun -number of time an EQ mapped to async events was overrun. comp_eq_overrun - number of time an EQ mapped to completion events was overrun. quota_exceeded_command - number of commands issued and failed due to quota exceeded. invalid_command - number of commands issued and failed dues to any reason other than quota exceeded. Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-06-13net/mlx5: Add IFC bits and enums for flow meterGravatar Jianbo Liu 2-4/+111
Add/extend structure layouts and defines for flow meter. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Ariel Levkovich <lariel@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-06-13RDMA/mlx5: Support handling of modify-header pattern ICM areaGravatar Yevgeny Kliteynik 1-0/+1
Add support for allocate/deallocate and registering MR of the new type of ICM area. Support exists only for devices that support sw_owner_v2. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-06-13net/mlx5: Manage ICM of type modify-header patternGravatar Yevgeny Kliteynik 1-0/+1
Added support for managing new type of ICM for devices that support sw_owner_v2. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-06-13net/mlx5: Introduce header-modify-pattern ICM propertiesGravatar Yevgeny Kliteynik 1-2/+5
Added new fields for device memory capabilities, in order to support creation of ICM memory for modify header patterns. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-06-13usercopy: Handle vm_map_ram() areasGravatar Matthew Wilcox (Oracle) 1-0/+1
vmalloc does not allocate a vm_struct for vm_map_ram() areas. That causes us to deny usercopies from those areas. This affects XFS which uses vm_map_ram() for its directories. Fix this by calling find_vmap_area() instead of find_vm_area(). Fixes: 0aef499f3172 ("mm/usercopy: Detect vmalloc overruns") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Zorro Lang <zlang@redhat.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220612213227.3881769-2-willy@infradead.org
2022-06-13net: make __sys_accept4_file() staticGravatar Yajun Deng 1-4/+0
__sys_accept4_file() isn't used outside of the file, make it static. As the same time, move file_flags and nofile parameters into __sys_accept4_file(). Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-13tcp: sk_forced_mem_schedule() optimizationGravatar Eric Dumazet 1-2/+1
sk_memory_allocated_add() has three callers, and returns to them @memory_allocated. sk_forced_mem_schedule() is one of them, and ignores the returned value. Change sk_memory_allocated_add() to return void. Change sock_reserve_memory() and __sk_mem_raise_allocated() to call sk_memory_allocated(). This removes one cache line miss [1] for RPC workloads, as first skbs in TCP write queue and receive queue go through sk_forced_mem_schedule(). [1] Cache line holding tcp_memory_allocated. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-12Merge tag 'wq-for-5.19-rc1-fixes' of ↵Gravatar Linus Torvalds 2-13/+61
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "Tetsuo's patch to trigger build warnings if system-wide wq's are flushed along with a TP type update and trivial comment update" * tag 'wq-for-5.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Switch to new kerneldoc syntax for named variable macro argument workqueue: Fix type of cpu in trace event workqueue: Wrap flush_workqueue() using a macro
2022-06-12Merge tag 'random-5.19-rc2-for-linus' of ↵Gravatar Linus Torvalds 2-3/+2
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull random number generator fixes from Jason Donenfeld: - A fix for a 5.19 regression for a case in which early device tree initializes the RNG, which flips a static branch. On most plaforms, jump labels aren't initialized until much later, so this caused splats. On a few mailing list threads, we cooked up easy fixes for arm64, arm32, and risc-v. But then things looked slightly more involved for xtensa, powerpc, arc, and mips. And at that point, when we're patching 7 architectures in a place before the console is even available, it seems like the cost/risk just wasn't worth it. So random.c works around it now by checking the already exported `static_key_initialized` boolean, as though somebody already ran into this issue in the past. I'm not super jazzed about that; it'd be prettier to not have to complicate downstream code. But I suppose it's practical. - A few small code nits and adding a missing __init annotation. - A change to the default config values to use the cpu and bootloader's seeds for initializing the RNG earlier. This brings them into line with what all the distros do (Fedora/RHEL, Debian, Ubuntu, Gentoo, Arch, NixOS, Alpine, SUSE, and Void... at least), and moreover will now give us test coverage in various test beds that might have caught the above device tree bug earlier. - A change to WireGuard CI's configuration to increase test coverage around the RNG. - A documentation comment fix to unrelated maintainerless CRC code that I was asked to take, I guess because it has to do with polynomials (which the RNG thankfully no longer uses). * tag 'random-5.19-rc2-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: wireguard: selftests: use maximum cpu features and allow rng seeding random: remove rng_has_arch_random() random: credit cpu and bootloader seeds by default random: do not use jump labels before they are initialized random: account for arch randomness in bits random: mark bootloader randomness code as __init random: avoid checking crng_ready() twice in random_init() crc-itu-t: fix typo in CRC ITU-T polynomial comment
2022-06-11workqueue: Switch to new kerneldoc syntax for named variable macro argumentGravatar Jonathan Neuschäfer 1-1/+1
The syntax without dots is available since commit 43756e347f21 ("scripts/kernel-doc: Add support for named variable macro arguments"). The same HTML output is produced with and without this patch. Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-06-11Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostGravatar Linus Torvalds 1-2/+3
Pull virtio fixes from Michael Tsirkin: "Fixes all over the place, most notably fixes for latent bugs in drivers that got exposed by suppressing interrupts before DRIVER_OK, which in turn has been done by 8b4ec69d7e09 ("virtio: harden vring IRQ")" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: um: virt-pci: set device ready in probe() vdpa: make get_vq_group and set_group_asid optional virtio: Fix all occurences of the "the the" typo vduse: Fix NULL pointer dereference on sysfs access vringh: Fix loop descriptors check in the indirect cases vdpa/mlx5: clean up indenting in handle_ctrl_vlan() vdpa/mlx5: fix error code for deleting vlan virtio-mmio: fix missing put_device() when vm_cmdline_parent registration failed vdpa/mlx5: Fix syntax errors in comments virtio-rng: make device ready before making request
2022-06-10Merge tag 'nfsd-5.19-1' of ↵Gravatar Linus Torvalds 1-1/+15
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: "Notable changes: - There is now a backup maintainer for NFSD Notable fixes: - Prevent array overruns in svc_rdma_build_writes() - Prevent buffer overruns when encoding NFSv3 READDIR results - Fix a potential UAF in nfsd_file_put()" * tag 'nfsd-5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: SUNRPC: Remove pointer type casts from xdr_get_next_encode_buffer() SUNRPC: Clean up xdr_get_next_encode_buffer() SUNRPC: Clean up xdr_commit_encode() SUNRPC: Optimize xdr_reserve_space() SUNRPC: Fix the calculation of xdr->end in xdr_get_next_encode_buffer() SUNRPC: Trap RDMA segment overflows NFSD: Fix potential use-after-free in nfsd_file_put() MAINTAINERS: reciprocal co-maintainership for file locking and nfsd
2022-06-10Merge tag 'for-5.19/dm-fixes-2' of ↵Gravatar Linus Torvalds 1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix DM core's bioset initialization so that blk integrity pool is properly setup. Remove now unused bioset_init_from_src. - Fix DM zoned hang from locking imbalance due to needless check in clone_endio(). * tag 'for-5.19/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: fix zoned locking imbalance due to needless check in clone_endio block: remove bioset_init_from_src dm: fix bio_set allocation
2022-06-10net: keep sk->sk_forward_alloc as small as possibleGravatar Eric Dumazet 1-27/+2
Currently, tcp_memory_allocated can hit tcp_mem[] limits quite fast. Each TCP socket can forward allocate up to 2 MB of memory, even after flow became less active. 10,000 sockets can have reserved 20 GB of memory, and we have no shrinker in place to reclaim that. Instead of trying to reclaim the extra allocations in some places, just keep sk->sk_forward_alloc values as small as possible. This should not impact performance too much now we have per-cpu reserves: Changes to tcp_memory_allocated should not be too frequent. For sockets not using SO_RESERVE_MEM: - idle sockets (no packets in tx/rx queues) have zero forward alloc. - non idle sockets have a forward alloc smaller than one page. Note: - Removal of SK_RECLAIM_CHUNK and SK_RECLAIM_THRESHOLD is left to MPTCP maintainers as a follow up. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>