Domain hash table #1

l0kod · 2024-01-18T16:58:36Z

To make it simpler, a Landlock domains is currently a landlock_ruleset struct. The use of this data structure includes fields which are useless for a domain, and a red-black tree which is not useful nor optimized for a read-only data structure.

To both improve performance and test coverage (by removing some WARN_ON_ONCE checks), we should create a dedicated domain data type, mainly consisting of a hash table. We should be able to use hash_long().

We should first create a tool to benchmark domain use and measure the performance improvement related to this new data type, see #24.

The text was updated successfully, but these errors were encountered:

gnoack · 2024-04-12T07:19:01Z

Bitwise lookup tables are potentially also an option for some more advanced features:
https://lore.kernel.org/all/[email protected]/

gnoack · 2024-04-12T08:00:03Z

I have been wondering about the design of this in the past as well; the existing approach seems focused on keeping the worst case behaviour in check, but most Landlock rulesets that I have seen so far are actually pretty simple -- it makes me wonder whether we should not also take average case performance into account separately from the worst case performance.

In my mind a more "obvious" implementation approach for these domains would be to create one kernel-internal ruleset structure with these RB-trees, and just link it to its parent ruleset instead of merging this.

As rulesets are always created in a nested fashion, this will form a tree of rulesets and we can use reference counting for memory management.

It becomes cheaper to enforce rulesets (merge is just setting the "parent" link)
The access right lookup would need to traverse to the top of the linked tree of linked rulesets
- (See below for why I think this is not as bad as it sounds)
- The nodes can store access_mask_t again and we'd use fewer loops to do bit-per-bit work
A maximum nesting depth could still be enforced to avoid performance problems.
The implementation would be simpler. (That alone might lead to savings.)

I suspect that this might work well in the average case, because as far as I have seen, in realistic scenarios, the stacked rulesets seldomly refer to the same inodes. So if during a merge operation, these inode entries don't actually collapse across the merged rulesets, we just end up with multiple RB-trees that have the same overall number of entries as the big one that we are merging together in the existing implementation.

If we move to hash tables instead of RB-trees, linear scanning can be a valid performance optimization for small hash table sizes.

There are too many variables here to reason about it with confidence. I have a hard time proving that approach we have fixed bug 24 :)

Has this or a similar approach been discussed before for Landlock, maybe in one of the earlier patch sets?

l0kod · 2024-04-12T10:00:03Z

Bitwise lookup tables are potentially also an option for some more advanced features: https://lore.kernel.org/all/[email protected]/

Indeed

I have been wondering about the design of this in the past as well; the existing approach seems focused on keeping the worst case behaviour in check, but most Landlock rulesets that I have seen so far are actually pretty simple -- it makes me wonder whether we should not also take average case performance into account separately from the worst case performance.

Sure, any idea?

In my mind a more "obvious" implementation approach for these domains would be to create one kernel-internal ruleset structure with these RB-trees, and just link it to its parent ruleset instead of merging this.

As rulesets are always created in a nested fashion, this will form a tree of rulesets and we can use reference counting for memory management.

I'm not sure to follow.

My reasoning to merge rulesets was to get a good locality for rules of the same domain. It takes more memory (by duplicating data) but I think it is not a big issue because rulesets are usually not so big and quick lookup is more valuable.

The landlock_hierarchy tree is use to keep track of domain hierarchies.

Has this or a similar approach been discussed before for Landlock, maybe in one of the earlier patch sets?

Not much. It was just mentioned that a hash table might be a good idea.

sm1ling-knight · 2024-04-18T14:11:35Z

@l0kod, what do you think about removing all layer-related logic for rules with non-hierarchical keys? That is, layers are currently required only to check access for some path (when we do pathwalk through all parent inodes, collecting access bits for each layer). For example, there is no need to keep layers for net rules, we can keep only one actions bitmask for each port.

l0kod · 2024-04-19T21:13:57Z

@l0kod, what do you think about removing all layer-related logic for rules with non-hierarchical keys? That is, layers are currently required only to check access for some path (when we do pathwalk through all parent inodes, collecting access bits for each layer). For example, there is no need to keep layers for net rules, we can keep only one actions bitmask for each port.

This is indeed not required to check an network port access but we need to store access rights per layer to be identify the layer that denied an access request. This is required for audit support to include in the logs the layer level that denied a request.

commit af8e119 upstream. re-enumerating full-speed devices after a failed address device command can trigger a NULL pointer dereference. Full-speed devices may need to reconfigure the endpoint 0 Max Packet Size value during enumeration. Usb core calls usb_ep0_reinit() in this case, which ends up calling xhci_configure_endpoint(). On Panther point xHC the xhci_configure_endpoint() function will additionally check and reserve bandwidth in software. Other hosts do this in hardware If xHC address device command fails then a new xhci_virt_device structure is allocated as part of re-enabling the slot, but the bandwidth table pointers are not set up properly here. This triggers the NULL pointer dereference the next time usb_ep0_reinit() is called and xhci_configure_endpoint() tries to check and reserve bandwidth [46710.713538] usb 3-1: new full-speed USB device number 5 using xhci_hcd [46710.713699] usb 3-1: Device not responding to setup address. [46710.917684] usb 3-1: Device not responding to setup address. [46711.125536] usb 3-1: device not accepting address 5, error -71 [46711.125594] BUG: kernel NULL pointer dereference, address: 0000000000000008 [46711.125600] #PF: supervisor read access in kernel mode [46711.125603] #PF: error_code(0x0000) - not-present page [46711.125606] PGD 0 P4D 0 [46711.125610] Oops: Oops: 0000 [#1] PREEMPT SMP PTI [46711.125615] CPU: 1 PID: 25760 Comm: kworker/1:2 Not tainted 6.10.3_2 #1 [46711.125620] Hardware name: Gigabyte Technology Co., Ltd. [46711.125623] Workqueue: usb_hub_wq hub_event [usbcore] [46711.125668] RIP: 0010:xhci_reserve_bandwidth (drivers/usb/host/xhci.c Fix this by making sure bandwidth table pointers are set up correctly after a failed address device command, and additionally by avoiding checking for bandwidth in cases like this where no actual endpoints are added or removed, i.e. only context for default control endpoint 0 is evaluated. Reported-by: Karel Balej <[email protected]> Closes: https://lore.kernel.org/linux-usb/[email protected]/ Cc: [email protected] Fixes: 651aaf3 ("usb: xhci: Handle USB transaction error on address command") Signed-off-by: Mathias Nyman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>

[ Upstream commit f8cde98 ] We shouldn't set real_dev to NULL because packets can be in transit and xfrm might call xdo_dev_offload_ok() in parallel. All callbacks assume real_dev is set. Example trace: kernel: BUG: unable to handle page fault for address: 0000000000001030 kernel: bond0: (slave eni0np1): making interface the new active one kernel: #PF: supervisor write access in kernel mode kernel: #PF: error_code(0x0002) - not-present page kernel: PGD 0 P4D 0 kernel: Oops: 0002 [#1] PREEMPT SMP kernel: CPU: 4 PID: 2237 Comm: ping Not tainted 6.7.7+ #12 kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014 kernel: RIP: 0010:nsim_ipsec_offload_ok+0xc/0x20 [netdevsim] kernel: bond0: (slave eni0np1): bond_ipsec_add_sa_all: failed to add SA kernel: Code: e0 0f 0b 48 83 7f 38 00 74 de 0f 0b 48 8b 47 08 48 8b 37 48 8b 78 40 e9 b2 e5 9a d7 66 90 0f 1f 44 00 00 48 8b 86 80 02 00 00 <83> 80 30 10 00 00 01 b8 01 00 00 00 c3 0f 1f 80 00 00 00 00 0f 1f kernel: bond0: (slave eni0np1): making interface the new active one kernel: RSP: 0018:ffffabde81553b98 EFLAGS: 00010246 kernel: bond0: (slave eni0np1): bond_ipsec_add_sa_all: failed to add SA kernel: kernel: RAX: 0000000000000000 RBX: ffff9eb404e74900 RCX: ffff9eb403d97c60 kernel: RDX: ffffffffc090de10 RSI: ffff9eb404e74900 RDI: ffff9eb3c5de9e00 kernel: RBP: ffff9eb3c0a42000 R08: 0000000000000010 R09: 0000000000000014 kernel: R10: 7974203030303030 R11: 3030303030303030 R12: 0000000000000000 kernel: R13: ffff9eb3c5de9e00 R14: ffffabde81553cc8 R15: ffff9eb404c53000 kernel: FS: 00007f2a77a3ad00(0000) GS:ffff9eb43bd00000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 0000000000001030 CR3: 00000001122ab000 CR4: 0000000000350ef0 kernel: bond0: (slave eni0np1): making interface the new active one kernel: Call Trace: kernel: <TASK> kernel: ? __die+0x1f/0x60 kernel: bond0: (slave eni0np1): bond_ipsec_add_sa_all: failed to add SA kernel: ? page_fault_oops+0x142/0x4c0 kernel: ? do_user_addr_fault+0x65/0x670 kernel: ? kvm_read_and_reset_apf_flags+0x3b/0x50 kernel: bond0: (slave eni0np1): making interface the new active one kernel: ? exc_page_fault+0x7b/0x180 kernel: ? asm_exc_page_fault+0x22/0x30 kernel: ? nsim_bpf_uninit+0x50/0x50 [netdevsim] kernel: bond0: (slave eni0np1): bond_ipsec_add_sa_all: failed to add SA kernel: ? nsim_ipsec_offload_ok+0xc/0x20 [netdevsim] kernel: bond0: (slave eni0np1): making interface the new active one kernel: bond_ipsec_offload_ok+0x7b/0x90 [bonding] kernel: xfrm_output+0x61/0x3b0 kernel: bond0: (slave eni0np1): bond_ipsec_add_sa_all: failed to add SA kernel: ip_push_pending_frames+0x56/0x80 Fixes: 18cb261 ("bonding: support hardware encryption offload to slaves") Signed-off-by: Nikolay Aleksandrov <[email protected]> Reviewed-by: Hangbin Liu <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

commit 5d92c7c upstream. If the ata_port_alloc() call in ata_host_alloc() fails, ata_host_release() will get called. However, the code in ata_host_release() tries to free ata_port struct members unconditionally, which can lead to the following: BUG: unable to handle page fault for address: 0000000000003990 PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 10 PID: 594 Comm: (udev-worker) Not tainted 6.10.0-rc5 torvalds#44 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014 RIP: 0010:ata_host_release.cold+0x2f/0x6e [libata] Code: e4 4d 63 f4 44 89 e2 48 c7 c6 90 ad 32 c0 48 c7 c7 d0 70 33 c0 49 83 c6 0e 41 RSP: 0018:ffffc90000ebb968 EFLAGS: 00010246 RAX: 0000000000000041 RBX: ffff88810fb52e78 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff88813b3218c0 RDI: ffff88813b3218c0 RBP: ffff88810fb52e40 R08: 0000000000000000 R09: 6c65725f74736f68 R10: ffffc90000ebb738 R11: 73692033203a746e R12: 0000000000000004 R13: 0000000000000000 R14: 0000000000000011 R15: 0000000000000006 FS: 00007f6cc55b9980(0000) GS:ffff88813b300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000003990 CR3: 00000001122a2000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> ? __die_body.cold+0x19/0x27 ? page_fault_oops+0x15a/0x2f0 ? exc_page_fault+0x7e/0x180 ? asm_exc_page_fault+0x26/0x30 ? ata_host_release.cold+0x2f/0x6e [libata] ? ata_host_release.cold+0x2f/0x6e [libata] release_nodes+0x35/0xb0 devres_release_group+0x113/0x140 ata_host_alloc+0xed/0x120 [libata] ata_host_alloc_pinfo+0x14/0xa0 [libata] ahci_init_one+0x6c9/0xd20 [ahci] Do not access ata_port struct members unconditionally. Fixes: 633273a ("libata-pmp: hook PMP support and enable it") Cc: [email protected] Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: John Garry <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Niklas Cassel <[email protected]> Signed-off-by: Oleksandr Tymoshenko <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

l0kod added the enhancement New feature or request label Jan 18, 2024

l0kod mentioned this issue Jan 19, 2024

Network port range control #16

Open

l0kod mentioned this issue Feb 1, 2024

Performance impact measurement tool #24

Open

l0kod mentioned this issue Apr 5, 2024

UDP socket control #10

Open

sm1ling-knight mentioned this issue Jun 7, 2024

Code refactoring #34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Domain hash table #1

Domain hash table #1

l0kod commented Jan 18, 2024 •

edited

Loading

gnoack commented Apr 12, 2024

gnoack commented Apr 12, 2024 •

edited

Loading

l0kod commented Apr 12, 2024

sm1ling-knight commented Apr 18, 2024

l0kod commented Apr 19, 2024

Domain hash table #1

Domain hash table #1

Comments

l0kod commented Jan 18, 2024 • edited Loading

gnoack commented Apr 12, 2024

gnoack commented Apr 12, 2024 • edited Loading

l0kod commented Apr 12, 2024

sm1ling-knight commented Apr 18, 2024

l0kod commented Apr 19, 2024

l0kod commented Jan 18, 2024 •

edited

Loading

gnoack commented Apr 12, 2024 •

edited

Loading