Commit 5236f57e authored by Christoph Paasch's avatar Christoph Paasch Committed by Jakub Kicinski
Browse files

net: Make nexthop-dumps scale linearly with the number of nexthops



When we have a (very) large number of nexthops, they do not fit within a
single message. rtm_dump_walk_nexthops() thus will be called repeatedly
and ctx->idx is used to avoid dumping the same nexthops again.

The approach in which we avoid dumping the same nexthops is by basically
walking the entire nexthop rb-tree from the left-most node until we find
a node whose id is >= s_idx. That does not scale well.

Instead of this inefficient approach, rather go directly through the
tree to the nexthop that should be dumped (the one whose nh_id >=
s_idx). This allows us to find the relevant node in O(log(n)).

We have quite a nice improvement with this:

Before:
=======

--> ~1M nexthops:
$ time ~/libnl/src/nl-nh-list | wc -l
1050624

real	0m21.080s
user	0m0.666s
sys	0m20.384s

--> ~2M nexthops:
$ time ~/libnl/src/nl-nh-list | wc -l
2101248

real	1m51.649s
user	0m1.540s
sys	1m49.908s

After:
======

--> ~1M nexthops:
$ time ~/libnl/src/nl-nh-list | wc -l
1050624

real	0m1.157s
user	0m0.926s
sys	0m0.259s

--> ~2M nexthops:
$ time ~/libnl/src/nl-nh-list | wc -l
2101248

real	0m2.763s
user	0m2.042s
sys	0m0.776s

Signed-off-by: default avatarChristoph Paasch <cpaasch@openai.com>
Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250816-nexthop_dump-v2-1-491da3462118@openai.com


Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
parent 51992f99
Loading
Loading
Loading
Loading
+33 −3
Original line number Diff line number Diff line
@@ -3511,12 +3511,42 @@ static int rtm_dump_walk_nexthops(struct sk_buff *skb,
	int err;

	s_idx = ctx->idx;
	for (node = rb_first(root); node; node = rb_next(node)) {

	/* If this is not the first invocation, ctx->idx will contain the id of
	 * the last nexthop we processed. Instead of starting from the very
	 * first element of the red/black tree again and linearly skipping the
	 * (potentially large) set of nodes with an id smaller than s_idx, walk
	 * the tree and find the left-most node whose id is >= s_idx.  This
	 * provides an efficient O(log n) starting point for the dump
	 * continuation.
	 */
	if (s_idx != 0) {
		struct rb_node *tmp = root->rb_node;

		node = NULL;
		while (tmp) {
			struct nexthop *nh;

			nh = rb_entry(tmp, struct nexthop, rb_node);
			if (nh->id < s_idx) {
				tmp = tmp->rb_right;
			} else {
				/* Track current candidate and keep looking on
				 * the left side to find the left-most
				 * (smallest id) that is still >= s_idx.
				 */
				node = tmp;
				tmp = tmp->rb_left;
			}
		}
	} else {
		node = rb_first(root);
	}

	for (; node; node = rb_next(node)) {
		struct nexthop *nh;

		nh = rb_entry(node, struct nexthop, rb_node);
		if (nh->id < s_idx)
			continue;

		ctx->idx = nh->id;
		err = nh_cb(skb, cb, nh, data);