路由查询函数fib_table_lookup,参数中指定了查询操作需要的路由表tb,流信息flp,以及标志位fib_flags,返回结果res。其中流结构flp成员目的地址daddr为查询所需的主key值。fib_flags支持两个lookup标志:FIB_LOOKUP_NOREF和FIB_LOOKUP_IGNORE_LINKSTATE。
首先,取得trie的首节点pn,接下来由其第一个子节点n(其cindex=0)开始遍历trie树。
/* should be called with rcu_read_lock */
int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp, struct fib_result *res, int fib_flags)
{
struct trie *t = (struct trie *) tb->tb_data;
#ifdef CONFIG_IP_FIB_TRIE_STATS
struct trie_use_stats __percpu *stats = t->stats;
#endif
const t_key key = ntohl(flp->daddr);
struct key_vector *n, *pn;
struct fib_alias *fa;
unsigned long index;
t_key cindex;
pn = t->kv;
cindex = 0;
n = get_child_rcu(pn, cindex);
if (!n) {
trace_fib_table_lookup(tb->tb_id, flp, NULL, -EAGAIN);
return -EAGAIN;
}
步骤一,首先获取键值key(目的地址)在节点n中的索引值。get_cindex函数在获取索引时,隐含的进行了前缀对比,如果索引index大于等于节点n可处理的位数量,表明前缀不相等(参见get_cindex中的异或操作),否则,key值与n节点的前缀相等,index为子节点索引值。接下来,如果节点n为叶子节点,即为需要的节点,路由查找完成。
/* Step 1: Travel to the longest prefix match in the trie */
for (;;) {
index = get_cindex(key, n);
/* This bit of code is a bit tricky but it combines multiple
* checks into a single check. The prefix consists of the
* prefix plus zeros for the "bits" in the prefix. The index
* is the difference between the key and this value. From
* this we can actually derive several pieces of data.
* if (index >= (1ul << bits))
* we have a mismatch in skip bits and failed
* else
* we know the value is cindex
*
* This check is safe even if bits == KEYLENGTH due to the
* fact that we can only allocate a node with 32 bits if a
* long is greater than 32 bits.
*/
if (index >= (1ul << n->bits))
break;
/* we have found a leaf. Prefixes have already been compared */
if (IS_LEAF(n))
goto found;
如下get_cindex宏定义,如果两者的前缀相同,结果为索引值。否则,一定大于等于n节点的最大索引值。
#define get_cindex(key, kv) (((key) ^ (kv)->key) >> (kv)->pos)
在节点n为中间节点的情况下,如果其自身的后缀长度slen大于其位置pos,表明其并不是最长前缀匹配的情况(可能走在错误的路上),之后可能会再次回到此节点,所以记录下其位置,节省处理周期。此时,键值key匹配了中间节点n的前缀,继续取出键值key在n中的索引指向的子节点,再次遍历此子节点。
如果新的子节点为空,表明未能完成最长前缀匹配,需要向trie树根回溯,调整到backtrace处理,这是就需要用到pn父节点了。
/* only record pn and cindex if we are going to be chopping
* bits later. Otherwise we are just wasting cycles.
*/
if (n->slen > n->pos) {
pn = n;
cindex = index;
}
n = get_child_rcu(n, index);
if (unlikely(!n))
goto backtrace;
}
如下所示路由表,查找目的地址(key值)为1.1.2.3的路由,首先key=1.1.2.3与1.0.0.0节点比较,前14个bits是相同的,两者匹配,但此节点为中间节点,非叶子节点,继续查找。第15bit是不相同的,节点1.0.0.0处理数据位bit[15,16],key对应子节点索引bit[15,16]=10b,即子节点1.1.0.0/23。
节点1.1.0.0/23与key=1.1.2.3的前22bits是相同的,两者匹配,但此节点也不是叶子节点,继续查找。两者第23bit不相同,节点1.1.0.0处理bit[23,24],key=1.1.2.3对应的bit[23,24]=10b,即子节点索引为2,匹配到空节点。子节点1.1.0.0索引为0,子节点1.1.1.0索引为1,需要由backtrace段代码处理。
另外,在遍历节点1.1.0.0/23时,由于其后缀长度slen为16(节点及其子节点最大后缀长度),大于其节点位置pos(32-23=9),所以,将父节点pn更新为节点1.1.0.0/23,并且记录n在父节点中的索引cindex。
# ip route add 1.1.1.0/24 via 192.168.2.2 table 10
# ip route add 1.1.0.0/16 via 192.168.2.3 table 10
# ip route add 1.0.0.0/8 via 192.168.2.4 table 10
#
# cat /proc/net/fib_trie
Id 10:
+-- 1.0.0.0/15 2
|-- 1.0.0.0
/8 universe UNICAST 192.168.2.4
+-- 1.1.0.0/23 2
|-- 1.1.0.0
/16 universe UNICAST 192.168.2.3
|-- 1.1.1.0
/24 universe UNICAST 192.168.2.2
步骤二,将上一步找到的节点n的子节点指针保存到cptr,如果键值key与节点n的前缀不匹配,或者在前缀匹配的情况下,节点n的后缀等于其节点位置pos,即节点n及其之后的子节点的后缀长度都要大于等于n节点的位置pos,相反的前缀长度在逐渐变短。按照最长匹配原则,以上两种情况都需要执行backtrace。
如果节点n为叶子节点,跳出循环。
/* Step 2: Sort out leaves and begin backtracing for longest prefix */
for (;;) {
/* record the pointer where our next node pointer is stored */
struct key_vector __rcu **cptr = n->tnode;
/* This test verifies that none of the bits that differ
* between the key and the prefix exist in the region of
* the lsb and higher in the prefix.
*/
if (unlikely(prefix_mismatch(key, n)) || (n->slen == n->pos))
goto backtrace;
/* exit out and process leaf */
if (unlikely(IS_LEAF(n)))
break;
在节点n为中间节点的情况下,取出其子节点,也由n表示。如果当前的子节点索引不为零,将cindex的最低一位为1的位(Least Significant Bit)置零。例如对于节点bits为3的情况下,子节点索引范围[0 - 7],当cindex为6时(110b),去掉LSB就是100b,即新的cindex为4。取得此时cindex对应的新子节点,如果其不为空,到以上的for循环开始,验证此节点是否匹配键值key。否则,如果新的子节点为空,继续去掉cindex的下一个LSB。
如果cindex等于零时,表明已遍历所有的子节点,需要向trie树上部继续(backtrace情况),如果父节点为trie树的根,返回EAGAIN的错误。否则,取得父节点pn在其父节点(n节点的祖父节点)中的索引,将pn替换为其父节点,继续在trie树的这一级进行查找。
/* Don't bother recording parent info. Since we are in
* prefix match mode we will have to come back to wherever
* we started this traversal anyway
*/
while ((n = rcu_dereference(*cptr)) == NULL) {
backtrace:
/* If we are at cindex 0 there are no more bits for
* us to strip at this level so we must ascend back
* up one level to see if there are any more bits to be stripped there.
*/
while (!cindex) {
t_key pkey = pn->key;
/* If we don't have a parent then there is nothing
* for us to do as we do not have any further nodes to parse.
*/
if (IS_TRIE(pn)) {
trace_fib_table_lookup(tb->tb_id, flp, NULL, -EAGAIN);
return -EAGAIN;
}
/* Get Child's index */
pn = node_parent_rcu(pn);
cindex = get_index(pkey, pn);
}
/* strip the least significant bit from the cindex */
cindex &= cindex - 1;
cptr = &pn->tnode[cindex]; /* grab pointer for next child node */
}
}
一种情况是,要查找的键值key与节点n的前缀完全相同,两者异或为零。另外,两者不完全相等,但是在节点n的前缀所表示的数值范围内,两者是完全相同的。后一种情况下,键值key的长度大于节点n的前缀长度。
static inline t_key prefix_mismatch(t_key key, struct key_vector *n)
{
t_key prefix = n->key;
return (key ^ prefix) & (prefix | -prefix);
}
如下所示路由表,键值key等于1.1.2.3,在步骤一时未能找了匹配的节点,从backtrace开始,把cindex的LSB剥离,索引值由10b变为00b,即索引0,即节点1.1.0.0/16,由于索引0对应的子节点存在,退出while循环。到步骤二开始的for循环进行处理,由于新节点的前缀与key匹配,并且新节点为叶子节点,即为我们要找的路由节点,网关为192.168.2.3。
# cat /proc/net/fib_trie
Id 10:
+-- 1.0.0.0/15 2
|-- 1.0.0.0
/8 universe UNICAST 192.168.2.4
+-- 1.1.0.0/23 2
|-- 1.1.0.0
/16 universe UNICAST 192.168.2.3
|-- 1.1.1.0
/24 universe UNICAST 192.168.2.2
步骤三,此处找到了匹配的节点n,遍历节点n的fib_alias路由别名链表,首先,键值key与节点n不同的最高位(Most Significant Bit)要小于当前路由的后缀表示的位数,例如,对于8位后缀(24位前缀)的情况,index要小于256(1<<8),即节点n的key值与查找键值key的至少前24位完全相同。
另外,如果指定了tos值,遍历的fib_alias的tos值也需要相等;fib_alias对应的fib_info有效,并且fib_info的scope值小于指定的值(scope为距离的一种表示,距离要近)。
如果fib_alias的类型为RTN_BLACKHOLE,RTN_UNREACHABLE,RTN_PROHIBIT等类型,err将小于零,返回错误。
found:
/* this line carries forward the xor from earlier in the function */
index = key ^ n->key;
/* Step 3: Process the leaf, if that fails fall back to backtracing */
hlist_for_each_entry_rcu(fa, &n->leaf, fa_list) {
struct fib_info *fi = fa->fa_info;
struct fib_nh_common *nhc;
int nhsel, err;
if ((BITS_PER_LONG > KEYLENGTH) || (fa->fa_slen < KEYLENGTH)) {
if (index >= (1ul << fa->fa_slen))
continue;
}
if (fa->fa_tos && fa->fa_tos != flp->flowi4_tos)
continue;
if (fi->fib_dead)
continue;
if (fa->fa_info->fib_scope < flp->flowi4_scope)
continue;
fib_alias_accessed(fa);
err = fib_props[fa->fa_type].error;
if (unlikely(err < 0)) {
out_reject:
#ifdef CONFIG_IP_FIB_TRIE_STATS
this_cpu_inc(stats->semantic_match_passed);
#endif
trace_fib_table_lookup(tb->tb_id, flp, NULL, err);
return err;
}
以下检测fib_info的下一跳信息,如果下一跳处于不可用状态(RTNH_F_DEAD),继续遍历fib_alias链表的下一个。如果fib_info的下一跳引用有值,检查下一跳是否为blackhole,,为真的话由以上的out_reject段代码处理。否则,在nexthop引用中查找合适的下一跳。没有找到合适的下一跳,将再次执行backtrace流程。
if (fi->fib_flags & RTNH_F_DEAD)
continue;
if (unlikely(fi->nh)) {
if (nexthop_is_blackhole(fi->nh)) {
err = fib_props[RTN_BLACKHOLE].error;
goto out_reject;
}
nhc = nexthop_get_nhc_lookup(fi->nh, fib_flags, flp, &nhsel);
if (nhc)
goto set_result;
goto miss;
}
对于多径路由,一是路由本身配置了多个下一跳,或者路由的下一跳引用是一个多路径的组。函数fib_lookup_good_nhc判断每个下一跳的可用性,最后,将路由查询结果赋值给res结构返回调用者。否则,在没有找到合适的下一跳的情况下,将再次执行backtrace流程。
for (nhsel = 0; nhsel < fib_info_num_path(fi); nhsel++) {
nhc = fib_info_nhc(fi, nhsel);
if (!fib_lookup_good_nhc(nhc, fib_flags, flp))
continue;
set_result:
if (!(fib_flags & FIB_LOOKUP_NOREF))
refcount_inc(&fi->fib_clntref);
res->prefix = htonl(n->key);
res->prefixlen = KEYLENGTH - fa->fa_slen;
res->nh_sel = nhsel;
res->nhc = nhc;
res->type = fa->fa_type;
res->scope = fi->fib_scope;
res->fi = fi;
res->table = tb;
res->fa_head = &n->leaf;
#ifdef CONFIG_IP_FIB_TRIE_STATS
this_cpu_inc(stats->semantic_match_passed);
#endif
trace_fib_table_lookup(tb->tb_id, flp, nhc, err);
return err;
}
}
miss:
#ifdef CONFIG_IP_FIB_TRIE_STATS
this_cpu_inc(stats->semantic_match_miss);
#endif
goto backtrace;
内核版本 5.10
|