我在之前已经写过两篇关于netfilter的文章:
Linux netfilter hook源码分析(基于内核代码版本4.18.0-80)_yg@hunter的博客-CSDN博客
Linux下使用Netfilter框架编写内核模块_yg@hunter的博客-CSDN博客
本文来详细分析linux netfilter在网络层的实现细节,本文代码分析是基于Linux内核版本4.18.0-80(也即centos8.0系统上)。
我画了一张linux内核协议栈网络层netfilter(iptables)的全景图,里面涉及内容比较多,我下面会详细讲解。
?
INGRESS入口钩子是在 Linux 内核 4.2 中添加的。 与其他 netfilter 挂钩不同,入口挂钩附加到特定的网络接口。 可以使用带有 ingress 钩子的 nftables 来强制实施甚至在prerouting之前生效的非常早的过滤策略。
请注意,在这个非常早期的阶段,碎片化的数据报尚未重新组装, 例如匹配 ip saddr 和 daddr 适用于所有 ip 数据包,但匹配传输层的头部(如 udp dport)仅适用于未分段的数据包或第一个片段,所以入口钩子提供了一种替代 tc 入口过滤的方法,但是仍然需要 tc 进行流量整形。
Netfilter/iptables由table和chain以及一些规则组成。
目录
1、iptables的链(chain)
⑴、五个链(chain)及对应钩子
①网络数据包的三种流转路径
②源码中网络层的5个hook的定义
2、iptables的表
⑵源码中IP层的表的定义
①raw表
②mangle表
③nat表
④filter表
3、Netfilter在网络层安装的5个hook点
⑴、NF_INET_PRE_ROUTING
①net\ipv4\ip_input.c
②net\ipv4\xfrm4_input.c
⑵、NF_INET_LOCAL_IN
⑶、NF_INET_FORWARD
①net\ipv4\ip_forward.c
②net\ipv4\ipmr.c
⑷、NF_INET_LOCAL_OUT
①net\ipv4\ip_output.c
②net\ipv4\raw.c
⑸、NF_INET_POST_ROUTING
1、iptables的链(chain)
netfilter在网络层安装了5个钩子,对应5个链,还可以通过编写内核模块来扩展这5个链的功能。
⑴、五个链(chain)及对应钩子
-
PREROUTING -->?NF_INET_PRE_ROUTING -
INPUT? -->?NF_INET_LOCAL_IN -
FORWARD -->?NF_INET_FORWARD -
OUTPUT-->?NF_INET_LOCAL_OUT -
POSTROUTING -->?NF_INET_POST_ROUTING
下图展示了网络层五条链的位置:
①网络数据包的三种流转路径
-
从网络流入本机:PREROUTING --> INPUT-->localhost -
从本机应用发出:localhost-->OUTPUT--> POSTROUTING -
经本机转发:PREROUTING --> FORWARD --> POSTROUTIN
②源码中网络层的5个hook的定义
include\uapi\linux\netfilter_ipv4.h
/* IP Hooks */
/* After promisc drops, checksum checks. */
#define NF_IP_PRE_ROUTING 0
/* If the packet is destined for this box. */
#define NF_IP_LOCAL_IN 1
/* If the packet is destined for another interface. */
#define NF_IP_FORWARD 2
/* Packets coming from a local process. */
#define NF_IP_LOCAL_OUT 3
/* Packets about to hit the wire. */
#define NF_IP_POST_ROUTING 4
#define NF_IP_NUMHOOKS 5
#endif /* ! __KERNEL__ */
在include\uapi\linux\netfilter.h中有对应的hook点定义:
enum nf_inet_hooks {
NF_INET_PRE_ROUTING,
NF_INET_LOCAL_IN,
NF_INET_FORWARD,
NF_INET_LOCAL_OUT,
NF_INET_POST_ROUTING,
NF_INET_NUMHOOKS
};
注:在4.2及以上版本内核中又增加了一个hook点NF_NETDEV_INGRESS:
enum nf_dev_hooks {
NF_NETDEV_INGRESS,
NF_NETDEV_NUMHOOKS
};
为 NFPROTO_INET 系列添加了 NF_INET_INGRESS 伪钩子。 这是将这个新钩子映射到现有的 NFPROTO_NETDEV 和 NF_NETDEV_INGRESS 钩子。 该钩子不保证数据包仅是 inet,用户必须明确过滤掉非 ip 流量。 这种基础结构使得在 nf_tables 中支持这个新钩子变得更容易。
2、iptables的表
⑴五张表(table)
-
raw:关闭启用的连接跟踪机制,加快封包穿越防火墙速度 -
mangle:修改数据标记位规则表 -
nat:network address translation 地址转换规则表 -
filter:过滤规则表,根据预定义的规则过滤符合条件的数据包,默认表 -
security:用于强制访问控制(MAC)网络规则,由Linux安全模块(如SELinux)实现
下图展示了五张表分布在对应链上:
?
⑵源码中IP层的表的定义
netfilter中的表的定义
include\linux\netfilter\x_tables.h
struct xt_table {
struct list_head list;
/* What hooks you will enter on */
unsigned int valid_hooks;
/* Man behind the curtain... */
struct xt_table_info *private;
/* Set this to THIS_MODULE if you are a module, otherwise NULL */
struct module *me;
u_int8_t af; /* address/protocol family */
int priority; /* hook order */
/* called when table is needed in the given netns */
int (*table_init)(struct net *net);
/* A unique name... */
const char name[XT_TABLE_MAXNAMELEN];
};
网络层各hook点的优先级
数值越低优先级越高:
include\uapi\linux\netfilter_ipv4.h
enum nf_ip_hook_priorities {
NF_IP_PRI_FIRST = INT_MIN,
NF_IP_PRI_RAW_BEFORE_DEFRAG = -450,
NF_IP_PRI_CONNTRACK_DEFRAG = -400,
NF_IP_PRI_RAW = -300,
NF_IP_PRI_SELINUX_FIRST = -225,
NF_IP_PRI_CONNTRACK = -200,
NF_IP_PRI_MANGLE = -150,
NF_IP_PRI_NAT_DST = -100,
NF_IP_PRI_FILTER = 0,
NF_IP_PRI_SECURITY = 50,
NF_IP_PRI_NAT_SRC = 100,
NF_IP_PRI_SELINUX_LAST = 225,
NF_IP_PRI_CONNTRACK_HELPER = 300,
NF_IP_PRI_CONNTRACK_CONFIRM = INT_MAX,
NF_IP_PRI_LAST = INT_MAX,
};
下面我们看下netfilter/iptables的这几张表的在内核源码中的定义。
①raw表
?源码里RAW_VALID_HOOKS宏可以看出raw表只有NF_INET_PRE_ROUTING、NF_INET_LOCAL_OUT链有效。
②mangle表
源码中valid_hooks参数可以看出mangle表对NF_INET_PRE_ROUTING、NF_INET_LOCAL_IN、NF_INET_FORWARD、NF_INET_LOCAL_OUT、NF_INET_POST_ROUTING五条链都有效。
③nat表
valid_hooks变量可以看出nat表只有NF_INET_PRE_ROUTING、NF_INET_POST_ROUTING、NF_INET_LOCAL_OUT、NF_INET_LOCAL_IN四条链有效。
④filter表
源码中valid_hooks参数可以看出filter表对NF_INET_LOCAL_IN、NF_INET_FORWARD、NF_INET_LOCAL_OUT三条链有效。
网络层的五张表在内核中对应了五个内核模块:
3、Netfilter在网络层安装的5个hook点
下面我们看下网络层的各个hook点安装的位置:
⑴、NF_INET_PRE_ROUTING
它是所有传入数据包到达的第一个hook点,它是在路由子系统中执行查找之前。这个钩子在 IPv4 的 ip_rcv() 方法中,在 IPv6 的 ipv6_rcv() 方法中。
/*
* Main IP Receive routine.
*/
int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev)
{
const struct iphdr *iph;
u32 len;
/* When the interface is in promisc. mode, drop all the crap
* that it receives, do not try to analyse it.
*/
if (skb->pkt_type == PACKET_OTHERHOST)
goto drop;
.......
iph = ip_hdr(skb);
/*
* RFC1122: 3.2.1.2 MUST silently discard any IP frame that fails the checksum.
*
* Is the datagram acceptable?
*
* 1. Length at least the size of an ip header
* 2. Version of 4
* 3. Checksums correctly. [Speed optimisation for later, skip loopback checksums]
* 4. Doesn't have a bogus length
*/
if (iph->ihl < 5 || iph->version != 4)
goto inhdr_error;
.......
return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, NULL, skb,
dev, NULL,
ip_rcv_finish);
.......
out:
return NET_RX_DROP;
}
int xfrm4_transport_finish(struct sk_buff *skb, int async)
{
struct iphdr *iph = ip_hdr(skb);
iph->protocol = XFRM_MODE_SKB_CB(skb)->protocol;
#ifndef CONFIG_NETFILTER
if (!async)
return -iph->protocol;
#endif
__skb_push(skb, skb->data - skb_network_header(skb));
iph->tot_len = htons(skb->len);
ip_send_check(iph);
NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, NULL, skb,
skb->dev, NULL,
xfrm4_rcv_encap_finish);
return 0;
}
⑵、NF_INET_LOCAL_IN
这个钩子在 IPv4 的 ip_local_deliver() 方法中,在 IPv6 的 ip6_input() 方法中。所有路由到本地主机的数据包都会到达此hook点,它是在首先通过 NF_INET_PRE_ROUTING hook点并在路由子系统中执行查找之后进到这里。
net\ipv4\ip_input.c
/*
* Deliver IP Packets to the higher protocol layers.
*/
int ip_local_deliver(struct sk_buff *skb)
{
/*
* Reassemble IP fragments.
*/
if (ip_is_fragment(ip_hdr(skb))) {
if (ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER))
return 0;
}
return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, NULL, skb,
skb->dev, NULL,
ip_local_deliver_finish);
}
⑶、NF_INET_FORWARD
①net\ipv4\ip_forward.c
int ip_forward(struct sk_buff *skb)
{
u32 mtu;
struct iphdr *iph; /* Our header */
struct rtable *rt; /* Route we use */
struct ip_options *opt = &(IPCB(skb)->opt);
.......
skb_forward_csum(skb);
/*
* According to the RFC, we must first decrease the TTL field. If
* that reaches zero, we must reply an ICMP control message telling
* that the packet's lifetime expired.
*/
if (ip_hdr(skb)->ttl <= 1)
goto too_many_hops;
if (!xfrm4_route_forward(skb))
goto drop;
rt = skb_rtable(skb);
if (opt->is_strictroute && rt->rt_uses_gateway)
goto sr_failed;
IPCB(skb)->flags |= IPSKB_FORWARDED;
mtu = ip_dst_mtu_maybe_forward(&rt->dst, true);
if (!ip_may_fragment(skb) && ip_exceeds_mtu(skb, mtu)) {
IP_INC_STATS(dev_net(rt->dst.dev), IPSTATS_MIB_FRAGFAILS);
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
htonl(mtu));
goto drop;
}
/* We are about to mangle packet. Copy it! */
if (skb_cow(skb, LL_RESERVED_SPACE(rt->dst.dev)+rt->dst.header_len))
goto drop;
iph = ip_hdr(skb);
/* Decrease ttl after skb cow done */
ip_decrease_ttl(iph);
/*
* We now generate an ICMP HOST REDIRECT giving the route
* we calculated.
*/
if (IPCB(skb)->flags & IPSKB_DOREDIRECT && !opt->srr &&
!skb_sec_path(skb))
ip_rt_send_redirect(skb);
skb->priority = rt_tos2priority(iph->tos);
return NF_HOOK(NFPROTO_IPV4, NF_INET_FORWARD, NULL, skb,
skb->dev, rt->dst.dev, ip_forward_finish);
sr_failed:
/*
* Strict routing permits no gatewaying
*/
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_SR_FAILED, 0);
goto drop;
too_many_hops:
/* Tell the sender its packet died... */
IP_INC_STATS_BH(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_INHDRERRORS);
icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0);
drop:
kfree_skb(skb);
return NET_RX_DROP;
}
②net\ipv4\ipmr.c
/* Processing handlers for ipmr_forward */
static void ipmr_queue_xmit(struct net *net, struct mr_table *mrt,
int in_vifi, struct sk_buff *skb,
struct mfc_cache *c, int vifi)
{
const struct iphdr *iph = ip_hdr(skb);
struct vif_device *vif = &mrt->vif_table[vifi];
struct net_device *dev;
struct rtable *rt;
struct flowi4 fl4;
int encap = 0;
if (vif->dev == NULL)
goto out_free;
if (vif->flags & VIFF_REGISTER) {
vif->pkt_out++;
vif->bytes_out += skb->len;
vif->dev->stats.tx_bytes += skb->len;
vif->dev->stats.tx_packets++;
ipmr_cache_report(mrt, skb, vifi, IGMPMSG_WHOLEPKT);
goto out_free;
}
if (ipmr_forward_offloaded(skb, mrt, in_vifi, vifi))
goto out_free;
if (vif->flags & VIFF_TUNNEL) {
rt = ip_route_output_ports(net, &fl4, NULL,
vif->remote, vif->local,
0, 0,
IPPROTO_IPIP,
RT_TOS(iph->tos), vif->link);
if (IS_ERR(rt))
goto out_free;
encap = sizeof(struct iphdr);
} else {
rt = ip_route_output_ports(net, &fl4, NULL, iph->daddr, 0,
0, 0,
IPPROTO_IPIP,
RT_TOS(iph->tos), vif->link);
if (IS_ERR(rt))
goto out_free;
}
dev = rt->dst.dev;
if (skb->len+encap > dst_mtu(&rt->dst) && (ntohs(iph->frag_off) & IP_DF)) {
/* Do not fragment multicasts. Alas, IPv4 does not
* allow to send ICMP, so that packets will disappear
* to blackhole.
*/
IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
ip_rt_put(rt);
goto out_free;
}
encap += LL_RESERVED_SPACE(dev) + rt->dst.header_len;
if (skb_cow(skb, encap)) {
ip_rt_put(rt);
goto out_free;
}
vif->pkt_out++;
vif->bytes_out += skb->len;
skb_dst_drop(skb);
skb_dst_set(skb, &rt->dst);
ip_decrease_ttl(ip_hdr(skb));
/* FIXME: forward and output firewalls used to be called here.
* What do we do with netfilter? -- RR
*/
if (vif->flags & VIFF_TUNNEL) {
ip_encap(net, skb, vif->local, vif->remote);
/* FIXME: extra output firewall step used to be here. --RR */
vif->dev->stats.tx_packets++;
vif->dev->stats.tx_bytes += skb->len;
}
IPCB(skb)->flags |= IPSKB_FORWARDED;
/* RFC1584 teaches, that DVMRP/PIM router must deliver packets locally
* not only before forwarding, but after forwarding on all output
* interfaces. It is clear, if mrouter runs a multicasting
* program, it should receive packets not depending to what interface
* program is joined.
* If we will not make it, the program will have to join on all
* interfaces. On the other hand, multihoming host (or router, but
* not mrouter) cannot join to more than one interface - it will
* result in receiving multiple packets.
*/
NF_HOOK(NFPROTO_IPV4, NF_INET_FORWARD, NULL, skb,
skb->dev, dev,
ipmr_forward_finish);
return;
out_free:
kfree_skb(skb);
}
⑷、NF_INET_LOCAL_OUT
①net\ipv4\ip_output.c
static int __ip_local_out_sk(struct sock *sk, struct sk_buff *skb)
{
struct iphdr *iph = ip_hdr(skb);
iph->tot_len = htons(skb->len);
ip_send_check(iph);
skb->protocol = htons(ETH_P_IP);
return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT, sk, skb, NULL,
skb_dst(skb)->dev, dst_output_sk);
}
int __ip_local_out(struct sk_buff *skb)
{
return __ip_local_out_sk(skb->sk, skb);
}
int ip_local_out_sk(struct sock *sk, struct sk_buff *skb)
{
int err;
err = __ip_local_out(skb);
if (likely(err == 1))
err = dst_output_sk(sk, skb);
return err;
}
EXPORT_SYMBOL_GPL(ip_local_out_sk);
②net\ipv4\raw.c
static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
void *from, size_t length,
struct rtable **rtp,
unsigned int flags)
{
struct inet_sock *inet = inet_sk(sk);
struct net *net = sock_net(sk);
struct iphdr *iph;
struct sk_buff *skb;
unsigned int iphlen;
int err;
struct rtable *rt = *rtp;
int hlen, tlen;
if (length > rt->dst.dev->mtu) {
ip_local_error(sk, EMSGSIZE, fl4->daddr, inet->inet_dport,
rt->dst.dev->mtu);
return -EMSGSIZE;
}
if (flags&MSG_PROBE)
goto out;
hlen = LL_RESERVED_SPACE(rt->dst.dev);
tlen = rt->dst.dev->needed_tailroom;
skb = sock_alloc_send_skb(sk,
length + hlen + tlen + 15,
flags & MSG_DONTWAIT, &err);
if (skb == NULL)
goto error;
skb_reserve(skb, hlen);
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
skb_dst_set(skb, &rt->dst);
*rtp = NULL;
skb_reset_network_header(skb);
iph = ip_hdr(skb);
skb_put(skb, length);
skb->ip_summed = CHECKSUM_NONE;
if (flags & MSG_CONFIRM)
skb_set_dst_pending_confirm(skb, 1);
skb->transport_header = skb->network_header;
err = -EFAULT;
if (memcpy_fromiovecend((void *)iph, from, 0, length))
goto error_free;
iphlen = iph->ihl * 4;
/*
* We don't want to modify the ip header, but we do need to
* be sure that it won't cause problems later along the network
* stack. Specifically we want to make sure that iph->ihl is a
* sane value. If ihl points beyond the length of the buffer passed
* in, reject the frame as invalid
*/
err = -EINVAL;
if (iphlen > length)
goto error_free;
if (iphlen >= sizeof(*iph)) {
if (!iph->saddr)
iph->saddr = fl4->saddr;
iph->check = 0;
iph->tot_len = htons(length);
if (!iph->id)
ip_select_ident(net, skb, NULL);
iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
}
if (iph->protocol == IPPROTO_ICMP)
icmp_out_count(net, ((struct icmphdr *)
skb_transport_header(skb))->type);
err = NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_OUT, sk, skb,
NULL, rt->dst.dev, dst_output_sk);
if (err > 0)
err = net_xmit_errno(err);
if (err)
goto error;
out:
return 0;
error_free:
kfree_skb(skb);
error:
IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
if (err == -ENOBUFS && !inet->recverr)
err = 0;
return err;
}
⑸、NF_INET_POST_ROUTING
net\ipv4\ip_output.c
int ip_mc_output(struct sock *sk, struct sk_buff *skb)
{
struct rtable *rt = skb_rtable(skb);
struct net_device *dev = rt->dst.dev;
/*
* If the indicated interface is up and running, send the packet.
*/
IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len);
skb->dev = dev;
skb->protocol = htons(ETH_P_IP);
/*
* Multicasts are looped back for other local users
*/
if (rt->rt_flags&RTCF_MULTICAST) {
if (sk_mc_loop(sk)
#ifdef CONFIG_IP_MROUTE
/* Small optimization: do not loopback not local frames,
which returned after forwarding; they will be dropped
by ip_mr_input in any case.
Note, that local frames are looped back to be delivered
to local recipients.
This check is duplicated in ip_mr_input at the moment.
*/
&&
((rt->rt_flags & RTCF_LOCAL) ||
!(IPCB(skb)->flags & IPSKB_FORWARDED))
#endif
) {
struct sk_buff *newskb = skb_clone(skb, GFP_ATOMIC);
if (newskb)
NF_HOOK(NFPROTO_IPV4, NF_INET_POST_ROUTING,
sk, newskb, NULL, newskb->dev,
dev_loopback_xmit);
}
/* Multicasts with ttl 0 must not go beyond the host */
if (ip_hdr(skb)->ttl == 0) {
kfree_skb(skb);
return 0;
}
}
if (rt->rt_flags&RTCF_BROADCAST) {
struct sk_buff *newskb = skb_clone(skb, GFP_ATOMIC);
if (newskb)
NF_HOOK(NFPROTO_IPV4, NF_INET_POST_ROUTING, sk, newskb,
NULL, newskb->dev, dev_loopback_xmit);
}
return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING, sk, skb, NULL,
skb->dev, ip_finish_output,
!(IPCB(skb)->flags & IPSKB_REROUTED));
}
以上我们看到xfrm中也有安装相关hook点,这里引用官方资料介绍下什么是xfrm:
xfrm is an IP framework for transforming packets (such as encrypting their payloads). This framework is used to implement the IPsec protocol suite (with the state object operating on the Security Association Database, and the policy object operating on the Security Policy Database). It is also used for the IP Payload Compression Protocol and features of Mobile IPv6.
简单来说,xfrm就是IP层的一个框架,用于封装实现IPSec协议。
到此,我们基于源码分析介绍完了Netfilter在网络层的实现。
|