概述
kubernetes的service 有iptabes和ipvs的實現(xiàn)。iptables的實現(xiàn)方式基本已經(jīng)被棄用留拾,ipvs 以靈活的負載均衡策略,效率高的優(yōu)點,可以完全替代iptables骑脱。netfilter是內(nèi)核包過濾的一個框架,ipvs苍糠,iptables叁丧,都屬于netfilter框架的一部分,各自實現(xiàn)的功能不一樣岳瞭。如iptables可以實現(xiàn)包過濾拥娄,包修改,nat瞳筏,負載均衡等功能稚瘾,ipvs則主要做負載均衡用。特別的姚炕,ipvs可以作為iptables的target摊欠,可見iptables的功能性和通用性更強。
下文將依次介紹netfilter的大致原理柱宦,iptables的表和鏈些椒,ipvs的原理,最后通過實例展示在k8s集群中掸刊,使用iptables與ipvs的表現(xiàn)形式有什么區(qū)別免糕。
Netfilter
netfilter的官方描述可以參見 Netfilter。
說一下我對netfilter的理解忧侧。netfilter運行于內(nèi)核態(tài)石窑,大都穿插在協(xié)議棧處理的各個關(guān)鍵位置,對包進行處理蚓炬,進入主機的包都要經(jīng)過PREROUTING松逊,應(yīng)用層發(fā)出的數(shù)據(jù)包都要經(jīng)過OUTPUT鏈,非本機的包试吁,要經(jīng)過FORWARDING鏈(ip_forward使能的情況下)棺棵。
下面的代碼展示了如何進入netfilter的處理。
// ip層接受外部數(shù)據(jù)包的入口
NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
net, NULL, skb, dev, NULL,
ip_rcv_finish);
// ip層向上層傳遞
NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN,
net, NULL, skb, skb->dev, NULL,
ip_local_deliver_finish);
// 向下層協(xié)議傳遞的出口
NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
net, sk, skb, NULL, dev,
ip_finish_output,
!(IPCB(skb)->flags & IPSKB_REROUTED));
// NF_HOOK的宏定義原型,其中okfn是正常處理后的回調(diào)函數(shù)
static inline int
NF_HOOK(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk, struct sk_buff *skb,
struct net_device *in, struct net_device *out,
int (*okfn)(struct net *, struct sock *, struct sk_buff *))
{
int ret = nf_hook(pf, hook, net, sk, skb, in, out, okfn);
if (ret == 1)
ret = okfn(net, sk, skb);
return ret;
}
// 可以看到大致有哪些hook點。
struct netns_nf {
const struct nf_logger __rcu *nf_loggers[NFPROTO_NUMPROTO];
struct nf_hook_entries __rcu *hooks_ipv4[NF_INET_NUMHOOKS];
struct nf_hook_entries __rcu *hooks_ipv6[NF_INET_NUMHOOKS];
struct nf_hook_entries __rcu *hooks_arp[NF_ARP_NUMHOOKS];
struct nf_hook_entries __rcu *hooks_bridge[NF_INET_NUMHOOKS];
struct nf_hook_entries __rcu *hooks_decnet[NF_DN_NUMHOOKS];
};
通過上面部分代碼的示例国章,NF_HOOK函數(shù)族(還有其他類似的)是netfilter提供的類似api或者模塊的功能简卧,而且可以猜測,在這個函數(shù)里面就會去匹配鏈上各個表內(nèi)的規(guī)則列表。比如此時的鏈是LOCAL_IN,那么要去遍歷managle移盆、filter币喧、nat表的規(guī)則轨域,之后再調(diào)用okfn的回調(diào)函數(shù)。
iptables的表
已nat表為例杀餐,其他的表也基本類似干发。nf_nat_ipv4_ops中有多個hook函數(shù),處在netfilter框架的幾個hook點上史翘。到數(shù)據(jù)包到達hook點時枉长,就會執(zhí)行相應(yīng)的函數(shù)。
比如當(dāng)一個進來的數(shù)據(jù)包達到PRE_ROUTING鏈時琼讽,并且達到nat表時必峰,就會調(diào)用iptables_nat_do_chain這個hook函數(shù),在去匹配表里的各個規(guī)則钻蹬。
// iptable_nat_do_chani調(diào)用ipt_do_table執(zhí)行具體操作
static const struct nf_hook_ops nf_nat_ipv4_ops[] = {
// 這里只展示常見的兩個吼蚁,nat在localin和localout上也有hook。
{
.hook = iptable_nat_do_chain
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_PRE_ROUTING,
.priority = NF_IP_PRI_NAT_DST,
},
{
.hook = iptable_nat_do_chain,
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_POST_ROUTING,
.priority = NF_IP_PRI_NAT_SRC,
},
};
static int ipt_nat_register_lookups(struct net *net)
{
// 注冊hook函數(shù)到net的nf中问欠。
ret = nf_nat_l3proto_ipv4_register_fn(net, &nf_nat_ipv4_ops[i]);
}
// 調(diào)用注冊的hook函數(shù)肝匆,對應(yīng)下文的iptable_filter_hook ,iptable_nat_do_chain 顺献。
static inline int
nf_hook_entry_hookfn(const struct nf_hook_entry *entry, struct sk_buff *skb,
struct nf_hook_state *state)
{
return entry->hook(entry->priv, skb, state);
}
....
/*
其他表也是類似的流程术唬。
filter : iptable_filter_hook -> ipt_do_table
nat : iptable_nat_do_chain -> ipt_do_table
raw : iptable_raw_hook -> ipt_do_table
*/
iptables的rules
iptables的規(guī)則就是描述一個包應(yīng)該如何處理,包括如何匹配一個包滚澜,匹配包之后的處理。如源地址為192.168.0.1的包嫁怀,drop掉设捐;所以出主機的包,做masq處理塘淑;網(wǎng)段為192.168.3.0/24的包打上0x4000的標(biāo)簽等等萝招。
從代碼可以看出數(shù)據(jù)包是順序通過netfilter的處理,所以大量的iptables規(guī)則勢必會影響內(nèi)核處理網(wǎng)絡(luò)包的性能存捺。
unsigned int
ipt_do_table(struct sk_buff *skb,
const struct nf_hook_state *state,
struct xt_table *table) {
struct ipt_entry *e;
e = get_entry(table_base, private->hook_entry[hook]);
acpar.match->match(skb, &acpar);
t = ipt_get_target_c(e);
// 這里的target函數(shù)就是類似redirect槐沼,dnat,snat捌治,set-mark等的包處理函數(shù)岗钩。
verdict = t->u.kernel.target->target(skb, &acpar)
}
// nat的處理
static struct xt_target xt_nat_target_reg[] __read_mostly = {
{
.name = "SNAT",
.revision = 0,
.checkentry = xt_nat_checkentry_v0,
.destroy = xt_nat_destroy,
.target = xt_snat_target_v0,
.targetsize = sizeof(struct nf_nat_ipv4_multi_range_compat),
.family = NFPROTO_IPV4,
.table = "nat",
.hooks = (1 << NF_INET_POST_ROUTING) |
(1 << NF_INET_LOCAL_IN),
.me = THIS_MODULE,
},
{
.name = "DNAT",
.revision = 0,
.checkentry = xt_nat_checkentry_v0,
.destroy = xt_nat_destroy,
.target = xt_dnat_target_v0,
.targetsize = sizeof(struct nf_nat_ipv4_multi_range_compat),
.family = NFPROTO_IPV4,
.table = "nat",
.hooks = (1 << NF_INET_PRE_ROUTING) |
(1 << NF_INET_LOCAL_OUT),
.me = THIS_MODULE,
},
{
.name = "SNAT",
.revision = 1,
.checkentry = xt_nat_checkentry,
.destroy = xt_nat_destroy,
.target = xt_snat_target_v1,
.targetsize = sizeof(struct nf_nat_range),
.table = "nat",
.hooks = (1 << NF_INET_POST_ROUTING) |
(1 << NF_INET_LOCAL_IN),
.me = THIS_MODULE,
},
{
.name = "DNAT",
.revision = 1,
.checkentry = xt_nat_checkentry,
.destroy = xt_nat_destroy,
.target = xt_dnat_target_v1,
.targetsize = sizeof(struct nf_nat_range),
.table = "nat",
.hooks = (1 << NF_INET_PRE_ROUTING) |
(1 << NF_INET_LOCAL_OUT),
.me = THIS_MODULE,
},
// .....
}
iptables中的MARK打標(biāo)簽的原理
在使用k8s的環(huán)境中,如果使用iptables查看本機的規(guī)則肖油,會經(jīng)臣嫦牛看到一些MARK的字樣,這個就是iptables的MARK功能森枪。一般流程是這樣 :在包處理的前一部分视搏,匹配到對應(yīng)的包审孽,打上標(biāo)簽,在包處理的后一部分浑娜,再處理擁有該標(biāo)簽的包佑力。
會好奇,這個是如何實現(xiàn)呢筋遭?標(biāo)簽又是在哪里存儲的呢打颤?
// xt_register_target 注冊target的處理 reject,mark宛畦,
// skb-> mark: Generic packet mark瘸洛,即skb中有字段記錄mark的值
static struct xt_target mark_tg_reg __read_mostly = {
.name = "MARK",
.revision = 2,
.family = NFPROTO_UNSPEC,
.target = mark_tg,
.targetsize = sizeof(struct xt_mark_tginfo2),
.me = THIS_MODULE,
};
mark_tg(struct sk_buff *skb, const struct xt_action_param *par)
{
const struct xt_mark_tginfo2 *info = par->targinfo;
skb->mark = (skb->mark & ~info->mask) ^ info->mark;
return XT_CONTINUE;
}
/* Registration hooks for targets. */
int xt_register_target(struct xt_target *target)
{
u_int8_t af = target->family;
mutex_lock(&xt[af].mutex);
list_add(&target->list, &xt[af].target);
mutex_unlock(&xt[af].mutex);
return 0;
}
ipt_do_table ->
t->u.kernel.target->target(skb, &acpar)
ipvs
ipvs也是屬于netfilter,并且在LOCAL_IN次和,LOCAL_OUT這個關(guān)卡上注冊了hook函數(shù)反肋,那為什么iptables命令看不到這些規(guī)則呢?iptables看到的數(shù)據(jù)是各個關(guān)卡上通過iptables命令設(shè)置的規(guī)則踏施,這些規(guī)則聚合成一個hook函數(shù)注冊到netfilter中石蔗,而ipvs是直接注冊到netfilter中的函數(shù)上的。
ipvs已經(jīng)可以替代iptables畅形,下面重點分析一下ipvs的實現(xiàn)养距。
kube-proxy使能ipvs后,會創(chuàng)建一個kube-ipvs0的網(wǎng)卡日熬,該網(wǎng)卡上綁定了集群所有的service的地址棍厌。當(dāng)有pod請求這個service地址時,就會先被該網(wǎng)卡接受竖席,在LOCAL_IN上觸發(fā)ipvs的處理耘纱,根據(jù)負載均衡策略選擇一個后端服務(wù)器,將service的地址替換為pod的ip毕荐,并送到local_out進行處理束析,之后的流程就是pod跨主機通信的流程了。
從下面代碼可以看出憎亚,ipvs處理的優(yōu)先級在iptables的表之后员寇,即先處理iptables的規(guī)則,再由ipvs進行處理第美。
static const struct nf_hook_ops ip_vs_ops[] = {
/* After packet filtering, change source only for VS/NAT */
{
.hook = ip_vs_reply4,
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_LOCAL_IN,
.priority = NF_IP_PRI_NAT_SRC - 2,
},
/* After packet filtering, forward packet through VS/DR, VS/TUN,
* or VS/NAT(change destination), so that filtering rules can be
* applied to IPVS. */
{
.hook = ip_vs_remote_request4,
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_LOCAL_IN,
.priority = NF_IP_PRI_NAT_SRC - 1,
},
/* Before ip_vs_in, change source only for VS/NAT */
{
.hook = ip_vs_local_reply4,
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_LOCAL_OUT,
.priority = NF_IP_PRI_NAT_DST + 1,
},
/* After mangle, schedule and forward local requests */
{
.hook = ip_vs_local_request4,
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_LOCAL_OUT,
.priority = NF_IP_PRI_NAT_DST + 2,
},
// ....
};
nf_register_net_hooks(net, ip_vs_ops, ARRAY_SIZE(ip_vs_ops));
//不同模式對應(yīng)不同的發(fā)包函數(shù)蝶锋。
static inline void ip_vs_bind_xmit(struct ip_vs_conn *cp)
{
switch (IP_VS_FWD_METHOD(cp)) {
case IP_VS_CONN_F_MASQ:
cp->packet_xmit = ip_vs_nat_xmit;
break;
case IP_VS_CONN_F_TUNNEL:
#ifdef CONFIG_IP_VS_IPV6
if (cp->daf == AF_INET6)
cp->packet_xmit = ip_vs_tunnel_xmit_v6;
else
#endif
cp->packet_xmit = ip_vs_tunnel_xmit;
break;
case IP_VS_CONN_F_DROUTE:
cp->packet_xmit = ip_vs_dr_xmit;
break;
case IP_VS_CONN_F_LOCALNODE:
cp->packet_xmit = ip_vs_null_xmit;
break;
case IP_VS_CONN_F_BYPASS:
cp->packet_xmit = ip_vs_bypass_xmit;
break;
}
}
// 接下來 當(dāng)包從local_in過來時,調(diào)用
ip_vs_remote_request4
=> ip_vs_in
=> cp->packet_xmit
// 將數(shù)據(jù)包送到local_out處理什往。
static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb,
struct ip_vs_conn *cp, int local) {
NF_HOOK(pf, NF_INET_LOCAL_OUT, cp->ipvs->net, NULL, skb,
NULL, skb_dst(skb)->dev, dst_output);
}
kube-proxy 使用 iptables
[root@master-9 net]# iptables -L KUBE-SERVICES -t nat -n
Chain KUBE-SERVICES (2 references)
target prot opt source destination
KUBE-SVC-NPX46M4PTMTKRN6Y tcp -- 0.0.0.0/0 10.96.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:443
KUBE-SVC-P4Q3KNUAWJVP4ILH tcp -- 0.0.0.0/0 10.96.0.131 /* default/nginx:http cluster IP */ tcp dpt:80
KUBE-SVC-TCOU7JCQXEZGVUNU udp -- 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:53
KUBE-SVC-I24EZXP75AX5E7TU tcp -- 0.0.0.0/0 10.96.0.199 /* calico-apiserver/calico-api:apiserver cluster IP */ tcp dpt:443
KUBE-SVC-ERIFXISQEP7F7OF4 tcp -- 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:53
KUBE-SVC-JD5MR3NA4I4DYORP tcp -- 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:metrics cluster IP */ tcp dpt:9153
KUBE-SVC-KQVGIOWQAVNMB2ZL tcp -- 0.0.0.0/0 10.96.0.220 /* calico-system/calico-kube-controllers-metrics:metrics-port cluster IP */ tcp dpt:9094
KUBE-SVC-RK657RLKDNVNU64O tcp -- 0.0.0.0/0 10.96.0.246 /* calico-system/calico-typha:calico-typha cluster IP */ tcp dpt:5473
KUBE-NODEPORTS all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL
[root@master-9 net]# iptables -L KUBE-SVC-P4Q3KNUAWJVP4ILH -t nat -n
Chain KUBE-SVC-P4Q3KNUAWJVP4ILH (1 references)
target prot opt source destination
KUBE-MARK-MASQ tcp -- !10.244.0.0/24 10.96.0.131 /* default/nginx:http cluster IP */ tcp dpt:80
KUBE-SEP-5IN3N7CMZK6ATMGU all -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx:http */ statistic mode random probability 0.10000000009
KUBE-SEP-HLNRLNS5YZR3HUCE all -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx:http */ statistic mode random probability 0.11111111101
KUBE-SEP-ATAKOMWYNQ36NI3T all -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx:http */ statistic mode random probability 0.12500000000
KUBE-SEP-BHAOEVLY2MXTCNVF all -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx:http */ statistic mode random probability 0.14285714272
KUBE-SEP-PJXLHWLF6ASQ35HU all -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx:http */ statistic mode random probability 0.16666666651
KUBE-SEP-G7DLGXRAERZMKSWC all -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx:http */ statistic mode random probability 0.20000000019
KUBE-SEP-MUV3XIL573AOQ3RO all -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx:http */ statistic mode random probability 0.25000000000
KUBE-SEP-24LCKPV3WIWIN6LO all -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx:http */ statistic mode random probability 0.33333333349
KUBE-SEP-CXJ2YZHIBRQ4BYKV all -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx:http */ statistic mode random probability 0.50000000000
KUBE-SEP-AG44B2ZFINL2G42M all -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx:http */
kube-proxy 使用 ipvs
[root@10 vs]# iptables -L KUBE-SERVICES -t nat -n
Chain KUBE-SERVICES (2 references)
target prot opt source destination
KUBE-MARK-MASQ all -- !10.244.0.0/16 0.0.0.0/0 /* Kubernetes service cluster ip + port for masquerade purpose */ match-set KUBE-CLUSTER-IP dst,dst
KUBE-NODE-PORT all -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL
ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-CLUSTER-IP dst,dst
[root@10 yaml]# ipvsadm -L 10.10.101.91-slave:ndmps
unexpected argument 10.10.101.91-slave:ndmps
[root@10 yaml]# ipvsadm -L
IP Virtual Server version 1.2.1 (size=4096)
TCP 10.10.101.91-slave:http rr
-> 10.244.114.29:http Masq 1 0 0
-> 10.244.186.20:http Masq 1 0 0
-> 10.244.186.21:http Masq 1 0 0
-> 10.244.186.22:http Masq 1 0 0
-> 10.244.186.23:http Masq 1 0 0
-> 10.244.186.24:http Masq 1 0 0
-> 10.244.188.15:http Masq 1 0 0
-> 10.244.188.17:http Masq 1 0 0
-> 10.244.188.18:http Masq 1 0 0
-> 10.244.188.19:http Masq 1 0 0
-> 10.244.188.20:http Masq 1 0 0
-> 10.244.218.17:http Masq 1 0 0
-> 10.244.218.18:http Masq 1 0 0
-> 10.244.218.19:http Masq 1 0 0
-> 10.244.218.20:http Masq 1 0 0
mode為ipvs時nodeport類型的service
主機上雖然監(jiān)聽了對應(yīng)的端口牲览,即使把kube-proxy停掉,也是不影響訪問的。
[root@10 yaml]# ss -lpn |grep 30000
tcp LISTEN 0 32768 *:30000 *:* users:(("kube-proxy",pid=1842,fd=10))
tcp LISTEN 0 32768 :::30000 :::* users:(("kube-proxy",pid=1842,fd=14))
TCP 10.10.101.91:31001 rr
-> 10.244.11.78:80 Masq 1 0 0
-> 10.244.11.81:80 Masq 1 0 0
-> 10.244.12.211:80 Masq 1 0 0
-> 10.244.13.17:80 Masq 1 0 0
-> 10.244.13.81:80 Masq 1 0 0
// 主機上的dump網(wǎng)卡第献,連狀態(tài)都是down贡必。
[root@10 yaml]# ip a s kube-ipvs0
5: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN
link/ether 6a:fa:a8:c2:62:8c brd ff:ff:ff:ff:ff:ff
inet 10.96.37.82/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.96.0.1/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.96.0.10/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.96.241.158/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.96.164.59/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet6 2001:db8:42:1::ab46/128 scope global
valid_lft forever preferred_lft forever
inet6 2001:db8:42:1::2021/128 scope global
valid_lft forever preferred_lft forever