本文主要在k8s原生集群上部署v0.12.1
版本的MetalLB
作為k8s的LoadBalancer
育谬,主要涉及MetalLB的Layer2模式和BGP模式兩種部署方案。由于BGP的相關(guān)原理和配置比較復(fù)雜帮哈,這里僅涉及簡(jiǎn)單的BGP配置膛檀。
文中使用的k8s集群是在CentOS7系統(tǒng)上基于docker
和flannel
組件部署v1.23.6
版本,此前寫的一些關(guān)于k8s基礎(chǔ)知識(shí)和集群搭建的一些方案娘侍,有需要的同學(xué)可以看一下咖刃。
1、工作原理
1.1 簡(jiǎn)介
在開始之前憾筏,我們需要了解一下MetalLB的工作原理嚎杨。
MetalLB hooks into your Kubernetes cluster, and provides a network load-balancer implementation. In short, it allows you to create Kubernetes services of type
LoadBalancer
in clusters that don’t run on a cloud provider, and thus cannot simply hook into paid products to provide load balancers.It has two features that work together to provide this service: address allocation, and external announcement.
MetalLB是 Kubernetes 集群中關(guān)于LoadBalancer的一個(gè)具體實(shí)現(xiàn),主要用于暴露k8s集群的服務(wù)到集群外部訪問(wèn)氧腰。MetalLB可以讓我們?cè)趉8s集群中創(chuàng)建服務(wù)類型為LoadBalancer
的服務(wù)磕潮,并且無(wú)需依賴云廠商提供的LoadBalancer
。
它具有兩個(gè)共同提供此服務(wù)的工作負(fù)載(workload):地址分配(address allocation)和外部公告(external announcement)容贝;對(duì)應(yīng)的就是在k8s中部署的controller
和speaker
。
1.2 address allocation
地址分配(address allocation)這個(gè)功能比較好理解之景,首先我們需要給MetalLB分配一段IP斤富,接著它會(huì)根據(jù)k8s的service中的相關(guān)配置來(lái)給LoadBalancer
的服務(wù)分配IP,從官網(wǎng)文檔中我們可以得知LoadBalancer
的IP可以手動(dòng)指定锻狗,也可以讓MetalLB自動(dòng)分配满力;同時(shí)還可以在MetalLB的configmap
中配置多個(gè)IP段焕参,并且單獨(dú)設(shè)置每個(gè)IP段是否開啟自動(dòng)分配。
地址分配(address allocation)主要就是由作為deployment
部署的controller
來(lái)實(shí)現(xiàn)油额,它負(fù)責(zé)監(jiān)聽集群中的service狀態(tài)并且分配IP
1.3 external announcement
外部公告(external announcement)的主要功能就是要把服務(wù)類型為LoadBalancer
的服務(wù)的EXTERNAL-IP
公布到網(wǎng)絡(luò)中去叠纷,確保客戶端能夠正常訪問(wèn)到這個(gè)IP潦嘶。MetalLB對(duì)此的實(shí)現(xiàn)方式主要有三種:ARP/NDP和BGP涩嚣;其中ARP/NDP分別對(duì)應(yīng)IPv4/IPv6協(xié)議的Layer2模式,BGP路由協(xié)議則是對(duì)應(yīng)BGP模式掂僵。外部公告(external announcement)主要就是由作為daemonset
部署的speaker
來(lái)實(shí)現(xiàn)航厚,它負(fù)責(zé)在網(wǎng)絡(luò)中發(fā)布ARP/NDP報(bào)文或者是和BGP路由器建立連接并發(fā)布BGP報(bào)文。
1.4 關(guān)于網(wǎng)絡(luò)
不管是Layer2模式還是BGP模式锰蓬,兩者都不使用Linux的網(wǎng)絡(luò)棧幔睬,也就是說(shuō)我們沒辦法使用諸如ip
命令之類的操作準(zhǔn)確的查看VIP所在的節(jié)點(diǎn)和相應(yīng)的路由,相對(duì)應(yīng)的是在每個(gè)節(jié)點(diǎn)上面都能看到一個(gè)kube-ipvs0
網(wǎng)卡接口上面的IP芹扭。同時(shí)麻顶,兩種模式都只是負(fù)責(zé)把VIP的請(qǐng)求引到對(duì)應(yīng)的節(jié)點(diǎn)上面,之后的請(qǐng)求怎么到達(dá)pod舱卡,按什么規(guī)則輪詢等都是由kube-proxy實(shí)現(xiàn)的辅肾。
兩種不同的模式各有優(yōu)缺點(diǎn)和局限性,我們先把兩者都部署起來(lái)再進(jìn)行分析灼狰。
2宛瞄、準(zhǔn)備工作
2.1 系統(tǒng)要求
在開始部署MetalLB之前,我們需要確定部署環(huán)境能夠滿足最低要求:
- 一個(gè)k8s集群交胚,要求版本不低于1.13.0玄括,且沒有負(fù)載均衡器相關(guān)插件
- k8s集群上的CNI組件和MetalLB兼容
- 預(yù)留一段IPv4地址給MetalLB作為L(zhǎng)oadBalance的VIP使用
- 如果使用的是MetalLB的BGP模式,還需要路由器支持BGP協(xié)議
- 如果使用的是MetalLB的Layer2模式叫编,因?yàn)槭褂昧?a target="_blank">memberlist算法來(lái)實(shí)現(xiàn)選主虱歪,因此需要確保各個(gè)k8s節(jié)點(diǎn)之間的7946端口可達(dá)(包括TCP和UDP協(xié)議),當(dāng)然也可以根據(jù)自己的需求配置為其他端口
2.2 cni插件的兼容性
MetalLB官方給出了對(duì)主流的一些CNI的兼容情況熬词,考慮到MetalLB主要還是利用了k8s自帶的kube-proxy組件做流量轉(zhuǎn)發(fā)旁钧,因此對(duì)大多數(shù)的CNI兼容情況都相當(dāng)不錯(cuò)。
CNI | 兼容性 | 主要問(wèn)題 |
---|---|---|
Calico | Mostly (see known issues) | 主要在于BGP模式的兼容性互拾,但是社區(qū)也提供了解決方案 |
Canal | Yes | - |
Cilium | Yes | - |
Flannel | Yes | - |
Kube-ovn | Yes | - |
Kube-router | Mostly (see known issues) | 無(wú)法支持 builtin external BGP peering mode |
Weave Net | Mostly (see known issues) |
externalTrafficPolicy: Local 支持情況視版本而定
|
從兼容性上面我們不難看出歪今,大多數(shù)情況是沒問(wèn)題的,出現(xiàn)兼容性問(wèn)題的主要原因就是和BGP有沖突颜矿。實(shí)際上BGP相關(guān)的兼容性問(wèn)題幾乎存在于每個(gè)開源的k8s負(fù)載均衡器上面寄猩。
2.3 云廠商的兼容性
MetalLB官方給出的列表中,我們可以看到對(duì)大多數(shù)云廠商的兼容性都很差骑疆,原因也很簡(jiǎn)單田篇,大多數(shù)的云環(huán)境上面都沒辦法運(yùn)行BGP協(xié)議替废,而通用性更高的layer2模式則因?yàn)楦鱾€(gè)云廠商的網(wǎng)絡(luò)環(huán)境不同而沒辦法確定是否能夠兼容
The short version is: cloud providers expose proprietary APIs instead of standard protocols to control their network layer, and MetalLB doesn’t work with those APIs.
當(dāng)然如果使用了云廠商的服務(wù),最好的方案是直接使用云廠商提供的LoadBalance
服務(wù)泊柬。
3椎镣、Layer2 mode
3.1 部署環(huán)境
本次MetalLB
的部署環(huán)境為基于docker
和flannel
部署的1.23.6
版本的k8s集群
IP | Hostname |
---|---|
10.31.8.1 | tiny-flannel-master-8-1.k8s.tcinternal |
10.31.8.11 | tiny-flannel-worker-8-11.k8s.tcinternal |
10.31.8.12 | tiny-flannel-worker-8-12.k8s.tcinternal |
10.8.64.0/18 | podSubnet |
10.8.0.0/18 | serviceSubnet |
10.31.8.100-10.31.8.200 | MetalLB IPpool |
3.2 配置ARP參數(shù)
部署Layer2模式需要把k8s集群中的ipvs配置打開strictARP
,開啟之后k8s集群中的kube-proxy
會(huì)停止響應(yīng)kube-ipvs0
網(wǎng)卡之外的其他網(wǎng)卡的arp請(qǐng)求兽赁,而由MetalLB接手處理状答。
strict ARP
開啟之后相當(dāng)于把 將 arp_ignore
設(shè)置為 1 并將 arp_announce
設(shè)置為 2 啟用嚴(yán)格的 ARP,這個(gè)原理和LVS中的DR模式對(duì)RS的配置一樣闸氮,可以參考之前的文章中的解釋剪况。
strict ARP configure arp_ignore and arp_announce to avoid answering ARP queries from kube-ipvs0 interface
# 查看kube-proxy中的strictARP配置
$ kubectl get configmap -n kube-system kube-proxy -o yaml | grep strictARP
strictARP: false
# 手動(dòng)修改strictARP配置為true
$ kubectl edit configmap -n kube-system kube-proxy
configmap/kube-proxy edited
# 使用命令直接修改并對(duì)比不同
$ kubectl get configmap kube-proxy -n kube-system -o yaml | sed -e "s/strictARP: false/strictARP: true/" | kubectl diff -f - -n kube-system
# 確認(rèn)無(wú)誤后使用命令直接修改并生效
$ kubectl get configmap kube-proxy -n kube-system -o yaml | sed -e "s/strictARP: false/strictARP: true/" | kubectl apply -f - -n kube-system
# 重啟kube-proxy確保配置生效
$ kubectl rollout restart ds kube-proxy -n kube-system
# 確認(rèn)配置生效
$ kubectl get configmap -n kube-system kube-proxy -o yaml | grep strictARP
strictARP: true
3.3 部署MetalLB
MetalLB的部署也十分簡(jiǎn)單,官方提供了manifest文件部署(yaml部署)蒲跨,helm3部署和Kustomize部署三種方式译断,這里我們還是使用manifest文件部署。
大多數(shù)的官方教程為了簡(jiǎn)化部署的步驟或悲,都是寫著直接用kubectl命令部署一個(gè)yaml的url孙咪,這樣子的好處是部署簡(jiǎn)單快捷,但是壞處就是本地自己沒有存檔巡语,不方便修改等操作翎蹈,因此我個(gè)人更傾向于把yaml文件下載到本地保存再進(jìn)行部署。
# 下載v0.12.1的兩個(gè)部署文件
$ wget https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/namespace.yaml
$ wget https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/metallb.yaml
# 如果使用frr來(lái)進(jìn)行BGP路由管理男公,則下載這兩個(gè)部署文件
$ wget https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/namespace.yaml
$ wget https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/metallb-frr.yaml
下載官方提供的yaml文件之后荤堪,我們?cè)偬崆皽?zhǔn)備好configmap
的配置,github上面有提供一個(gè)參考文件枢赔,layer2模式需要的配置并不多澄阳,這里我們只做最基礎(chǔ)的一些參數(shù)配置定義即可:
-
protocol
這一項(xiàng)我們配置為layer2
-
addresses
這里我們可以使用CIDR來(lái)批量配置(198.51.100.0/24
),也可以指定首尾IP來(lái)配置(192.168.0.150-192.168.0.200
)踏拜,這里我們指定一段和k8s節(jié)點(diǎn)在同一個(gè)子網(wǎng)的IP
$ cat > configmap-metallb.yaml <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
namespace: metallb-system
name: config
data:
config: |
address-pools:
- name: default
protocol: layer2
addresses:
- 10.31.8.100-10.31.8.200
EOF
接下來(lái)就可以開始進(jìn)行部署碎赢,整體可以分為三步:
- 部署
namespace
- 部署
deployment
和daemonset
- 配置
configmap
# 創(chuàng)建namespace
$ kubectl apply -f namespace.yaml
namespace/metallb-system created
$ kubectl get ns
NAME STATUS AGE
default Active 8d
kube-node-lease Active 8d
kube-public Active 8d
kube-system Active 8d
metallb-system Active 8s
nginx-quic Active 8d
# 部署deployment和daemonset,以及相關(guān)所需的其他資源
$ kubectl apply -f metallb.yaml
Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
podsecuritypolicy.policy/controller created
podsecuritypolicy.policy/speaker created
serviceaccount/controller created
serviceaccount/speaker created
clusterrole.rbac.authorization.k8s.io/metallb-system:controller created
clusterrole.rbac.authorization.k8s.io/metallb-system:speaker created
role.rbac.authorization.k8s.io/config-watcher created
role.rbac.authorization.k8s.io/pod-lister created
role.rbac.authorization.k8s.io/controller created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:controller created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:speaker created
rolebinding.rbac.authorization.k8s.io/config-watcher created
rolebinding.rbac.authorization.k8s.io/pod-lister created
rolebinding.rbac.authorization.k8s.io/controller created
daemonset.apps/speaker created
deployment.apps/controller created
# 這里主要就是部署了controller這個(gè)deployment來(lái)檢查service的狀態(tài)
$ kubectl get deploy -n metallb-system
NAME READY UP-TO-DATE AVAILABLE AGE
controller 1/1 1 1 86s
# speaker則是使用ds部署到每個(gè)節(jié)點(diǎn)上面用來(lái)協(xié)商VIP速梗、收發(fā)ARP肮塞、NDP等數(shù)據(jù)包
$ kubectl get ds -n metallb-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
speaker 3 3 3 3 3 kubernetes.io/os=linux 64s
$ kubectl get pod -n metallb-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
controller-57fd9c5bb-svtjw 1/1 Running 0 117s 10.8.65.4 tiny-flannel-worker-8-11.k8s.tcinternal <none> <none>
speaker-bf79q 1/1 Running 0 117s 10.31.8.11 tiny-flannel-worker-8-11.k8s.tcinternal <none> <none>
speaker-fl5l8 1/1 Running 0 117s 10.31.8.12 tiny-flannel-worker-8-12.k8s.tcinternal <none> <none>
speaker-nw2fm 1/1 Running 0 117s 10.31.8.1 tiny-flannel-master-8-1.k8s.tcinternal <none> <none>
$ kubectl apply -f configmap-layer2.yaml
configmap/config created
3.4 部署測(cè)試服務(wù)
我們還是自定義一個(gè)服務(wù)來(lái)進(jìn)行測(cè)試,測(cè)試鏡像使用nginx姻锁,默認(rèn)情況下會(huì)返回請(qǐng)求客戶端的IP和端口
$ cat > nginx-quic-lb.yaml <<EOF
apiVersion: v1
kind: Namespace
metadata:
name: nginx-quic
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-lb
namespace: nginx-quic
spec:
selector:
matchLabels:
app: nginx-lb
replicas: 4
template:
metadata:
labels:
app: nginx-lb
spec:
containers:
- name: nginx-lb
image: tinychen777/nginx-quic:latest
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: nginx-lb-service
namespace: nginx-quic
spec:
externalTrafficPolicy: Cluster
internalTrafficPolicy: Cluster
selector:
app: nginx-lb
ports:
- protocol: TCP
port: 80 # match for service access port
targetPort: 80 # match for pod access port
type: LoadBalancer
loadBalancerIP: 10.31.8.100
EOF
注意上面的配置中我們把service配置中的type
字段指定為LoadBalancer
枕赵,并且指定了loadBalancerIP
為10.31.8.100
注意:并非所有的
LoadBalancer
都允許設(shè)置loadBalancerIP
。如果
LoadBalancer
支持該字段位隶,那么將根據(jù)用戶設(shè)置的loadBalancerIP
來(lái)創(chuàng)建負(fù)載均衡器烁设。如果沒有設(shè)置
loadBalancerIP
字段,將會(huì)給負(fù)載均衡器指派一個(gè)臨時(shí) IP。如果設(shè)置了
loadBalancerIP
装黑,但LoadBalancer
并不支持這種特性,那么設(shè)置的loadBalancerIP
值將會(huì)被忽略掉弓熏。
# 創(chuàng)建一個(gè)測(cè)試服務(wù)檢查效果
$ kubectl apply -f nginx-quic-lb.yaml
namespace/nginx-quic created
deployment.apps/nginx-lb created
service/nginx-lb-service created
查看服務(wù)狀態(tài)恋谭,這時(shí)候TYPE已經(jīng)變成LoadBalancer
,EXTERNAL-IP
顯示為我們定義的10.31.8.100
# 查看服務(wù)狀態(tài)挽鞠,這時(shí)候TYPE已經(jīng)變成LoadBalancer
$ kubectl get svc -n nginx-quic
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx-lb-service LoadBalancer 10.8.32.221 10.31.8.100 80:30181/TCP 25h
此時(shí)我們?cè)偃ゲ榭磌8s機(jī)器中的nginx-lb-service
狀態(tài)疚颊,可以看到ClusetIP
、LoadBalancer-VIP
和nodeport
的相關(guān)信息以及流量策略TrafficPolicy
等配置
$ kubectl get svc -n nginx-quic nginx-lb-service -o yaml
apiVersion: v1
kind: Service
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"name":"nginx-lb-service","namespace":"nginx-quic"},"spec":{"externalTrafficPolicy":"Cluster","internalTrafficPolicy":"Cluster","loadBalancerIP":"10.31.8.100","ports":[{"port":80,"protocol":"TCP","targetPort":80}],"selector":{"app":"nginx-lb"},"type":"LoadBalancer"}}
creationTimestamp: "2022-05-16T06:01:23Z"
name: nginx-lb-service
namespace: nginx-quic
resourceVersion: "1165135"
uid: f547842e-4547-4d01-abbc-89ac8b059a2a
spec:
allocateLoadBalancerNodePorts: true
clusterIP: 10.8.32.221
clusterIPs:
- 10.8.32.221
externalTrafficPolicy: Cluster
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
loadBalancerIP: 10.31.8.100
ports:
- nodePort: 30181
port: 80
protocol: TCP
targetPort: 80
selector:
app: nginx-lb
sessionAffinity: None
type: LoadBalancer
status:
loadBalancer:
ingress:
- ip: 10.31.8.100
查看IPVS規(guī)則信认,這時(shí)候可以看到ClusetIP材义、LoadBalancer-VIP和nodeport的轉(zhuǎn)發(fā)規(guī)則,默認(rèn)情況下在創(chuàng)建LoadBalance的時(shí)候還會(huì)創(chuàng)建nodeport服務(wù):
$ ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 172.17.0.1:30181 rr
-> 10.8.65.15:80 Masq 1 0 0
-> 10.8.65.16:80 Masq 1 0 0
-> 10.8.66.12:80 Masq 1 0 0
-> 10.8.66.13:80 Masq 1 0 0
TCP 10.8.32.221:80 rr
-> 10.8.65.15:80 Masq 1 0 0
-> 10.8.65.16:80 Masq 1 0 0
-> 10.8.66.12:80 Masq 1 0 0
-> 10.8.66.13:80 Masq 1 0 0
TCP 10.8.64.0:30181 rr
-> 10.8.65.15:80 Masq 1 0 0
-> 10.8.65.16:80 Masq 1 0 0
-> 10.8.66.12:80 Masq 1 0 0
-> 10.8.66.13:80 Masq 1 0 0
TCP 10.8.64.1:30181 rr
-> 10.8.65.15:80 Masq 1 0 0
-> 10.8.65.16:80 Masq 1 0 0
-> 10.8.66.12:80 Masq 1 0 0
-> 10.8.66.13:80 Masq 1 0 0
TCP 10.31.8.1:30181 rr
-> 10.8.65.15:80 Masq 1 0 0
-> 10.8.65.16:80 Masq 1 0 0
-> 10.8.66.12:80 Masq 1 0 0
-> 10.8.66.13:80 Masq 1 0 0
TCP 10.31.8.100:80 rr
-> 10.8.65.15:80 Masq 1 0 0
-> 10.8.65.16:80 Masq 1 0 0
-> 10.8.66.12:80 Masq 1 0 0
-> 10.8.66.13:80 Masq 1 0 0
使用curl檢查服務(wù)是否正常
$ curl 10.31.8.100:80
10.8.64.0:60854
$ curl 10.8.1.166:80
10.8.64.0:2562
$ curl 10.31.8.1:30974
10.8.64.0:1635
$ curl 10.31.8.100:80
10.8.64.0:60656
3.5 關(guān)于VIP
在每臺(tái)k8s節(jié)點(diǎn)機(jī)器上面的kube-ipvs0
網(wǎng)卡上面都能看到這個(gè)LoadBalancer的VIP:
$ ip addr show kube-ipvs0
5: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
link/ether 4e:ba:e8:25:cf:17 brd ff:ff:ff:ff:ff:ff
inet 10.8.0.1/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.8.0.10/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.8.32.221/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.31.8.100/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
想要定位到VIP在那個(gè)節(jié)點(diǎn)上面則比較麻煩嫁赏,我們可以找一臺(tái)和K8S集群處于同一個(gè)二層網(wǎng)絡(luò)的機(jī)器其掂,查看arp表,再根據(jù)mac地址找到對(duì)應(yīng)的節(jié)點(diǎn)IP潦蝇,這樣子可以反查到IP在哪個(gè)節(jié)點(diǎn)上面款熬。
$ arp -a | grep 10.31.8.100
? (10.31.8.100) at 52:54:00:5c:9c:97 [ether] on eth0
$ arp -a | grep 52:54:00:5c:9c:97
tiny-flannel-worker-8-12.k8s.tcinternal (10.31.8.12) at 52:54:00:5c:9c:97 [ether] on eth0
? (10.31.8.100) at 52:54:00:5c:9c:97 [ether] on eth0
$ ip a | grep 52:54:00:5c:9c:97
link/ether 52:54:00:5c:9c:97 brd ff:ff:ff:ff:ff:ff
又或者我們可以查看speaker的pod日志,我們可以找到對(duì)應(yīng)的服務(wù)IP被宣告的日志記錄
$ kubectl logs -f -n metallb-system speaker-fl5l8
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.31.8.100"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"nginx-quic/nginx-lb-service","ts":"2022-05-16T06:11:34.099204376Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.31.8.100"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"nginx-quic/nginx-lb-service","ts":"2022-05-16T06:12:09.527334808Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.31.8.100"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"nginx-quic/nginx-lb-service","ts":"2022-05-16T06:12:09.547734268Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.31.8.100"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"nginx-quic/nginx-lb-service","ts":"2022-05-16T06:12:34.267651651Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.31.8.100"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"nginx-quic/nginx-lb-service","ts":"2022-05-16T06:12:34.286130424Z"}
3.6 關(guān)于nodeport
相信不少細(xì)心的同學(xué)已經(jīng)發(fā)現(xiàn)了攘乒,我們?cè)趧?chuàng)建LoadBalancer
服務(wù)的時(shí)候贤牛,默認(rèn)情況下k8s會(huì)幫我們自動(dòng)創(chuàng)建一個(gè)nodeport
服務(wù),這個(gè)操作可以通過(guò)指定Service
中的allocateLoadBalancerNodePorts
字段來(lái)定義開關(guān)则酝,默認(rèn)情況下為true
不同的loadbalancer實(shí)現(xiàn)原理不同殉簸,有些是需要依賴nodeport來(lái)進(jìn)行流量轉(zhuǎn)發(fā),有些則是直接轉(zhuǎn)發(fā)請(qǐng)求到pod中沽讹。對(duì)于MetalLB而言般卑,是通過(guò)kube-proxy將請(qǐng)求的流量直接轉(zhuǎn)發(fā)到pod,因此我們需要關(guān)閉nodeport的話可以修改service中的spec.allocateLoadBalancerNodePorts
字段妥泉,將其設(shè)置為false椭微,那么在創(chuàng)建svc的時(shí)候就不會(huì)分配nodeport。
但是需要注意的是如果是對(duì)已有service進(jìn)行修改盲链,關(guān)閉nodeport(從true改為false)蝇率,k8s不會(huì)自動(dòng)去清除已有的ipvs規(guī)則,這需要我們自行手動(dòng)刪除刽沾。
我們重新定義創(chuàng)建一個(gè)svc
apiVersion: v1
kind: Service
metadata:
name: nginx-lb-service
namespace: nginx-quic
spec:
allocateLoadBalancerNodePorts: false
externalTrafficPolicy: Cluster
internalTrafficPolicy: Cluster
selector:
app: nginx-lb
ports:
- protocol: TCP
port: 80 # match for service access port
targetPort: 80 # match for pod access port
type: LoadBalancer
loadBalancerIP: 10.31.8.100
此時(shí)再去查看對(duì)應(yīng)的svc狀態(tài)和ipvs規(guī)則會(huì)發(fā)現(xiàn)已經(jīng)沒有nodeport相關(guān)的配置
$ ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.8.62.180:80 rr
-> 10.8.65.18:80 Masq 1 0 0
-> 10.8.65.19:80 Masq 1 0 0
-> 10.8.66.14:80 Masq 1 0 0
-> 10.8.66.15:80 Masq 1 0 0
TCP 10.31.8.100:80 rr
-> 10.8.65.18:80 Masq 1 0 0
-> 10.8.65.19:80 Masq 1 0 0
-> 10.8.66.14:80 Masq 1 0 0
-> 10.8.66.15:80 Masq 1 0 0
$ kubectl get svc -n nginx-quic
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx-lb-service LoadBalancer 10.8.62.180 10.31.8.100 80/TCP 23s
如果是把已有服務(wù)的spec.allocateLoadBalancerNodePorts
從true
改為false
本慕,原有的nodeport
不會(huì)自動(dòng)刪除,因此最好在初始化的時(shí)候就規(guī)劃好相關(guān)參數(shù)
$ kubectl get svc -n nginx-quic nginx-lb-service -o yaml | egrep " allocateLoadBalancerNodePorts: "
allocateLoadBalancerNodePorts: false
$ kubectl get svc -n nginx-quic
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx-lb-service LoadBalancer 10.8.62.180 10.31.8.100 80:31405/TCP 85m
4侧漓、BGP mode
4.1 網(wǎng)絡(luò)拓?fù)?/h2>
測(cè)試環(huán)境的網(wǎng)絡(luò)拓?fù)浞浅5暮?jiǎn)單锅尘,MetalLB的網(wǎng)段為了和前面Layer2模式進(jìn)行區(qū)分,更換為10.9.0.0/16
,具體信息如下
IP | Hostname |
---|---|
10.31.8.1 | tiny-flannel-master-8-1.k8s.tcinternal |
10.31.8.11 | tiny-flannel-worker-8-11.k8s.tcinternal |
10.31.8.12 | tiny-flannel-worker-8-12.k8s.tcinternal |
10.31.254.251 | OpenWrt |
10.9.0.0/16 | MetalLB BGP IPpool |
三臺(tái)k8s的節(jié)點(diǎn)直連Openwrt路由器藤违,OpenWRT作為k8s節(jié)點(diǎn)的網(wǎng)關(guān)的同時(shí)浪腐,還在上面跑BGP協(xié)議,將對(duì)MetalLB使用的VIP的請(qǐng)求路由到各個(gè)k8s節(jié)點(diǎn)上顿乒。
在開始配置之前议街,我們需要給路由器和k8s節(jié)點(diǎn)都分配一個(gè)私有的AS號(hào),這里可以參考wiki上面的AS號(hào)劃分使用璧榄。這里我們路由器使用AS號(hào)為64512特漩,MetalLB使用AS號(hào)為64513。
4.2 安裝路由軟件
以家里常見的openwrt路由器為例骨杂,我們先在上面安裝quagga組件涂身,當(dāng)然要是使用的openwrt版本編譯了frr模塊的話推薦使用frr來(lái)進(jìn)行配置。
如果使用的是別的發(fā)行版Linux(如CentOS或者Debian)推薦直接使用
frr
進(jìn)行配置搓蚪。
我們先在openwrt上面直接使用opkg安裝quagga
$ opkg update
$ opkg install quagga quagga-zebra quagga-bgpd quagga-vtysh
如果使用的openwrt版本足夠新蛤售,是可以直接使用opkg安裝frr組件的
$ opkg update
$ opkg install frr frr-babeld frr-bfdd frr-bgpd frr-eigrpd frr-fabricd frr-isisd frr-ldpd frr-libfrr frr-nhrpd frr-ospf6d frr-ospfd frr-pbrd frr-pimd frr-ripd frr-ripngd frr-staticd frr-vrrpd frr-vtysh frr-watchfrr frr-zebra
如果是使用frr記得在配置中開啟bgpd參數(shù)再重啟frr
$ sed -i 's/bgpd=no/bgpd=yes/g' /etc/frr/daemons
$ /etc/init.d/frr restart
4.3 配置路由器BGP
下面的服務(wù)配置以frr
為例,實(shí)際上使用quagga
的話也是使用vtysh
進(jìn)行配置或者是直接修改配置文件陕凹,兩者區(qū)別不大悍抑。
檢查服務(wù)是否監(jiān)聽了2601和2605端口
root@OpenWrt:~# netstat -ntlup | egrep "zebra|bgpd"
tcp 0 0 0.0.0.0:2601 0.0.0.0:* LISTEN 3018/zebra
tcp 0 0 0.0.0.0:2605 0.0.0.0:* LISTEN 3037/bgpd
BGP協(xié)議使用的179端口還沒有被監(jiān)聽是因?yàn)槲覀冞€沒有進(jìn)行配置,這里我們可以直接使用vtysh進(jìn)行配置或者是直接修改配置文件然后重啟服務(wù)杜耙。
直接在命令行輸入vtysh就可以進(jìn)入到vtysh的配置終端(和kvm虛擬化的virsh類似)搜骡,這時(shí)候注意留意終端的提示符變化了
root@OpenWrt:~# vtysh
Hello, this is Quagga (version 1.2.4).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
OpenWrt#
但是命令行配置比較麻煩,我們也可以直接修改配置文件然后重啟服務(wù)佑女。
quagga修改的bgp配置文件默認(rèn)是/etc/quagga/bgpd.conf
记靡,不同的發(fā)行版和安裝方式可能會(huì)不同。
$ cat /etc/quagga/bgpd.conf
!
! Zebra configuration saved from vty
! 2022/05/19 11:01:35
!
password zebra
!
router bgp 64512
bgp router-id 10.31.254.251
neighbor 10.31.8.1 remote-as 64513
neighbor 10.31.8.1 description 10-31-8-1
neighbor 10.31.8.11 remote-as 64513
neighbor 10.31.8.11 description 10-31-8-11
neighbor 10.31.8.12 remote-as 64513
neighbor 10.31.8.12 description 10-31-8-12
maximum-paths 3
!
address-family ipv6
exit-address-family
exit
!
access-list vty permit 127.0.0.0/8
access-list vty deny any
!
line vty
access-class vty
!
如果使用的是frr团驱,那么配置文件會(huì)有所變化摸吠,需要修改的是/etc/frr/frr.conf
,不同的發(fā)行版和安裝方式可能會(huì)不同嚎花。
$ cat /etc/frr/frr.conf
frr version 8.2.2
frr defaults traditional
hostname tiny-openwrt-plus
!
password zebra
!
router bgp 64512
bgp router-id 10.31.254.251
no bgp ebgp-requires-policy
neighbor 10.31.8.1 remote-as 64513
neighbor 10.31.8.1 description 10-31-8-1
neighbor 10.31.8.11 remote-as 64513
neighbor 10.31.8.11 description 10-31-8-11
neighbor 10.31.8.12 remote-as 64513
neighbor 10.31.8.12 description 10-31-8-12
!
address-family ipv4 unicast
exit-address-family
exit
!
access-list vty seq 5 permit 127.0.0.0/8
access-list vty seq 10 deny any
!
line vty
access-class vty
exit
!
完成配置后需要重啟服務(wù)
# 重啟frr的命令
$ /etc/init.d/frr restart
# 重啟quagge的命令
$ /etc/init.d/quagga restart
重啟后我們進(jìn)入vtysh查看bgp的狀態(tài)
tiny-openwrt-plus# show ip bgp summary
IPv4 Unicast Summary (VRF default):
BGP router identifier 10.31.254.251, local AS number 64512 vrf-id 0
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 3, using 2149 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
10.31.8.1 4 64513 0 0 0 0 0 never Active 0 10-31-8-1
10.31.8.11 4 64513 0 0 0 0 0 never Active 0 10-31-8-11
10.31.8.12 4 64513 0 0 0 0 0 never Active 0 10-31-8-12
Total number of neighbors 3
這時(shí)候再查看路由器的監(jiān)聽端口寸痢,可以看到BGP已經(jīng)跑起來(lái)了
$ netstat -ntlup | egrep "zebra|bgpd"
tcp 0 0 127.0.0.1:2605 0.0.0.0:* LISTEN 31625/bgpd
tcp 0 0 127.0.0.1:2601 0.0.0.0:* LISTEN 31618/zebra
tcp 0 0 0.0.0.0:179 0.0.0.0:* LISTEN 31625/bgpd
tcp 0 0 :::179 :::* LISTEN 31625/bgpd
4.4 配置MetalLB BGP
首先我們修改configmap
apiVersion: v1
kind: ConfigMap
metadata:
namespace: metallb-system
name: config
data:
config: |
peers:
- peer-address: 10.31.254.251
peer-port: 179
peer-asn: 64512
my-asn: 64513
address-pools:
- name: default
protocol: bgp
addresses:
- 10.9.0.0/16
修改完成后我們重新部署configmap
,并檢查metallb
的狀態(tài)
$ kubectl apply -f configmap-metal.yaml
configmap/config configured
$ kubectl get cm -n metallb-system config -o yaml
apiVersion: v1
data:
config: |
peers:
- peer-address: 10.31.254.251
peer-port: 179
peer-asn: 64512
my-asn: 64513
address-pools:
- name: default
protocol: bgp
addresses:
- 10.9.0.0/16
kind: ConfigMap
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","data":{"config":"peers:\n- peer-address: 10.31.254.251\n peer-port: 179\n peer-asn: 64512\n my-asn: 64513\naddress-pools:\n- name: default\n protocol: bgp\n addresses:\n - 10.9.0.0/16\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"config","namespace":"metallb-system"}}
creationTimestamp: "2022-05-16T04:37:54Z"
name: config
namespace: metallb-system
resourceVersion: "1412854"
uid: 6d94ca36-93fe-4ea2-9407-96882ad8e35c
此時(shí)我們從路由器上面可以看到已經(jīng)和三個(gè)k8s節(jié)點(diǎn)建立了BGP連接
tiny-openwrt-plus# show ip bgp summary
IPv4 Unicast Summary (VRF default):
BGP router identifier 10.31.254.251, local AS number 64512 vrf-id 0
BGP table version 3
RIB entries 5, using 920 bytes of memory
Peers 3, using 2149 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
10.31.8.1 4 64513 6 4 0 0 0 00:00:45 3 3 10-31-8-1
10.31.8.11 4 64513 6 4 0 0 0 00:00:45 3 3 10-31-8-11
10.31.8.12 4 64513 6 4 0 0 0 00:00:45 3 3 10-31-8-12
Total number of neighbors 3
如果出現(xiàn)某個(gè)節(jié)點(diǎn)的BGP連接建立失敗的情況紊选,可以重啟該節(jié)點(diǎn)上面的speaker來(lái)重試建立BGP連接
$ kubectl delete po speaker-fl5l8 -n metallb-system
4.5 配置Service
當(dāng)configmap更改生效之后啼止,原有服務(wù)的EXTERNAL-IP
不會(huì)重新分配
$ kubectl get svc -n nginx-quic
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx-lb-service LoadBalancer 10.8.4.92 10.31.8.100 80/TCP 18h
此時(shí)我們可以重啟controller
,讓它重新為我們的服務(wù)分配EXTERNAL-IP
$ kubectl delete po -n metallb-system controller-57fd9c5bb-svtjw
pod "controller-57fd9c5bb-svtjw" deleted
重啟完成之后我們?cè)贆z查svc的狀態(tài)兵罢,如果svc的配置中關(guān)于LoadBalancer的VIP是自動(dòng)分配的(即沒有指定loadBalancerIP
字段)献烦,那么這時(shí)候應(yīng)該就已經(jīng)拿到新的IP在正常運(yùn)行了,但是我們這個(gè)服務(wù)的loadBalancerIP
之前手動(dòng)指定為10.31.8.100了卖词,這里的EXTERNAL-IP
狀態(tài)就變?yōu)?code>pending巩那。
$ kubectl get svc -n nginx-quic
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx-lb-service LoadBalancer 10.8.4.92 <pending> 80/TCP 18h
重新修改loadBalancerIP
為10.9.1.1
,此時(shí)可以看到服務(wù)已經(jīng)正常
$ kubectl get svc -n nginx-quic
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx-lb-service LoadBalancer 10.8.4.92 10.9.1.1 80/TCP 18h
再查看controller的日志可以看到
$ kubectl logs controller-57fd9c5bb-d6jsl -n metallb-system
{"branch":"HEAD","caller":"level.go:63","commit":"v0.12.1","goversion":"gc / go1.16.14 / amd64","level":"info","msg":"MetalLB controller starting version 0.12.1 (commit v0.12.1, branch HEAD)","ts":"2022-05-18T03:45:45.440872105Z","version":"0.12.1"}
{"caller":"level.go:63","configmap":"metallb-system/config","event":"configLoaded","level":"info","msg":"config (re)loaded","ts":"2022-05-18T03:45:45.610395481Z"}
{"caller":"level.go:63","error":"[\"10.31.8.100\"] is not allowed in config","event":"clearAssignment","level":"info","msg":"current IP not allowed by config, clearing","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.611009691Z"}
{"caller":"level.go:63","event":"clearAssignment","level":"info","msg":"user requested a different IP than the one currently assigned","reason":"differentIPRequested","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.611062419Z"}
{"caller":"level.go:63","error":"controller not synced","level":"error","msg":"controller not synced yet, cannot allocate IP; will retry after sync","op":"allocateIPs","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.611080525Z"}
{"caller":"level.go:63","event":"stateSynced","level":"info","msg":"controller synced, can allocate IPs now","ts":"2022-05-18T03:45:45.611117023Z"}
{"caller":"level.go:63","error":"[\"10.31.8.100\"] is not allowed in config","event":"clearAssignment","level":"info","msg":"current IP not allowed by config, clearing","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.617013146Z"}
{"caller":"level.go:63","event":"clearAssignment","level":"info","msg":"user requested a different IP than the one currently assigned","reason":"differentIPRequested","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.617089367Z"}
{"caller":"level.go:63","error":"[\"10.31.8.100\"] is not allowed in config","level":"error","msg":"IP allocation failed","op":"allocateIPs","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.617122976Z"}
{"caller":"level.go:63","event":"serviceUpdated","level":"info","msg":"updated service object","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.626039403Z"}
{"caller":"level.go:63","error":"[\"10.31.8.100\"] is not allowed in config","level":"error","msg":"IP allocation failed","op":"allocateIPs","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.626361986Z"}
{"caller":"level.go:63","event":"ipAllocated","ip":["10.9.1.1"],"level":"info","msg":"IP address assigned by controller","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:47:19.943434144Z"}
再查看speaker的日志我們可以看到和路由之間成功建立BGP連接的日志、使用了不符合規(guī)范的loadBalancerIP
10.31.8.100的報(bào)錯(cuò)日志即横,以及為loadBalancerIP
10.9.1.1分配BGP路由的日志
$ kubectl logs -n metallb-system speaker-bf79q
{"caller":"level.go:63","configmap":"metallb-system/config","event":"peerAdded","level":"info","msg":"peer configured, starting BGP session","peer":"10.31.254.251","ts":"2022-05-18T03:41:55.046091105Z"}
{"caller":"level.go:63","configmap":"metallb-system/config","event":"configLoaded","level":"info","msg":"config (re)loaded","ts":"2022-05-18T03:41:55.046268735Z"}
{"caller":"level.go:63","error":"assigned IP not allowed by config","ips":["10.31.8.100"],"level":"error","msg":"IP allocated by controller not allowed by config","op":"setBalancer","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:41:55.051955069Z"}
struct { Version uint8; ASN16 uint16; HoldTime uint16; RouterID uint32; OptsLen uint8 }{Version:0x4, ASN16:0xfc00, HoldTime:0xb4, RouterID:0xa1ffefd, OptsLen:0x1e}
{"caller":"level.go:63","event":"sessionUp","level":"info","localASN":64513,"msg":"BGP session established","peer":"10.31.254.251:179","peerASN":64512,"ts":"2022-05-18T03:41:55.052734174Z"}
{"caller":"level.go:63","level":"info","msg":"triggering discovery","op":"memberDiscovery","ts":"2022-05-18T03:42:40.183574415Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"10.31.8.12","node event":"NodeLeave","node name":"tiny-flannel-worker-8-12.k8s.tcinternal","ts":"2022-05-18T03:44:03.649494062Z"}
{"caller":"level.go:63","error":"assigned IP not allowed by config","ips":["10.31.8.100"],"level":"error","msg":"IP allocated by controller not allowed by config","op":"setBalancer","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:44:03.655003303Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"10.31.8.12","node event":"NodeJoin","node name":"tiny-flannel-worker-8-12.k8s.tcinternal","ts":"2022-05-18T03:44:06.247929645Z"}
{"caller":"level.go:63","error":"assigned IP not allowed by config","ips":["10.31.8.100"],"level":"error","msg":"IP allocated by controller not allowed by config","op":"setBalancer","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:44:06.25369106Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["10.9.1.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:47:19.953729779Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.9.1.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:47:19.953912236Z"}
我們?cè)诩和獾娜我庖粋€(gè)機(jī)器進(jìn)行測(cè)試
$ curl -v 10.9.1.1
* About to connect() to 10.9.1.1 port 80 (#0)
* Trying 10.9.1.1...
* Connected to 10.9.1.1 (10.9.1.1) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.9.1.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: nginx
< Date: Wed, 18 May 2022 04:17:41 GMT
< Content-Type: text/plain
< Content-Length: 16
< Connection: keep-alive
<
10.8.64.0:43939
* Connection #0 to host 10.9.1.1 left intact
4.6 檢查ECMP
此時(shí)再查看路由器上面的路由狀態(tài)噪生,可以看到有關(guān)于10.9.1.1
的/32
路由,這時(shí)候的下一條有多個(gè)IP东囚,說(shuō)明已經(jīng)成功開啟了ECMP杠园。
tiny-openwrt-plus# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
K>* 0.0.0.0/0 [0/0] via 10.31.254.254, eth0, 00:04:52
B>* 10.9.1.1/32 [20/0] via 10.31.8.1, eth0, weight 1, 00:01:40
* via 10.31.8.11, eth0, weight 1, 00:01:40
* via 10.31.8.12, eth0, weight 1, 00:01:40
C>* 10.31.0.0/16 is directly connected, eth0, 00:04:52
我們?cè)賱?chuàng)建多幾個(gè)服務(wù)進(jìn)行測(cè)試
# kubectl get svc -n nginx-quic
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx-lb-service LoadBalancer 10.8.4.92 10.9.1.1 80/TCP 23h
nginx-lb2-service LoadBalancer 10.8.10.48 10.9.1.2 80/TCP 64m
nginx-lb3-service LoadBalancer 10.8.6.116 10.9.1.3 80/TCP 64m
再查看此時(shí)路由器的狀態(tài)
tiny-openwrt-plus# show ip bgp
BGP table version is 3, local router ID is 10.31.254.251, vrf id 0
Default local pref 100, local AS 64512
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found
Network Next Hop Metric LocPrf Weight Path
*= 10.9.1.1/32 10.31.8.12 0 64513 ?
*> 10.31.8.1 0 64513 ?
*= 10.31.8.11 0 64513 ?
*= 10.9.1.2/32 10.31.8.12 0 64513 ?
*> 10.31.8.1 0 64513 ?
*= 10.31.8.11 0 64513 ?
*= 10.9.1.3/32 10.31.8.12 0 64513 ?
*> 10.31.8.1 0 64513 ?
*= 10.31.8.11 0 64513 ?
Displayed 3 routes and 9 total paths
tiny-openwrt-plus# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
K>* 0.0.0.0/0 [0/0] via 10.31.254.254, eth0, 00:06:12
B>* 10.9.1.1/32 [20/0] via 10.31.8.1, eth0, weight 1, 00:03:00
* via 10.31.8.11, eth0, weight 1, 00:03:00
* via 10.31.8.12, eth0, weight 1, 00:03:00
B>* 10.9.1.2/32 [20/0] via 10.31.8.1, eth0, weight 1, 00:03:00
* via 10.31.8.11, eth0, weight 1, 00:03:00
* via 10.31.8.12, eth0, weight 1, 00:03:00
B>* 10.9.1.3/32 [20/0] via 10.31.8.1, eth0, weight 1, 00:03:00
* via 10.31.8.11, eth0, weight 1, 00:03:00
* via 10.31.8.12, eth0, weight 1, 00:03:00
C>* 10.31.0.0/16 is directly connected, eth0, 00:06:12
只有當(dāng)路由表顯示我們的LoadBalancerIP的下一跳有多個(gè)IP的時(shí)候,才說(shuō)明ECMP配置成功舔庶,否則需要檢查BGP的配置是否正確。
5陈醒、總結(jié)
5.1 Layer2 mode優(yōu)缺點(diǎn)
優(yōu)點(diǎn):
- 通用性強(qiáng)惕橙,對(duì)比BGP模式不需要BGP路由器支持,幾乎可以適用于任何網(wǎng)絡(luò)環(huán)境钉跷;當(dāng)然云廠商的網(wǎng)絡(luò)環(huán)境例外
缺點(diǎn):
- 所有的流量都會(huì)在同一個(gè)節(jié)點(diǎn)上弥鹦,該節(jié)點(diǎn)的容易成為流量的瓶頸
- 當(dāng)VIP所在節(jié)點(diǎn)宕機(jī)之后,需要較長(zhǎng)時(shí)間進(jìn)行故障轉(zhuǎn)移(一般在10s)爷辙,這主要是因?yàn)镸etalLB使用了memberlist來(lái)進(jìn)行選主彬坏,當(dāng)VIP所在節(jié)點(diǎn)宕機(jī)之后重新選主的時(shí)間要比傳統(tǒng)的keepalived使用的vrrp協(xié)議要更長(zhǎng)
- 難以定位VIP所在節(jié)點(diǎn),MetalLB并沒有提供一個(gè)簡(jiǎn)單直觀的方式讓我們查看到底哪一個(gè)節(jié)點(diǎn)是VIP所屬節(jié)點(diǎn)膝晾,基本只能通過(guò)抓包或者查看pod日志來(lái)確定栓始,當(dāng)集群規(guī)模變大的時(shí)候這會(huì)變得非常的麻煩
改進(jìn)方案:
- 有條件的可以考慮使用BGP模式
- 既不能用BGP模式也不能接受Layer2模式的,基本和目前主流的三個(gè)開源負(fù)載均衡器無(wú)緣了(三者都是Layer2模式和BGP模式且原理類似血当,優(yōu)缺點(diǎn)相同)
5.2 BGP mode優(yōu)缺點(diǎn)
BGP模式的優(yōu)缺點(diǎn)幾乎和Layer2模式相反
優(yōu)點(diǎn):
- 無(wú)單點(diǎn)故障幻赚,在開啟ECMP的前提下,k8s集群內(nèi)所有的節(jié)點(diǎn)都有請(qǐng)求流量臊旭,都會(huì)參與負(fù)載均衡并轉(zhuǎn)發(fā)請(qǐng)求
缺點(diǎn):
- 條件苛刻落恼,需要有BGP路由器支持,配置起來(lái)也更復(fù)雜离熏;
- ECMP的故障轉(zhuǎn)移(failover)并不是特別地優(yōu)雅佳谦,這個(gè)問(wèn)題的嚴(yán)重程度取決于使用的ECMP算法;當(dāng)集群的節(jié)點(diǎn)出現(xiàn)變動(dòng)導(dǎo)致BGP連接出現(xiàn)變動(dòng)滋戳,所有的連接都會(huì)進(jìn)行重新哈希(使用三元組或五元組哈希)钻蔑,這對(duì)一些服務(wù)來(lái)說(shuō)可能會(huì)有影響;
路由器中使用的哈希值通常 不穩(wěn)定胧瓜,因此每當(dāng)后端集的大小發(fā)生變化時(shí)(例如矢棚,當(dāng)一個(gè)節(jié)點(diǎn)的 BGP 會(huì)話關(guān)閉時(shí)),現(xiàn)有的連接將被有效地隨機(jī)重新哈希府喳,這意味著大多數(shù)現(xiàn)有的連接最終會(huì)突然被轉(zhuǎn)發(fā)到不同的后端蒲肋,而這個(gè)后端可能和此前的后端毫不相干且不清楚上下文狀態(tài)信息。
改進(jìn)方案:
MetalLB給出了一些改進(jìn)方案,下面列出來(lái)給大家參考一下
- 使用更穩(wěn)定的ECMP算法來(lái)減少后端變動(dòng)時(shí)對(duì)現(xiàn)有連接的影響兜粘,如“resilient ECMP” or “resilient LAG”
- 將服務(wù)部署到特定的節(jié)點(diǎn)上減少可能帶來(lái)的影響
- 在流量低峰期進(jìn)行變更
- 將服務(wù)分開部署到兩個(gè)不同的LoadBalanceIP的服務(wù)中申窘,然后利用DNS進(jìn)行流量切換
- 在客戶端加入透明的用戶無(wú)感的重試邏輯
- 在LoadBalance后面加入一層ingress來(lái)實(shí)現(xiàn)更優(yōu)雅的failover(但是并不是所有的服務(wù)都可以使用ingress)
- 接受現(xiàn)實(shí)……(Accept that there will be occasional bursts of reset connections. For low-availability internal services, this may be acceptable as-is.)
5.3 MetalLB的優(yōu)缺點(diǎn)
這里盡量客觀的總結(jié)概況一些客觀事實(shí),是否為優(yōu)缺點(diǎn)可能會(huì)因人而異:
- 開源時(shí)間久(相對(duì)于云原生負(fù)載均衡器而言)孔轴,有一定的社區(qū)基礎(chǔ)和熱度剃法,但是項(xiàng)目目前還處于beta狀態(tài)
- 部署簡(jiǎn)單快捷,默認(rèn)情況下需要配置的參數(shù)不多路鹰,可以快速上手
- 官方文檔不多贷洲,只有一些基礎(chǔ)的配置和說(shuō)明,想要深入了解晋柱,可能需要閱讀源碼
- 進(jìn)階管理配置不便优构,如果你想精確了解服務(wù)的當(dāng)前狀態(tài),可能會(huì)比較麻煩
- configmap修改之后生效不是特別的優(yōu)雅雁竞,很多情況下需要我們手動(dòng)重啟pod
總的來(lái)說(shuō)钦椭,MetalLB作為一款處于beta階段的開源負(fù)載均衡器,很好地彌補(bǔ)了這一塊領(lǐng)域的空白碑诉,并且對(duì)后面開源的一些同類服務(wù)有著一定的影響彪腔。但是從實(shí)際生產(chǎn)落地的角度,給我的感覺就是目前更傾向于有得用且能用进栽,并不能算得上好用德挣,但是又考慮到MetalLB最開始只是一個(gè)個(gè)人開源項(xiàng)目,最近才有專門的組織進(jìn)行管理維護(hù)泪幌,這也是可以理解的盲厌,希望它能夠發(fā)展得更好吧。