k8s系列06-負(fù)載均衡器之MatelLB

本文主要在k8s原生集群上部署v0.12.1版本的MetalLB作為k8s的LoadBalancer育谬,主要涉及MetalLBLayer2模式BGP模式兩種部署方案。由于BGP的相關(guān)原理和配置比較復(fù)雜帮哈,這里僅涉及簡(jiǎn)單的BGP配置膛檀。

文中使用的k8s集群是在CentOS7系統(tǒng)上基于dockerflannel組件部署v1.23.6版本,此前寫的一些關(guān)于k8s基礎(chǔ)知識(shí)和集群搭建的一些方案娘侍,有需要的同學(xué)可以看一下咖刃。

1、工作原理

1.1 簡(jiǎn)介

在開始之前憾筏,我們需要了解一下MetalLB的工作原理嚎杨。

MetalLB hooks into your Kubernetes cluster, and provides a network load-balancer implementation. In short, it allows you to create Kubernetes services of type LoadBalancer in clusters that don’t run on a cloud provider, and thus cannot simply hook into paid products to provide load balancers.

It has two features that work together to provide this service: address allocation, and external announcement.

MetalLB是 Kubernetes 集群中關(guān)于LoadBalancer的一個(gè)具體實(shí)現(xiàn),主要用于暴露k8s集群的服務(wù)到集群外部訪問(wèn)氧腰。MetalLB可以讓我們?cè)趉8s集群中創(chuàng)建服務(wù)類型為LoadBalancer的服務(wù)磕潮,并且無(wú)需依賴云廠商提供的LoadBalancer

它具有兩個(gè)共同提供此服務(wù)的工作負(fù)載(workload):地址分配(address allocation)和外部公告(external announcement)容贝;對(duì)應(yīng)的就是在k8s中部署的controllerspeaker

1.2 address allocation

地址分配(address allocation)這個(gè)功能比較好理解之景,首先我們需要給MetalLB分配一段IP斤富,接著它會(huì)根據(jù)k8s的service中的相關(guān)配置來(lái)給LoadBalancer的服務(wù)分配IP,從官網(wǎng)文檔中我們可以得知LoadBalancer的IP可以手動(dòng)指定锻狗,也可以讓MetalLB自動(dòng)分配满力;同時(shí)還可以在MetalLB的configmap中配置多個(gè)IP段焕参,并且單獨(dú)設(shè)置每個(gè)IP段是否開啟自動(dòng)分配。

地址分配(address allocation)主要就是由作為deployment部署的controller來(lái)實(shí)現(xiàn)油额,它負(fù)責(zé)監(jiān)聽集群中的service狀態(tài)并且分配IP

1.3 external announcement

外部公告(external announcement)的主要功能就是要把服務(wù)類型為LoadBalancer的服務(wù)的EXTERNAL-IP公布到網(wǎng)絡(luò)中去叠纷,確保客戶端能夠正常訪問(wèn)到這個(gè)IP潦嘶。MetalLB對(duì)此的實(shí)現(xiàn)方式主要有三種:ARP/NDP和BGP涩嚣;其中ARP/NDP分別對(duì)應(yīng)IPv4/IPv6協(xié)議的Layer2模式,BGP路由協(xié)議則是對(duì)應(yīng)BGP模式掂僵。外部公告(external announcement)主要就是由作為daemonset部署的speaker來(lái)實(shí)現(xiàn)航厚,它負(fù)責(zé)在網(wǎng)絡(luò)中發(fā)布ARP/NDP報(bào)文或者是和BGP路由器建立連接并發(fā)布BGP報(bào)文。

1.4 關(guān)于網(wǎng)絡(luò)

不管是Layer2模式還是BGP模式锰蓬,兩者都不使用Linux的網(wǎng)絡(luò)棧幔睬,也就是說(shuō)我們沒辦法使用諸如ip命令之類的操作準(zhǔn)確的查看VIP所在的節(jié)點(diǎn)和相應(yīng)的路由,相對(duì)應(yīng)的是在每個(gè)節(jié)點(diǎn)上面都能看到一個(gè)kube-ipvs0網(wǎng)卡接口上面的IP芹扭。同時(shí)麻顶,兩種模式都只是負(fù)責(zé)把VIP的請(qǐng)求引到對(duì)應(yīng)的節(jié)點(diǎn)上面,之后的請(qǐng)求怎么到達(dá)pod舱卡,按什么規(guī)則輪詢等都是由kube-proxy實(shí)現(xiàn)的辅肾。

兩種不同的模式各有優(yōu)缺點(diǎn)和局限性,我們先把兩者都部署起來(lái)再進(jìn)行分析灼狰。

2宛瞄、準(zhǔn)備工作

2.1 系統(tǒng)要求

在開始部署MetalLB之前,我們需要確定部署環(huán)境能夠滿足最低要求:

  • 一個(gè)k8s集群交胚,要求版本不低于1.13.0玄括,且沒有負(fù)載均衡器相關(guān)插件
  • k8s集群上的CNI組件和MetalLB兼容
  • 預(yù)留一段IPv4地址給MetalLB作為L(zhǎng)oadBalance的VIP使用
  • 如果使用的是MetalLB的BGP模式,還需要路由器支持BGP協(xié)議
  • 如果使用的是MetalLB的Layer2模式叫编,因?yàn)槭褂昧?a target="_blank">memberlist算法來(lái)實(shí)現(xiàn)選主虱歪,因此需要確保各個(gè)k8s節(jié)點(diǎn)之間的7946端口可達(dá)(包括TCP和UDP協(xié)議),當(dāng)然也可以根據(jù)自己的需求配置為其他端口

2.2 cni插件的兼容性

MetalLB官方給出了對(duì)主流的一些CNI的兼容情況熬词,考慮到MetalLB主要還是利用了k8s自帶的kube-proxy組件做流量轉(zhuǎn)發(fā)旁钧,因此對(duì)大多數(shù)的CNI兼容情況都相當(dāng)不錯(cuò)。

CNI 兼容性 主要問(wèn)題
Calico Mostly (see known issues) 主要在于BGP模式的兼容性互拾,但是社區(qū)也提供了解決方案
Canal Yes -
Cilium Yes -
Flannel Yes -
Kube-ovn Yes -
Kube-router Mostly (see known issues) 無(wú)法支持 builtin external BGP peering mode
Weave Net Mostly (see known issues) externalTrafficPolicy: Local支持情況視版本而定

從兼容性上面我們不難看出歪今,大多數(shù)情況是沒問(wèn)題的,出現(xiàn)兼容性問(wèn)題的主要原因就是和BGP有沖突颜矿。實(shí)際上BGP相關(guān)的兼容性問(wèn)題幾乎存在于每個(gè)開源的k8s負(fù)載均衡器上面寄猩。

2.3 云廠商的兼容性

MetalLB官方給出的列表中,我們可以看到對(duì)大多數(shù)云廠商的兼容性都很差骑疆,原因也很簡(jiǎn)單田篇,大多數(shù)的云環(huán)境上面都沒辦法運(yùn)行BGP協(xié)議替废,而通用性更高的layer2模式則因?yàn)楦鱾€(gè)云廠商的網(wǎng)絡(luò)環(huán)境不同而沒辦法確定是否能夠兼容

The short version is: cloud providers expose proprietary APIs instead of standard protocols to control their network layer, and MetalLB doesn’t work with those APIs.

當(dāng)然如果使用了云廠商的服務(wù),最好的方案是直接使用云廠商提供的LoadBalance服務(wù)泊柬。

3椎镣、Layer2 mode

3.1 部署環(huán)境

本次MetalLB的部署環(huán)境為基于dockerflannel部署的1.23.6版本的k8s集群

IP Hostname
10.31.8.1 tiny-flannel-master-8-1.k8s.tcinternal
10.31.8.11 tiny-flannel-worker-8-11.k8s.tcinternal
10.31.8.12 tiny-flannel-worker-8-12.k8s.tcinternal
10.8.64.0/18 podSubnet
10.8.0.0/18 serviceSubnet
10.31.8.100-10.31.8.200 MetalLB IPpool

3.2 配置ARP參數(shù)

部署Layer2模式需要把k8s集群中的ipvs配置打開strictARP,開啟之后k8s集群中的kube-proxy會(huì)停止響應(yīng)kube-ipvs0網(wǎng)卡之外的其他網(wǎng)卡的arp請(qǐng)求兽赁,而由MetalLB接手處理状答。

strict ARP開啟之后相當(dāng)于把 將 arp_ignore 設(shè)置為 1 并將 arp_announce 設(shè)置為 2 啟用嚴(yán)格的 ARP,這個(gè)原理和LVS中的DR模式對(duì)RS的配置一樣闸氮,可以參考之前的文章中的解釋剪况。

strict ARP configure arp_ignore and arp_announce to avoid answering ARP queries from kube-ipvs0 interface

# 查看kube-proxy中的strictARP配置
$ kubectl get configmap -n kube-system kube-proxy -o yaml | grep strictARP
      strictARP: false

# 手動(dòng)修改strictARP配置為true
$ kubectl edit configmap -n kube-system kube-proxy
configmap/kube-proxy edited

# 使用命令直接修改并對(duì)比不同
$ kubectl get configmap kube-proxy -n kube-system -o yaml | sed -e "s/strictARP: false/strictARP: true/" | kubectl diff -f - -n kube-system

# 確認(rèn)無(wú)誤后使用命令直接修改并生效
$ kubectl get configmap kube-proxy -n kube-system -o yaml | sed -e "s/strictARP: false/strictARP: true/" | kubectl apply -f - -n kube-system

# 重啟kube-proxy確保配置生效
$ kubectl rollout restart ds kube-proxy -n kube-system

# 確認(rèn)配置生效
$ kubectl get configmap -n kube-system kube-proxy -o yaml | grep strictARP
      strictARP: true

3.3 部署MetalLB

MetalLB的部署也十分簡(jiǎn)單,官方提供了manifest文件部署(yaml部署)蒲跨,helm3部署Kustomize部署三種方式译断,這里我們還是使用manifest文件部署。

大多數(shù)的官方教程為了簡(jiǎn)化部署的步驟或悲,都是寫著直接用kubectl命令部署一個(gè)yaml的url孙咪,這樣子的好處是部署簡(jiǎn)單快捷,但是壞處就是本地自己沒有存檔巡语,不方便修改等操作翎蹈,因此我個(gè)人更傾向于把yaml文件下載到本地保存再進(jìn)行部署。

# 下載v0.12.1的兩個(gè)部署文件
$ wget https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/namespace.yaml
$ wget https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/metallb.yaml

# 如果使用frr來(lái)進(jìn)行BGP路由管理男公,則下載這兩個(gè)部署文件
$ wget https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/namespace.yaml
$ wget https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/metallb-frr.yaml

下載官方提供的yaml文件之后荤堪,我們?cè)偬崆皽?zhǔn)備好configmap配置,github上面有提供一個(gè)參考文件枢赔,layer2模式需要的配置并不多澄阳,這里我們只做最基礎(chǔ)的一些參數(shù)配置定義即可:

  • protocol這一項(xiàng)我們配置為layer2
  • addresses這里我們可以使用CIDR來(lái)批量配置(198.51.100.0/24),也可以指定首尾IP來(lái)配置(192.168.0.150-192.168.0.200)踏拜,這里我們指定一段和k8s節(jié)點(diǎn)在同一個(gè)子網(wǎng)的IP
$ cat > configmap-metallb.yaml <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  namespace: metallb-system
  name: config
data:
  config: |
    address-pools:
    - name: default
      protocol: layer2
      addresses:
      - 10.31.8.100-10.31.8.200
EOF

接下來(lái)就可以開始進(jìn)行部署碎赢,整體可以分為三步:

  1. 部署namespace
  2. 部署deploymentdaemonset
  3. 配置configmap
# 創(chuàng)建namespace
$ kubectl apply -f namespace.yaml
namespace/metallb-system created
$ kubectl get ns
NAME              STATUS   AGE
default           Active   8d
kube-node-lease   Active   8d
kube-public       Active   8d
kube-system       Active   8d
metallb-system    Active   8s
nginx-quic        Active   8d

# 部署deployment和daemonset,以及相關(guān)所需的其他資源
$ kubectl apply -f metallb.yaml
Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
podsecuritypolicy.policy/controller created
podsecuritypolicy.policy/speaker created
serviceaccount/controller created
serviceaccount/speaker created
clusterrole.rbac.authorization.k8s.io/metallb-system:controller created
clusterrole.rbac.authorization.k8s.io/metallb-system:speaker created
role.rbac.authorization.k8s.io/config-watcher created
role.rbac.authorization.k8s.io/pod-lister created
role.rbac.authorization.k8s.io/controller created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:controller created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:speaker created
rolebinding.rbac.authorization.k8s.io/config-watcher created
rolebinding.rbac.authorization.k8s.io/pod-lister created
rolebinding.rbac.authorization.k8s.io/controller created
daemonset.apps/speaker created
deployment.apps/controller created

# 這里主要就是部署了controller這個(gè)deployment來(lái)檢查service的狀態(tài)
$ kubectl get deploy -n metallb-system
NAME         READY   UP-TO-DATE   AVAILABLE   AGE
controller   1/1     1            1           86s
# speaker則是使用ds部署到每個(gè)節(jié)點(diǎn)上面用來(lái)協(xié)商VIP速梗、收發(fā)ARP肮塞、NDP等數(shù)據(jù)包
$ kubectl get ds -n metallb-system
NAME      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
speaker   3         3         3       3            3           kubernetes.io/os=linux   64s
$ kubectl get pod -n metallb-system -o wide
NAME                         READY   STATUS    RESTARTS   AGE    IP           NODE                                      NOMINATED NODE   READINESS GATES
controller-57fd9c5bb-svtjw   1/1     Running   0          117s   10.8.65.4    tiny-flannel-worker-8-11.k8s.tcinternal   <none>           <none>
speaker-bf79q                1/1     Running   0          117s   10.31.8.11   tiny-flannel-worker-8-11.k8s.tcinternal   <none>           <none>
speaker-fl5l8                1/1     Running   0          117s   10.31.8.12   tiny-flannel-worker-8-12.k8s.tcinternal   <none>           <none>
speaker-nw2fm                1/1     Running   0          117s   10.31.8.1    tiny-flannel-master-8-1.k8s.tcinternal    <none>           <none>

      
$ kubectl apply -f configmap-layer2.yaml
configmap/config created

3.4 部署測(cè)試服務(wù)

我們還是自定義一個(gè)服務(wù)來(lái)進(jìn)行測(cè)試,測(cè)試鏡像使用nginx姻锁,默認(rèn)情況下會(huì)返回請(qǐng)求客戶端的IP和端口

$ cat > nginx-quic-lb.yaml <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: nginx-quic

---

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-lb
  namespace: nginx-quic
spec:
  selector:
    matchLabels:
      app: nginx-lb
  replicas: 4
  template:
    metadata:
      labels:
        app: nginx-lb
    spec:
      containers:
      - name: nginx-lb
        image: tinychen777/nginx-quic:latest
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 80

---

apiVersion: v1
kind: Service
metadata:
  name: nginx-lb-service
  namespace: nginx-quic
spec:
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Cluster
  selector:
    app: nginx-lb
  ports:
  - protocol: TCP
    port: 80 # match for service access port
    targetPort: 80 # match for pod access port
  type: LoadBalancer
  loadBalancerIP: 10.31.8.100
EOF

注意上面的配置中我們把service配置中的type字段指定為LoadBalancer枕赵,并且指定了loadBalancerIP10.31.8.100

注意:并非所有的LoadBalancer都允許設(shè)置 loadBalancerIP

如果LoadBalancer支持該字段位隶,那么將根據(jù)用戶設(shè)置的 loadBalancerIP 來(lái)創(chuàng)建負(fù)載均衡器烁设。

如果沒有設(shè)置 loadBalancerIP 字段,將會(huì)給負(fù)載均衡器指派一個(gè)臨時(shí) IP。

如果設(shè)置了 loadBalancerIP装黑,但LoadBalancer并不支持這種特性,那么設(shè)置的 loadBalancerIP 值將會(huì)被忽略掉弓熏。

# 創(chuàng)建一個(gè)測(cè)試服務(wù)檢查效果
$ kubectl apply -f nginx-quic-lb.yaml
namespace/nginx-quic created
deployment.apps/nginx-lb created
service/nginx-lb-service created

查看服務(wù)狀態(tài)恋谭,這時(shí)候TYPE已經(jīng)變成LoadBalancerEXTERNAL-IP顯示為我們定義的10.31.8.100

# 查看服務(wù)狀態(tài)挽鞠,這時(shí)候TYPE已經(jīng)變成LoadBalancer
$ kubectl get svc -n nginx-quic
NAME               TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)        AGE
nginx-lb-service   LoadBalancer   10.8.32.221   10.31.8.100   80:30181/TCP   25h

此時(shí)我們?cè)偃ゲ榭磌8s機(jī)器中的nginx-lb-service狀態(tài)疚颊,可以看到ClusetIPLoadBalancer-VIPnodeport的相關(guān)信息以及流量策略TrafficPolicy等配置

$ kubectl get svc -n nginx-quic nginx-lb-service -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"name":"nginx-lb-service","namespace":"nginx-quic"},"spec":{"externalTrafficPolicy":"Cluster","internalTrafficPolicy":"Cluster","loadBalancerIP":"10.31.8.100","ports":[{"port":80,"protocol":"TCP","targetPort":80}],"selector":{"app":"nginx-lb"},"type":"LoadBalancer"}}
  creationTimestamp: "2022-05-16T06:01:23Z"
  name: nginx-lb-service
  namespace: nginx-quic
  resourceVersion: "1165135"
  uid: f547842e-4547-4d01-abbc-89ac8b059a2a
spec:
  allocateLoadBalancerNodePorts: true
  clusterIP: 10.8.32.221
  clusterIPs:
  - 10.8.32.221
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  loadBalancerIP: 10.31.8.100
  ports:
  - nodePort: 30181
    port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: nginx-lb
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
    - ip: 10.31.8.100

查看IPVS規(guī)則信认,這時(shí)候可以看到ClusetIP材义、LoadBalancer-VIP和nodeport的轉(zhuǎn)發(fā)規(guī)則,默認(rèn)情況下在創(chuàng)建LoadBalance的時(shí)候還會(huì)創(chuàng)建nodeport服務(wù):

$ ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  172.17.0.1:30181 rr
  -> 10.8.65.15:80                Masq    1      0          0
  -> 10.8.65.16:80                Masq    1      0          0
  -> 10.8.66.12:80                Masq    1      0          0
  -> 10.8.66.13:80                Masq    1      0          0
TCP  10.8.32.221:80 rr
  -> 10.8.65.15:80                Masq    1      0          0
  -> 10.8.65.16:80                Masq    1      0          0
  -> 10.8.66.12:80                Masq    1      0          0
  -> 10.8.66.13:80                Masq    1      0          0
TCP  10.8.64.0:30181 rr
  -> 10.8.65.15:80                Masq    1      0          0
  -> 10.8.65.16:80                Masq    1      0          0
  -> 10.8.66.12:80                Masq    1      0          0
  -> 10.8.66.13:80                Masq    1      0          0
TCP  10.8.64.1:30181 rr
  -> 10.8.65.15:80                Masq    1      0          0
  -> 10.8.65.16:80                Masq    1      0          0
  -> 10.8.66.12:80                Masq    1      0          0
  -> 10.8.66.13:80                Masq    1      0          0
TCP  10.31.8.1:30181 rr
  -> 10.8.65.15:80                Masq    1      0          0
  -> 10.8.65.16:80                Masq    1      0          0
  -> 10.8.66.12:80                Masq    1      0          0
  -> 10.8.66.13:80                Masq    1      0          0
TCP  10.31.8.100:80 rr
  -> 10.8.65.15:80                Masq    1      0          0
  -> 10.8.65.16:80                Masq    1      0          0
  -> 10.8.66.12:80                Masq    1      0          0
  -> 10.8.66.13:80                Masq    1      0          0

使用curl檢查服務(wù)是否正常

$ curl 10.31.8.100:80
10.8.64.0:60854
$ curl 10.8.1.166:80
10.8.64.0:2562
$ curl 10.31.8.1:30974
10.8.64.0:1635
$ curl 10.31.8.100:80
10.8.64.0:60656

3.5 關(guān)于VIP

在每臺(tái)k8s節(jié)點(diǎn)機(jī)器上面的kube-ipvs0網(wǎng)卡上面都能看到這個(gè)LoadBalancer的VIP:

$ ip addr show kube-ipvs0
5: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
    link/ether 4e:ba:e8:25:cf:17 brd ff:ff:ff:ff:ff:ff
    inet 10.8.0.1/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.8.0.10/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.8.32.221/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.31.8.100/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever

想要定位到VIP在那個(gè)節(jié)點(diǎn)上面則比較麻煩嫁赏,我們可以找一臺(tái)和K8S集群處于同一個(gè)二層網(wǎng)絡(luò)的機(jī)器其掂,查看arp表,再根據(jù)mac地址找到對(duì)應(yīng)的節(jié)點(diǎn)IP潦蝇,這樣子可以反查到IP在哪個(gè)節(jié)點(diǎn)上面款熬。

$ arp -a | grep 10.31.8.100
? (10.31.8.100) at 52:54:00:5c:9c:97 [ether] on eth0

$ arp -a | grep 52:54:00:5c:9c:97
tiny-flannel-worker-8-12.k8s.tcinternal (10.31.8.12) at 52:54:00:5c:9c:97 [ether] on eth0
? (10.31.8.100) at 52:54:00:5c:9c:97 [ether] on eth0

$ ip a | grep 52:54:00:5c:9c:97
    link/ether 52:54:00:5c:9c:97 brd ff:ff:ff:ff:ff:ff

又或者我們可以查看speaker的pod日志,我們可以找到對(duì)應(yīng)的服務(wù)IP被宣告的日志記錄

$ kubectl logs -f -n metallb-system speaker-fl5l8
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.31.8.100"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"nginx-quic/nginx-lb-service","ts":"2022-05-16T06:11:34.099204376Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.31.8.100"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"nginx-quic/nginx-lb-service","ts":"2022-05-16T06:12:09.527334808Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.31.8.100"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"nginx-quic/nginx-lb-service","ts":"2022-05-16T06:12:09.547734268Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.31.8.100"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"nginx-quic/nginx-lb-service","ts":"2022-05-16T06:12:34.267651651Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.31.8.100"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"nginx-quic/nginx-lb-service","ts":"2022-05-16T06:12:34.286130424Z"}

3.6 關(guān)于nodeport

相信不少細(xì)心的同學(xué)已經(jīng)發(fā)現(xiàn)了攘乒,我們?cè)趧?chuàng)建LoadBalancer服務(wù)的時(shí)候贤牛,默認(rèn)情況下k8s會(huì)幫我們自動(dòng)創(chuàng)建一個(gè)nodeport服務(wù),這個(gè)操作可以通過(guò)指定Service中的allocateLoadBalancerNodePorts字段來(lái)定義開關(guān)则酝,默認(rèn)情況下為true

不同的loadbalancer實(shí)現(xiàn)原理不同殉簸,有些是需要依賴nodeport來(lái)進(jìn)行流量轉(zhuǎn)發(fā),有些則是直接轉(zhuǎn)發(fā)請(qǐng)求到pod中沽讹。對(duì)于MetalLB而言般卑,是通過(guò)kube-proxy將請(qǐng)求的流量直接轉(zhuǎn)發(fā)到pod,因此我們需要關(guān)閉nodeport的話可以修改service中的spec.allocateLoadBalancerNodePorts字段妥泉,將其設(shè)置為false椭微,那么在創(chuàng)建svc的時(shí)候就不會(huì)分配nodeport。

但是需要注意的是如果是對(duì)已有service進(jìn)行修改盲链,關(guān)閉nodeport(從true改為false)蝇率,k8s不會(huì)自動(dòng)去清除已有的ipvs規(guī)則,這需要我們自行手動(dòng)刪除刽沾。

我們重新定義創(chuàng)建一個(gè)svc

apiVersion: v1
kind: Service
metadata:
  name: nginx-lb-service
  namespace: nginx-quic
spec:
  allocateLoadBalancerNodePorts: false
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Cluster
  selector:
    app: nginx-lb
  ports:
  - protocol: TCP
    port: 80 # match for service access port
    targetPort: 80 # match for pod access port
  type: LoadBalancer
  loadBalancerIP: 10.31.8.100

此時(shí)再去查看對(duì)應(yīng)的svc狀態(tài)和ipvs規(guī)則會(huì)發(fā)現(xiàn)已經(jīng)沒有nodeport相關(guān)的配置

$ ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.8.62.180:80 rr
  -> 10.8.65.18:80                Masq    1      0          0
  -> 10.8.65.19:80                Masq    1      0          0
  -> 10.8.66.14:80                Masq    1      0          0
  -> 10.8.66.15:80                Masq    1      0          0
TCP  10.31.8.100:80 rr
  -> 10.8.65.18:80                Masq    1      0          0
  -> 10.8.65.19:80                Masq    1      0          0
  -> 10.8.66.14:80                Masq    1      0          0
  -> 10.8.66.15:80                Masq    1      0          0

$ kubectl get svc -n nginx-quic
NAME               TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)   AGE
nginx-lb-service   LoadBalancer   10.8.62.180   10.31.8.100   80/TCP    23s

如果是把已有服務(wù)的spec.allocateLoadBalancerNodePortstrue改為false本慕,原有的nodeport不會(huì)自動(dòng)刪除,因此最好在初始化的時(shí)候就規(guī)劃好相關(guān)參數(shù)

$ kubectl get svc -n nginx-quic nginx-lb-service -o yaml | egrep " allocateLoadBalancerNodePorts: "
  allocateLoadBalancerNodePorts: false
$ kubectl get svc -n nginx-quic
NAME               TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)        AGE
nginx-lb-service   LoadBalancer   10.8.62.180   10.31.8.100   80:31405/TCP   85m

4侧漓、BGP mode

4.1 網(wǎng)絡(luò)拓?fù)?/h2>

測(cè)試環(huán)境的網(wǎng)絡(luò)拓?fù)浞浅5暮?jiǎn)單锅尘,MetalLB的網(wǎng)段為了和前面Layer2模式進(jìn)行區(qū)分,更換為10.9.0.0/16,具體信息如下

IP Hostname
10.31.8.1 tiny-flannel-master-8-1.k8s.tcinternal
10.31.8.11 tiny-flannel-worker-8-11.k8s.tcinternal
10.31.8.12 tiny-flannel-worker-8-12.k8s.tcinternal
10.31.254.251 OpenWrt
10.9.0.0/16 MetalLB BGP IPpool

三臺(tái)k8s的節(jié)點(diǎn)直連Openwrt路由器藤违,OpenWRT作為k8s節(jié)點(diǎn)的網(wǎng)關(guān)的同時(shí)浪腐,還在上面跑BGP協(xié)議,將對(duì)MetalLB使用的VIP的請(qǐng)求路由到各個(gè)k8s節(jié)點(diǎn)上顿乒。

在開始配置之前议街,我們需要給路由器和k8s節(jié)點(diǎn)都分配一個(gè)私有的AS號(hào),這里可以參考wiki上面的AS號(hào)劃分使用璧榄。這里我們路由器使用AS號(hào)為64512特漩,MetalLB使用AS號(hào)為64513。

4.2 安裝路由軟件

以家里常見的openwrt路由器為例骨杂,我們先在上面安裝quagga組件涂身,當(dāng)然要是使用的openwrt版本編譯了frr模塊的話推薦使用frr來(lái)進(jìn)行配置。

如果使用的是別的發(fā)行版Linux(如CentOS或者Debian)推薦直接使用frr進(jìn)行配置搓蚪。

我們先在openwrt上面直接使用opkg安裝quagga

$ opkg update 
$ opkg install quagga quagga-zebra quagga-bgpd quagga-vtysh

如果使用的openwrt版本足夠新蛤售,是可以直接使用opkg安裝frr組件的

$ opkg update 
$ opkg install frr frr-babeld frr-bfdd frr-bgpd frr-eigrpd frr-fabricd frr-isisd frr-ldpd frr-libfrr frr-nhrpd frr-ospf6d frr-ospfd frr-pbrd frr-pimd frr-ripd frr-ripngd frr-staticd frr-vrrpd frr-vtysh frr-watchfrr frr-zebra

如果是使用frr記得在配置中開啟bgpd參數(shù)再重啟frr

$ sed -i 's/bgpd=no/bgpd=yes/g' /etc/frr/daemons
$ /etc/init.d/frr restart

4.3 配置路由器BGP

下面的服務(wù)配置以frr為例,實(shí)際上使用quagga的話也是使用vtysh進(jìn)行配置或者是直接修改配置文件陕凹,兩者區(qū)別不大悍抑。

檢查服務(wù)是否監(jiān)聽了2601和2605端口

root@OpenWrt:~# netstat -ntlup | egrep "zebra|bgpd"
tcp        0      0 0.0.0.0:2601            0.0.0.0:*               LISTEN      3018/zebra
tcp        0      0 0.0.0.0:2605            0.0.0.0:*               LISTEN      3037/bgpd

BGP協(xié)議使用的179端口還沒有被監(jiān)聽是因?yàn)槲覀冞€沒有進(jìn)行配置,這里我們可以直接使用vtysh進(jìn)行配置或者是直接修改配置文件然后重啟服務(wù)杜耙。

直接在命令行輸入vtysh就可以進(jìn)入到vtysh的配置終端(和kvm虛擬化的virsh類似)搜骡,這時(shí)候注意留意終端的提示符變化了

root@OpenWrt:~# vtysh

Hello, this is Quagga (version 1.2.4).
Copyright 1996-2005 Kunihiro Ishiguro, et al.

OpenWrt#

但是命令行配置比較麻煩,我們也可以直接修改配置文件然后重啟服務(wù)佑女。

quagga修改的bgp配置文件默認(rèn)是/etc/quagga/bgpd.conf记靡,不同的發(fā)行版和安裝方式可能會(huì)不同。

$ cat /etc/quagga/bgpd.conf
!
! Zebra configuration saved from vty
!   2022/05/19 11:01:35
!
password zebra
!
router bgp 64512
 bgp router-id 10.31.254.251
 neighbor 10.31.8.1 remote-as 64513
 neighbor 10.31.8.1 description 10-31-8-1
 neighbor 10.31.8.11 remote-as 64513
 neighbor 10.31.8.11 description 10-31-8-11
 neighbor 10.31.8.12 remote-as 64513
 neighbor 10.31.8.12 description 10-31-8-12
 maximum-paths 3
!
 address-family ipv6
 exit-address-family
 exit
!
access-list vty permit 127.0.0.0/8
access-list vty deny any
!
line vty
 access-class vty
!

如果使用的是frr团驱,那么配置文件會(huì)有所變化摸吠,需要修改的是/etc/frr/frr.conf,不同的發(fā)行版和安裝方式可能會(huì)不同嚎花。

$ cat /etc/frr/frr.conf
frr version 8.2.2
frr defaults traditional
hostname tiny-openwrt-plus
!
password zebra
!
router bgp 64512
 bgp router-id 10.31.254.251
 no bgp ebgp-requires-policy
 neighbor 10.31.8.1 remote-as 64513
 neighbor 10.31.8.1 description 10-31-8-1
 neighbor 10.31.8.11 remote-as 64513
 neighbor 10.31.8.11 description 10-31-8-11
 neighbor 10.31.8.12 remote-as 64513
 neighbor 10.31.8.12 description 10-31-8-12
 !
 address-family ipv4 unicast
 exit-address-family
exit
!
access-list vty seq 5 permit 127.0.0.0/8
access-list vty seq 10 deny any
!
line vty
 access-class vty
exit
!

完成配置后需要重啟服務(wù)

# 重啟frr的命令
$ /etc/init.d/frr restart
# 重啟quagge的命令
$ /etc/init.d/quagga restart

重啟后我們進(jìn)入vtysh查看bgp的狀態(tài)

tiny-openwrt-plus# show ip bgp summary

IPv4 Unicast Summary (VRF default):
BGP router identifier 10.31.254.251, local AS number 64512 vrf-id 0
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 3, using 2149 KiB of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
10.31.8.1       4      64513         0         0        0    0    0    never       Active        0 10-31-8-1
10.31.8.11      4      64513         0         0        0    0    0    never       Active        0 10-31-8-11
10.31.8.12      4      64513         0         0        0    0    0    never       Active        0 10-31-8-12

Total number of neighbors 3

這時(shí)候再查看路由器的監(jiān)聽端口寸痢,可以看到BGP已經(jīng)跑起來(lái)了

$ netstat -ntlup | egrep "zebra|bgpd"
tcp        0      0 127.0.0.1:2605          0.0.0.0:*               LISTEN      31625/bgpd
tcp        0      0 127.0.0.1:2601          0.0.0.0:*               LISTEN      31618/zebra
tcp        0      0 0.0.0.0:179             0.0.0.0:*               LISTEN      31625/bgpd
tcp        0      0 :::179                  :::*                    LISTEN      31625/bgpd

4.4 配置MetalLB BGP

首先我們修改configmap

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: metallb-system
  name: config
data:
  config: |
    peers:
    - peer-address: 10.31.254.251
      peer-port: 179
      peer-asn: 64512
      my-asn: 64513
    address-pools:
    - name: default
      protocol: bgp
      addresses:
      - 10.9.0.0/16

修改完成后我們重新部署configmap,并檢查metallb的狀態(tài)

$ kubectl apply -f configmap-metal.yaml
configmap/config configured

$ kubectl get cm -n metallb-system config -o yaml
apiVersion: v1
data:
  config: |
    peers:
    - peer-address: 10.31.254.251
      peer-port: 179
      peer-asn: 64512
      my-asn: 64513
    address-pools:
    - name: default
      protocol: bgp
      addresses:
      - 10.9.0.0/16
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"config":"peers:\n- peer-address: 10.31.254.251\n  peer-port: 179\n  peer-asn: 64512\n  my-asn: 64513\naddress-pools:\n- name: default\n  protocol: bgp\n  addresses:\n  - 10.9.0.0/16\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"config","namespace":"metallb-system"}}
  creationTimestamp: "2022-05-16T04:37:54Z"
  name: config
  namespace: metallb-system
  resourceVersion: "1412854"
  uid: 6d94ca36-93fe-4ea2-9407-96882ad8e35c

此時(shí)我們從路由器上面可以看到已經(jīng)和三個(gè)k8s節(jié)點(diǎn)建立了BGP連接

tiny-openwrt-plus# show ip bgp summary

IPv4 Unicast Summary (VRF default):
BGP router identifier 10.31.254.251, local AS number 64512 vrf-id 0
BGP table version 3
RIB entries 5, using 920 bytes of memory
Peers 3, using 2149 KiB of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
10.31.8.1       4      64513         6         4        0    0    0 00:00:45            3        3 10-31-8-1
10.31.8.11      4      64513         6         4        0    0    0 00:00:45            3        3 10-31-8-11
10.31.8.12      4      64513         6         4        0    0    0 00:00:45            3        3 10-31-8-12

Total number of neighbors 3

如果出現(xiàn)某個(gè)節(jié)點(diǎn)的BGP連接建立失敗的情況紊选,可以重啟該節(jié)點(diǎn)上面的speaker來(lái)重試建立BGP連接

$ kubectl delete po speaker-fl5l8 -n metallb-system

4.5 配置Service

當(dāng)configmap更改生效之后啼止,原有服務(wù)的EXTERNAL-IP不會(huì)重新分配

$ kubectl get svc -n nginx-quic
NAME               TYPE           CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
nginx-lb-service   LoadBalancer   10.8.4.92    10.31.8.100   80/TCP    18h

此時(shí)我們可以重啟controller,讓它重新為我們的服務(wù)分配EXTERNAL-IP

$ kubectl delete po -n metallb-system controller-57fd9c5bb-svtjw
pod "controller-57fd9c5bb-svtjw" deleted

重啟完成之后我們?cè)贆z查svc的狀態(tài)兵罢,如果svc的配置中關(guān)于LoadBalancer的VIP是自動(dòng)分配的(即沒有指定loadBalancerIP字段)献烦,那么這時(shí)候應(yīng)該就已經(jīng)拿到新的IP在正常運(yùn)行了,但是我們這個(gè)服務(wù)的loadBalancerIP之前手動(dòng)指定為10.31.8.100了卖词,這里的EXTERNAL-IP狀態(tài)就變?yōu)?code>pending巩那。

$ kubectl get svc -n nginx-quic
NAME               TYPE           CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
nginx-lb-service   LoadBalancer   10.8.4.92    <pending>     80/TCP    18h

重新修改loadBalancerIP10.9.1.1,此時(shí)可以看到服務(wù)已經(jīng)正常

$ kubectl get svc -n nginx-quic
NAME               TYPE           CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
nginx-lb-service   LoadBalancer   10.8.4.92    10.9.1.1      80/TCP    18h

再查看controller的日志可以看到

$ kubectl logs controller-57fd9c5bb-d6jsl -n metallb-system
{"branch":"HEAD","caller":"level.go:63","commit":"v0.12.1","goversion":"gc / go1.16.14 / amd64","level":"info","msg":"MetalLB controller starting version 0.12.1 (commit v0.12.1, branch HEAD)","ts":"2022-05-18T03:45:45.440872105Z","version":"0.12.1"}
{"caller":"level.go:63","configmap":"metallb-system/config","event":"configLoaded","level":"info","msg":"config (re)loaded","ts":"2022-05-18T03:45:45.610395481Z"}
{"caller":"level.go:63","error":"[\"10.31.8.100\"] is not allowed in config","event":"clearAssignment","level":"info","msg":"current IP not allowed by config, clearing","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.611009691Z"}
{"caller":"level.go:63","event":"clearAssignment","level":"info","msg":"user requested a different IP than the one currently assigned","reason":"differentIPRequested","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.611062419Z"}
{"caller":"level.go:63","error":"controller not synced","level":"error","msg":"controller not synced yet, cannot allocate IP; will retry after sync","op":"allocateIPs","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.611080525Z"}
{"caller":"level.go:63","event":"stateSynced","level":"info","msg":"controller synced, can allocate IPs now","ts":"2022-05-18T03:45:45.611117023Z"}
{"caller":"level.go:63","error":"[\"10.31.8.100\"] is not allowed in config","event":"clearAssignment","level":"info","msg":"current IP not allowed by config, clearing","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.617013146Z"}
{"caller":"level.go:63","event":"clearAssignment","level":"info","msg":"user requested a different IP than the one currently assigned","reason":"differentIPRequested","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.617089367Z"}
{"caller":"level.go:63","error":"[\"10.31.8.100\"] is not allowed in config","level":"error","msg":"IP allocation failed","op":"allocateIPs","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.617122976Z"}
{"caller":"level.go:63","event":"serviceUpdated","level":"info","msg":"updated service object","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.626039403Z"}
{"caller":"level.go:63","error":"[\"10.31.8.100\"] is not allowed in config","level":"error","msg":"IP allocation failed","op":"allocateIPs","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.626361986Z"}
{"caller":"level.go:63","event":"ipAllocated","ip":["10.9.1.1"],"level":"info","msg":"IP address assigned by controller","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:47:19.943434144Z"}

再查看speaker的日志我們可以看到和路由之間成功建立BGP連接的日志、使用了不符合規(guī)范的loadBalancerIP10.31.8.100的報(bào)錯(cuò)日志即横,以及為loadBalancerIP10.9.1.1分配BGP路由的日志

$ kubectl logs -n metallb-system speaker-bf79q

{"caller":"level.go:63","configmap":"metallb-system/config","event":"peerAdded","level":"info","msg":"peer configured, starting BGP session","peer":"10.31.254.251","ts":"2022-05-18T03:41:55.046091105Z"}
{"caller":"level.go:63","configmap":"metallb-system/config","event":"configLoaded","level":"info","msg":"config (re)loaded","ts":"2022-05-18T03:41:55.046268735Z"}
{"caller":"level.go:63","error":"assigned IP not allowed by config","ips":["10.31.8.100"],"level":"error","msg":"IP allocated by controller not allowed by config","op":"setBalancer","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:41:55.051955069Z"}
struct { Version uint8; ASN16 uint16; HoldTime uint16; RouterID uint32; OptsLen uint8 }{Version:0x4, ASN16:0xfc00, HoldTime:0xb4, RouterID:0xa1ffefd, OptsLen:0x1e}
{"caller":"level.go:63","event":"sessionUp","level":"info","localASN":64513,"msg":"BGP session established","peer":"10.31.254.251:179","peerASN":64512,"ts":"2022-05-18T03:41:55.052734174Z"}
{"caller":"level.go:63","level":"info","msg":"triggering discovery","op":"memberDiscovery","ts":"2022-05-18T03:42:40.183574415Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"10.31.8.12","node event":"NodeLeave","node name":"tiny-flannel-worker-8-12.k8s.tcinternal","ts":"2022-05-18T03:44:03.649494062Z"}
{"caller":"level.go:63","error":"assigned IP not allowed by config","ips":["10.31.8.100"],"level":"error","msg":"IP allocated by controller not allowed by config","op":"setBalancer","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:44:03.655003303Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"10.31.8.12","node event":"NodeJoin","node name":"tiny-flannel-worker-8-12.k8s.tcinternal","ts":"2022-05-18T03:44:06.247929645Z"}
{"caller":"level.go:63","error":"assigned IP not allowed by config","ips":["10.31.8.100"],"level":"error","msg":"IP allocated by controller not allowed by config","op":"setBalancer","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:44:06.25369106Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["10.9.1.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:47:19.953729779Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.9.1.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:47:19.953912236Z"}

我們?cè)诩和獾娜我庖粋€(gè)機(jī)器進(jìn)行測(cè)試

$ curl -v 10.9.1.1
* About to connect() to 10.9.1.1 port 80 (#0)
*   Trying 10.9.1.1...
* Connected to 10.9.1.1 (10.9.1.1) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.9.1.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: nginx
< Date: Wed, 18 May 2022 04:17:41 GMT
< Content-Type: text/plain
< Content-Length: 16
< Connection: keep-alive
<
10.8.64.0:43939
* Connection #0 to host 10.9.1.1 left intact

4.6 檢查ECMP

此時(shí)再查看路由器上面的路由狀態(tài)噪生,可以看到有關(guān)于10.9.1.1/32路由,這時(shí)候的下一條有多個(gè)IP东囚,說(shuō)明已經(jīng)成功開啟了ECMP杠园。

tiny-openwrt-plus# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

K>* 0.0.0.0/0 [0/0] via 10.31.254.254, eth0, 00:04:52
B>* 10.9.1.1/32 [20/0] via 10.31.8.1, eth0, weight 1, 00:01:40
  *                    via 10.31.8.11, eth0, weight 1, 00:01:40
  *                    via 10.31.8.12, eth0, weight 1, 00:01:40
C>* 10.31.0.0/16 is directly connected, eth0, 00:04:52

我們?cè)賱?chuàng)建多幾個(gè)服務(wù)進(jìn)行測(cè)試

# kubectl get svc -n nginx-quic
NAME                TYPE           CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
nginx-lb-service    LoadBalancer   10.8.4.92    10.9.1.1      80/TCP    23h
nginx-lb2-service   LoadBalancer   10.8.10.48   10.9.1.2      80/TCP    64m
nginx-lb3-service   LoadBalancer   10.8.6.116   10.9.1.3      80/TCP    64m

再查看此時(shí)路由器的狀態(tài)

tiny-openwrt-plus# show ip bgp
BGP table version is 3, local router ID is 10.31.254.251, vrf id 0
Default local pref 100, local AS 64512
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

   Network          Next Hop            Metric LocPrf Weight Path
*= 10.9.1.1/32      10.31.8.12                             0 64513 ?
*>                  10.31.8.1                              0 64513 ?
*=                  10.31.8.11                             0 64513 ?
*= 10.9.1.2/32      10.31.8.12                             0 64513 ?
*>                  10.31.8.1                              0 64513 ?
*=                  10.31.8.11                             0 64513 ?
*= 10.9.1.3/32      10.31.8.12                             0 64513 ?
*>                  10.31.8.1                              0 64513 ?
*=                  10.31.8.11                             0 64513 ?

Displayed  3 routes and 9 total paths


tiny-openwrt-plus# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

K>* 0.0.0.0/0 [0/0] via 10.31.254.254, eth0, 00:06:12
B>* 10.9.1.1/32 [20/0] via 10.31.8.1, eth0, weight 1, 00:03:00
  *                    via 10.31.8.11, eth0, weight 1, 00:03:00
  *                    via 10.31.8.12, eth0, weight 1, 00:03:00
B>* 10.9.1.2/32 [20/0] via 10.31.8.1, eth0, weight 1, 00:03:00
  *                    via 10.31.8.11, eth0, weight 1, 00:03:00
  *                    via 10.31.8.12, eth0, weight 1, 00:03:00
B>* 10.9.1.3/32 [20/0] via 10.31.8.1, eth0, weight 1, 00:03:00
  *                    via 10.31.8.11, eth0, weight 1, 00:03:00
  *                    via 10.31.8.12, eth0, weight 1, 00:03:00
C>* 10.31.0.0/16 is directly connected, eth0, 00:06:12

只有當(dāng)路由表顯示我們的LoadBalancerIP的下一跳有多個(gè)IP的時(shí)候,才說(shuō)明ECMP配置成功舔庶,否則需要檢查BGP的配置是否正確。

5陈醒、總結(jié)

5.1 Layer2 mode優(yōu)缺點(diǎn)

優(yōu)點(diǎn):

  • 通用性強(qiáng)惕橙,對(duì)比BGP模式不需要BGP路由器支持,幾乎可以適用于任何網(wǎng)絡(luò)環(huán)境钉跷;當(dāng)然云廠商的網(wǎng)絡(luò)環(huán)境例外

缺點(diǎn):

  • 所有的流量都會(huì)在同一個(gè)節(jié)點(diǎn)上弥鹦,該節(jié)點(diǎn)的容易成為流量的瓶頸
  • 當(dāng)VIP所在節(jié)點(diǎn)宕機(jī)之后,需要較長(zhǎng)時(shí)間進(jìn)行故障轉(zhuǎn)移(一般在10s)爷辙,這主要是因?yàn)镸etalLB使用了memberlist來(lái)進(jìn)行選主彬坏,當(dāng)VIP所在節(jié)點(diǎn)宕機(jī)之后重新選主的時(shí)間要比傳統(tǒng)的keepalived使用的vrrp協(xié)議要更長(zhǎng)
  • 難以定位VIP所在節(jié)點(diǎn),MetalLB并沒有提供一個(gè)簡(jiǎn)單直觀的方式讓我們查看到底哪一個(gè)節(jié)點(diǎn)是VIP所屬節(jié)點(diǎn)膝晾,基本只能通過(guò)抓包或者查看pod日志來(lái)確定栓始,當(dāng)集群規(guī)模變大的時(shí)候這會(huì)變得非常的麻煩

改進(jìn)方案:

  • 有條件的可以考慮使用BGP模式
  • 既不能用BGP模式也不能接受Layer2模式的,基本和目前主流的三個(gè)開源負(fù)載均衡器無(wú)緣了(三者都是Layer2模式和BGP模式且原理類似血当,優(yōu)缺點(diǎn)相同)

5.2 BGP mode優(yōu)缺點(diǎn)

BGP模式的優(yōu)缺點(diǎn)幾乎和Layer2模式相反

優(yōu)點(diǎn):

  • 無(wú)單點(diǎn)故障幻赚,在開啟ECMP的前提下,k8s集群內(nèi)所有的節(jié)點(diǎn)都有請(qǐng)求流量臊旭,都會(huì)參與負(fù)載均衡并轉(zhuǎn)發(fā)請(qǐng)求

缺點(diǎn):

  • 條件苛刻落恼,需要有BGP路由器支持,配置起來(lái)也更復(fù)雜离熏;
  • ECMP的故障轉(zhuǎn)移(failover)并不是特別地優(yōu)雅佳谦,這個(gè)問(wèn)題的嚴(yán)重程度取決于使用的ECMP算法;當(dāng)集群的節(jié)點(diǎn)出現(xiàn)變動(dòng)導(dǎo)致BGP連接出現(xiàn)變動(dòng)滋戳,所有的連接都會(huì)進(jìn)行重新哈希(使用三元組或五元組哈希)钻蔑,這對(duì)一些服務(wù)來(lái)說(shuō)可能會(huì)有影響;

路由器中使用的哈希值通常 不穩(wěn)定胧瓜,因此每當(dāng)后端集的大小發(fā)生變化時(shí)(例如矢棚,當(dāng)一個(gè)節(jié)點(diǎn)的 BGP 會(huì)話關(guān)閉時(shí)),現(xiàn)有的連接將被有效地隨機(jī)重新哈希府喳,這意味著大多數(shù)現(xiàn)有的連接最終會(huì)突然被轉(zhuǎn)發(fā)到不同的后端蒲肋,而這個(gè)后端可能和此前的后端毫不相干且不清楚上下文狀態(tài)信息。

改進(jìn)方案:

MetalLB給出了一些改進(jìn)方案,下面列出來(lái)給大家參考一下

  • 使用更穩(wěn)定的ECMP算法來(lái)減少后端變動(dòng)時(shí)對(duì)現(xiàn)有連接的影響兜粘,如“resilient ECMP” or “resilient LAG”
  • 將服務(wù)部署到特定的節(jié)點(diǎn)上減少可能帶來(lái)的影響
  • 在流量低峰期進(jìn)行變更
  • 將服務(wù)分開部署到兩個(gè)不同的LoadBalanceIP的服務(wù)中申窘,然后利用DNS進(jìn)行流量切換
  • 在客戶端加入透明的用戶無(wú)感的重試邏輯
  • 在LoadBalance后面加入一層ingress來(lái)實(shí)現(xiàn)更優(yōu)雅的failover(但是并不是所有的服務(wù)都可以使用ingress)
  • 接受現(xiàn)實(shí)……(Accept that there will be occasional bursts of reset connections. For low-availability internal services, this may be acceptable as-is.)

5.3 MetalLB的優(yōu)缺點(diǎn)

這里盡量客觀的總結(jié)概況一些客觀事實(shí),是否為優(yōu)缺點(diǎn)可能會(huì)因人而異:

  • 開源時(shí)間久(相對(duì)于云原生負(fù)載均衡器而言)孔轴,有一定的社區(qū)基礎(chǔ)和熱度剃法,但是項(xiàng)目目前還處于beta狀態(tài)
  • 部署簡(jiǎn)單快捷,默認(rèn)情況下需要配置的參數(shù)不多路鹰,可以快速上手
  • 官方文檔不多贷洲,只有一些基礎(chǔ)的配置和說(shuō)明,想要深入了解晋柱,可能需要閱讀源碼
  • 進(jìn)階管理配置不便优构,如果你想精確了解服務(wù)的當(dāng)前狀態(tài),可能會(huì)比較麻煩
  • configmap修改之后生效不是特別的優(yōu)雅雁竞,很多情況下需要我們手動(dòng)重啟pod

總的來(lái)說(shuō)钦椭,MetalLB作為一款處于beta階段的開源負(fù)載均衡器,很好地彌補(bǔ)了這一塊領(lǐng)域的空白碑诉,并且對(duì)后面開源的一些同類服務(wù)有著一定的影響彪腔。但是從實(shí)際生產(chǎn)落地的角度,給我的感覺就是目前更傾向于有得用且能用进栽,并不能算得上好用德挣,但是又考慮到MetalLB最開始只是一個(gè)個(gè)人開源項(xiàng)目,最近才有專門的組織進(jìn)行管理維護(hù)泪幌,這也是可以理解的盲厌,希望它能夠發(fā)展得更好吧。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末祸泪,一起剝皮案震驚了整個(gè)濱河市吗浩,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌没隘,老刑警劉巖懂扼,帶你破解...
    沈念sama閱讀 212,816評(píng)論 6 492
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場(chǎng)離奇詭異右蒲,居然都是意外死亡阀湿,警方通過(guò)查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,729評(píng)論 3 385
  • 文/潘曉璐 我一進(jìn)店門瑰妄,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)陷嘴,“玉大人,你說(shuō)我怎么就攤上這事间坐≡职ぃ” “怎么了邑退?”我有些...
    開封第一講書人閱讀 158,300評(píng)論 0 348
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)劳澄。 經(jīng)常有香客問(wèn)我地技,道長(zhǎng),這世上最難降的妖魔是什么秒拔? 我笑而不...
    開封第一講書人閱讀 56,780評(píng)論 1 285
  • 正文 為了忘掉前任莫矗,我火速辦了婚禮,結(jié)果婚禮上砂缩,老公的妹妹穿的比我還像新娘作谚。我一直安慰自己,他們只是感情好庵芭,可當(dāng)我...
    茶點(diǎn)故事閱讀 65,890評(píng)論 6 385
  • 文/花漫 我一把揭開白布食磕。 她就那樣靜靜地躺著,像睡著了一般喳挑。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上滔悉,一...
    開封第一講書人閱讀 50,084評(píng)論 1 291
  • 那天伊诵,我揣著相機(jī)與錄音,去河邊找鬼回官。 笑死曹宴,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的歉提。 我是一名探鬼主播笛坦,決...
    沈念sama閱讀 39,151評(píng)論 3 410
  • 文/蒼蘭香墨 我猛地睜開眼,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼苔巨!你這毒婦竟也來(lái)了版扩?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 37,912評(píng)論 0 268
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤侄泽,失蹤者是張志新(化名)和其女友劉穎礁芦,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體悼尾,經(jīng)...
    沈念sama閱讀 44,355評(píng)論 1 303
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡柿扣,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,666評(píng)論 2 327
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了闺魏。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片未状。...
    茶點(diǎn)故事閱讀 38,809評(píng)論 1 341
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖析桥,靈堂內(nèi)的尸體忽然破棺而出司草,到底是詐尸還是另有隱情艰垂,我是刑警寧澤,帶...
    沈念sama閱讀 34,504評(píng)論 4 334
  • 正文 年R本政府宣布翻伺,位于F島的核電站材泄,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏吨岭。R本人自食惡果不足惜拉宗,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 40,150評(píng)論 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望辣辫。 院中可真熱鬧旦事,春花似錦、人聲如沸急灭。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,882評(píng)論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)葬馋。三九已至卖鲤,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間畴嘶,已是汗流浹背蛋逾。 一陣腳步聲響...
    開封第一講書人閱讀 32,121評(píng)論 1 267
  • 我被黑心中介騙來(lái)泰國(guó)打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留窗悯,地道東北人区匣。 一個(gè)月前我還...
    沈念sama閱讀 46,628評(píng)論 2 362
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像蒋院,于是被迫代替她去往敵國(guó)和親亏钩。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 43,724評(píng)論 2 351

推薦閱讀更多精彩內(nèi)容