這是個困擾筆者2天的問題洒嗤,過程中也查閱大量stackoverflow/google/baidu(千篇一律的內存小了/升級/重裝等)箫荡,今天終于想通解決了,故在此記錄渔隶,望給有相同經歷的同學提供一種思路羔挡。
先來看下具體問題,集群完成后發(fā)現 kubectl version 報錯:间唉,追加 --v 9 查看詳細日志后發(fā)現 Client 端正常绞灼,服務端服務正常響應。
[root@***-24-69-3 ~]# kubectl version --v 9
I0511 09:49:55.099313 2329027 loader.go:372] Config loaded from file: /etc/kubernetes/admin.conf
I0511 09:49:55.099762 2329027 round_trippers.go:466] curl -v -XGET -H "Accept: application/json, */*" -H "User-Agent: kubectl/v1.23.6 (linux/amd64) kubernetes/ad33385" 'https://***.24.69.222:6443/version?timeout=32s'
I0511 09:49:55.100226 2329027 round_trippers.go:510] HTTP Trace: Dial to tcp:***.24.69.222:6443 succeed
I0511 09:50:05.100639 2329027 round_trippers.go:570] HTTP Statistics: DNSLookup 0 ms Dial 0 ms TLSHandshake 10000 ms Duration 10000 ms
I0511 09:50:05.100654 2329027 round_trippers.go:577] Response Headers:
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.6", GitCommit:"ad3338546da947756e8a88aa6822e9c11e7eac22", GitTreeState:"clean", BuildDate:"2022-04-14T08:49:13Z", GoVersion:"go1.17.9", Compiler:"gc", Platform:"linux/amd64"}
I0511 09:50:05.100728 2329027 helpers.go:237] Connection error: Get https://***.24.69.222:6443/version?timeout=32s: net/http: TLS handshake timeout
F0511 09:50:05.100742 2329027 helpers.go:118] Unable to connect to the server: net/http: TLS handshake timeout
goroutine 1 [running]:
k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0x1)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1038 +0x8a
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).output(0x3080040, 0x3, 0x0, 0xc00053a230, 0x2, {0x25f2ec7, 0x10}, 0xc00010c400, 0x0)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:987 +0x5fd
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).printDepth(0xc0004386c0, 0x40, 0x0, {0x0, 0x0}, 0x2a, {0xc00011eb20, 0x1, 0x1})
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:735 +0x1ae
k8s.io/kubernetes/vendor/k8s.io/klog/v2.FatalDepth(...)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1518
k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util.fatal({0xc0004386c0, 0x40}, 0xc0001fa120)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util/helpers.go:96 +0xc5
k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util.checkErr({0x1fed760, 0xc0001fa120}, 0x1e797d0)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util/helpers.go:191 +0x7d7
k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util.CheckErr(...)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util/helpers.go:118
k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/version.NewCmdVersion.func1(0xc000aecf00, {0xc00043b460, 0x0, 0x2})
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/version/version.go:79 +0xd1
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute(0xc000aecf00, {0xc00043b420, 0x2, 0x2})
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:860 +0x5f8
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc000395680)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:974 +0x3bc
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute(...)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/vendor/k8s.io/component-base/cli.run(0xc000395680)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/component-base/cli/run.go:146 +0x325
k8s.io/kubernetes/vendor/k8s.io/component-base/cli.RunNoErrOutput(...)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/component-base/cli/run.go:84
main.main()
_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubectl/kubectl.go:30 +0x1e
goroutine 6 [chan receive]:
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).flushDaemon(0x0)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1181 +0x6a
created by k8s.io/kubernetes/vendor/k8s.io/klog/v2.init.0
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:420 +0xfb
goroutine 51 [select]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0, {0x1febb40, 0xc000568000}, 0x1, 0xc000138360)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:167 +0x13b
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0, 0x12a05f200, 0x0, 0x0, 0x0)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until(...)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Forever(0x0, 0x0)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:81 +0x28
created by k8s.io/kubernetes/vendor/k8s.io/component-base/logs.InitLogs
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/component-base/logs/logs.go:179 +0x85
# 直接執(zhí)行 curl 訪問呈野,問題一致
[root@***-24-69-3 ~]# curl -v -k -H "Accept: application/json, */*" -H "User-Agent: kubectl/v1.23.6 (linux/amd64) kubernetes/ad33385" 'https://***.24.69.222:6443/version?timeout=32s'
* About to connect() to ***.24.69.222 port 6443 (#0)
* Trying ***.24.69.222...
* Connected to ***.24.69.222 (***.24.69.222) port 6443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
之后筆者用了比較笨的辦法恢復了故障的節(jié)點低矮,并保留了一臺作為故障排查。
正常節(jié)點上用命令觀察其實能發(fā)現故障節(jié)點服務一切是正常的被冒,那么這里可以基本上判斷 CNI 網絡插件 Flannel 是正常服務的军掂,這里就開始回看 kubectl 的 https 協議問題了。
[root@***-24-69-2 ~]# kubectl get no
NAME STATUS ROLES AGE VERSION
***-24-69-2.*** Ready control-plane,master 4d19h v1.23.6
***-24-69-3.*** Ready control-plane,master 4d19h v1.23.6
***-24-69-30.*** Ready <none> 35m v1.23.6
***-24-69-31.*** Ready <none> 15m v1.23.6
***-24-69-32.*** Ready <none> 30s v1.23.6
***-24-69-4.*** Ready control-plane,master 4d19h v1.23.6
***-24-69-5.*** Ready <none> 23h v1.23.6
***-24-69-6.*** Ready <none> 22h v1.23.6
# 24-69-3 觀察 flannel 正常
[root@***-24-69-3 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: em1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
link/ether 38:68:dd:4f:e7:58 brd ff:ff:ff:ff:ff:ff
3: em2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
link/ether 38:68:dd:4f:e7:58 brd ff:ff:ff:ff:ff:ff
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 38:68:dd:4f:e7:58 brd ff:ff:ff:ff:ff:ff
6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:9b:fa:a8:00 brd ff:ff:ff:ff:ff:ff
inet ***.17.0.1/16 brd ***.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
13: bond0.169@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 38:68:dd:4f:e7:58 brd ff:ff:ff:ff:ff:ff
inet ***.24.69.3/24 brd ***.24.69.255 scope global bond0.169
valid_lft forever preferred_lft forever
14: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
link/ether 6a:89:1a:8a:92:1f brd ff:ff:ff:ff:ff:ff
inet 10.244.1.0/32 scope global flannel.1
valid_lft forever preferred_lft forever
這里需要回顧下 HTTPS 的安全通信機制(這里借用了《圖解HTTP》的相關內容)
客戶端發(fā)送Client Hello報文與服務器進行SSL通信昨悼,報文中指明了客戶端支持的SSL的版本蝗锥,加密組件(加密算法及密鑰長度)
服務器可以進行SSL通信時,以Server Hello報文作為回應率触,報文中也包含了SSL版本及加密組件
緊接著服務器發(fā)送Certificate報文终议,將自己的公開密鑰證書發(fā)給客戶端
服務器發(fā)送Server Hello Done報文通知客戶端,握手部分結束
SSL第一次握手結束后葱蝗,客戶端發(fā)送Client Key Exchange報文回應痊剖,報文中包含使用服務器公開密鑰加密的一種被稱為Pre-master secret的隨機密碼串,這個密碼串十分重要,它作為后面通信的共享密鑰
接著客戶端繼續(xù)發(fā)送Change Cipher Spec報文垒玲,該報文提示服務器陆馁,在此報文之后的通信會采用Pre-master secret密鑰加密
客戶端繼續(xù)發(fā)送Finished報文,此報文包含了之前所有報文的整體校驗值合愈,服務器能否正確的解密該報文決定了此次握手協商是否成功
服務器同樣發(fā)送Change Cipher Spec報文
服務器同樣發(fā)送Finished報文
服務器和客戶端的Finished報文交換完畢后叮贩,SSL連接建立完成,通信加密完成受到SSL的保護佛析,之后進行應用層的通信益老,即發(fā)送HTTP請求
應用層通信,發(fā)送HTTP響應
最后由客戶端斷開連接寸莫,發(fā)送close_notify報文捺萌,之后進行TCP的四次揮手斷開連接
到這里,可以明確在客戶端發(fā)送 Client Hello 報文與服務器進行 SSL 通信時膘茎,并未得到服務端 Server Hello 報文的應答桃纯,這里比較奇怪的是 .24.69.222 服務測網卡 IP 可以正常 ping 通酷誓,這里在說明下,.24.69.222 是由 Master 節(jié)點中的一臺虛擬出來的 IP态坦,那么其他正常節(jié)點也可以通過該虛擬 IP 完成 HTTPS通信盐数,其實這里問題就比較明顯了(特別是高可用集群部署時,參照了 keepalived + lvs 做 LB 的同學們)伞梯。
[root@***-24-69-3 bin]# ping ***.24.69.222
PING ***.24.69.222 (***.24.69.222) 56(84) bytes of data.
64 bytes from ***.24.69.222: icmp_seq=1 ttl=64 time=0.140 ms
64 bytes from ***.24.69.222: icmp_seq=2 ttl=64 time=0.151 ms
64 bytes from ***.24.69.222: icmp_seq=3 ttl=64 time=0.084 ms
^C
--- ***.24.69.222 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.084/0.125/0.151/0.029 ms
這里需要引入 IP 地址和硬件地址以及 ARP 的相關知識:
物理地址是數據鏈路層和物理層使用的地址玫氢,而 IP 地址是網絡層和以上各層使用的地址,是一種邏輯地址(稱 IP 地址是邏輯地址是因為 IP 地址是用軟件實現的)谜诫。
在發(fā)送數據時漾峡,數據從高層下到低層,然后才到通信鏈路上傳輸喻旷。使用 IP 地址的 IP 數據報一旦交給了數據鏈路層灰殴,就被封裝成 MAC 幀了。MAC 幀在傳送時使用的源地址和目的地址都是硬件地址掰邢,這兩個硬件地址都寫在 MAC 幀的首部中牺陶。
到這里就可以引入 ARP 地址解析協議了。
首先當我們知道了一個機器(主機或路由器)的 IP 地址辣之,需要找出其相應的硬件地址掰伸。地址解析協議 ARP 就是用來解決這樣的問題的。
地址解析協議 ARP 會在主機 ARP 高速緩存中存放一個 從 IP 地址到硬件地址的映射表怀估,并且這個映射表還經常動態(tài)更新(新增或超時刪除)狮鸭。
好了,我們再來觀察下 ARP 的高速緩存對應的 MAC 物理地址多搀,看看是否有問題歧蕉。
# 故障節(jié)點查看 ARP 高速緩存
[root@***-24-69-3 bin]# arp -e
Address HWtype HWaddress Flags Mask Iface
***.24.69.222 ether b4:05:5d:7d:89:3a C bond0.169
......
# 222 節(jié)點查看對應網卡
# 這里發(fā)現了,兩者的 MAC 地址竟然不一致
[root@***-24-69-2 ~]# ip a
......
5: bond0.169@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 38:68:dd:4f:e7:e8 brd ff:ff:ff:ff:ff:ff
inet ***.24.69.2/24 brd ***.24.69.255 scope global bond0.169
valid_lft forever preferred_lft forever
inet ***.24.69.222/24 brd ***.24.69.255 scope global secondary bond0.169:1
valid_lft forever preferred_lft forever
問題定位到康铭,我們接著操作( )惯退。
[root@***-24-69-3 ~]# arp -d ***.24.69.222
[root@***-24-69-3 ~]# ping ***.24.69.222
[root@***-24-69-3 ~]# arp -e
Address HWtype HWaddress Flags Mask Iface
***.24.69.222 ether 38:68:dd:4f:e7:e8 C bond0.169
[root@***-24-69-3 ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.6", GitCommit:"ad3338546da947756e8a88aa6822e9c11e7eac22", GitTreeState:"clean", BuildDate:"2022-04-14T08:49:13Z", GoVersion:"go1.17.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.6", GitCommit:"ad3338546da947756e8a88aa6822e9c11e7eac22", GitTreeState:"clean", BuildDate:"2022-04-14T08:43:11Z", GoVersion:"go1.17.9", Compiler:"gc", Platform:"linux/amd64"}
OK,順利修復从藤,又看到了 Server Version 正常響應了催跪。
參考文獻:
《計算機網絡(第6版)》謝希仁;
《圖解HTTP》上野宣夷野;