前言
最近公司的 k8s 集群出現(xiàn)了一個問題:在執(zhí)行任何 kubectl 命令時都會出現(xiàn)以下錯誤曾掂,本文就記錄一下該問題的溯源過程以及解決方式另患,希望對大家有幫助:
The connection to the server 192.168.100.170:6443 was refused - did you specify the right host or port?
問題溯源
相信很多朋友都遇到過這個問題,6443
是 k8s APIServer 的默認(rèn)端口薄榛,出現(xiàn)訪問被拒絕肯定是 kubelet 有問題或者被防火墻攔截了锰悼,這里先看一下這個端口上的 kubelet 是不是還或者:
netstat -pnlt | grep 6443
運(yùn)行之后什么都沒有返回漾月,也就是說 APIServer 完全沒有提供服務(wù)根时,那我們就去查看一下 kubelet 的日志瘦赫,大家都知道使用 kubeadm 搭建的 k8s集群里,APIServer 都是在 docker 里運(yùn)行的蛤迎,這里我們先找到對應(yīng)的容器确虱,記得加 -a
,因為該容器可能已經(jīng)處于非正常狀態(tài)了:
docker ps -a | grep apiserver
# 輸出
f40d97ee2be6 40a63db91ef8 "kube-apiserver --au…" 2 minutes ago Exited (255) 2 minutes ago k8s_kube-apiserver_kube-apiserver-master1_kube-system_7beef975d93d634ecee05282d3d3a9ac_718
4b866fe71e33 registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.1 "/pause" 2 days ago Up 2 days k8s_POD_kube-apiserver-master1_kube-system_7beef975d93d634ecee05282d3d3a9ac_0
這里能看到兩個容器替裆,可以看到 容器的狀態(tài)已經(jīng)是 Exited
了校辩,注意下面的pause
容器,這個只是用來引導(dǎo) APIServer 的辆童,并不是服務(wù)的實際運(yùn)行容器宜咒,所以看不到日志,所以查看日志時不要輸錯容器 id 了把鉴。接下來查看 APIServer 的日志:
docker logs -f f40d97ee2be6
# 輸出
I1230 01:39:42.942786 1 server.go:557] external host was not specified, using 192.168.100.171
I1230 01:39:42.942924 1 server.go:146] Version: v1.13.1
I1230 01:39:43.325424 1 plugins.go:158] Loaded 8 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,Priority,DefaultTolerationSeconds,DefaultStorageClass,MutatingAdmissionWebhook.
I1230 01:39:43.325451 1 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
I1230 01:39:43.326327 1 plugins.go:158] Loaded 8 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,Priority,DefaultTolerationSeconds,DefaultStorageClass,MutatingAdmissionWebhook.
I1230 01:39:43.326340 1 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
F1230 01:40:03.328865 1 storage_decorator.go:57] Unable to create storage backend: config (&{ /registry [https://127.0.0.1:2379] /etc/kubernetes/pki/apiserver-etcd-client.key /etc/kubernetes/pki/apiserver-etcd-client.crt /etc/kubernetes/pki/etcd/ca.crt true 0xc0004bd440 <nil> 5m0s 1m0s}), err (dial tcp 127.0.0.1:2379: connect: connection refused)
從最后一行可以看到故黑,是 APIServer 在嘗試創(chuàng)建存儲時出現(xiàn)了問題,導(dǎo)致無法正確啟動服務(wù)庭砍,由于 k8s 是使用 etcd 作為存儲的场晶,所以我們再來查看 etcd 的日志。
注意怠缸,我這里 etcd 也是運(yùn)行在 docker 里的诗轻,如果你是直接以 service 的形式運(yùn)行的話需要使用 systemctl status etcd
來查看日志,下面是 docker 的 etcd 日志查看:
# 查看 etcd 容器揭北,注意 etcd 也有對應(yīng)的 pause 容器
docker ps -a | grep etcd
# 輸出
1b8b522ee4e8 3cab8e1b9802 "etcd --advertise-cl…" 7 minutes ago Exited (2) 6 minutes ago k8s_etcd_etcd-master1_kube-system_1051dec0649f2b816946cb1fea184325_942
c9440543462e registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.1 "/pause" 2 days ago Up 2 days k8s_POD_etcd-master1_kube-system_1051dec0649f2b816946cb1fea184325_0
# 查看 etcd 日志
docker logs -f 1b8b522ee4e8
# 輸出
2019-12-30 01:43:44.075758 I | raft: 92b79bbe6bd2706a is starting a new election at term 165711
2019-12-30 01:43:44.075806 I | raft: 92b79bbe6bd2706a became candidate at term 165712
2019-12-30 01:43:44.075819 I | raft: 92b79bbe6bd2706a received MsgVoteResp from 92b79bbe6bd2706a at term 165712
2019-12-30 01:43:44.075832 I | raft: 92b79bbe6bd2706a [logterm: 82723, index: 84358879] sent MsgVote request to a25634eca298ea33 at term 165712
2019-12-30 01:43:44.075844 I | raft: 92b79bbe6bd2706a [logterm: 82723, index: 84358879] sent MsgVote request to 645060e8e879847c at term 165712
2019-12-30 01:43:45.075783 I | raft: 92b79bbe6bd2706a is starting a new election at term 165712
2019-12-30 01:43:45.075818 I | raft: 92b79bbe6bd2706a became candidate at term 165713
2019-12-30 01:43:45.075830 I | raft: 92b79bbe6bd2706a received MsgVoteResp from 92b79bbe6bd2706a at term 165713
2019-12-30 01:43:45.075840 I | raft: 92b79bbe6bd2706a [logterm: 82723, index: 84358879] sent MsgVote request to 645060e8e879847c at term 165713
2019-12-30 01:43:45.075849 I | raft: 92b79bbe6bd2706a [logterm: 82723, index: 84358879] sent MsgVote request to a25634eca298ea33 at term 165713
2019-12-30 01:43:45.928418 E | etcdserver: publish error: etcdserver: request timed out
2019-12-30 01:43:46.363974 I | etcdmain: rejected connection from "192.168.100.181:35914" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")
2019-12-30 01:43:46.364006 I | etcdmain: rejected connection from "192.168.100.181:35912" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")
2019-12-30 01:43:46.477058 I | etcdmain: rejected connection from "192.168.100.181:35946" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")
2019-12-30 01:43:46.483326 I | etcdmain: rejected connection from "192.168.100.181:35944" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")
2019-12-30 01:43:46.575790 I | raft: 92b79bbe6bd2706a is starting a new election at term 165713
2019-12-30 01:43:46.575818 I | raft: 92b79bbe6bd2706a became candidate at term 165714
2019-12-30 01:43:46.575829 I | raft: 92b79bbe6bd2706a received MsgVoteResp from 92b79bbe6bd2706a at term 165714
2019-12-30 01:43:46.575839 I | raft: 92b79bbe6bd2706a [logterm: 82723, index: 84358879] sent MsgVote request to 645060e8e879847c at term 165714
2019-12-30 01:43:46.575848 I | raft: 92b79bbe6bd2706a [logterm: 82723, index: 84358879] sent MsgVote request to a25634eca298ea33 at term 165714
2019-12-30 01:43:46.595828 I | etcdmain: rejected connection from "192.168.100.181:35962" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")
2019-12-30 01:43:46.597536 I | etcdmain: rejected connection from "192.168.100.181:35964" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")
2019-12-30 01:43:46.709028 I | etcdmain: rejected connection from "192.168.100.181:35970" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")
2019-12-30 01:43:46.714243 I | etcdmain: rejected connection from "192.168.100.181:35972" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")
2019-12-30 01:43:46.928411 W | rafthttp: health check for peer a25634eca298ea33 could not connect: dial tcp 192.168.100.191:2380: getsockopt: connection refused
...
可以看到 etcd 一直在循環(huán)輸出上面的錯誤日志直到超時退出扳炬,從里面可以提取到一條關(guān)鍵錯誤,就是 error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid
搔体。這個錯誤對于經(jīng)常維護(hù) k8s 集群的朋友可能很熟悉了恨樟,又是證書到期了。
這個集群有三臺 master嫉柴,分別是 171
厌杜、181
和191
,可以從錯誤信息前看到是在請求 181
時出現(xiàn)了證書驗證失敗的問題计螺,我們登陸 181
機(jī)器來驗證錯誤:
# 進(jìn)入 k8s 證書目錄
cd /etc/kubernetes/pki
# 查看證書到期時間
openssl x509 -in etcd/server.crt -noout -text |grep ' Not '
# 輸出
Not Before: Dec 26 08:12:11 2018 GMT
Not After : Dec 26 08:12:11 2019 GMT
經(jīng)過排查夯尽,發(fā)現(xiàn) k8s 的相關(guān)證書都沒事,但是 etcd 的證書都到期了登馒。關(guān)于 k8s 需要的證書可以看這篇文章匙握,接下來我們就來解決問題:
問題解決
注意,由于 k8s 版本問題陈轿,這一部分的內(nèi)容可能和你的不太一樣圈纺,我所使用的版本如下:
root@master1:~# kubelet --version
Kubernetes v1.13.1
root@master1:~# kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.1", GitCommit:"eec55b9ba98609a46fee712359c7b5b365bdd920", GitTreeState:"clean", BuildDate:"2018-12-13T10:36:44Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
如果版本相差過大的話請進(jìn)行百度秦忿,相關(guān)的解決方案還是挺多的,下面解決方案請先配合 -h
使用蛾娶,注意:以下操作會導(dǎo)致服務(wù)停止灯谣,請謹(jǐn)慎執(zhí)行:
備份原始文件
cd /etc
cp -r kubernetes kubernetes.bak
重新生成證書
重新生成證書需要集群初始化時的配置文件,我的配置文件kubeadm.yaml
如下:
kind: ClusterConfiguration
apiVersion: kubeadm.k8s.io/v1beta1
controlPlaneEndpoint: "192.168.100.170:6443"
apiServer:
certSANS:
- master1
- master2
- master3
- 192.168.100.170
- 192.168.100.171
- 192.168.100.181
- 192.168.100.191
其中 192.168.100.170
是 VIP蛔琅,171
胎许、181
、191
分別對應(yīng)master1
罗售、master2
辜窑、master3
主機(jī)。接下來使用配置文件重新簽發(fā)證書寨躁,每個管理節(jié)點(diǎn)都要執(zhí)行:
kubeadm init phase certs all --config=kubeadm.yaml
重新生成配置文件
kubeadm init phase kubeconfig all --config kubeadm.yaml
這個命令也需要每個管理節(jié)點(diǎn)都執(zhí)行一次穆碎,被重新生成的配置文件包括下列幾個:
- admin.conf
- controller-manager.conf
- kubelet.conf
- scheduler.conf
重啟管理節(jié)點(diǎn)的 k8s
重啟 etcd,apiserver职恳,controller-manager所禀,scheduler 容器,一般情況下 kubectl 都可以正常使用了话肖,記得kubectl get nodes
查看節(jié)點(diǎn)的狀態(tài)北秽。
重新生成工作節(jié)點(diǎn)的配置文件
如果上一步查看的工作節(jié)點(diǎn)的狀態(tài)還是為 NotReady
的話,就需要重新進(jìn)行生成最筒,如果你根證書也更換了的話就會導(dǎo)致這個問題,工作節(jié)點(diǎn)的證書也會失效蔚叨,直接備份并移除下面的證書并重啟 kubelet 即可:
mv /var/lib/kubelet/pki /var/lib/kubelet/pki.bak
systemctl daemon-reload && systemctl restart kubelet
如果不行的話就直接把管理節(jié)點(diǎn)的/etc/kubernetes/pki/ca.crt
復(fù)制到對應(yīng)工作節(jié)點(diǎn)的相同目錄下然后再次啟動 kubelet床蜘。等待三分鐘左右應(yīng)該就可以在管理節(jié)點(diǎn)上看到該工作節(jié)點(diǎn)的狀態(tài)變?yōu)?code>Ready。
總結(jié)
k8s 的證書只有一年的設(shè)置確定有點(diǎn)坑蔑水,雖然為了讓使用者更新到最新版本的本意是好的邢锯。如果你現(xiàn)在 k8s 集群還是正常但是并沒有執(zhí)行過證書更新操作的話,請及時查看你的證書到期時間搀别,等到證書到期就為時已晚了丹擎。