k8s集群被打爆后的故障分析

故障概述

接到反饋有kubelet的10250端口無法連接泊脐，經(jīng)過排查發(fā)現(xiàn)有部分Node是NotReady狀態(tài)，且很多Pod是Terminating狀態(tài)银还，此時整個K8S集群都處于失控狀態(tài)毕匀，業(yè)務(wù)不能正常工作翁潘，老的Pod不斷被刪除涡尘，新的Pod又很難被調(diào)度忍弛，又有部分Pod無法刪除。壓測的部分業(yè)務(wù)無法正常工作考抄。

故障分析

為什么出現(xiàn)很多NotReady细疚？
查看Node狀態(tài)，此時已經(jīng)有部分Node的狀態(tài)為NotReady

kubectl get nodes
NAME                    STATUS     ROLES    AGE   VERSION
op-k8s1-pm         Ready      <none>   66d   v1.17.3
op-k8s10-pm        Ready      <none>   46h   v1.17.3
op-k8s11-pm        Ready      <none>   46h   v1.17.3
op-k8s2-pm         NotReady   <none>   66d   v1.17.3
op-k8s3-pm         NotReady   <none>   66d   v1.17.3
op-k8s4-pm         NotReady   <none>   66d   v1.17.3
op-k8s5-pm         NotReady   <none>   66d   v1.17.3
op-k8s6-pm         NotReady   <none>   66d   v1.17.3
...
op-k8smaster3-pm   Ready      master   69d   v1.17.3

以下為op-k8s2-pm上的資源使用情況排查

free -g
total used free shared buff/cache available
Mem: 250 242 1 0 7 3
Swap: 0 0 0

uptime
18:10:11 up 70 days, 8 min, 2 users, load average: 733.31, 616.92, 625.68

ps aux|grep java |grep -v tini |wc -l
91

top
top - 18:11:24 up 70 days, 10 min, 2 users, load average: 579.80, 607.49, 622.64
Tasks: 1069 total, 3 running, 688 sleeping, 0 stopped, 0 zombie
%Cpu(s): 9.4 us, 7.0 sy, 0.0 ni, 0.2 id, 81.5 wa, 0.4 hi, 1.5 si, 0.0 st
KiB Mem : 26275092+total, 984572 free, 25881926+used, 2947100 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 370160 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
973481 nfsnobo+ 20 0 126808 28788 0 S 403.5 0.0 709:41.85 node_exporter
957773 1004 20 0 16.6g 3.0g 0 S 100.0 1.2 3:02.27 java
277 root 20 0 0 0 0 R 99.1 0.0 52:00.89 kswapd0
278 root 20 0 0 0 0 R 99.1 0.0 100:21.86 kswapd1
895608 root 20 0 1706728 3600 2336 D 7.1 0.0 3:04.87 journalctl
874 root 20 0 4570848 127852 0 S 4.4 0.0 8:16.75 kubelet
11115 maintain 20 0 165172 1760 0 R 2.7 0.0 0:00.21 top
965470 1004 20 0 17.3g 2.8g 0 S 2.7 1.1 1:59.22 java
9838 root 20 0 0 0 0 I 1.8 0.0 0:04.95 kworker/u98:0-f
952613 1004 20 0 19.7g 2.8g 0 S 1.8 1.1 1:51.01 java
954967 1004 20 0 13.6g 3.0g 0 S 1.8 1.2 3:00.73 java

此時在op-k8s2-pm已經(jīng)出現(xiàn)大量的Terminating的Pod

kubectl get pod -owide --all-namespaces |grep op-k8s2-pm
test            tutor-episode--venv-stress-fudao-6c68ff7f89-w49tv                 1/1     Terminating         0          23h     10.1.4.56    op-k8s2-pm         <none>           <none>
test            tutor-es-lesson--venv-stress-fudao-69f67c4dc4-r56m4               1/1     Terminating         0          23h     10.1.4.93    op-k8s2-pm         <none>           <none>
test            tutor-faculty--venv-stress-fudao-7f44fbdcd5-dzcxq                 1/1     Terminating         0          23h     10.1.4.45    op-k8s2-pm         <none>           <none>
...
test            tutor-oauth--venv-stress-fudao-5989489c9d-jtzgg                   1/1     Terminating         0          23h     10.1.4.78    op-k8s2-pm         <none>           <none>

kubelet日志川梅，出現(xiàn)了很多“use of closed network connection”疯兼，由于K8S的默認(rèn)連接是HTTP2.0長連接然遏，此錯誤表明此kubelet連接到Apiserver的連接是broken的，在Apiserver上通過netstat觀察也沒有45688這個連接存在

Nov 05 21:58:35 op-k8s2-pm kubelet[105611]: E1105 21:58:35.276562 105611 reflector.go:153] k8s.io/client-go/informers/factory.go:135: Failed to list *v1beta1.RuntimeClass: Get https://10.2.2.2:443/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: write tcp 10.2.2.7:45688->10.2.2.2:443: use of closed network connection
Nov 05 21:58:35 op-k8s2-pm kubelet[105611]: I1105 21:58:35.346898 105611 config.go:100] Looking for [api file], have seen map[]
Nov 05 21:58:35 op-k8s2-pm kubelet[105611]: I1105 21:58:35.446871 105611 config.go:100] Looking for [api file], have seen map[]

到此時基本判定soho-op-k8s2-pm的資源耗盡導(dǎo)致kubelet無法連接Apiserver吧彪，最終出現(xiàn)NotRead待侵。經(jīng)過排查其他的Node，情況基本一致姨裸。

為什么突然出現(xiàn)這么大量的資源爭搶秧倾？

tutor-lesson-activity--venv-stress-fudao            16/16   16           16          12h28m
tutor-lesson-renew--venv-stress-fudao               9/10    9            32          12h45m
tutor-live-data-check                               9/32    9            32          12h10m
tutor-oauth--venv-stress-fudao                      16/32   16           32          12h19m
tutor-pepl--venv-stress-fudao                       10/32   10           32          12h
tutor-profile--venv-stress-fudao                    7/32    7            32          12h
tutor-recommend--venv-stress-fudao                  32/32   32           32          12h
...
tutor-student-lesson--venv-stress-fudao             24/24   24           24          12h

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
    yfd_deploy_version: 875f2832-1e81-11eb-ba04-00163e0a5041
  labels:
    project: tutor-recommend
  name: tutor-recommend--venv-stress-fudao
  namespace: test
spec:
  replicas: 32
  selector:
    matchLabels:
      project: tutor-recommend
  template:
    metadata:
      labels:
        project: tutor-recommen
    spec:
      containers:
        resources:
          limits:
            cpu: "4"
            memory: 8G
          requests:
            cpu: 500m
            memory: 512M

經(jīng)過詢問是由于壓測想用k8s測試環(huán)境模擬線上環(huán)境壓測一下，但是測試節(jié)點數(shù)遠遠低于線上的節(jié)點數(shù)

結(jié)論：由于很多deployment增加了大量副本數(shù),且資源超賣驗證啦扬，最終導(dǎo)致K8S集群資源不足，造成了資源爭搶凫碌，不斷壓垮kubelet扑毡，出現(xiàn)NotReady

為什么出現(xiàn)很多Terminating?

沒有觸發(fā)Evict啊(Evict的Pod的狀態(tài)為Evict），而且Pod的狀態(tài)Terminating盛险，在排查的過程也發(fā)現(xiàn)不斷有Pod生成瞄摊，老的Pod不斷Terminating。

為什么沒有觸發(fā)Evict呢苦掘？經(jīng)過排查發(fā)現(xiàn)磁盤沒有壓力换帜，內(nèi)存的必須要小于100M才會觸發(fā)Evict。以下的kubelet的配置就可能看出沒有觸發(fā)Evict

evictionHard:
  imagefs.available: 1%
  memory.available: 100Mi
  nodefs.inodesFree: 1%

注一：由于Evict不可控鹤啡，且一旦發(fā)現(xiàn)磁盤或者內(nèi)存有壓力惯驼，最好通過手動處理，自動處理很難解決問題递瑰，且很容出現(xiàn)遷移了一個Pod后節(jié)點恢復(fù)祟牲，但是一會兒又出現(xiàn)壓力，來回來去觸發(fā)Evict
注二：將來會徹底禁止Evict

排查kube-controller-manager的日志發(fā)現(xiàn)

endpoints_controller.go:590] Pod is out of service: test/tutor-episode--venv-stress-fudao-6c68ff7f89-h8wwz
taint_manager.go:105] NoExecuteTaintManager is deleting Pod: test/tutor-episode--venv-stress-fudao-6c68ff7f89-h8wwz
request.go:565] Throttling request took 1.148923424s, request: DELETE:https://10.2.2.2:443/api/v1/namespaces/test/pods/tutor-episode--venv-stress-fudao-6c68ff7f89-h8wwz
disruption.go:457] No PodDisruptionBudgets found for pod tutor-episode--venv-stress-fudao-6c68ff7f89-h8wwz, PodDisruptionBudget controller will avoid syncing
endpoints_controller.go:420] Pod is being deleted test/tutor-episode--venv-stress-fudao-6c68ff7f89-h8wwz
controller_utils.go:911] Ignoring inactive pod test/tutor-episode--venv-stress-fudao-6c68ff7f89-h8wwz in state Running, deletion time 2020-11-05 10:06:27 +0000 UTC

發(fā)現(xiàn)是由于NoExecuteTaintManager控制器刪除的Pod抖部，經(jīng)過追查代碼發(fā)現(xiàn)
kubernetes/pkg/controller/nodelifecycle/scheduler/taint_manager.go

handleNodeUpdate

func (tc *NoExecuteTaintManager) handleNodeUpdate(nodeUpdate nodeUpdateItem) {
    node, err := tc.getNode(nodeUpdate.nodeName)
    if err != nil {
        if apierrors.IsNotFound(err) {
            // Delete
            klog.V(4).Infof("Noticed node deletion: %#v", nodeUpdate.nodeName)
            tc.taintedNodesLock.Lock()
            defer tc.taintedNodesLock.Unlock()
            delete(tc.taintedNodes, nodeUpdate.nodeName)
            return
        }
        utilruntime.HandleError(fmt.Errorf("cannot get node %s: %v", nodeUpdate.nodeName, err))
        return
    }
 
    // Create or Update
    klog.V(4).Infof("Noticed node update: %#v", nodeUpdate)
    taints := getNoExecuteTaints(node.Spec.Taints)
    func() {
        tc.taintedNodesLock.Lock()
        defer tc.taintedNodesLock.Unlock()
        klog.V(4).Infof("Updating known taints on node %v: %v", node.Name, taints)
        if len(taints) == 0 {
            delete(tc.taintedNodes, node.Name)
        } else {
            tc.taintedNodes[node.Name] = taints
        }
    }()
 
    // This is critical that we update tc.taintedNodes before we call getPodsAssignedToNode:
    // getPodsAssignedToNode can be delayed as long as all future updates to pods will call
    // tc.PodUpdated which will use tc.taintedNodes to potentially delete delayed pods.
    pods, err := tc.getPodsAssignedToNode(node.Name)
    if err != nil {
        klog.Errorf(err.Error())
        return
    }
    if len(pods) == 0 {
        return
    }
    // Short circuit, to make this controller a bit faster.
    if len(taints) == 0 {
        klog.V(4).Infof("All taints were removed from the Node %v. Cancelling all evictions...", node.Name)
        for i := range pods {
            tc.cancelWorkWithEvent(types.NamespacedName{Namespace: pods[i].Namespace, Name: pods[i].Name})
        }
        return
    }
 
    now := time.Now()
    for _, pod := range pods {
        podNamespacedName := types.NamespacedName{Namespace: pod.Namespace, Name: pod.Name}
        tc.processPodOnNode(podNamespacedName, node.Name, pod.Spec.Tolerations, taints, now)
    }
}

processPodOnNode

func (tc *NoExecuteTaintManager) processPodOnNode(
    podNamespacedName types.NamespacedName,
    nodeName string,
    tolerations []v1.Toleration,
    taints []v1.Taint,
    now time.Time,
) {
    if len(taints) == 0 {
        tc.cancelWorkWithEvent(podNamespacedName)
    }
    allTolerated, usedTolerations := v1helper.GetMatchingTolerations(taints, tolerations)
    if !allTolerated {
        klog.V(2).Infof("Not all taints are tolerated after update for Pod %v on %v", podNamespacedName.String(), nodeName)
        // We're canceling scheduled work (if any), as we're going to delete the Pod right away.
        tc.cancelWorkWithEvent(podNamespacedName)
        tc.taintEvictionQueue.AddWork(NewWorkArgs(podNamespacedName.Name, podNamespacedName.Namespace), time.Now(), time.Now())
        return
    }
    minTolerationTime := getMinTolerationTime(usedTolerations)
    // getMinTolerationTime returns negative value to denote infinite toleration.
    if minTolerationTime < 0 {
        klog.V(4).Infof("New tolerations for %v tolerate forever. Scheduled deletion won't be cancelled if already scheduled.", podNamespacedName.String())
        return
    }
 
    startTime := now
    triggerTime := startTime.Add(minTolerationTime)
    scheduledEviction := tc.taintEvictionQueue.GetWorkerUnsafe(podNamespacedName.String())
    if scheduledEviction != nil {
        startTime = scheduledEviction.CreatedAt
        if startTime.Add(minTolerationTime).Before(triggerTime) {
            return
        }
        tc.cancelWorkWithEvent(podNamespacedName)
    }
    tc.taintEvictionQueue.AddWork(NewWorkArgs(podNamespacedName.Name, podNamespacedName.Namespace), startTime, triggerTime)
}

發(fā)現(xiàn)LifecycleController控制會監(jiān)控所有的Node说贝，一旦發(fā)現(xiàn)Node的taint有NoExecuted，就檢查其上的Pod是否有toleration慎颗，如果沒有或者TolerationSeconds時間小于0乡恕，就啟動Pod刪除動作。

kubernetes的文檔上的解釋 (詳見：https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
)

NoExecute

Normally, if a taint with effect NoExecute is added to a node, then any pods that do not tolerate the taint will be evicted immediately, and pods that do tolerate the taint will never be evicted. However, a toleration with NoExecute effect can specify an optional tolerationSeconds field that dictates how long the pod will stay bound to the node after the taint is added.

kubectl get pods -n test tutor-recommend--venv-stress-fudao-5d89fc7dd5-vcp8h -ojson |jq '.spec.tolerations'
[
  {
    "effect": "NoExecute",
    "key": "node.kubernetes.io/not-ready",
    "operator": "Exists",
    "tolerationSeconds": 300
  },
  {
    "effect": "NoExecute",
    "key": "node.kubernetes.io/unreachable",
    "operator": "Exists",
    "tolerationSeconds": 300
  }
]

Pod上果然有設(shè)置not-ready和tolerationSeconds俯萎，但是這些tolerations誰設(shè)置呢傲宜？什么時候設(shè)置的？

從代碼里發(fā)現(xiàn)夫啊，是由defaulttolerationseconds的adminssion controller設(shè)置的蛋哭，是在Pod生成的時候默認(rèn)就是設(shè)置的。從v1.13以后的版本中涮母，默認(rèn)會啟動此設(shè)置谆趾。用來替代--pod-eviction-timeout參數(shù)躁愿，來實現(xiàn)Evict的功能。
kubernetes/plugin/pkg/admission/defaulttolerationseconds/admission.go

Admit

// Admit makes an admission decision based on the request attributes
func (p *Plugin) Admit(ctx context.Context, attributes admission.Attributes, o admission.ObjectInterfaces) (err error) {
    if attributes.GetResource().GroupResource() != api.Resource("pods") {
        return nil
    }
 
    if len(attributes.GetSubresource()) > 0 {
        // only run the checks below on pods proper and not subresources
        return nil
    }
 
    pod, ok := attributes.GetObject().(*api.Pod)
    if !ok {
        return errors.NewBadRequest(fmt.Sprintf("expected *api.Pod but got %T", attributes.GetObject()))
    }
 
    tolerations := pod.Spec.Tolerations
 
    toleratesNodeNotReady := false
    toleratesNodeUnreachable := false
    for _, toleration := range tolerations {
        if (toleration.Key == v1.TaintNodeNotReady || len(toleration.Key) == 0) &&
            (toleration.Effect == api.TaintEffectNoExecute || len(toleration.Effect) == 0) {
            toleratesNodeNotReady = true
        }
 
        if (toleration.Key == v1.TaintNodeUnreachable || len(toleration.Key) == 0) &&
            (toleration.Effect == api.TaintEffectNoExecute || len(toleration.Effect) == 0) {
            toleratesNodeUnreachable = true
        }
    }
 
    if !toleratesNodeNotReady {
        pod.Spec.Tolerations = append(pod.Spec.Tolerations, notReadyToleration)
    }
 
    if !toleratesNodeUnreachable {
        pod.Spec.Tolerations = append(pod.Spec.Tolerations, unreachableToleration)
    }
 
    return nil
}

結(jié)論

出現(xiàn)很多的Terminating的Pod是由于節(jié)點Node NotReady后觸發(fā)NoExecute驅(qū)逐導(dǎo)致沪蓬。在Deployment中可以使用功能彤钟，但是在statefulset中建議要取消此功能。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末跷叉，一起剝皮案震驚了整個濱河市逸雹，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌云挟，老刑警劉巖梆砸，帶你破解...
沈念sama閱讀 217,277評論 6贊 503
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異园欣，居然都是意外死亡帖世，警方通過查閱死者的電腦和手機，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,689評論 3贊 393
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門沸枯，熙熙樓的掌柜王于貴愁眉苦臉地迎上來日矫，“玉大人，你說我怎么就攤上這事绑榴∧慕危” “怎么了？”我有些...
開封第一講書人閱讀 163,624評論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵翔怎，是天一觀的道長窃诉。經(jīng)常有香客問我，道長赤套，這世上最難降的妖魔是什么褐奴？我笑而不...
開封第一講書人閱讀 58,356評論 1贊 293
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮于毙，結(jié)果婚禮上敦冬，老公的妹妹穿的比我還像新娘。我一直安慰自己唯沮，他們只是感情好脖旱，可當(dāng)我...
茶點故事閱讀 67,402評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著介蛉，像睡著了一般萌庆。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上币旧，一...
開封第一講書人閱讀 51,292評論 1贊 301
城市分裂傳說
那天践险，我揣著相機與錄音，去河邊找鬼。笑死巍虫，一個胖子當(dāng)著我的面吹牛彭则，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播占遥，決...
沈念sama閱讀 40,135評論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼俯抖，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了瓦胎？” 一聲冷哼從身側(cè)響起芬萍，我...
開封第一講書人閱讀 38,992評論 0贊 275
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎搔啊，沒想到半個月后柬祠，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,429評論 1贊 314
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡负芋，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 37,636評論 3贊 334
?白月光啟示錄
正文我和宋清朗相戀三年漫蛔，在試婚紗的時候發(fā)現(xiàn)自己被綠了。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片示罗。...
茶點故事閱讀 39,785評論 1贊 348
活死人
序言：一個原本活蹦亂跳的男人離奇死亡惩猫，死狀恐怖芝硬，靈堂內(nèi)的尸體忽然破棺而出蚜点，到底是詐尸還是另有隱情，我是刑警寧澤拌阴，帶...
沈念sama閱讀 35,492評論 5贊 345
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布绍绘，位于F島的核電站，受9級特大地震影響迟赃，放射性物質(zhì)發(fā)生泄漏陪拘。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點故事閱讀 41,092評論 3贊 328
男人毒藥：我在死后第九天來索命
文/蒙蒙一纤壁、第九天我趴在偏房一處隱蔽的房頂上張望左刽。院中可真熱鬧，春花似錦酌媒、人聲如沸欠痴。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,723評論 0贊 22
一樁弒父案秒咨，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽喇辽。三九已至，卻和暖如春雨席，著一層夾襖步出監(jiān)牢的瞬間菩咨，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 32,858評論 1贊 269
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留抽米，地道東北人特占。一個月前我還...
沈念sama閱讀 47,891評論 2贊 370
代替公主和親
正文我出身青樓，卻偏偏與公主長得像缨硝，于是被迫代替她去往敵國和親摩钙。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點故事閱讀 44,713評論 2贊 354

k8s集群被打爆后的故障分析

故障概述

故障分析

結(jié)論

推薦閱讀更多精彩內(nèi)容