【已解決】k8s Cronjob.spec.failedJobsHistoryLimit不生效問題

標簽

kubernetes喇颁、Cronjob漓帅、pod

背景介紹

如下面的yaml所示老速，明明已經(jīng)設置了.spec.failedJobsHistoryLimit為1诵棵，但仍然產(chǎn)生了7個狀態(tài)為Error的Pod：

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: mycronjob
  namespace: prod
  labels:
    task: processor
spec:
  failedJobsHistoryLimit: 1
  successfulJobsHistoryLimit: 3
……

kubectl get pod -n prod -l task=processor
NAME                      READY   STATUS   RESTARTS   AGE
mycronjob-16043364027mpp   0/1     Error    0          9h
mycronjob-16043364098q8q   0/1     Error    0          9h
mycronjob-160433640hc2ch   0/1     Error    0          9h
mycronjob-160433640nrdqb   0/1     Error    0          9h
mycronjob-160433640r49cq   0/1     Error    0          8h
mycronjob-160433640tnfvw   0/1     Error    0          9h
mycronjob-160433640vhdsc   0/1     Error    0          9h

那么原环，問題來了痰娱，為什么CronJob.spec.successfulJobsHistoryLimit可以生效，而CronJob.spec.failedJobsHistoryLimit沒有生效呢？

分析

理解這個問題前，我們首先要搞清楚苗踪，CronJob是干什么的颠区。
官方介紹

A CronJob creates Jobs on a repeating schedule.

One CronJob object is like one line of a crontab (cron table) file. It runs a job periodically on a given schedule, written in Cron format.

從定義中，我們不難看出通铲，CronJob是用來管理Job的毕莱，而Job才是生成Pod的源頭，因此想要探尋CronJob.spec.failedJobsHistoryLimit失效的原因颅夺，我們得去看CronJob定期創(chuàng)建的Job的配置：
執(zhí)行命令：

kubectl get job -n prod -l task=processor -o yaml

得到：

apiVersion: v1
items:
- apiVersion: batch/v1
  kind: Job
  metadata:
    labels:
      task: processor
    name: processor-1604336400
    namespace: prod
    ownerReferences:
    - apiVersion: batch/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: CronJob
      name: processor
  spec:
    backoffLimit: 6
    completions: 1
    parallelism: 1
  status:
    conditions:
    - message: Job has reached the specified backoff limit
      reason: BackoffLimitExceeded
      type: Failed

注意觀察spec.backoffLimit這個配置朋截，官方解釋是：

There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set .spec.backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s ...) capped at six minutes. The back-off count is reset when a Job's Pod is deleted or successful without any other Pods for the Job failing around that time.

翻譯過來就是說，Job處理過程中吧黄，如果它創(chuàng)建的Pod失敗了部服，那么默認情況下，Job會重復創(chuàng)建6次新的Pod拗慨，如果我們不想它創(chuàng)建這么多次廓八，可以更改.spec.backoffLimit這個配置。
講到這里赵抢，相信大家都知道問題出在哪兒了剧蹂。

總結(jié)

CronJob創(chuàng)建了Job，并且根據(jù)我們的配置烦却，限制了Job的失敗以及成功歷史輸分別為3和1宠叼，但是Job什么時候算失敗確是由Job.spec.backoffLimit規(guī)定的，因此我們通過CronJob.spec.failedJobsHistoryLimit限制的只能是Job的個數(shù)其爵，此個數(shù)可以通過命令kubectl get job -n prod -l task=processor查看冒冬，想要限制最終的失敗Pod數(shù)，得控制Job.spec.backoffLimit這個配置才可以摩渺。

參考

Running Automated Tasks with a CronJob
Jobs
Pod Lifecycle

思考

如果設置CronJob.spec.failedJobsHistoryLimit為2窄驹，Job.spec.backoffLimit為5，那么最多會保留多少個狀態(tài)為Error的Pod ?

?著作權歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者