標簽
kubernetes
喇颁、Cronjob
漓帅、pod
背景介紹
如下面的yaml
所示老速,明明已經(jīng)設置了.spec.failedJobsHistoryLimit
為1诵棵,但仍然產(chǎn)生了7個狀態(tài)為Error的Pod:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: mycronjob
namespace: prod
labels:
task: processor
spec:
failedJobsHistoryLimit: 1
successfulJobsHistoryLimit: 3
……
kubectl get pod -n prod -l task=processor
NAME READY STATUS RESTARTS AGE
mycronjob-16043364027mpp 0/1 Error 0 9h
mycronjob-16043364098q8q 0/1 Error 0 9h
mycronjob-160433640hc2ch 0/1 Error 0 9h
mycronjob-160433640nrdqb 0/1 Error 0 9h
mycronjob-160433640r49cq 0/1 Error 0 8h
mycronjob-160433640tnfvw 0/1 Error 0 9h
mycronjob-160433640vhdsc 0/1 Error 0 9h
那么原环,問題來了痰娱,為什么CronJob.spec.successfulJobsHistoryLimit
可以生效,而CronJob.spec.failedJobsHistoryLimit
沒有生效呢?
分析
理解這個問題前,我們首先要搞清楚苗踪,CronJob是干什么的颠区。
官方介紹
A CronJob creates Jobs on a repeating schedule.
One CronJob object is like one line of a crontab (cron table) file. It runs a job periodically on a given schedule, written in Cron format.
從定義中,我們不難看出通铲,CronJob
是用來管理Job
的毕莱,而Job
才是生成Pod
的源頭,因此想要探尋CronJob.spec.failedJobsHistoryLimit
失效的原因颅夺,我們得去看CronJob
定期創(chuàng)建的Job
的配置:
執(zhí)行命令:
kubectl get job -n prod -l task=processor -o yaml
得到:
apiVersion: v1
items:
- apiVersion: batch/v1
kind: Job
metadata:
labels:
task: processor
name: processor-1604336400
namespace: prod
ownerReferences:
- apiVersion: batch/v1beta1
blockOwnerDeletion: true
controller: true
kind: CronJob
name: processor
spec:
backoffLimit: 6
completions: 1
parallelism: 1
status:
conditions:
- message: Job has reached the specified backoff limit
reason: BackoffLimitExceeded
type: Failed
注意觀察spec.backoffLimit
這個配置朋截,官方解釋是:
There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set .spec.backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s ...) capped at six minutes. The back-off count is reset when a Job's Pod is deleted or successful without any other Pods for the Job failing around that time.
翻譯過來就是說,Job
處理過程中吧黄,如果它創(chuàng)建的Pod
失敗了部服,那么默認情況下,Job
會重復創(chuàng)建6次新的Pod
拗慨,如果我們不想它創(chuàng)建這么多次廓八,可以更改.spec.backoffLimit
這個配置。
講到這里赵抢,相信大家都知道問題出在哪兒了剧蹂。
總結(jié)
CronJob
創(chuàng)建了Job
,并且根據(jù)我們的配置烦却,限制了Job
的失敗以及成功歷史輸分別為3和1宠叼,但是Job
什么時候算失敗確是由Job.spec.backoffLimit
規(guī)定的,因此我們通過CronJob.spec.failedJobsHistoryLimit
限制的只能是Job
的個數(shù)其爵,此個數(shù)可以通過命令kubectl get job -n prod -l task=processor
查看冒冬,想要限制最終的失敗Pod
數(shù),得控制Job.spec.backoffLimit
這個配置才可以摩渺。
參考
Running Automated Tasks with a CronJob
Jobs
Pod Lifecycle
思考
如果設置CronJob.spec.failedJobsHistoryLimit
為2窄驹,Job.spec.backoffLimit
為5,那么最多會保留多少個狀態(tài)為Error的Pod ?