簡(jiǎn)介
Pod priority and Preemption
在k8s里面調(diào)度節(jié)點(diǎn)的時(shí)候可以給pod指定Priority,讓pod有不同的優(yōu)先級(jí).這樣在scheduler調(diào)度pod的時(shí)候會(huì)優(yōu)先調(diào)度優(yōu)先級(jí)高的pod,如果發(fā)生資源不夠的時(shí)候會(huì)觸發(fā)搶占式調(diào)度.
啟用 Pod priority and Preemption
- 在1.11之后的版本中默認(rèn)開(kāi)啟,并且在1.14中變成stable.
- 在1.11之前的版本需要給kube-scheduler指定--feature-gates=PodPriority=true來(lái)開(kāi)啟
example
創(chuàng)建PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for Test pods only."
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 10000
globalDefault: false
description: "This priority class should be used for Test pods only."
上面的yaml中定義了2個(gè)優(yōu)先級(jí) high-priority, low-priority.value分別是1000000,10000.
創(chuàng)建deployment
apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
name: nginx-deploy-high
spec:
selector:
matchLabels:
app: nginx
replicas: 1 # tells deployment to run 2 pods matching the template
template:
metadata:
labels:
app: nginx
spec:
hostNetwork: true
priorityClassName: high-priority
containers:
- name: nginx
image: nginx:latest
ports:
- containerPort: 8088
apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
name: nginx-deploy-low
spec:
selector:
matchLabels:
app: nginx
replicas: 1 # tells deployment to run 2 pods matching the template
template:
metadata:
labels:
app: nginx
spec:
hostNetwork: true
priorityClassName: low-priority
containers:
- name: nginx
image: nginx:latest
ports:
- containerPort: 8088
kubectl create -f ./nginx- deploy-low-priority.yaml
kubectl create -f ./nginx-deploy-high.yaml
About to try and schedule pod prometheus/nginx-deploy-high-76b56d5cc5-vpfjn
I1225 15:10:09.753527 1 scheduler.go:456] Attempting to schedule pod: prometheus/nginx-deploy-high-76b56d5cc5-vpfjn
I1225 15:10:09.753643 1 generic_scheduler.go:648] since alwaysCheckAllPredicates has not been set, the predicate evaluation is short circuited and there are chances of other predicates failing as well.
I1225 15:10:09.753696 1 factory.go:665] Unable to schedule prometheus/nginx-deploy-high-76b56d5cc5-vpfjn: no fit: 0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports.; waiting
I1225 15:10:09.753741 1 factory.go:736] Updating pod condition for prometheus/nginx-deploy-high-76b56d5cc5-vpfjn to (PodScheduled==False, Reason=Unschedulable)
I1225 15:10:09.755568 1 generic_scheduler.go:318] Pod prometheus/nginx-deploy-high-76b56d5cc5-vpfjn is not eligible for more preemption.
I1225 15:10:09.755726 1 scheduling_queue.gkube
I1225 15:10:11.729743 1 generic_scheduler.go:1147] Node host108752172 is a potential node for preemption.
I1225 15:10:11.729916 1 generic_scheduler.go:648] since alwaysCheckAllPredicates has not been set, the predicate evaluation is short circuited and there are chances of other predicates failing as well.
I1225 15:10:11.730407 1 cache.go:309] Finished binding for pod ac27a286-4272-47a6-8677-735b23e981fa. Can be expired.
I1225 15:10:11.730627 1 scheduler.go:593] pod prometheus/nginx-deploy-high-76b56d5cc5-vpfjn is bound successfully on node host108752172, 1 nodes evaluated, 1 nodes were found feasible
I1225 15:10:12.066208 1 leaderelection.go:276] successfully renewed lease kube-system/kube-scheduler
分析
上面通過(guò)kubectl創(chuàng)建了2個(gè)deployment,nginx-deploy-low和nginx-deploy-high. nginx-deploy-low是先創(chuàng)建的,nginx-deploy-high后創(chuàng)建.上面的日志可以看到scheduler在調(diào)度nginx-deploy-high-76b56d5cc5-vpfjn的時(shí)候發(fā)現(xiàn)短褲8088已經(jīng)被nginx-deploy-low的pod占了.然后nginx-deploy-high-76b56d5cc5-vpfjn這個(gè)pod因?yàn)镻riority的值比low的pod高.所以scheduler會(huì)標(biāo)記Node host108752172 is a potential node for preemption.為可搶占.然后正在running的nginx-deploy-low pod會(huì)變成為pending.nginx-deploy-high pod會(huì)變?yōu)閞unning.
總結(jié)
- 如果有2個(gè)pod在調(diào)度隊(duì)列里面,一個(gè)的priority比較高,一個(gè)比較低.調(diào)度器會(huì)以?xún)?yōu)先調(diào)度priority值高的.這里因?yàn)閷?shí)驗(yàn)環(huán)境不好重新.
- 如果調(diào)度的時(shí)候發(fā)現(xiàn)資源不夠了,scheduler會(huì)搶占優(yōu)先級(jí)比較低的pod的資源優(yōu)先給優(yōu)先級(jí)高的pod.