K8S高級(jí)調(diào)度——親和性和反親和性
參考:
Inter-pod topological affinity and anti-affinity
what
- 親和性:應(yīng)用A與應(yīng)用B兩個(gè)應(yīng)用頻繁交互,所以有必要利用親和性讓兩個(gè)應(yīng)用的盡可能的靠近耻蛇,甚至在一個(gè)node上翰蠢,以減少因網(wǎng)絡(luò)通信而帶來的性能損耗父丰。
- 反親和性:當(dāng)應(yīng)用的采用多副本部署時(shí),有必要采用反親和性讓各個(gè)應(yīng)用實(shí)例打散分布在各個(gè)node上费韭,以提高HA。
node
node親和性可以約束調(diào)度器基于node labels調(diào)度pod。
考慮以下場(chǎng)景:
有az1蕊程,az2兩個(gè)zone,現(xiàn)在我們只希望pod實(shí)例部署在az1
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/e2e-az-name
operator: In
values:
- az1
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: another-node-label-key
operator: In
values:
- another-node-label-value
containers:
- name: with-node-affinity
image: k8s.gcr.io/pause:2.0
兩種類型:
-
requiredDuringSchedulingIgnoredDuringExecution
: hard驼唱,嚴(yán)格執(zhí)行藻茂,滿足規(guī)則調(diào)度,否則不調(diào)度玫恳,在預(yù)選階段執(zhí)行辨赐,所以違反hard約定一定不會(huì)調(diào)度到 -
preferredDuringSchedulingIgnoredDuringExecution
:soft,盡力執(zhí)行京办,優(yōu)先滿足規(guī)則調(diào)度肖油,在優(yōu)選階段執(zhí)行,
后綴IgnoredDuringExecution
表示如果labels發(fā)生改變臂港,使得原本運(yùn)行的pod不在滿足規(guī)則森枪,那么這個(gè)pod將忽視這個(gè)改變,繼續(xù)運(yùn)行审孽。
-
requiredDuringSchedulingRequiredDuringExecution
:未實(shí)現(xiàn)县袱,與之前類似,只是后綴不同佑力,代表如果labels發(fā)生改變式散,kubelet將驅(qū)逐不滿足規(guī)則的pod
Note: 支持的operator操作: In
, NotIn
, Exists
, DoesNotExist
, Gt
, Lt
. 其中,NotIn
and DoesNotExist
用于實(shí)現(xiàn)反親和性打颤。
Note: weight范圍1-100暴拄。這個(gè)涉及調(diào)度器的優(yōu)選打分過程漓滔,每個(gè)node的評(píng)分都會(huì)加上這個(gè)weight,最后bind最高的node乖篷。
限制
- 同時(shí)指定
nodeSelector
andnodeAffinity
响驴,pod必須都滿足 -
nodeAffinity
有多個(gè)nodeSelectorTerms
,pod只需滿足一個(gè) -
nodeSelectorTerms
多個(gè)matchExpressions
撕蔼,pod必須都滿足 - 由于
IgnoredDuringExecution
豁鲤,所以改變labels不會(huì)影響已經(jīng)運(yùn)行pod
總的來說,node親和性與nodeSelector類似鲸沮,是它的擴(kuò)展琳骡。
Inter-pod
在K8S中,我們可以根據(jù)node上已運(yùn)行pod的標(biāo)簽來決定將pod調(diào)度到哪個(gè)node讼溺。
例如:pod是否(親和性:是楣号,反親和性:否)可以調(diào)度在X上;此時(shí)在X上怒坯,已經(jīng)運(yùn)行了一些pods炫狱;調(diào)度器需要考慮這些pods是否滿足規(guī)則Y。
- 規(guī)則Y就是
LabelSelector
敬肚, - X是一個(gè)邏輯拓?fù)涓拍畋霞觯梢允莕ode,rack艳馒,az憎亚,region等等;用
topologyKey
表示弄慰,具體值用node label表示第美。
kubernetes.io/hostname
failure-domain.beta.kubernetes.io/zone
failure-domain.beta.kubernetes.io/region
beta.kubernetes.io/instance-type
beta.kubernetes.io/os
beta.kubernetes.io/arch
Note: 此特性有明顯的性能損耗,需要大量運(yùn)算陆爽。
apiVersion: v1
kind: Pod
metadata:
name: with-pod-affinity
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S1
topologyKey: failure-domain.beta.kubernetes.io/zone
Note: 合法的operator
包括:In
, NotIn
, Exists
, DoesNotExist
.
限制
topologyKey
:
- 對(duì)于親和性和軟反親和性什往,不允許空
topologyKey
; - 對(duì)于硬反親和性,
LimitPodHardAntiAffinityTopology
控制器用于限制topologyKey
只能是kubernetes.io/hostname
; - 對(duì)于軟反親和性慌闭,空
topologyKey
被解讀成kubernetes.io/hostname
,failure-domain.beta.kubernetes.io/zone
andfailure-domain.beta.kubernetes.io/region
的組合别威;
example
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-cache
spec:
selector:
matchLabels:
app: store
replicas: 3
template:
metadata:
labels:
app: store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- name: redis-server
image: redis:3.2-alpine
部署3個(gè)redis實(shí)例,并且為了提升HA驴剔,都不在一個(gè)node省古。
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-server
spec:
selector:
matchLabels:
app: web-store
replicas: 3
template:
metadata:
labels:
app: web-store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-store
topologyKey: "kubernetes.io/hostname"
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- name: web-app
image: nginx:1.12-alpine
部署三個(gè)web實(shí)例,為了提升HA丧失,都不在一個(gè)node豺妓;并且為了方便與redis交互,盡量與redis在同一個(gè)node。
對(duì)稱性
考慮一個(gè)場(chǎng)景琳拭,兩個(gè)應(yīng)用S1和S2⊙刀眩現(xiàn)在嚴(yán)格要求S1 pod不能與S2 pod運(yùn)行在一個(gè)node,如果僅設(shè)置S1的hard反親和性是不夠的白嘁,必須同時(shí)給S2設(shè)置對(duì)應(yīng)的hard反親和性坑鱼。即調(diào)度S1 pod時(shí),考慮node沒有S2 pod权薯,同時(shí)需要在調(diào)度S2 pod時(shí)姑躲,考慮node上沒有S1 pod睡扬∶蓑迹考慮下面兩種情況:
- 先調(diào)度S2,后調(diào)度S1卖怜,可以滿足反親和性屎开,
- 先調(diào)度S1,后調(diào)度S2马靠,違反S1的反親和性規(guī)則奄抽,因?yàn)镾2沒有反親和性規(guī)則,所以在schedule-time可以與S1調(diào)度在一個(gè)拓?fù)湎隆?/li>
這就是對(duì)稱性甩鳄,即S1設(shè)置了與S2相關(guān)的hard反親和性規(guī)則逞度,就必須對(duì)稱地給S2設(shè)置與S1相關(guān)的hard反親和性規(guī)則,以達(dá)到調(diào)度預(yù)期妙啃。
Note:
- 反親和性(soft/hard)具備對(duì)稱性档泽,上面已經(jīng)舉過例子了
- hard親和性不具備對(duì)稱性,例如期望S1揖赴、S2親和馆匿,那么調(diào)度S2的時(shí)候沒有必要node上一定要有S1,但是有一個(gè)隱含規(guī)則燥滑,node上有S1更好
- soft親和性具備對(duì)稱性渐北,不是很理解,遺留
Note:
hard反親和性對(duì)稱性問題代碼已經(jīng)解決了:
kubernetes\pkg\scheduler\algorithm\predicates\predicates.go
// InterPodAffinityMatches checks if a pod can be scheduled on the specified node with pod affinity/anti-affinity configuration.
func (c *PodAffinityChecker) InterPodAffinityMatches(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
c.satisfiesExistingPodsAntiAffinity(pod, meta, nodeInfo)
c.satisfiesPodsAffinityAntiAffinity(pod, meta, nodeInfo, affinity)
}
- 檢查pod是否會(huì)打破已經(jīng)運(yùn)行pods(從cache中獲让 )的反親和性赃蛛,利用了pods反親和性的
RequiredDuringSchedulingIgnoredDuringExecution
- 檢查pod的親和性/反親和性是否滿足,都是hard
ps. 所以說hard是在預(yù)選過程使用搀菩,優(yōu)選打分過程使用soft
特別注意
Don't co-locate pods of this service with any other pods including pods of this service:
{LabelSelector: empty, TopologyKey: "node"}
呕臂,在反親和性中,空的selector表示不與任何pod親和秕磷。由于hard規(guī)則在預(yù)選階段處理诵闭,所以如果只有一個(gè)node滿足hard親和性,但是這個(gè)node又不滿足其他預(yù)選判斷,比如資源不足疏尿,那么就無法調(diào)度瘟芝。所以何時(shí)用hard,何時(shí)用soft需要根據(jù)業(yè)務(wù)考量褥琐。
如果所有node上都沒有符合親和性規(guī)則的target pod锌俱,那么pod調(diào)度可以忽略親和性