kubernets中負責(zé)pod調(diào)度的重要模塊是kube-schduler。kube-scheduler就是調(diào)度安排Pod到具體的Node爹土,,kube-scheduler通過API Server提供的接口監(jiān)聽Pod任務(wù)列表,獲取待調(diào)度pod,然后根據(jù)一系列的預(yù)選策略和優(yōu)選策略給各個Node節(jié)點打分嗤谚,然后將Pod發(fā)送到得分最高的Node節(jié)點上,同時將綁定信息寫入etcd.
node節(jié)點上的kubelet通過kuber-apiserver的監(jiān)聽怔蚌,獲取kube-scheduler產(chǎn)生的綁定事件巩步,獲取pod清單,下載鏡像桦踊,啟動容器渗钉。
調(diào)度策略
Kubernetes的調(diào)度策略分為Predicates(預(yù)選策略)和Priorites(優(yōu)選策略),整個調(diào)度過程分為兩步:
預(yù)選策略钞钙,Predicates是強制性規(guī)則,遍歷所有的Node節(jié)點声离,按照具體的預(yù)選策略篩選出符合要求的Node列表芒炼,如沒有Node符合Predicates策略規(guī)則,那該Pod就會被掛起术徊,直到有Node能夠滿足本刽。
優(yōu)選策略,在第一步篩選的基礎(chǔ)上赠涮,按照優(yōu)選策略為待選Node打分排序子寓,獲取最優(yōu)者。
- 源碼位置:
predicates包為k8s支持的所有預(yù)選策略
priorites包為k8s支持的所有優(yōu)選策略
algorithmprovider包下的defaults包為默認的預(yù)選和優(yōu)選策略
Predicates 預(yù)選策略
v1.7支持15個策略笋除,Kubernetes(v1.7)中可用的Predicates策略有:
- MatchNodeSelector:檢查spec.nodeSelector是否包含Node節(jié)點的label定義
- PodFitsResources:檢查主機的資源(cpu和內(nèi)存)是否滿足Pod的需求斜友,根據(jù)實際已經(jīng)分配(Limit)的資源量做調(diào)度
- PodFitsHostPorts:檢查Pod內(nèi)每一個容器所需的HostPort是否已被其它容器占用,如果有所需的HostPort不滿足需求垃它,那么Pod不能調(diào)度到這個主機上
- HostName:檢查主機名稱是不是Pod指定的NodeName
- NoDiskConflict:根據(jù)pod.spec.volumes檢查在此主機上是否存在卷沖突鲜屏。如果這個主機已經(jīng)掛載了卷,其它同樣使用這個卷的Pod不能調(diào)度到這個主機上国拇,不同的存儲后端具體規(guī)則不同
- NoVolumeZoneConflict:檢查給定的zone限制前提下洛史,檢查如果在此主機上部署Pod是否存在卷沖突
- PodToleratesNodeTaints:確保pod定義的tolerates能接納node定義的taints
- CheckNodeMemoryPressure:檢查pod是否可以調(diào)度到已經(jīng)報告了主機內(nèi)存壓力過大的節(jié)點
- CheckNodeDiskPressure:檢查pod是否可以調(diào)度到已經(jīng)報告了主機的存儲壓力過大的節(jié)點
- MaxEBSVolumeCount:確保已掛載的EBS存儲卷不超過設(shè)置的最大值,默認39
- MaxGCEPDVolumeCount:確保已掛載的GCE存儲卷不超過設(shè)置的最大值酱吝,默認16
- MaxAzureDiskVolumeCount:確保已掛載的Azure存儲卷不超過設(shè)置的最大值也殖,默認16
- MatchInterPodAffinity:檢查pod和其他pod是否符合親和性規(guī)則
- GeneralPredicates:檢查pod與主機上kubernetes相關(guān)組件是否匹配
- NoVolumeNodeConflict:檢查給定的Node限制前提下,檢查如果在此主機上部署Pod是否存在卷沖突
Priorites 優(yōu)選策略
Kubernetes(v1.7)中可用的Priorites策略有:
- EqualPriority:所有節(jié)點同樣優(yōu)先級
- ImageLocalityPriority:根據(jù)主機上是否已具備Pod運行的環(huán)境來打分务热,得分計算:不存在所需鏡像忆嗜,返回0分己儒,存在鏡像,鏡像越大得分越高
- LeastRequestedPriority:計算Pods需要的CPU和內(nèi)存在當(dāng)前節(jié)點可用資源的百分比霎褐,具有最小百分比的節(jié)點就是最優(yōu)址愿,得分計算公式
cpu((capacity – sum(requested)) * 10 / capacity) + memory((capacity – sum(requested)) * 10 / capacity) / 2
- BalancedResourceAllocation:節(jié)點上各項資源(CPU、內(nèi)存)使用率最均衡的為最優(yōu)冻璃,得分計算公式
10 – abs(totalCpu/cpuNodeCapacity-totalMemory/memoryNodeCapacity)*10
- SelectorSpreadPriority:按Service和Replicaset歸屬計算Node上分布最少的同類Pod數(shù)量响谓,得分計算:數(shù)量越少得分越高
- NodePreferAvoidPodsPriority:判斷alpha.kubernetes.io/preferAvoidPods屬性,設(shè)置權(quán)重為10000省艳,覆蓋其他策略
- NodeAffinityPriority:節(jié)點親和性選擇策略娘纷,提供兩種選擇器支持:requiredDuringSchedulingIgnoredDuringExecution(保證所選的主機必須滿足所有Pod對主機的規(guī)則要求)、preferresDuringSchedulingIgnoredDuringExecution(調(diào)度器會盡量但不保證滿足NodeSelector的所有要求)
- TaintTolerationPriority:類似于Predicates策略中的PodToleratesNodeTaints跋炕,優(yōu)先調(diào)度到標(biāo)記了Taint的節(jié)點
- InterPodAffinityPriority:pod親和性選擇策略赖晶,類似NodeAffinityPriority,提供兩種選擇器支持:requiredDuringSchedulingIgnoredDuringExecution(保證所選的主機必須滿足所有Pod對主機的規(guī)則要求)辐烂、preferresDuringSchedulingIgnoredDuringExecution(調(diào)度器會盡量但不保證滿足NodeSelector的所有要求)
- MostRequestedPriority:動態(tài)伸縮集群環(huán)境比較適用遏插,會優(yōu)先調(diào)度pod到使用率最高的主機節(jié)點,這樣在伸縮集群時纠修,就會騰出空閑機器胳嘲,從而進行停機處理。
默認策略
默認預(yù)選策略
func defaultPredicates() sets.String {
predSet := sets.NewString(
factory.RegisterFitPredicateFactory(
"NoVolumeZoneConflict",
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
return predicates.NewVolumeZonePredicate(args.PVInfo, args.PVCInfo)
},
),
factory.RegisterFitPredicateFactory(
"MaxEBSVolumeCount",
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
// TODO: allow for generically parameterized scheduler predicates, because this is a bit ugly
maxVols := getMaxVols(aws.DefaultMaxEBSVolumes)
return predicates.NewMaxPDVolumeCountPredicate(predicates.EBSVolumeFilter, maxVols, args.PVInfo, args.PVCInfo)
},
),
factory.RegisterFitPredicateFactory(
"MaxGCEPDVolumeCount",
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
// TODO: allow for generically parameterized scheduler predicates, because this is a bit ugly
maxVols := getMaxVols(DefaultMaxGCEPDVolumes)
return predicates.NewMaxPDVolumeCountPredicate(predicates.GCEPDVolumeFilter, maxVols, args.PVInfo, args.PVCInfo)
},
),
factory.RegisterFitPredicateFactory(
"MaxAzureDiskVolumeCount",
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
// TODO: allow for generically parameterized scheduler predicates, because this is a bit ugly
maxVols := getMaxVols(DefaultMaxAzureDiskVolumes)
return predicates.NewMaxPDVolumeCountPredicate(predicates.AzureDiskVolumeFilter, maxVols, args.PVInfo, args.PVCInfo)
},
),
factory.RegisterFitPredicateFactory(
predicates.MatchInterPodAffinity,
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
return predicates.NewPodAffinityPredicate(args.NodeInfo, args.PodLister)
},
),
factory.RegisterFitPredicate("NoDiskConflict", predicates.NoDiskConflict),
factory.RegisterFitPredicate("GeneralPredicates", predicates.GeneralPredicates),
factory.RegisterFitPredicate("CheckNodeMemoryPressure", predicates.CheckNodeMemoryPressurePredicate),
factory.RegisterFitPredicate("CheckNodeDiskPressure", predicates.CheckNodeDiskPressurePredicate),
factory.RegisterFitPredicateFactory(
"NoVolumeNodeConflict",
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
return predicates.NewVolumeNodePredicate(args.PVInfo, args.PVCInfo, nil)
},
),
)
if utilfeature.DefaultFeatureGate.Enabled(features.TaintNodesByCondition) {
predSet.Insert(factory.RegisterMandatoryFitPredicate("PodToleratesNodeTaints", predicates.PodToleratesNodeTaints))
glog.Warningf("TaintNodesByCondition is enabled, PodToleratesNodeTaints predicate is mandatory")
} else {
predSet.Insert(factory.RegisterMandatoryFitPredicate("CheckNodeCondition", predicates.CheckNodeConditionPredicate))
predSet.Insert(factory.RegisterFitPredicate("PodToleratesNodeTaints", predicates.PodToleratesNodeTaints))
}
return predSet
}
默認優(yōu)選策略
func defaultPriorities() sets.String {
return sets.NewString(
factory.RegisterPriorityConfigFactory(
"SelectorSpreadPriority",
factory.PriorityConfigFactory{
Function: func(args factory.PluginFactoryArgs) algorithm.PriorityFunction {
return priorities.NewSelectorSpreadPriority(args.ServiceLister, args.ControllerLister, args.ReplicaSetLister, args.StatefulSetLister)
},
Weight: 1,
},
),
factory.RegisterPriorityConfigFactory(
"InterPodAffinityPriority",
factory.PriorityConfigFactory{
Function: func(args factory.PluginFactoryArgs) algorithm.PriorityFunction {
return priorities.NewInterPodAffinityPriority(args.NodeInfo, args.NodeLister, args.PodLister, args.HardPodAffinitySymmetricWeight)
},
Weight: 1,
},
),
factory.RegisterPriorityFunction2("LeastRequestedPriority", priorities.LeastRequestedPriorityMap, nil, 1),
factory.RegisterPriorityFunction2("BalancedResourceAllocation", priorities.BalancedResourceAllocationMap, nil, 1),
factory.RegisterPriorityFunction2("NodePreferAvoidPodsPriority", priorities.CalculateNodePreferAvoidPodsPriorityMap, nil, 10000),
factory.RegisterPriorityFunction2("NodeAffinityPriority", priorities.CalculateNodeAffinityPriorityMap, priorities.CalculateNodeAffinityPriorityReduce, 1),
factory.RegisterPriorityFunction2("TaintTolerationPriority", priorities.ComputeTaintTolerationPriorityMap, priorities.ComputeTaintTolerationPriorityReduce, 1),
)
}
默認注冊但不加載的策略
預(yù)選策略
// Registers predicates and priorities that are not enabled by default, but user can pick when creating his
// own set of priorities/predicates.
factory.RegisterFitPredicate("PodFitsPorts", predicates.PodFitsHostPorts)
factory.RegisterFitPredicate("PodFitsHostPorts", predicates.PodFitsHostPorts)
factory.RegisterFitPredicate("PodFitsResources", predicates.PodFitsResources)
factory.RegisterFitPredicate("HostName", predicates.PodFitsHost)
factory.RegisterFitPredicate("MatchNodeSelector", predicates.PodMatchNodeSelector)
優(yōu)選策略
factory.RegisterPriorityFunction2("EqualPriority", core.EqualPriorityMap, nil, 1)
factory.RegisterPriorityFunction2("ImageLocalityPriority", priorities.ImageLocalityPriorityMap, nil, 1)
factory.RegisterPriorityFunction2("MostRequestedPriority", priorities.MostRequestedPriorityMap, nil, 1)