k8s-調(diào)度算法

k8s-調(diào)度算法

  1. 預(yù)選算法,過濾nodes
  2. 優(yōu)選算法示损,對nodes打分

ps. scheduler_algorithm

預(yù)選

方法簽名

func Predicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {}

總共20個預(yù)選過程

ps. 不拙劣地翻譯了,直接看代碼注解吧鸵贬。代碼出處:kubernetes-master\pkg\scheduler\algorithm\predicates\predicates.go

volume

  • NoDiskConflict(重要): evaluates if a pod can fit due to the volumes it requests, and those that are already mounted. If there is already a volume mounted on that node, another pod that uses the same volume can't be scheduled there.
  • NewMaxPDVolumeCountPredicate(重要): creates a predicate which evaluates whether a pod can fit based on the number of volumes which match a filter that it requests, and those that are already present.
    The predicate looks for both volumes used directly, as well as PVC volumes that are backed by relevant volume types, counts the number of unique volumes, and rejects the new pod if it would place the total count over the maximum.
  • NewVolumeZonePredicate(重要): evaluates if a pod can fit due to the volumes it requests, given that some volumes may have zone scheduling constraints. The requirement is that any volume zone-labels must match the equivalent zone-labels on the node. It is OK for the node to have more zone-label constraints (for example, a hypothetical replicated volume might allow region-wide access)
    Currently this is only supported with PersistentVolumeClaims, and looks to the labels only on the bound PersistentVolume.
    Working with volumes declared inline in the pod specification (i.e. not using a PersistentVolume) is likely to be harder, as it would require determining the zone of a volume during scheduling, and that is likely to require calling out to the cloud provider. It seems that we are moving away from inline volume declarations anyway.
  • NewVolumeBindingPredicate: evaluates if a pod can fit due to the volumes it requests, for both bound and unbound PVCs.
    For PVCs that are bound, then it checks that the corresponding PV's node affinity is satisfied by the given node.
    For PVCs that are unbound, it tries to find available PVs that can satisfy the PVC requirements and that the PV node affinity is satisfied by the given node.
    The predicate returns true if all bound PVCs have compatible PVs with the node, and if all unbound
    PVCs can be matched with an available and node-compatible PV.

pod

  • PodFitsResources(重要): checks if a node has sufficient resources, such as cpu, memory, gpu, opaque int resources etc to run a pod.
  • PodMatchNodeSelector(重要): checks if a pod node selector matches the node label.
  • PodFitsHost(重要): checks if a pod spec node name matches the current node.
  • CheckNodeLabelPresence: checks whether all of the specified labels exists on a node or not, regardless of their value
    If "presence" is false, then returns false if any of the requested labels matches any of the node's labels, otherwise returns true.
    If "presence" is true, then returns false if any of the requested labels does not match any of the node's labels, otherwise returns true.
    Consider the cases where the nodes are placed in regions/zones/racks and these are identified by labels
    In some cases, it is required that only nodes that are part of ANY of the defined regions/zones/racks be selected
    Alternately, eliminating nodes that have a certain label, regardless of value, is also useful A node may have a label with "retiring" as key and the date as the value and it may be desirable to avoid scheduling new pods on this node
  • checkServiceAffinity: is a predicate which matches nodes in such a way to force that ServiceAffinity.labels are homogenous for pods that are scheduled to a node. (i.e. it returns true IFF this pod can be added to this node such that all other pods in the same service are running on nodes with the exact same ServiceAffinity.label values).
    For example:
    If the first pod of a service was scheduled to a node with label "region=foo",
    all the other subsequent pods belong to the same service will be schedule on
    nodes with the same "region=foo" label.
  • PodFitsHostPorts(重要): checks if a node has free ports for the requested pod ports.
  • GeneralPredicates: GeneralPredicates checks whether noncriticalPredicates and EssentialPredicates pass. noncriticalPredicates are the predicates that only non-critical pods need俗他,noncriticalPredicates就是PodFitsResources
  • EssentialPredicates : are the predicates that all pods, including critical pods, need脖捻, 包括PodFitsHost阔逼,PodFitsHostPortsPodMatchNodeSelector
  • InterPodAffinityMatches: checks if a pod can be scheduled on the specified node with pod affinity/anti-affinity configuration.

node

ps. 通過kubectl describe no {node-name}查看node狀態(tài):

  • CheckNodeUnschedulablePredicate: checks if a pod can be scheduled on a node with Unschedulable spec.檢查node的unschedulable狀態(tài)
  • PodToleratesNodeTaints: checks if a pod tolerations can tolerate the node taints地沮,node taints污點機制
  • PodToleratesNodeNoExecuteTaints: checks if a pod tolerations can tolerate the node's NoExecute taints
  • CheckNodeMemoryPressurePredicate(重要): checks if a pod can be scheduled on a node reporting memory pressure condition.
  • CheckNodeDiskPressurePredicate(重要): checks if a pod can be scheduled on a node reporting disk pressure condition.
  • CheckNodePIDPressurePredicate: checks if a pod can be scheduled on a node reporting pid pressure condition.
  • CheckNodeConditionPredicate: checks if a pod can be scheduled on a node reporting out of disk, network unavailable and not ready condition. Only node conditions are accounted in this predicate.

優(yōu)選

ps. 代碼出處:kubernetes-master\pkg\scheduler\algorithm\priorities

ResourceAllocationPriority

// ResourceAllocationPriority contains information to calculate resource allocation priority.
type ResourceAllocationPriority struct {
    Name   string
    scorer func(requested, allocable *schedulercache.Resource, includeVolumes bool, requestedVolumes int, allocatableVolumes int) int64
}

// PriorityMap priorities nodes according to the resource allocations on the node.
// It will use `scorer` function to calculate the score.
func (r *ResourceAllocationPriority) PriorityMap(
    pod *v1.Pod,
    meta interface{},
    nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) 
  • balancedResourceScorer(重要): favors nodes with balanced resource usage rate.
    should NOT be used alone, and MUST be used together ith LeastRequestedPriority. It calculates the difference between the cpu and memory fraction f capacity, and prioritizes the host based on how close the two metrics are to each other.
    計算公式:10 - variance(cpuFraction,memoryFraction,volumeFraction)*10
    選擇各個資源使用最均衡的node
  • leastResourceScorer(重要): favors nodes with fewer requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and rioritizes based on the minimum of the average of the fraction of requested to capacity.
    計算公式:(cpu((capacity-sum(requested))10/capacity) + memory((capacity-sum(requested))10/capacity))/2
    選擇最空閑的node
  • mostResourceScorer: favors nodes with most requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes based on the maximum of the average of the fraction of requested to capacity.
    計算公式: (cpu(10 * sum(requested) / capacity) + memory(10 * sum(requested) / capacity)) / 2
    盡量用盡一個node的資源
  • requested_to_capacity_ratio: assigns 1.0 to resource when all capacity is available and 0.0 when requested amount is equal to capacity.

image_locality(重要)

favors nodes that already have requested pod container's images.
It will detect whether the requested images are present on a node, and then calculate a score ranging from 0 to 10
based on the total size of those images.

  • If none of the images are present, this node will be given the lowest priority.
  • If some of the images are present on a node, the larger their sizes' sum, the higher the node's priority.

interpod_affinity(重要)

compute a sum by iterating through the elements of weightedPodAffinityTerm and adding "weight" to the sum if the corresponding PodAffinityTerm is satisfied for that node; the node(s) with the highest sum are the most preferred.
Symmetry need to be considered for preferredDuringSchedulingIgnoredDuringExecution from podAffinity & podAntiAffinity,
symmetry need to be considered for hard requirements from podAffinity

node_affinity(重要)

scheduling preferences indicated in PreferredDuringSchedulingIgnoredDuringExecution. Each time a node match a preferredSchedulingTerm, it will a get an add of preferredSchedulingTerm.Weight. Thus, the more preferredSchedulingTerms the node satisfies and the more the preferredSchedulingTerm that is satisfied weights, the higher score the node gets.

node_label

checks whether a particular label exists on a node or not, regardless of its value.
If presence is true, prioritizes nodes that have the specified label, regardless of value.
If presence is false, prioritizes nodes that do not have the specified label.

node_prefer_avoid_pods

priorities nodes according to the node annotation "scheduler.alpha.kubernetes.io/preferAvoidPods".

selector_spreading(重要)

  • SelectorSpreadPriority: spreads pods across hosts, considering pods belonging to the same service,RC,RS or StatefulSet.
    When a pod is scheduled, it looks for services, RCs,RSs and StatefulSets that match the pod, then finds existing pods that match those selectors.
    It favors nodes that have fewer existing matching pods.
    i.e. it pushes the scheduler towards a node where there's the smallest number of pods which match the same service, RC,RSs or StatefulSets selectors as the pod being scheduled.
  • ServiceAntiAffinityPriority: spreads pods by minimizing the number of pods belonging to the same service on given machine

taint_toleration

prepares the priority list for all the nodes based on the number of intolerable taints on the node. 詳見taint-and-toleration

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末嗜浮,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子摩疑,更是在濱河造成了極大的恐慌危融,老刑警劉巖,帶你破解...
    沈念sama閱讀 216,843評論 6 502
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件雷袋,死亡現(xiàn)場離奇詭異吉殃,居然都是意外死亡,警方通過查閱死者的電腦和手機楷怒,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,538評論 3 392
  • 文/潘曉璐 我一進店門蛋勺,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人鸠删,你說我怎么就攤上這事抱完。” “怎么了刃泡?”我有些...
    開封第一講書人閱讀 163,187評論 0 353
  • 文/不壞的土叔 我叫張陵巧娱,是天一觀的道長。 經(jīng)常有香客問我烘贴,道長禁添,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,264評論 1 292
  • 正文 為了忘掉前任桨踪,我火速辦了婚禮老翘,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘馒闷。我一直安慰自己酪捡,他們只是感情好,可當我...
    茶點故事閱讀 67,289評論 6 390
  • 文/花漫 我一把揭開白布纳账。 她就那樣靜靜地躺著逛薇,像睡著了一般。 火紅的嫁衣襯著肌膚如雪疏虫。 梳的紋絲不亂的頭發(fā)上永罚,一...
    開封第一講書人閱讀 51,231評論 1 299
  • 那天啤呼,我揣著相機與錄音,去河邊找鬼呢袱。 笑死官扣,一個胖子當著我的面吹牛,可吹牛的內(nèi)容都是我干的羞福。 我是一名探鬼主播惕蹄,決...
    沈念sama閱讀 40,116評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼治专!你這毒婦竟也來了卖陵?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 38,945評論 0 275
  • 序言:老撾萬榮一對情侶失蹤张峰,失蹤者是張志新(化名)和其女友劉穎泪蔫,沒想到半個月后,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體喘批,經(jīng)...
    沈念sama閱讀 45,367評論 1 313
  • 正文 獨居荒郊野嶺守林人離奇死亡撩荣,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,581評論 2 333
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了饶深。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片餐曹。...
    茶點故事閱讀 39,754評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖粥喜,靈堂內(nèi)的尸體忽然破棺而出凸主,到底是詐尸還是另有隱情,我是刑警寧澤额湘,帶...
    沈念sama閱讀 35,458評論 5 344
  • 正文 年R本政府宣布卿吐,位于F島的核電站,受9級特大地震影響锋华,放射性物質(zhì)發(fā)生泄漏嗡官。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 41,068評論 3 327
  • 文/蒙蒙 一毯焕、第九天 我趴在偏房一處隱蔽的房頂上張望衍腥。 院中可真熱鬧,春花似錦纳猫、人聲如沸婆咸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,692評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽尚骄。三九已至,卻和暖如春侵续,著一層夾襖步出監(jiān)牢的瞬間倔丈,已是汗流浹背憨闰。 一陣腳步聲響...
    開封第一講書人閱讀 32,842評論 1 269
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留需五,地道東北人鹉动。 一個月前我還...
    沈念sama閱讀 47,797評論 2 369
  • 正文 我出身青樓,卻偏偏與公主長得像宏邮,于是被迫代替她去往敵國和親泽示。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 44,654評論 2 354

推薦閱讀更多精彩內(nèi)容