1. 前言
轉(zhuǎn)載請說明原文出處, 尊重他人勞動成果!
源碼位置: https://github.com/nicktming/kubernetes
分支: tming-v1.13 (基于v1.13版本)
本文將分析
kube-scheduler
如何實現(xiàn)高可用.
k8s
中kube-scheuler
的高可用是通過leaderElection
實現(xiàn)的, 關(guān)于leaderElection
可以參考 [k8s源碼分析][client-go] k8s選舉leaderelection (分布式資源鎖實現(xiàn)). 對于同一個schedulerName
的scheduler
, 無論啟動了多少個實例, 只能有一個leader
, 并且只有該leader
在提供服務(wù), 其余的競爭者只能一直在等待.
2. 例子
關(guān)于
k8s
環(huán)境安裝可以參考 k8s源碼編譯以及二進制安裝(用于源碼開發(fā)調(diào)試版).
2.1 初始狀態(tài)
因為
master(172.21.0.16)
上的scheduler
是先起來的, 所以它是leader
, 雖然兩臺機器上都安裝了scheduler
, 但是只有leader
提供服務(wù), 另外一個(也就是worker(172.21.0.12)
上面的scheduler
是處于等待狀態(tài), 并沒有真正運行自己的邏輯).
另外這兩個
scheduler
競爭的是同一個資源kube-system/kube-scheduler
, 也就是后面看到的kube-system
這個namespace
中的名字為kube-scheduler
的endpoints
.
example1.png
[root@master kubectl]# ./kubectl get endpoints -n kube-system
NAME ENDPOINTS AGE
kube-controller-manager <none> 42h
kube-scheduler <none> 42h
[root@master kubectl]#
[root@master kubectl]# ./kubectl get endpoints kube-scheduler -o yaml -n kube-system
apiVersion: v1
kind: Endpoints
metadata:
annotations:
control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"master_74cc3de4-f0be-11e9-9232-525400d54f7e","leaseDurationSeconds":15,"acquireTime":"2019-10-17T09:14:19Z","renewTime":"2019-10-17T09:41:41Z","leaderTransitions":5}'
creationTimestamp: "2019-10-15T14:56:55Z"
name: kube-scheduler
namespace: kube-system
resourceVersion: "59633"
selfLink: /api/v1/namespaces/kube-system/endpoints/kube-scheduler
uid: 0786d7b7-ef5c-11e9-af01-525400d54f7e
2.2 關(guān)閉leader
此時關(guān)閉
master(172.21.0.16)
的kube-scheduler
, 只剩下一個scheduler
, 所以worker(172.21.0.12)
會成為新的leader
并提供服務(wù).
example2.png
查看
k8s
中endpoints
的變化,holderIdentity
已經(jīng)由master_74cc3de4-f0be-11e9-9232-525400d54f7e
變成worker_f6134651-f0bf-11e9-a387-5254009b5271
了.
[root@master kubectl]# ./kubectl get endpoints kube-scheduler -o yaml -n kube-system
apiVersion: v1
kind: Endpoints
metadata:
annotations:
control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"worker_f6134651-f0bf-11e9-a387-5254009b5271","leaseDurationSeconds":15,"acquireTime":"2019-10-17T09:42:11Z","renewTime":"2019-10-17T09:42:13Z","leaderTransitions":6}'
creationTimestamp: "2019-10-15T14:56:55Z"
name: kube-scheduler
namespace: kube-system
resourceVersion: "59667"
selfLink: /api/v1/namespaces/kube-system/endpoints/kube-scheduler
uid: 0786d7b7-ef5c-11e9-af01-525400d54f7e
[root@master kubectl]#
查看處于
worker(172.21.0.16)
上的scheduler
的日志有successfully acquired lease kube-system/kube-scheduler
.
[root@worker scheduler]# cat config.txt
./kube-scheduler --master=http://172.21.0.16:8080
[root@worker scheduler]# ./kube-scheduler --master=http://172.21.0.16:8080
...
I1017 17:24:47.941202 32277 leaderelection.go:205] attempting to acquire leader lease kube-system/kube-scheduler...
I1017 17:42:11.815383 32277 leaderelection.go:214] successfully acquired lease kube-system/kube-scheduler
2.3 啟動一個自定義調(diào)度器
此時我在
master()
節(jié)點上啟動一個my-scheduler
. 關(guān)于如果啟動自定義調(diào)度器可以參考 [k8s源碼分析][kube-scheduler]scheduler之自定義調(diào)度器(1)
example3.png
[root@master kubectl]# ./kubectl get endpoints -n kube-system
NAME ENDPOINTS AGE
kube-controller-manager <none> 42h
kube-scheduler <none> 42h
my-scheduler <none> 7s
[root@master kubectl]# ./kubectl get endpoints my-scheduler -o yaml -n kube-system
apiVersion: v1
kind: Endpoints
metadata:
annotations:
control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"master_1dd3cdbe-f0c3-11e9-985f-525400d54f7e","leaseDurationSeconds":15,"acquireTime":"2019-10-17T09:47:23Z","renewTime":"2019-10-17T09:47:45Z","leaderTransitions":0}'
creationTimestamp: "2019-10-17T09:47:23Z"
name: my-scheduler
namespace: kube-system
resourceVersion: "60119"
selfLink: /api/v1/namespaces/kube-system/endpoints/my-scheduler
uid: 1e6d5569-f0c3-11e9-b23b-525400d54f7e
[root@master kubectl]#
可以看到在對應(yīng)的
endpoints
上多了一個新的my-scheduler
. 然而因為這兩個scheduler
競爭的資源不同, 所以各自都是其對應(yīng)資源的leader
并且都會提供服務(wù).my-scheduler
這個會為schedulerName=my-scheduler
這樣的pods
分配節(jié)點, 而default-scheduler
會為使用默認(rèn)調(diào)度器的pods
分配節(jié)點.
[root@master kubectl]# cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: test
spec:
containers:
- name: podtest
image: nginx
ports:
- containerPort: 80
[root@master kubectl]# cat pod-scheduler.yaml
apiVersion: v1
kind: Pod
metadata:
name: test-schduler
spec:
schedulerName: my-scheduler
containers:
- name: podtest-scheduler
image: nginx
ports:
- containerPort: 80
[root@master kubectl]# ./kubectl get pods
No resources found.
[root@master kubectl]# ./kubectl apply -f pod.yaml
pod/test created
[root@master kubectl]# ./kubectl apply -f pod-scheduler.yaml
pod/test-schduler created
[root@master kubectl]# ./kubectl get pods
NAME READY STATUS RESTARTS AGE
test 1/1 Running 0 3m3s
test-schduler 1/1 Running 0 2m55s
[root@master kubectl]# ./kubectl get pod test-schduler -o yaml | grep schedulerName
{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"test-schduler","namespace":"default"},"spec":{"containers":[{"image":"nginx","name":"podtest-scheduler","ports":[{"containerPort":80}]}],"schedulerName":"my-scheduler"}}
schedulerName: my-scheduler
[root@master kubectl]# ./kubectl get pod test -o yaml | grep schedulerName
schedulerName: default-scheduler
[root@master kubectl]#
如果需要在一臺機器上起多個
scheduler
, 需要改一下health
和metric
的端口號. 這里就不測試了, 因為上面的結(jié)果已經(jīng)很清晰了,default-scheduler
和my-scheduler
各自調(diào)度屬于自己的pod
.
3. 源碼分析
3.1 結(jié)構(gòu)體與默認(rèn)值
這里是與
leaderElection
相關(guān)的配置了, 其中需要注意的
LockObjectNamespace
: 代表的就是namespace
.
LockObjectName
: 代表的就是name
.
ResourceLock
: 代表的是什么類型的資源,leaderElection
目前就支持三種資源endpoints
,configmap
和lease
.
LeaderElect
: 代表是否啟用高可用.
// pkg/scheduler/apis/config/types.go
type KubeSchedulerConfiguration struct {
...
LeaderElection KubeSchedulerLeaderElectionConfiguration
...
}
type KubeSchedulerLeaderElectionConfiguration struct {
apiserverconfig.LeaderElectionConfiguration
// LockObjectNamespace defines the namespace of the lock object
LockObjectNamespace string
// LockObjectName defines the lock object name
LockObjectName string
}
type LeaderElectionConfiguration struct {
LeaderElect bool
LeaseDuration metav1.Duration
RenewDeadline metav1.Duration
RetryPeriod metav1.Duration
ResourceLock string
}
當(dāng)前通過配置文件可以直接配置, 在 [k8s源碼分析][kube-scheduler]scheduler之自定義調(diào)度器(2) 中就已經(jīng)體現(xiàn)過了. 但是沒有配置這些參數(shù)的時候發(fā)現(xiàn)還是啟用了高可用, 并且從上面的例子中也可以看到默認(rèn)調(diào)度器中也生成
kube-system/kube-scheduler
這樣的endpoints
.
所以來看一下這些配置的系統(tǒng)默認(rèn)值.
// pkg/scheduler/apis/config/v1alpha1/defaults.go
func SetDefaults_KubeSchedulerConfiguration(obj *kubescedulerconfigv1alpha1.KubeSchedulerConfiguration) {
...
if len(obj.LeaderElection.LockObjectNamespace) == 0 {
// obj.LeaderElection.LockObjectNamespace = kube-system
obj.LeaderElection.LockObjectNamespace = kubescedulerconfigv1alpha1.SchedulerDefaultLockObjectNamespace
}
if len(obj.LeaderElection.LockObjectName) == 0 {
// obj.LeaderElection.LockObjectName = kube-scheduler
obj.LeaderElection.LockObjectName = kubescedulerconfigv1alpha1.SchedulerDefaultLockObjectName
}
...
}
// k8s.io/apiserver/pkg/apis/config/v1alpha1/defaults.go
func RecommendedDefaultLeaderElectionConfiguration(obj *LeaderElectionConfiguration) {
zero := metav1.Duration{}
if obj.LeaseDuration == zero {
obj.LeaseDuration = metav1.Duration{Duration: 15 * time.Second}
}
if obj.RenewDeadline == zero {
obj.RenewDeadline = metav1.Duration{Duration: 10 * time.Second}
}
if obj.RetryPeriod == zero {
obj.RetryPeriod = metav1.Duration{Duration: 2 * time.Second}
}
if obj.ResourceLock == "" {
obj.ResourceLock = EndpointsResourceLock
}
if obj.LeaderElect == nil {
obj.LeaderElect = utilpointer.BoolPtr(true)
}
}
所以默認(rèn)設(shè)置的為
LockObjectNamespace = "kube-system"
LockObjectName = "kube-scheduler"
ResourceLock = "endpoints"
LeaderElect = true
3.2 流程
關(guān)于啟動流程在 [k8s源碼分析][kube-scheduler]scheduler之啟動run(1) 已經(jīng)分析過了, 這里就只關(guān)注跟
leaderElection
相關(guān)的部分.
// cmd/kube-scheduler/app/options/options.go
func (o *Options) Config() (*schedulerappconfig.Config, error) {
...
// Set up leader election if enabled.
var leaderElectionConfig *leaderelection.LeaderElectionConfig
// 默認(rèn)值就是true 只要用戶不設(shè)置為false 這一步就會執(zhí)行
// 也就是說kube-scheduler 默認(rèn)就是支持高可用
if c.ComponentConfig.LeaderElection.LeaderElect {
leaderElectionConfig, err = makeLeaderElectionConfig(c.ComponentConfig.LeaderElection, leaderElectionClient, recorder)
if err != nil {
return nil, err
}
}
...
c.LeaderElection = leaderElectionConfig
...
}
func makeLeaderElectionConfig(config kubeschedulerconfig.KubeSchedulerLeaderElectionConfiguration, client clientset.Interface, recorder record.EventRecorder) (*leaderelection.LeaderElectionConfig, error) {
hostname, err := os.Hostname()
if err != nil {
return nil, fmt.Errorf("unable to get hostname: %v", err)
}
// add a uniquifier so that two processes on the same host don't accidentally both become active
id := hostname + "_" + string(uuid.NewUUID())
rl, err := resourcelock.New(config.ResourceLock,
config.LockObjectNamespace,
config.LockObjectName,
client.CoreV1(),
resourcelock.ResourceLockConfig{
Identity: id,
EventRecorder: recorder,
})
if err != nil {
return nil, fmt.Errorf("couldn't create resource lock: %v", err)
}
return &leaderelection.LeaderElectionConfig{
Lock: rl,
LeaseDuration: config.LeaseDuration.Duration,
RenewDeadline: config.RenewDeadline.Duration,
RetryPeriod: config.RetryPeriod.Duration,
WatchDog: leaderelection.NewLeaderHealthzAdaptor(time.Second * 20),
Name: "kube-scheduler",
}, nil
}
這里可以看到
id
是由主機名與一個uuid
合并的字符串. 然后生成一個LeaderElectionConfig
對象. 這些在 [k8s源碼分析][client-go] k8s選舉leaderelection (分布式資源鎖實現(xiàn)) 已經(jīng)詳細(xì)分析過了.
最后看一下運行
// cmd/kube-scheduler/app/server.go
func Run(cc schedulerserverconfig.CompletedConfig, stopCh <-chan struct{}) error {
...
// Prepare a reusable runCommand function.
run := func(ctx context.Context) {
sched.Run()
<-ctx.Done()
}
ctx, cancel := context.WithCancel(context.TODO()) // TODO once Run() accepts a context, it should be used here
defer cancel()
go func() {
select {
case <-stopCh:
cancel()
case <-ctx.Done():
}
}()
// If leader election is enabled, runCommand via LeaderElector until done and exit.
// 啟動高可用
if cc.LeaderElection != nil {
cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
// 調(diào)用run方法
OnStartedLeading: run,
OnStoppedLeading: func() {
utilruntime.HandleError(fmt.Errorf("lost master"))
},
}
leaderElector, err := leaderelection.NewLeaderElector(*cc.LeaderElection)
if err != nil {
return fmt.Errorf("couldn't create leader elector: %v", err)
}
leaderElector.Run(ctx)
return fmt.Errorf("lost lease")
}
// Leader election is disabled, so runCommand inline until done.
run(ctx)
return fmt.Errorf("finished without leader elect")
}
如果啟動了高可用, 實現(xiàn)配置一下該
client
在獲得leader
之后需要回調(diào)的函數(shù)run
, 然后生成一個leaderElector
實例, 調(diào)用其Run
去競爭leadership
.
這些都已經(jīng)在 [k8s源碼分析][client-go] k8s選舉leaderelection (分布式資源鎖實現(xiàn)) 已經(jīng)詳細(xì)分析過了, 就不多說了.