Volcano中job管理

Job是高性能工作負載的基本對象旺隙，本文檔提供了在volcano中定義job的方法

對job的定義和k8s中對job的定義方法類似富腊，如Status,Spec等，下面描述了job的主要功能

Multiple Pod Template
由于大多數(shù)高性能工作負載包含不同類型的任務欺殿，例如TensorFlow(PS/Worker)、SPark(驅(qū)動/執(zhí)行器)筒严；Job通過taskSpecs來支持multiple pod template扛芽，定義如下。此外Policies將在Error Handling部分中描述腐晾。

// JobSpec describes how the job execution will look like and when it will actually run
type JobSpec struct {
    ...

    // Tasks specifies the task specification of Job
    // +optional
    Tasks []TaskSpec `json:"tasks,omitempty" protobuf:"bytes,5,opt,name=tasks"`
}

// TaskSpec specifies the task specification of Job
type TaskSpec struct {
    // Name specifies the name of task
    Name string `json:"name,omitempty" protobuf:"bytes,1,opt,name=name"`

    // Replicas specifies the replicas of this TaskSpec in Job
    Replicas int32 `json:"replicas,omitempty" protobuf:"bytes,2,opt,name=replicas"`

    // Specifies the pod that will be created for this TaskSpec
    // when executing a Job
    Template v1.PodTemplateSpec `json:"template,omitempty" protobuf:"bytes,3,opt,name=template"`

    // Specifies the lifecycle of tasks
    // +optional
    Policies []LifecyclePolicy `json:"policies,omitempty" protobuf:"bytes,4,opt,name=policies"`
}

JobController 會基于spec.tasks中的templates 和 replicas來創(chuàng)建pods; pod的the controlled OwnerReference 會被分給job. 如下是 YAML with multiple pod template的示例.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: tf-job
spec:
  tasks:
  - name: "ps"
    replicas: 2
    template:
      spec:
        containers:
        - name: ps
          image: ps-img
  - name: "worker"
    replicas: 5
    template:
      spec: 
        containers:
        - name: worker
          image: worker-img

Job Input/Output
多數(shù)高性能工作負載任務會有數(shù)據(jù)讀寫叉弦，如下是任務讀寫的示例

type VolumeSpec struct {
    MountPath string `json:"mountPath" protobuf:"bytes,1,opt,name=mountPath"`

    // defined the PVC name
    // + optional
    VolumeClaimName string `json:"volumeClaimName,omitempty" protobuf:"bytes,2,opt,name=volumeClaimName"`

    // VolumeClaim defines the PVC used by the VolumeSpec.
    // + optional
    VolumeClaim *PersistentVolumeClaim `json:"claim,omitempty" protobuf:"bytes,3,opt,name=claim"`
}

type JobSpec struct{
    ...

    // The volumes mount on Job
    // +optional
    Volumes []VolumeSpec `json:"volumes,omitempty" protobuf:"bytes,1,opt,name=volumes"`
}

Job的Volumes可以為nil，也就意味著用戶可以自行管理數(shù)據(jù)赴魁。如果*VolumeSpec.volumeClaim 是 nil 且 *VolumeSpec.volumeClaimName 是 nil 或不存在于 PersistentVolumeClaim卸奉，emptyDir volume 將被用于每個 Task/Pod.

Conditions and Phases
The following phases are introduced to give a simple, high-level summary of where the Job is in its lifecycle; and the conditions array, the reason and message field contain more detail about the job’s status.

type JobPhase string

const (
    // Pending is the phase that job is pending in the queue, waiting for scheduling decision
    Pending JobPhase = "Pending"
    // Aborting is the phase that job is aborted, waiting for releasing pods
    Aborting JobPhase = "Aborting"
    // Aborted is the phase that job is aborted by user or error handling
    Aborted JobPhase = "Aborted"
    // Running is the phase that minimal available tasks of Job are running
    Running JobPhase = "Running"
    // Restarting is the phase that the Job is restarted, waiting for pod releasing and recreating
    Restarting JobPhase = "Restarting"
    // Completed is the phase that all tasks of Job are completed successfully
    Completed JobPhase = "Completed"
    // Terminating is the phase that the Job is terminated, waiting for releasing pods
    Terminating JobPhase = "Terminating"
    // Teriminated is the phase that the job is finished unexpected, e.g. events
    Teriminated JobPhase = "Terminated"
)

// JobState contains details for the current state of the job.
type JobState struct {
    // The phase of Job.
    // +optional
    Phase JobPhase `json:"phase,omitempty" protobuf:"bytes,1,opt,name=phase"`

    // Unique, one-word, CamelCase reason for the phase's last transition.
    // +optional
    Reason string `json:"reason,omitempty" protobuf:"bytes,2,opt,name=reason"`

    // Human-readable message indicating details about last transition.
    // +optional
    Message string `json:"message,omitempty" protobuf:"bytes,3,opt,name=message"`
}

// JobStatus represents the current state of a Job
type JobStatus struct {
    // Current state of Job.
    State JobState `json:"state,omitempty" protobuf:"bytes,1,opt,name=state"`

    ......
}

The following table shows available transactions between different phases. The phase can not transfer to the target phase if the cell is empty.

Restarting, Aborting and Terminating are temporary states to avoid race condition, e.g. there’ll be several PodeEvictedEvents because of TerminateJobAction which should not be handled again.

Error Handling
job創(chuàng)建后會有相關(guān)事件，如Pod succeeded, Pod failed颖御。因此LifecyclePolicy 基于用戶的配置來處理不同事件

// Event is the type of Event related to the Job
type Event string

const (
    // AllEvents means all event
    AllEvents             Event = "*"
    // PodFailedEvent is triggered if Pod was failed
    PodFailedEvent        Event = "PodFailed"
    // PodEvictedEvent is triggered if Pod was deleted
    PodEvictedEvent       Event = "PodEvicted"
    // These below are several events can lead to job 'Unknown'
    // 1. Task Unschedulable, this is triggered when part of
    //    pods can't be scheduled while some are already running in gang-scheduling case.
    JobUnknownEvent Event = "Unknown"

    // OutOfSyncEvent is triggered if Pod/Job were updated
    OutOfSyncEvent Event = "OutOfSync"
    // CommandIssuedEvent is triggered if a command is raised by user
    CommandIssuedEvent Event = "CommandIssued"
    // TaskCompletedEvent is triggered if the 'Replicas' amount of pods in one task are succeed
    TaskCompletedEvent Event = "TaskCompleted"
)

// Action is the type of event handling 
type Action string

const (
    // AbortJobAction if this action is set, the whole job will be aborted:
    // all Pod of Job will be evicted, and no Pod will be recreated
    AbortJobAction Action = "AbortJob"
    // RestartJobAction if this action is set, the whole job will be restarted
    RestartJobAction Action = "RestartJob"
    // TerminateJobAction if this action is set, the whole job wil be terminated
    // and can not be resumed: all Pod of Job will be evicted, and no Pod will be recreated.
    TerminateJobAction Action = "TerminateJob"
    // CompleteJobAction if this action is set, the unfinished pods will be killed, job completed.
    CompleteJobAction Action = "CompleteJob"

    // ResumeJobAction is the action to resume an aborted job.
    ResumeJobAction Action = "ResumeJob"
    // SyncJobAction is the action to sync Job/Pod status.
    SyncJobAction Action = "SyncJob"
)

// LifecyclePolicy specifies the lifecycle and error handling of task and job.
type LifecyclePolicy struct {
    Event  Event  `json:"event,omitempty" protobuf:"bytes,1,opt,name=event"`
    Action Action `json:"action,omitempty" protobuf:"bytes,2,opt,name=action"`
    Timeout *metav1.Duration `json:"timeout,omitempty" protobuf:"bytes,3,opt,name=timeout"`
}

JobSpec 和 TaskSpec 都包含 lifecycle policy: JobSpec 中的policies為默認policy 如果 TaskSpec中為空; TaskSpec 中的policy會覆蓋默認的policy

// JobSpec describes how the job execution will look like and when it will actually run
type JobSpec struct {
    ... 

    // Specifies the default lifecycle of tasks
    // +optional
    Policies []LifecyclePolicy `json:"policies,omitempty" protobuf:"bytes,5,opt,name=policies"`

    // Tasks specifies the task specification of Job
    // +optional
    Tasks []TaskSpec `json:"tasks,omitempty" protobuf:"bytes,6,opt,name=tasks"`
}

// TaskSpec specifies the task specification of Job
type TaskSpec struct {
    ...

    // Specifies the lifecycle of tasks
    // +optional
    Policies []LifecyclePolicy `json:"policies,omitempty" protobuf:"bytes,4,opt,name=policies"`
}

下面是job和task的LifecyclePolicy使用示范
對于ml榄棵，如果任意task失敗或者evicted了，那么job應當被重啟潘拱。為簡化配置疹鳄，job層的LifecyclePolicy將配置如下，如果task沒有設(shè)置LifecyclePolicy芦岂，那么所有task會使用spec.policies中的policy

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: tf-job
spec:
  # If any event here, restart the whole job.
  policies:
  - event: *
    action: RestartJob
  tasks:
  - name: "ps"
    replicas: 1
    template:
      spec:
        containers:
        - name: ps
          image: ps-img
  - name: "worker"
    replicas: 5
    template:  
      spec: 
        containers:
        - name: worker
          image: worker-img
  ...

一些大數(shù)據(jù)框架如spark會有不同需求瘪弓，如果driver task失敗那么整個job將重啟，如果executor task失敗那么只需重啟task禽最，OnFailure用于executor而RestartJob 用于driver的spec.tasks.policies ,如下所示

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: spark-job
spec:
  tasks:
  - name: "driver"
    replicas: 1
    policies:
    - event: *
      action: RestartJob
    template:
      spec:
        containers:
        - name: driver
          image: driver-img
  - name: "executor"
    replicas: 5
    template:  
      spec: 
        containers:
        - name: executor
          image: executor-img
        restartPolicy: OnFailure

Features Interaction
Admission Controller
必須包括下列驗證以保證期望運行
- spec.minAvailable <= sum(spec.taskSpecs.replicas)
- no duplicated name in spec.taskSpecs array
- no duplicated event handler in LifecyclePolicy array, both job policies and task policies
CoScheduling
也就是gang-scheduling,spec.minAvailable用于指明會同時調(diào)度多少個pods腺怯，spec.minAvailable的默認值is the summary ofspec.tasks.replicas袱饭。
如果spec.minAvailable > sum(spec.tasks.replicas)那么job的創(chuàng)建會被拒絕
如果spec.minAvailable < sum(spec.tasks.replicas), 那么spec.tasks 的pod會被隨機創(chuàng)建
參照Task Priority within Job部分查看如何按順序創(chuàng)建task

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: tf-job
spec:
  # minAvailable to run job
  minAvailable: 6
  tasks:
  - name: "ps"
    replicas: 1
    template:
      spec:
        containers:
        - name: "ps"
          image: "ps-img"
  - name: "worker"
    replicas: 5
    template:
      spec: 
        containers:
        - name: "worker"
          image: "worker-img"

Task Priority within Job
除了multiple pod template，每個任務的優(yōu)先級可能不同呛占。PodTemplate中的PriorityClass用于定義job中task的優(yōu)先級虑乖。下面是運行spark job的一個示例，1 driver with 5 executors晾虑， the driver’s priority is master-pri which is higher than normal pods; as spec.minAvailable is 3, the scheduler will make sure one driver with 2 executors will be scheduled if not enough resources.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: spark-job
spec:
  minAvailable: 3
  tasks:
  - name: "driver"
    replicas: 1
    template:
      spec:
        priorityClass: "master-pri"
        containers:
        - name: driver
          image: driver-img
  - name: "executor"
    replicas: 5
    template:
      spec: 
        containers:
        - name: executor
          image: executor-img

注意:雖然高優(yōu)先級pod會優(yōu)先調(diào)度疹味，但kubelet之間還是存在race condition以至于低優(yōu)先級pod可能會先啟動。 job/task dependency用于處理這種race condition

Resource sharing between Job
默認情況下spec.minAvailable is set to the summary of spec.tasks.replicas帜篇，如果設(shè)定為一個更小的值糙捺，那么超出spec.minAvailable部分的pod會共享jobs之間的資源

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: spark-job
spec:
  minAvailable: 3
  tasks:
  - name: "driver"
    replicas: 1
    template:
      spec:
        priorityClass: "master-pri"
        containers:
        - name: driver
          image: driver-img
  - name: "executor"
    replicas: 5
    template:
      spec: 
        containers:
        - name: executor
          image: executor-img

Plugins for Job
As many jobs of AI frame, e.g. TensorFlow, MPI, Mxnet, need set env, pods communicate, ssh sign in without password. We provide Job api plugins to give users a better focus on core business. Now we have three plugins, every plugin has parameters, if not provided, we use default.

env: set VK_TASK_INDEX to each container, is a index for giving the identity to container.
svc: create Serivce and *.host to enable pods communicate.
ssh: sign in ssh without password, e.g. use command mpirun or mpiexec.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: mpi-job
spec:
  minAvailable: 2
  schedulerName: scheduler
  policies:
  - event: PodEvicted
    action: RestartJob
  plugins:
    ssh: []
    env: []
    svc: []
  tasks:
  - replicas: 1
    name: mpimaster
    template:
      spec:
        containers:
          image: mpi-image
          name: mpimaster
  - replicas: 2
    name: mpiworker
    template: 
      spec:
        containers:
          image: mpi-image
          name: mpiworker

最后編輯于：2020.07.26 17:11:38

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市笙隙，隨后出現(xiàn)的幾起案子洪灯，更是在濱河造成了極大的恐慌，老刑警劉巖逃沿，帶你破解...
沈念sama閱讀 206,482評論 6贊 481
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件婴渡，死亡現(xiàn)場離奇詭異，居然都是意外死亡凯亮，警方通過查閱死者的電腦和手機边臼，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,377評論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來假消，“玉大人柠并，你說我怎么就攤上這事「晦郑” “怎么了臼予？”我有些...
開封第一講書人閱讀 152,762評論 0贊 342
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長啃沪。經(jīng)常有香客問我粘拾，道長，這世上最難降的妖魔是什么创千？我笑而不...
開封第一講書人閱讀 55,273評論 1贊 279
?港島之戀（遺憾婚禮）
正文為了忘掉前任缰雇，我火速辦了婚禮，結(jié)果婚禮上追驴，老公的妹妹穿的比我還像新娘械哟。我一直安慰自己，他們只是感情好殿雪，可當我...
茶點故事閱讀 64,289評論 5贊 373
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布暇咆。她就那樣靜靜地躺著，像睡著了一般。火紅的嫁衣襯著肌膚如雪爸业。梳的紋絲不亂的頭發(fā)上其骄，一...
開封第一講書人閱讀 49,046評論 1贊 285
城市分裂傳說
那天，我揣著相機與錄音沃呢，去河邊找鬼年栓。笑死，一個胖子當著我的面吹牛薄霜，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播纸兔，決...
沈念sama閱讀 38,351評論 3贊 400
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼惰瓜，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了汉矿？” 一聲冷哼從身側(cè)響起崎坊，我...
開封第一講書人閱讀 36,988評論 0贊 259
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎洲拇，沒想到半個月后奈揍，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 43,476評論 1贊 300
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡赋续，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 35,948評論 2贊 324
?白月光啟示錄
正文我和宋清朗相戀三年男翰，在試婚紗的時候發(fā)現(xiàn)自己被綠了。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片纽乱。...
茶點故事閱讀 38,064評論 1贊 333
活死人
序言：一個原本活蹦亂跳的男人離奇死亡蛾绎，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出鸦列，到底是詐尸還是另有隱情租冠，我是刑警寧澤，帶...
沈念sama閱讀 33,712評論 4贊 323
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布薯嗤，位于F島的核電站顽爹，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏骆姐。R本人自食惡果不足惜镜粤，卻給世界環(huán)境...
茶點故事閱讀 39,261評論 3贊 307
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望诲锹。院中可真熱鬧繁仁，春花似錦、人聲如沸归园。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,264評論 0贊 19
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽庸诱。三九已至捻浦，卻和暖如春晤揣，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背朱灿。一陣腳步聲響...
開封第一講書人閱讀 31,486評論 1贊 262
情欲美人皮
我被黑心中介騙來泰國打工昧识，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人盗扒。一個月前我還...
沈念sama閱讀 45,511評論 2贊 354
代替公主和親
正文我出身青樓跪楞，卻偏偏與公主長得像，于是被迫代替她去往敵國和親侣灶。傳聞我的和親對象是個殘疾皇子甸祭，可洞房花燭夜當晚...
茶點故事閱讀 42,802評論 2贊 345

Volcano中job管理

推薦閱讀更多精彩內(nèi)容