volcano Job初探

valcano

參考:

job-api

Error Handling

// Event is the type of Event related to the Job
type Event string

const (
    // AllEvents means all event
    AllEvents             Event = "*"
    // PodFailedEvent is triggered if Pod was failed
    PodFailedEvent        Event = "PodFailed"
    // PodEvictedEvent is triggered if Pod was deleted
    PodEvictedEvent       Event = "PodEvicted"
    // These below are several events can lead to job 'Unknown'
    // 1. Task Unschedulable, this is triggered when part of
    //    pods can't be scheduled while some are already running in gang-scheduling case.
    JobUnknownEvent Event = "Unknown"

    // OutOfSyncEvent is triggered if Pod/Job were updated
    OutOfSyncEvent Event = "OutOfSync"
    // CommandIssuedEvent is triggered if a command is raised by user
    CommandIssuedEvent Event = "CommandIssued"
    // TaskCompletedEvent is triggered if the 'Replicas' amount of pods in one task are succeed
    TaskCompletedEvent Event = "TaskCompleted"
)

// Action is the type of event handling 
type Action string

const (
    // AbortJobAction if this action is set, the whole job will be aborted:
    // all Pod of Job will be evicted, and no Pod will be recreated
    AbortJobAction Action = "AbortJob"
    // RestartJobAction if this action is set, the whole job will be restarted
    RestartJobAction Action = "RestartJob"
    // TerminateJobAction if this action is set, the whole job wil be terminated
    // and can not be resumed: all Pod of Job will be evicted, and no Pod will be recreated.
    TerminateJobAction Action = "TerminateJob"
    // CompleteJobAction if this action is set, the unfinished pods will be killed, job completed.
    CompleteJobAction Action = "CompleteJob"

    // ResumeJobAction is the action to resume an aborted job.
    ResumeJobAction Action = "ResumeJob"
    // SyncJobAction is the action to sync Job/Pod status.
    SyncJobAction Action = "SyncJob"
)

// LifecyclePolicy specifies the lifecycle and error handling of task and job.
type LifecyclePolicy struct {
    Event  Event  `json:"event,omitempty" protobuf:"bytes,1,opt,name=event"`
    Action Action `json:"action,omitempty" protobuf:"bytes,2,opt,name=action"`
    Timeout *metav1.Duration `json:"timeout,omitempty" protobuf:"bytes,3,opt,name=timeout"`
}

通過LifecyclePolicy對不同Job event做針對性處理。例如分布式訓(xùn)練魄缚,一個task失敗暗甥,可以讓整個Job重新運(yùn)行喜滨。

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: tf-job
spec:
  # If any event here, restart the whole job.
  policies:
  - event: *
    action: RestartJob
  tasks:
  - name: "ps"
    replicas: 1
    template:
      spec:
        containers:
        - name: ps
          image: ps-img
  - name: "worker"
    replicas: 5
    template:  
      spec: 
        containers:
        - name: worker
          image: worker-img
  ...

還可以針對每個task指定action

Gang-schduling

這部分完全就是kube-batch的設(shè)計(jì)。spec.minAvailable表示結(jié)對調(diào)度pod的數(shù)量撤防。spec.minAvailable默認(rèn)值是spec.tasks.replicas的總和虽风。如果spec.minAvailable > sumspec.tasks.replicas,就無法創(chuàng)建Job寄月。如果spec.minAvailable < sumspec.tasks.replicas辜膝,就根據(jù)task優(yōu)先級創(chuàng)建或者隨機(jī)創(chuàng)建。

task優(yōu)先級

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: spark-job
spec:
  minAvailable: 3
  tasks:
  - name: "driver"
    replicas: 1
    template:
      spec:
        priorityClass: "master-pri" // 高優(yōu)先級
        containers:
        - name: driver
          image: driver-img
  - name: "executor"
    replicas: 5
    template:
      spec: 
        containers:
        - name: executor
          image: executor-img

調(diào)度器只能保證優(yōu)先調(diào)度Job中高優(yōu)的pods

Job plugins

  plugins:
    ssh: []
    env: []
    svc: []

目前提供3種插件剥懒,幫助用戶運(yùn)行AI job

  • env:目前只加了VK_TASK_INDEX内舟,后期應(yīng)該至少會把數(shù)組值用起來
  • svc:給Job創(chuàng)建svc,維護(hù)host文件初橘,方便pods通信
func generateHost(job *vkv1.Job) map[string]string {
   data := make(map[string]string, len(job.Spec.Tasks))

   for _, ts := range job.Spec.Tasks {
      hosts := make([]string, 0, ts.Replicas)

      for i := 0; i < int(ts.Replicas); i++ {
         hostName := ts.Template.Spec.Hostname
         subdomain := ts.Template.Spec.Subdomain
         if len(hostName) == 0 {
            hostName = vkhelpers.MakePodName(job.Name, ts.Name, i)
         }
         if len(subdomain) == 0 {
            subdomain = job.Name
         }
         hosts = append(hosts, hostName+"."+subdomain)
         if len(ts.Template.Spec.Hostname) != 0 {
            break
         }
      }

      key := fmt.Sprintf(ConfigMapTaskHostFmt, ts.Name)
      data[key] = strings.Join(hosts, "\n")
   }

   return data
}
  • ssh:ssh免密通信
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末验游,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子保檐,更是在濱河造成了極大的恐慌耕蝉,老刑警劉巖,帶你破解...
    沈念sama閱讀 218,858評論 6 508
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件夜只,死亡現(xiàn)場離奇詭異垒在,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)扔亥,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,372評論 3 395
  • 文/潘曉璐 我一進(jìn)店門场躯,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人旅挤,你說我怎么就攤上這事踢关。” “怎么了粘茄?”我有些...
    開封第一講書人閱讀 165,282評論 0 356
  • 文/不壞的土叔 我叫張陵签舞,是天一觀的道長。 經(jīng)常有香客問我柒瓣,道長儒搭,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,842評論 1 295
  • 正文 為了忘掉前任芙贫,我火速辦了婚禮搂鲫,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘屹培。我一直安慰自己默穴,他們只是感情好怔檩,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,857評論 6 392
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著蓄诽,像睡著了一般薛训。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上仑氛,一...
    開封第一講書人閱讀 51,679評論 1 305
  • 那天乙埃,我揣著相機(jī)與錄音,去河邊找鬼锯岖。 笑死介袜,一個胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的出吹。 我是一名探鬼主播遇伞,決...
    沈念sama閱讀 40,406評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼捶牢!你這毒婦竟也來了鸠珠?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 39,311評論 0 276
  • 序言:老撾萬榮一對情侶失蹤秋麸,失蹤者是張志新(化名)和其女友劉穎渐排,沒想到半個月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體灸蟆,經(jīng)...
    沈念sama閱讀 45,767評論 1 315
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡驯耻,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,945評論 3 336
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了炒考。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片可缚。...
    茶點(diǎn)故事閱讀 40,090評論 1 350
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖斋枢,靈堂內(nèi)的尸體忽然破棺而出城看,到底是詐尸還是另有隱情,我是刑警寧澤杏慰,帶...
    沈念sama閱讀 35,785評論 5 346
  • 正文 年R本政府宣布,位于F島的核電站炼鞠,受9級特大地震影響缘滥,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜谒主,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,420評論 3 331
  • 文/蒙蒙 一朝扼、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧霎肯,春花似錦擎颖、人聲如沸榛斯。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,988評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽驮俗。三九已至,卻和暖如春允跑,著一層夾襖步出監(jiān)牢的瞬間王凑,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 33,101評論 1 271
  • 我被黑心中介騙來泰國打工聋丝, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留索烹,地道東北人。 一個月前我還...
    沈念sama閱讀 48,298評論 3 372
  • 正文 我出身青樓弱睦,卻偏偏與公主長得像百姓,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子况木,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,033評論 2 355