valcano
參考:
Error Handling
// Event is the type of Event related to the Job
type Event string
const (
// AllEvents means all event
AllEvents Event = "*"
// PodFailedEvent is triggered if Pod was failed
PodFailedEvent Event = "PodFailed"
// PodEvictedEvent is triggered if Pod was deleted
PodEvictedEvent Event = "PodEvicted"
// These below are several events can lead to job 'Unknown'
// 1. Task Unschedulable, this is triggered when part of
// pods can't be scheduled while some are already running in gang-scheduling case.
JobUnknownEvent Event = "Unknown"
// OutOfSyncEvent is triggered if Pod/Job were updated
OutOfSyncEvent Event = "OutOfSync"
// CommandIssuedEvent is triggered if a command is raised by user
CommandIssuedEvent Event = "CommandIssued"
// TaskCompletedEvent is triggered if the 'Replicas' amount of pods in one task are succeed
TaskCompletedEvent Event = "TaskCompleted"
)
// Action is the type of event handling
type Action string
const (
// AbortJobAction if this action is set, the whole job will be aborted:
// all Pod of Job will be evicted, and no Pod will be recreated
AbortJobAction Action = "AbortJob"
// RestartJobAction if this action is set, the whole job will be restarted
RestartJobAction Action = "RestartJob"
// TerminateJobAction if this action is set, the whole job wil be terminated
// and can not be resumed: all Pod of Job will be evicted, and no Pod will be recreated.
TerminateJobAction Action = "TerminateJob"
// CompleteJobAction if this action is set, the unfinished pods will be killed, job completed.
CompleteJobAction Action = "CompleteJob"
// ResumeJobAction is the action to resume an aborted job.
ResumeJobAction Action = "ResumeJob"
// SyncJobAction is the action to sync Job/Pod status.
SyncJobAction Action = "SyncJob"
)
// LifecyclePolicy specifies the lifecycle and error handling of task and job.
type LifecyclePolicy struct {
Event Event `json:"event,omitempty" protobuf:"bytes,1,opt,name=event"`
Action Action `json:"action,omitempty" protobuf:"bytes,2,opt,name=action"`
Timeout *metav1.Duration `json:"timeout,omitempty" protobuf:"bytes,3,opt,name=timeout"`
}
通過LifecyclePolicy
對不同Job event做針對性處理。例如分布式訓(xùn)練魄缚,一個task
失敗暗甥,可以讓整個Job重新運(yùn)行喜滨。
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: tf-job
spec:
# If any event here, restart the whole job.
policies:
- event: *
action: RestartJob
tasks:
- name: "ps"
replicas: 1
template:
spec:
containers:
- name: ps
image: ps-img
- name: "worker"
replicas: 5
template:
spec:
containers:
- name: worker
image: worker-img
...
還可以針對每個task指定action
Gang-schduling
這部分完全就是kube-batch
的設(shè)計(jì)。spec.minAvailable
表示結(jié)對調(diào)度pod的數(shù)量撤防。spec.minAvailable
默認(rèn)值是spec.tasks.replicas
的總和虽风。如果spec.minAvailable
> sumspec.tasks.replicas
,就無法創(chuàng)建Job寄月。如果spec.minAvailable
< sumspec.tasks.replicas
辜膝,就根據(jù)task優(yōu)先級創(chuàng)建或者隨機(jī)創(chuàng)建。
task優(yōu)先級
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: spark-job
spec:
minAvailable: 3
tasks:
- name: "driver"
replicas: 1
template:
spec:
priorityClass: "master-pri" // 高優(yōu)先級
containers:
- name: driver
image: driver-img
- name: "executor"
replicas: 5
template:
spec:
containers:
- name: executor
image: executor-img
調(diào)度器只能保證優(yōu)先調(diào)度Job中高優(yōu)的pods
Job plugins
plugins:
ssh: []
env: []
svc: []
目前提供3種插件剥懒,幫助用戶運(yùn)行AI job
- env:目前只加了
VK_TASK_INDEX
内舟,后期應(yīng)該至少會把數(shù)組值用起來 - svc:給Job創(chuàng)建svc,維護(hù)host文件初橘,方便pods通信
func generateHost(job *vkv1.Job) map[string]string {
data := make(map[string]string, len(job.Spec.Tasks))
for _, ts := range job.Spec.Tasks {
hosts := make([]string, 0, ts.Replicas)
for i := 0; i < int(ts.Replicas); i++ {
hostName := ts.Template.Spec.Hostname
subdomain := ts.Template.Spec.Subdomain
if len(hostName) == 0 {
hostName = vkhelpers.MakePodName(job.Name, ts.Name, i)
}
if len(subdomain) == 0 {
subdomain = job.Name
}
hosts = append(hosts, hostName+"."+subdomain)
if len(ts.Template.Spec.Hostname) != 0 {
break
}
}
key := fmt.Sprintf(ConfigMapTaskHostFmt, ts.Name)
data[key] = strings.Join(hosts, "\n")
}
return data
}
- ssh:ssh免密通信