在上面的文章中绘闷,我成功運(yùn)行了pipelines的簡單實(shí)例橡庞。這個(gè)簡單的例子沒有文件的操作,但是這肯定不符合我們的要求印蔗,所以接下來介紹如何運(yùn)行官網(wǎng)的ML 例子扒最。
這次試用的例子是:KubeFlow pipeline using TFX OSS components
準(zhǔn)備工作
由于這個(gè)例子使用的鏡像,文件都是某歌的华嘹,所以我們要想辦法把他弄到自己服務(wù)器能pull到的地方吧趣。
鏡像:大家可以在dockerhub上搜索‘zoux’。所有相關(guān)的鏡像耙厚,我都上傳到dockerhub上去了强挫。
文件:https://github.com/zoux86/kubeflow/tree/master/taxi-cab-classification
運(yùn)行代碼
代碼片段1-定義文件路徑和鏡像名稱
EXPERIMENT_NAME = 'TFX16'
OUTPUT_DIR = './output'
PROJECT_NAME = 'TFX16'
TRAIN_DATA = '/home/zoux/data/taxi-cab-classification/train.csv'
EVAL_DATA = '/home/zoux/data/taxi-cab-classification/eval.csv'
HIDDEN_LAYER_SIZE = '1500'
STEPS = 3000
DATAFLOW_TFDV_IMAGE = '192.168.14.66:5000/ml-pipeline-dataflow-tfdv:v1'
DATAFLOW_TFT_IMAGE = '192.168.14.66:5000/ml-pipeline-dataflow-tft:v1'
DATAFLOW_TFMA_IMAGE = '192.168.14.66:5000/ml-pipeline-dataflow-tfma:v1'
DATAFLOW_TF_PREDICT_IMAGE = '192.168.14.66:5000/ml-pipeline-dataflow-tf-predict:v1'
KUBEFLOW_TF_TRAINER_IMAGE = '192.168.14.66:5000/ml-pipeline-kubeflow-tf-trainer:v1'
KUBEFLOW_DEPLOYER_IMAGE = '192.168.14.66:5000/ml-pipeline-kubeflow-deployer:v1'
注意:
(1)這里我使用的都是集群本地倉庫中的鏡像,這些鏡像我都是提前下載好颜曾,push到倉庫中去的纠拔。
(2)我的鏡像是自己定義的V1版本,這是我自己build的一個(gè)新鏡像泛豪。和原來某歌的鏡像不同在于稠诲,我在鏡像中創(chuàng)建了一個(gè)文件夾'/home/zoux/data/taxi-cab-classification/',并將所有需要的文件都放在了這個(gè)文件夾中,這樣每個(gè)啟動(dòng)的每個(gè)容器都可以在自己的鏡像中使用所需要的文件诡曙。
代碼片段2-導(dǎo)入python sdk
import kfp
from kfp import compiler
import kfp.dsl as dsl
import kfp.notebook
import kfp.gcp as gcp
client = kfp.Client()
from kubernetes import client as k8s_client
exp = client.create_experiment(name=EXPERIMENT_NAME)
注意如果保存說找不到kfp模塊臀叙,需要在jupyter中再下載一次python SDK。
!pip3 install https://storage.googleapis.com/ml-pipeline/release/0.1.3/kfp.tar.gz --upgrade
代碼片段3-創(chuàng)建pipelines
mport kfp.dsl as dsl
# Below are a list of helper functions to wrap the components to provide a simpler interface for pipeline function.
def dataflow_tf_data_validation_op(inference_data: 'GcsUri', validation_data: 'GcsUri', column_names: 'GcsUri[text/json]', key_columns, project: 'GcpProject', mode, validation_output: 'GcsUri[Directory]', step_name='validation'):
return dsl.ContainerOp(
name = step_name,
image = DATAFLOW_TFDV_IMAGE,
arguments = [
'--csv-data-for-inference', inference_data,
'--csv-data-to-validate', validation_data,
'--column-names', column_names,
'--key-columns', key_columns,
'--project', project,
'--mode', mode,
'--output', validation_output,
],
file_outputs = {
'schema': '/schema.txt',
}
)
def dataflow_tf_transform_op(train_data: 'GcsUri', evaluation_data: 'GcsUri', schema: 'GcsUri[text/json]', project: 'GcpProject', preprocess_mode, preprocess_module: 'GcsUri[text/code/python]', transform_output: 'GcsUri[Directory]', step_name='preprocess'):
return dsl.ContainerOp(
name = step_name,
image = DATAFLOW_TFT_IMAGE,
arguments = [
'--train', train_data,
'--eval', evaluation_data,
'--schema', schema,
'--project', project,
'--mode', preprocess_mode,
'--preprocessing-module', preprocess_module,
'--output', transform_output,
],
file_outputs = {'transformed': '/output.txt'}
)
def tf_train_op(transformed_data_dir, schema: 'GcsUri[text/json]', learning_rate: float, hidden_layer_size: int, steps: int, target: str, preprocess_module: 'GcsUri[text/code/python]', training_output: 'GcsUri[Directory]', step_name='training', use_gpu=False):
tf_train_op = dsl.ContainerOp(
name = step_name,
image = KUBEFLOW_TF_TRAINER_IMAGE,
arguments = [
'--transformed-data-dir', transformed_data_dir,
'--schema', schema,
'--learning-rate', learning_rate,
'--hidden-layer-size', hidden_layer_size,
'--steps', steps,
'--target', target,
'--preprocessing-module', preprocess_module,
'--job-dir', training_output,
],
file_outputs = {'train': '/output.txt'}
)
return tf_train_op
def dataflow_tf_model_analyze_op(model: 'TensorFlow model', evaluation_data: 'GcsUri', schema: 'GcsUri[text/json]', project: 'GcpProject', analyze_mode, analyze_slice_column, analysis_output: 'GcsUri', step_name='analysis'):
return dsl.ContainerOp(
name = step_name,
image = DATAFLOW_TFMA_IMAGE,
arguments = [
'--model', model,
'--eval', evaluation_data,
'--schema', schema,
'--project', project,
'--mode', analyze_mode,
'--slice-columns', analyze_slice_column,
'--output', analysis_output,
],
file_outputs = {'analysis': '/output.txt'}
)
def dataflow_tf_predict_op(evaluation_data: 'GcsUri', schema: 'GcsUri[text/json]', target: str, model: 'TensorFlow model', predict_mode, project: 'GcpProject', prediction_output: 'GcsUri', step_name='prediction'):
return dsl.ContainerOp(
name = step_name,
image = DATAFLOW_TF_PREDICT_IMAGE,
arguments = [
'--data', evaluation_data,
'--schema', schema,
'--target', target,
'--model', model,
'--mode', predict_mode,
'--project', project,
'--output', prediction_output,
],
file_outputs = {'prediction': '/output.txt'}
)
def kubeflow_deploy_op(model: 'TensorFlow model', tf_server_name, step_name='deploy'):
return dsl.ContainerOp(
name = step_name,
image = KUBEFLOW_DEPLOYER_IMAGE,
arguments = [
'--model-path', model,
'--server-name', tf_server_name
]
)
# The pipeline definition
@dsl.pipeline(
name='TFX Taxi Cab Classification Pipeline Example',
description='Example pipeline that does classification with model analysis based on a public BigQuery dataset.'
)
def taxi_cab_classification(
output,
project,
column_names=dsl.PipelineParam(name='column-names', value='/home/zoux/data/taxi-cab-classification/column-names.json'),
key_columns=dsl.PipelineParam(name='key-columns', value='trip_start_timestamp'),
train=dsl.PipelineParam(name='train', value=TRAIN_DATA),
evaluation=dsl.PipelineParam(name='evaluation', value=EVAL_DATA),
validation_mode=dsl.PipelineParam(name='validation-mode', value='local'),
preprocess_mode=dsl.PipelineParam(name='preprocess-mode', value='local'),
preprocess_module: dsl.PipelineParam=dsl.PipelineParam(name='preprocess-module', value='/home/zoux/data/taxi-cab-classification/preprocessing.py'),
target=dsl.PipelineParam(name='target', value='tips'),
learning_rate=dsl.PipelineParam(name='learning-rate', value=0.1),
hidden_layer_size=dsl.PipelineParam(name='hidden-layer-size', value=HIDDEN_LAYER_SIZE),
steps=dsl.PipelineParam(name='steps', value=STEPS),
predict_mode=dsl.PipelineParam(name='predict-mode', value='local'),
analyze_mode=dsl.PipelineParam(name='analyze-mode', value='local'),
analyze_slice_column=dsl.PipelineParam(name='analyze-slice-column', value='trip_start_hour')):
# set the flag to use GPU trainer
use_gpu = False
validation_output = '/nfs-pv/tfx-pv'
transform_output = '/nfs-pv/tfx-pv'
training_output = '/nfs-pv/tfx-pv'
analysis_output = '/nfs-pv/tfx-pv'
prediction_output = '/nfs-pv/tfx-pv'
tf_server_name = 'taxi-cab-classification-model-{{workflow.name}}'
validation = dataflow_tf_data_validation_op(train, evaluation, column_names, key_columns, project, validation_mode, validation_output).add_volume(k8s_client.V1Volume(
name='tfx-pv',
nfs=k8s_client.V1NFSVolumeSource(path='/nfs-pv/tfx-pv',server='192.168.14.66'))).add_volume_mount(
k8s_client.V1VolumeMount(mount_path='/nfs-pv/tfx-pv',name='tfx-pv'))
preprocess = dataflow_tf_transform_op(train, evaluation, validation.outputs['schema'], project, preprocess_mode, preprocess_module, transform_output).add_volume(k8s_client.V1Volume(
name='tfx-pv',
nfs=k8s_client.V1NFSVolumeSource(path='/nfs-pv/tfx-pv',server='192.168.14.66'))).add_volume_mount(
k8s_client.V1VolumeMount(mount_path='/nfs-pv/tfx-pv',name='tfx-pv'))
training = tf_train_op(preprocess.output, validation.outputs['schema'], learning_rate, hidden_layer_size, steps, target, preprocess_module, training_output, use_gpu=use_gpu).add_volume(k8s_client.V1Volume(
name='tfx-pv',
nfs=k8s_client.V1NFSVolumeSource(path='/nfs-pv/tfx-pv',server='192.168.14.66'))).add_volume_mount(
k8s_client.V1VolumeMount(mount_path='/nfs-pv/tfx-pv',name='tfx-pv'))
analysis = dataflow_tf_model_analyze_op(training.output, evaluation, validation.outputs['schema'], project, analyze_mode, analyze_slice_column, analysis_output).add_volume(k8s_client.V1Volume(
name='tfx-pv',
nfs=k8s_client.V1NFSVolumeSource(path='/nfs-pv/tfx-pv',server='192.168.14.66'))).add_volume_mount(
k8s_client.V1VolumeMount(mount_path='/nfs-pv/tfx-pv',name='tfx-pv'))
prediction = dataflow_tf_predict_op(evaluation, validation.outputs['schema'], target, training.output, predict_mode, project, prediction_output).add_volume(k8s_client.V1Volume(
name='tfx-pv',
nfs=k8s_client.V1NFSVolumeSource(path='/nfs-pv/tfx-pv',server='192.168.14.66'))).add_volume_mount(
k8s_client.V1VolumeMount(mount_path='/nfs-pv/tfx-pv',name='tfx-pv'))
deploy = kubeflow_deploy_op(training.output, tf_server_name).add_volume(k8s_client.V1Volume(
name='tfx-pv',
nfs=k8s_client.V1NFSVolumeSource(path='/nfs-pv/tfx-pv',server='192.168.14.66'))).add_volume_mount(
k8s_client.V1VolumeMount(mount_path='/nfs-pv/tfx-pv',name='tfx-pv'))
這里需要注意价卤,這個(gè)和官網(wǎng)給的例子是不同的劝萤,細(xì)心一點(diǎn)就可以發(fā)現(xiàn)這里沒有
“apply(gcp.use_gcp_secret('user-gcp-sa'))”。
這是因?yàn)樯厦婺蔷湓挼淖饔檬鞘褂胓s的掛載卷慎璧。我們國內(nèi)服務(wù)器是訪問不到的床嫌。所以我們得使用自己的掛載卷。
說的明白一點(diǎn)就是胸私,我們運(yùn)作的是一個(gè)圖厌处,圖中每一個(gè)節(jié)點(diǎn)是一個(gè)鏡像。pipelines中每一個(gè)節(jié)點(diǎn)運(yùn)行完岁疼,pod就會(huì)消失阔涉,所以該節(jié)點(diǎn)的數(shù)據(jù)也會(huì)丟失。所以我們必須讓所有節(jié)點(diǎn)都能訪問一個(gè)永遠(yuǎn)存在的目錄捷绒。這是我們就需要為所有節(jié)點(diǎn)掛載持久存儲(chǔ)卷瑰排。這里我使用的是NFS分布式存儲(chǔ)系統(tǒng)來當(dāng)存儲(chǔ)卷。
關(guān)于如何使用NFS暖侨,前面的文章已經(jīng)有過說明椭住,這里不再累贅。這里直接說明如何讓所有的容器都關(guān)聯(lián)一個(gè)NFS PV存儲(chǔ)卷字逗。
(1)創(chuàng)建pv,pvc
pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: tfx-pv
spec:
capacity:
storage: 5Gi
accessModes:
- ReadWriteOnce
nfs:
server: 192.168.14.66
path: /nfs-pv/tfx-pv
pvc.yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: tfxclaim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
創(chuàng)建好pv,pvc后京郑。運(yùn)行上面的代碼就能跑TFX實(shí)例了显押。
結(jié)果如下:
這里顯示最后幾個(gè)還有問題是因?yàn)樽詈蟮姆?wù)和分析也是在某歌平臺(tái)上的,我就沒修改了傻挂。
到這里至少就知道pipelines能在自己的集群上跑起來。接下來就是自己寫鏡像挖息,跑自己的代碼了金拒。
路漫漫其修遠(yuǎn)兮 吾將上下而求索