pipelines中文意思是計算機(jī)流水線作業(yè),通過pipelines的api可以很方便的實(shí)現(xiàn)數(shù)據(jù)工作流:數(shù)據(jù)源->特征轉(zhuǎn)換->數(shù)據(jù)建模->數(shù)據(jù)預(yù)言
pipeline常用組件
- Transformer:一個抽象概念缀蹄,包括特征轉(zhuǎn)換和數(shù)據(jù)預(yù)言
- Estimator:作用于訓(xùn)練數(shù)據(jù)的抽象概念该押,例如邏輯回歸(用回歸思維解決分類問題)的算法
邏輯回歸
- aggregationDepth: (>= 2) (default: 2)
- elasticNetParam: 正則化范式比废膘,正則化有兩種方式:L1(Lasso)和L2(Ridge),L1用于特征的稀疏化,L2用于防止過擬合(default: 0.0)
- family: (default: auto)
- featuresCol: 設(shè)置特征列(default: features)
- fitIntercept: (default: true)
- labelCol: 設(shè)置標(biāo)簽列(default: label)
- lowerBoundsOnCoefficients: (undefined)
- lowerBoundsOnIntercepts: . (undefined)
- maxIter: (>= 0) (default: 100)
- predictionCol: 設(shè)置預(yù)測列(default: prediction)
- probabilityCol: (default: probability)
- rawPredictionCol: (default: rawPrediction)
- regParam: 正則化主要用于防止過擬合現(xiàn)象,如果數(shù)據(jù)集較小,特征維數(shù)又多,易出現(xiàn)過擬合,考慮增大正則化系數(shù) (>= 0) (default: 0.0)
- standardization: 標(biāo)準(zhǔn)化 (default: true)
- threshold: 設(shè)置二分類閾值, [0, 1] (default: 0.5)
- thresholds: 閾值-多元分類 (undefined)
- tol: 迭代算法的收斂性 (>= 0) (default: 1.0E-6)
- upperBoundsOnCoefficients: (undefined)
- upperBoundsOnIntercepts: (undefined)
- weightCol: 權(quán)重系數(shù)