簡(jiǎn)介
IsolationForest(孤立森林)適用于大規(guī)模數(shù)據(jù),應(yīng)用于網(wǎng)絡(luò)安全的攻擊檢測(cè)和流量異常,以及金融機(jī)構(gòu)的欺詐行為。
IsolationForest 兩大步驟
1危尿、從訓(xùn)練集中抽樣,構(gòu)建iTree
2馁痴、對(duì)iForest中的每顆iTree進(jìn)行測(cè)試,記錄path length肺孤,然后根據(jù)異常分?jǐn)?shù)計(jì)算公式罗晕,計(jì)算每條測(cè)試數(shù)據(jù)的anomaly score
IsolationForest建模原則
1、異常數(shù)據(jù)只占少量
2赠堵、異常數(shù)據(jù)特征值與正常值相差很大
算法只需兩個(gè)參數(shù)
1小渊、樹(shù)的多少(一般100就比較好了)
2、抽樣多少(一般256就比較好了)
模型注意事項(xiàng):
1茫叭、模型預(yù)測(cè)結(jié)果為1和-1,其中1為正常值酬屉,而-1為異常值;
2揍愁、當(dāng)異常數(shù)據(jù)太少了呐萨,模型訓(xùn)練集只有正常數(shù)據(jù)時(shí),也是可行的莽囤,但是預(yù)測(cè)結(jié)果會(huì)降低(如下面的例子)谬擦。
舉例:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
rng = np.random.RandomState(42)
# 生成訓(xùn)練集
X = 0.3 * rng.randn(100, 2)
X_train = np.r_[X + 2, X - 2]
# 生成一些常規(guī)的新奇觀察
X = 0.3 * rng.randn(20, 2)
X_test = np.r_[X + 2, X - 2]
# 產(chǎn)生一些異常新穎的觀察
X_outliers = rng.uniform(low=-4, high=4, size=(20, 2))
#X_outliers.max()
# fit訓(xùn)練
clf = IsolationForest(max_samples=100, random_state=rng)
clf.fit(X_train)
#predict預(yù)測(cè)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers) #預(yù)測(cè)結(jié)果應(yīng)該全是異常值
# plot畫(huà)圖
xx, yy = np.meshgrid(np.linspace(-5, 5, 50), np.linspace(-5, 5, 50))
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) #CLF模型框架
Z = Z.reshape(xx.shape)
plt.title("IsolationForest")
plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)
b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c='white',
s=20, edgecolor='k')
b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c='green',
s=20, edgecolor='k')
c = plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c='red',
s=20, edgecolor='k')
plt.axis('tight')
plt.xlim((-5, 5))
plt.ylim((-5, 5))
plt.legend([b1, b2, c],
["training observations",
"new regular observations", "new abnormal observations"],
loc="upper left")
plt.show()