SVM - 簡書

reference：[非常好的兩本書放坏。再加上libsvm的源碼與調(diào)參的論文。]
[1]http://files2.syncfusion.com/Downloads/Ebooks/support_vector_machines_succinctly.pdf
[2]An Introduction to Support Vector Machines and Other Kernel-based Learning Methods
[3]https://pan.baidu.com/share/link?shareid=262520779&uk=1378138793
干貨
首先，SVM是解決supervised learning 中classification問題。有兩種情況，看是否linearly separable民晒。線性不可分則引入kernel，想法為先做transformation到其他空間進(jìn)而轉(zhuǎn)為可分問題锄禽。
對于線性可分的監(jiān)督分類問題潜必，SVM的目標(biāo)是什么呢? find the optimal separating hyperplane which maximizes the margin of the training data
為什么以最大化間隔為目標(biāo)？因為it correctly classifies the training data and because it is the one which will generalize better with unseen data
這里的間隔指沃但？關(guān)于間隔涉及到兩種分類磁滚，一種分類為幾何間隔與函數(shù)間隔；一種為軟宵晚、硬間隔垂攘。幾何間隔在二維則為點線距離，三維空間就是我們學(xué)習(xí)的點面距離淤刃。函數(shù)間隔二維中可以理解為幾個點沒有在線上晒他，三維則為幾個點沒有在面上；或者結(jié)合幾何間隔可理解為逸贾，是將幾何間隔求解的分母去掉了陨仅，沒有歸一化(也因此SVM中不能選擇以函數(shù)間隔衡量，否則maximizes是沒有意義的)铝侵。關(guān)于軟硬灼伤，是看噪聲，There will never be any data point inside the margin. If data is noisy, we need soft margin classifier.
繼上面的目標(biāo)咪鲜，假設(shè)該超平面的公式為W*X=0狐赡，這里會有疑惑：
- Why do we use the hyperplane equation WX instead of Y=ax+b? --> the vector W will always be normal to the hyperplane
澄清下要做的步驟：
- 1 數(shù)據(jù)集 2 選擇兩個超平面能夠分類數(shù)據(jù)并在兩平面間沒有其他點
  3 最大化超平面間隔
將步驟整理成數(shù)學(xué)過程
- 設(shè)兩個超平面， WX+b = -θ 和 WX+b = +θ疟丙。這里颖侄，為什么我們需要兩個超平面？我們設(shè)想的是隆敢，假定最佳的超平面在這兩個超平面的中間发皿。我們求得兩個超平面即可求得最佳分類超平面。
- θ取值無關(guān)拂蝎，直接設(shè)為1穴墅。即得WX+b = -1 和 WX+b = +1。這里要想明白W與b到底是什么關(guān)系温自？b依賴還是獨立于W玄货？顯然，是獨立的悼泌，可以想象為松捉，我們需要求得W與b兩個變量，能夠最大化間隔馆里。
- 需要滿足兩個約束: 1. 任何>=1的為class 1 2.任何<=-1的為class -1 -->這個限制使在兩平面間沒有其他點
- 將兩個約束寫為一個式子即： y(wx+b)>=1
- 最大化間隔隘世？對于這個問題可柿，目前我們已知條件是兩個。一個是兩個平面 WX+b = -1 和 WX+b = +1丙者。一個是有一個點x在平面 WX+b = -1 上复斥。得：
  w*(x + m*w/||w||)+b=1
  化簡得 m = 2/||w||
- 得到的公式意味著：如果||w||沒有限制，那么m我們可以取得任意大的值械媒。
- 現(xiàn)在自然就面臨optimization problem目锭。所有的點subject to y(wx+b)>=1, 在此條件下如何minimize ||w||?先引入理論1，該理論為兩個條件纷捞，在兩個條件滿足的情況下痢虹，可以說我們得到了一個scalar function的local minimum。
  
  theorem 1 .jpg
- f為從集合σ(其元素為vector)到實數(shù)集(其元素為值)的映射主儡，且在x處連續(xù)奖唯、可二階導(dǎo)。這里涉及到兩個的概念：
  1. gradient ：a generalization of the usual concept of derivative of a function in one dimension to a function in several dimensions ( the gradient of a function is a vector containing each of its partial derivatives.)注意符號為 nabla,圖中倒三角缀辩。
  2. scalar function：A scalar valued function is a function that takes one or more values but returns a single value. In our case f is a scalar valued function.
- positive definite：
  A symmetric matrix A is called positive definite if x.TAx>0 , for all n維的實數(shù)向量x臭埋。
- theorem 2中的四個條件是等價的。因此可以通過其他三種情況來判斷是否為正定臀玄。這里選擇主子式來判斷Hessian正定瓢阴，涉及到三個概念：
  
  theorem 2 .jpg
  1. Minors：刪除某行和某列的所有值再計算行列式。remove the ith line and the jth column
  2. Principal minors ：刪除的行健无、列號一致荣恐。remove the ith line and the jth column and i=j
  3. Leading principal minor ：The leading principal minor of A of order k is the minor of order k obtained by deleting the last n?k rows and columns.（這里包含一個正三角符號，標(biāo)注刪除哪些行列）栗子看圖Leading principal minor
    
    leading principal minor.jpg
- 得到了local minimum累贤，How can we find a global minimum?兩步走叠穆， 1. Find all the local minima 2. Take the smallest one; it is the global minimum. 另一個思路是看我們的f是否是convex ,是then we are sure its local minimum is a global minimum.
  Theorem: A local minimum of a convex function is a global minimum 這里又涉及到convex function, convex set的定義。
- What is a convex function? A function is convex if you can trace a line between two of its points without crossing the function line.
  
  convex 0<λ<1.jpg
A function is convex if its epigraph (the set of points on or above the graph of the function) is a convex set. In Euclidean space, a convex set is the region such that, for every pair of points within the region, every point on the straight line segment that joins the pair of points is also within the region.

栗子看圖convex set

convex set.jpg
- 同樣根據(jù)Hessian判斷是否convex臼膏，這里需要看是否是positive semidefinite,而semi有對應(yīng)三個條件是與之等價硼被。看 theorem 3
More generally, a continuous, twice differentiable function of several variables is convex on a convex set if and only if its Hessian matrix is positive semidefinite on the interior of the convex set.The difference here is that we need to check all the principal minors, not only the leading principal minors.

theorem 3.jpg
- 其中on the interior of the convex set是什么意思呢渗磅？定義：the domain of a convex function is a convex set嚷硫，那么 a function is convex on a convex set意思就是在domain上是convex function，而interior只是意味著為兩邊開區(qū)間始鱼。
- 其他證明是convex function的方法
- 這里就談convex function的optimization問題求解仔掸。涉及對偶概念，根據(jù)wiki医清，
Duality :duality means that optimization problems may be viewed from either of two perspectives, the primal problem or the dual problem (the duality principle). The solution to the dual problem provides a lower bound to the solution of the primal (minimization) problem.

給最小值以下限起暮。lower bound中有一個值為infimum (即 greatest lower bound)。補(bǔ)充会烙，相對而言

The same logic apply with the relation "greater than or equal" and we have the concept of upper-bound and supremum.
- 對偶负懦，在求最小值時求對應(yīng)的最大值筒捺，求出的最大值將是=<最小值，兩者之差即為duality gap纸厉。對應(yīng)來說焙矛，在求最大值時求對應(yīng)最小值，求出的最小值將是>=最大值即upper bound残腌。duality gap為正，我們稱之weak duality holds贫导，為0則為strong duality holds抛猫。
- 拉格朗日乘子：
In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints.

Lagrange_portrait.jpg
- 如何將3D圖在2D平面表示：Contour lines 兩個要點：1. 線上的點z值不變，for each point on a line, the function returns the same value 2. 顏色扮演標(biāo)識孩灯，the darker the area is, the smallest the value of the function is
  
  contours.png
- 那么梯度就可以用向量場進(jìn)行可視化闺金。箭頭指向函數(shù)增長的方向。與Contour lines 圖有什么關(guān)系呢峰档？在Contour lines 圖中败匹，gradient vectors非常容易畫出，1 垂直于Contour lines 2.指向增加的方向讥巡。
  
  gradient_and_contour.png
- 將約束函數(shù)和優(yōu)化目標(biāo)函數(shù)以contour lines 畫在一幅圖中掀亩，并畫出兩者的gradient vector』肚辏可得到最優(yōu)點槽棍。圖中的優(yōu)化目標(biāo)函數(shù)為x^2+y2, 約束函數(shù)為 y=1-x。
  
  function_and_constraint.png
- ▽f(x,y) = λ ▽g(x,y)
  λ 這就是拉格朗日乘子抬驴。根據(jù)圖中炼七，當(dāng)兩個gradient vector平行時，我們得到最優(yōu)解布持。無論是否同向豌拙。更無論是否等長。乘以λ 即意味著不必等長题暖。即求▽L(x,y,λ )=0時的x按傅，y。現(xiàn)在我們需要列出L并求解芙委。
- 由于我們需要求f(w)=1/2*||w||^2的最小值逞敷，將每個約束函數(shù)乘以的 λ需要取正數(shù)。
  
  L.jpg
- 又面臨一個問題灌侣，求L(x,y,λ )=0推捐， the problem can only be solved analytically when the number of examples is small (Tyson Smith, 2004 即只有當(dāng)約束函數(shù)數(shù)量比較小的時候，λ 個數(shù)不多侧啼，我們才能用分析的方法求解). So we will once again rewrite the problem using the duality principle-->we need to minimize with respect to w and b, and to maximize with respect to a at the same time.我們在上一步需要最小化
  
  duality_after_L.jpg
- 這里需要講清楚三個問題牛柒。1. 拉格朗日是對約束函數(shù)是等式的情況堪簿，那么我們在這里是不等式約束的問題，也用拉格朗日乘子解決皮壁，需要滿足什么條件嗎椭更？(KKT)2.之前說了，對偶問題有強(qiáng)與弱蛾魄，只有當(dāng)強(qiáng)時虑瀑，gap才為0，我們才能將對偶問題的最大值作為原問題的最小值滴须。那么舌狗，這里是否滿足是strong duality holds? （強(qiáng)對偶即下文Slater’s condition）3.或許你對為什么能夠是對w b求min，對a求max還是留有疑問扔水。(拉格朗日到對偶問題這兩個之間的轉(zhuǎn)化過程)
- 仍需要引入兩個理論痛侍。1. duality principle 2.Slater’s condition 3.KKT
  首先，L對w與b求偏導(dǎo)魔市，令為0(這里兩個等式)主届，再將這兩個等式帶入到L中，消去了w待德、b君丁，只剩下變量a，即得L(a)将宪。于是將問題轉(zhuǎn)化為 Wolfe dual Problem
  
  wolfe dual.jpg
- 慢著谈截，這里又出現(xiàn)一個問題。
  
  Traditionally the Wolfe dual Lagrangian problem is constrained by the gradients being equal to zero. In theory, we should add the constraints θL/θw=0 and θL/θb=0 . However, we only added the latter, because it is necessary for removing b from the function. However, we can solve the problem without the constraint θL/θw=0.
  
  這里就會不明白為什么不需要加上θL/θw=0約束仍能夠solve the problem涧偷？暫且保留疑問簸喂。
- Karush-Kuhn-Tucker conditions :first-order necessary conditions for a solution of an optimization problem to be optimal
  除了KKT還需要滿足一些regularity conditions，其中之一為Slater’s condition燎潮。
  
  Because the primal problem we are trying to solve is a convex problem, the KKT conditions are also sufficient for the point to be primal and dual optimal, and there is zero duality gap.
  
  這里說的喻鳄，即只要為convex問題，KKT也滿足确封，即可說得出的結(jié)果是原問題或?qū)ε紗栴}的最優(yōu)解除呵，因為Slater’s condition是一定滿足了的，gap=0爪喘。對于SVM颜曾，如果結(jié)果滿足KKT，那么即可說是最優(yōu)解秉剑。(詳細(xì)證明過程[2] ch5)
  
  KKT.jpg
- 可見泛豪，1. stationary 即為偏導(dǎo)為0即為駐點,若無約束函數(shù)則直接gradient是0，有了約束則gradient of the Lagrangian為0。2. primal feasibility為原問題約束函數(shù) 3. dual feasibility 為我們對L求解時使用對偶理論時的約束函數(shù) 4.complementary slackness含義是诡曙，要么a=0臀叙，要么y(wx+b)-1=0.這里與Support vectors相關(guān)，
  
  Support vectors are examples having a positive Lagrange multiplier. They are the ones for which the constraint y(wx+b)-1>=0 is active. (We say the constraint is active when y(wx+b)-1=0 )
  
  這里价卤，是否會疑惑為什么不能同時為0劝萤？為什么multiplier一定是正數(shù)？在KKT中慎璧，我們只選取支持向量床嫌，即將不等號約束改為等號約束，其他的點不考慮胸私。
  
  Solving the SVM problem is equivalent to finding a solution to the KKT conditions.” (Burges, 1988)
- 現(xiàn)在有了L(a),求導(dǎo)即可既鞠。得到了a。再根據(jù)偏導(dǎo)為0的公式回代得到w 盖文。再根據(jù)prime problem中的約束函數(shù)y(wx+b)-1>=0，計算b
- 用QP solver來解對偶問題蚯姆。用python CVXOPT包五续。將wolfe dual.jpg中我們需要求解的公式轉(zhuǎn)化到下面CVXOPT支持的形式。這里引入了一個Gram matrix - The matrix of all possible inner products of X.
  
  CVXOPT.jpg
  
  轉(zhuǎn)化過程.jpg
  
  這個圖里有問題龄恋，minimize部分最后一項需要q.T*a, 詳見代碼部分疙驾，需要q = cvxopt.matrix(-1 * np.ones(m))。
- code部分：

import cvxopt.solvers
X, y = 這里獲取到數(shù)據(jù)
m = X.shape[0]  #data有多少
# Gram 
K = np.array([np.dot(X[i], X[j]) for j in range(m) for i in range(m)]).reshape((m, m)) 
P = cvxopt.matrix(np.outer(y, y) * K)
q = cvxopt.matrix(-1 * np.ones(m))
# 等式約束
A = cvxopt.matrix(y, (1, m))
b = cvxopt.matrix(0.0)
# 不等式約束
G = cvxopt.matrix(np.diag(-1 * np.ones(m))) h = cvxopt.matrix(np.zeros(m))
# 求解
solution = cvxopt.solvers.qp(P, q, G, h, A, b)
# 拉格朗日乘子
multipliers = np.ravel(solution['x'])
# 支持向量
has_positive_multiplier = multipliers > 1e-7 
sv_multipliers = multipliers[has_positive_multiplier]
support_vectors = X[has_positive_multiplier] 
support_vectors_y = y[has_positive_multiplier]

#計算w郭毕，b
def compute_w(multipliers, X, y):
    return np.sum(multipliers[i] * y[i] * X[i]  for i in range(len(y)))
def compute_b(w, X, y):
    return np.sum([y[i] - np.dot(w, X[i]) for i in range(len(X))])/len(X)

w = compute_w(multipliers, X, y)
w_from_sv = compute_w(sv_multipliers, support_vectors, support_vect
b = compute_b(w, support_vectors, support_vectors_y)

We saw that the original optimization problem can be rewritten using a Lagrangian function. Then, thanks to duality theory, we transformed the Lagrangian problem into the Wolfe dual problem. We eventually used the package CVXOPT to solve the Wolfe dual.

為什么需要將拉格朗日函數(shù)轉(zhuǎn)化為對偶問題到wolfe dual它碎？
這里還差對偶原則及Slater’s condition 概念。