Numerical optimizers for Logistic Regression

全文轉(zhuǎn)載自Numerical optimizers for Logistic Regression
原網(wǎng)頁(yè)看不了公式顧將全文進(jìn)行了轉(zhuǎn)載

In this post I compar several implementations of Logistic Regression. The task was to implement a Logistic Regression model using standard optimization tools from scipy.optimize and compare them against state of the art implementations such as LIBLINEAR.

In this blog post I'll write down all the implementation details of this model, in the hope that not only the conclusions but also the process would be useful for future comparisons and benchmarks.

Function evaluation

We consider the case in which the decision function is an affine function, i.e., f(x) = \langle x, w \rangle + c where w and c are parameters to estimate. The loss function for the \ell_2-regularized logistic regression, i.e. the function to be minimized is

\mathcal{L}(w, \lambda, X, y) = - \frac{1}{n}\sum_{i=1}^n \log(\phi(y_i (\langle X_i, w \rangle + c))) + \frac{\lambda}{2} w^T w

where \phi(t) = 1\. / (1 + \exp(-t)) is the logistic function, \lambda w^T w is the regularization term and X, y is the input data, with X \in \mathbb{R}^{n \times p} and y \in \{-1, 1\}^n. However, this formulation is not great from a practical standpoint. Even for not unlikely values of t such as t = -100, \exp(100) will overflow, assigning the loss an (erroneous) value of +\infty. For this reason 1, we evaluate \log(\phi(t)) as

\log(\phi(t)) = \begin{cases} - \log(1 + \exp(-t)) \text{ if } t > 0 \\ t - \log(1 + \exp(t)) \text{ if } t \leq 0\\ \end{cases}

The gradient of the loss function is given by

\begin{aligned} \nabla_w \mathcal{L} &= \frac{1}{n}\sum_{i=1}^n y_i X_i (\phi(y_i (\langle X_i, w \rangle + c)) - 1) + \lambda w \\ \nabla_c \mathcal{L} &= \sum_{i=1}^n y_i (\phi(y_i (\langle X_i, w \rangle + c)) - 1) \end{aligned}

Similarly, the logistic function \phi used here can be computed in a more stable way using the formula

\phi(t) = \begin{cases} 1 / (1 + \exp(-t)) \text{ if } t > 0 \\ \exp(t) / (1 + \exp(t)) \text{ if } t \leq 0\\ \end{cases}

Finally, we will also need the Hessian for some second-order methods, which is given by

\nabla_w ^2 \mathcal{L} = X^T D X + \lambda I

where I is the identity matrix and D is a diagonal matrix given by D_{ii} = \phi(y_i w^T X_i)(1 - \phi(y_i w^T X_i)).

In Python, these function can be written as

import numpy as np

def phi(t):
    # logistic function, returns 1 / (1 + exp(-t))
    idx = t > 0
    out = np.empty(t.size, dtype=np.float)
    out[idx] = 1. / (1 + np.exp(-t[idx]))
    exp_t = np.exp(t[~idx])
    out[~idx] = exp_t / (1. + exp_t)
    return out

def loss(x0, X, y, alpha):
    # logistic loss function, returns Sum{-log(phi(t))}
    w, c = x0[:X.shape[1]], x0[-1]
    z = X.dot(w) + c
    yz = y * z
    idx = yz > 0
    out = np.zeros_like(yz)
    out[idx] = np.log(1 + np.exp(-yz[idx]))
    out[~idx] = (-yz[~idx] + np.log(1 + np.exp(yz[~idx])))
    out = out.sum() / X.shape[0] + .5 * alpha * w.dot(w)
    return out

def gradient(x0, X, y, alpha):
    # gradient of the logistic loss
    w, c = x0[:X.shape[1]], x0[-1]
    z = X.dot(w) + c
    z = phi(y * z)
    z0 = (z - 1) * y
    grad_w = X.T.dot(z0) / X.shape[0] + alpha * w
    grad_c = z0.sum() / X.shape[0]
    return np.concatenate((grad_w, [grad_c]))

Benchmarks

I tried several methods to estimate this \ell_2-regularized logistic regression. There is one first-order method (that is, it only makes use of the gradient and not of the Hessian), Conjugate Gradient whereas all the others are Quasi-Newton methods. The method I tested are:

  • CG = Conjugate Gradient as implemented in scipy.optimize.fmin_cg
  • TNC = Truncated Newton as implemented in scipy.optimize.fmin_tnc
  • BFGS = Broyden–Fletcher–Goldfarb–Shanno method, as implemented in scipy.optimize.fmin_bfgs.
  • L-BFGS = Limited-memory BFGS as implemented in scipy.optimize.fmin_l_bfgs_b. Contrary to the BFGS algorithm, which is written in Python, this one wraps a C implementation.
  • Trust Region = Trust Region Newton method 1. This is the solver used by LIBLINEAR that I've wrapped to accept any Python function in the package pytron

To assure the most accurate results across implementations, all timings were collected by callback functions that were called from the algorithm on each iteration. Finally, I plot the maximum absolute value of the gradient (=the infinity norm of the gradient) with respect to time.

The synthetic data used in the benchmarks was generated as described in 2 and consists primarily of the design matrix X being Gaussian noise, the vector of coefficients is drawn also from a Gaussian distribution and the explained variable y is generated as y = \text{sign}(X w). We then perturb matrix X by adding Gaussian noise with covariance 0.8. The number of samples and features was fixed to 10^4 and 10^3 respectively. The penalization parameter \lambda was fixed to 1.

In this setting variables are typically uncorrelated and most solvers perform decently:

Benchmark Logistic

Here, the Trust Region and L-BFGS solver perform almost equally good, with Conjugate Gradient and Truncated Newton falling shortly behind. I was surprised by the difference between BFGS and L-BFGS, I would have thought that when memory was not an issue both algorithms should perform similarly.

To make things more interesting, we now make the design to be slightly more correlated. We do so by adding a constant term of 1 to the matrix X and adding also a column vector of ones this matrix to account for the intercept. These are the results:

Benchmark Logistic

Here, we already see that second-order methods dominate over first-order methods (well, except for BFGS), with Trust Region clearly dominating the picture but with TNC not far behind.

Finally, if we force the matrix to be even more correlated (we add 10. to the design matrix X), then we have:

Benchmark Logistic

Here, the Trust-Region method has the same timing as before, but all other methods have got substantially worse.The Trust Region method, unlike the other methods is surprisingly robust to correlated designs.

To sum up, the Trust Region method performs extremely well for optimizing the Logistic Regression model under different conditionings of the design matrix. The LIBLINEAR software uses this solver and thus has similar performance, with the sole exception that the evaluation of the logistic function and its derivatives is done in C++ instead of Python. In practice, however, due to the small number of iterations of this solver I haven't seen any significant difference.

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末翔忽,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子盏檐,更是在濱河造成了極大的恐慌歇式,老刑警劉巖,帶你破解...
    沈念sama閱讀 206,968評(píng)論 6 482
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件胡野,死亡現(xiàn)場(chǎng)離奇詭異材失,居然都是意外死亡,警方通過(guò)查閱死者的電腦和手機(jī)硫豆,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,601評(píng)論 2 382
  • 文/潘曉璐 我一進(jìn)店門龙巨,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái),“玉大人够庙,你說(shuō)我怎么就攤上這事恭应〕” “怎么了耘眨?”我有些...
    開(kāi)封第一講書人閱讀 153,220評(píng)論 0 344
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)境肾。 經(jīng)常有香客問(wèn)我剔难,道長(zhǎng),這世上最難降的妖魔是什么奥喻? 我笑而不...
    開(kāi)封第一講書人閱讀 55,416評(píng)論 1 279
  • 正文 為了忘掉前任偶宫,我火速辦了婚禮,結(jié)果婚禮上环鲤,老公的妹妹穿的比我還像新娘纯趋。我一直安慰自己,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,425評(píng)論 5 374
  • 文/花漫 我一把揭開(kāi)白布吵冒。 她就那樣靜靜地躺著纯命,像睡著了一般。 火紅的嫁衣襯著肌膚如雪痹栖。 梳的紋絲不亂的頭發(fā)上亿汞,一...
    開(kāi)封第一講書人閱讀 49,144評(píng)論 1 285
  • 那天,我揣著相機(jī)與錄音揪阿,去河邊找鬼疗我。 笑死,一個(gè)胖子當(dāng)著我的面吹牛南捂,可吹牛的內(nèi)容都是我干的吴裤。 我是一名探鬼主播,決...
    沈念sama閱讀 38,432評(píng)論 3 401
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼黑毅,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼嚼摩!你這毒婦竟也來(lái)了?” 一聲冷哼從身側(cè)響起矿瘦,我...
    開(kāi)封第一講書人閱讀 37,088評(píng)論 0 261
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤枕面,失蹤者是張志新(化名)和其女友劉穎,沒(méi)想到半個(gè)月后缚去,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體潮秘,經(jīng)...
    沈念sama閱讀 43,586評(píng)論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,028評(píng)論 2 325
  • 正文 我和宋清朗相戀三年易结,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了枕荞。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 38,137評(píng)論 1 334
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡搞动,死狀恐怖躏精,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情鹦肿,我是刑警寧澤矗烛,帶...
    沈念sama閱讀 33,783評(píng)論 4 324
  • 正文 年R本政府宣布,位于F島的核電站箩溃,受9級(jí)特大地震影響瞭吃,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜涣旨,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,343評(píng)論 3 307
  • 文/蒙蒙 一歪架、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧霹陡,春花似錦和蚪、人聲如沸止状。這莊子的主人今日做“春日...
    開(kāi)封第一講書人閱讀 30,333評(píng)論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)导俘。三九已至,卻和暖如春剔蹋,著一層夾襖步出監(jiān)牢的瞬間旅薄,已是汗流浹背。 一陣腳步聲響...
    開(kāi)封第一講書人閱讀 31,559評(píng)論 1 262
  • 我被黑心中介騙來(lái)泰國(guó)打工泣崩, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留少梁,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 45,595評(píng)論 2 355
  • 正文 我出身青樓矫付,卻偏偏與公主長(zhǎng)得像凯沪,于是被迫代替她去往敵國(guó)和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子买优,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,901評(píng)論 2 345

推薦閱讀更多精彩內(nèi)容