Machine Learning: Classification 2

A quick review on logsitc regression
Logistic regression tries to model the relationship between predictors and the conditional disrtibution of the response Y given the predictors X using logistic function.
\log(\frac{P(x)}{1-P(X)}) = \beta_0 + \beta_1X_1 + \beta_2X_2+...+\beta_pX_p

logistic regression also has several assumptions

  • The predictors are linearly associated with log odds
  • The response is binary variable with only two classes

To estimate the unknown parameters of the function, \beta_0, \beta_1,..., \beta_p,maximum likelihood is used instead of the least squares. We aim to fit a line that has the maximum likelihood after converting this straight line into a s-shaped curve using logistic function of log odds.
l(\beta_0,\beta_1,...,\beta_p) = \prod P(x_i)\prod 1-P(X_i)

Sometimes, we may need to classify the response with more than two classes; in other words, we need to perform multiclass classification. Although the two-class logistic regression model can be extended to multiple-class extentions, but in practice they tend not to be used very often because another method, Linear Discriminant Analysis is more popular for multiple-class classification.

Linear Discriminant Analysis and Quadratic Discriminant Analysis

Agenda

  • LDA
  • QDA
  • How to choose between LDA and QDA (variance/bias trade off)

LDA

What is Linear Discriminant Analysis?
Unlike logistic regression, in which we directly model the conditional distribution of response given predictors X, LDA indirectly models the conditional distribution of the response by using Bayes' theorem. In other words, the method models separately the distribution of predictors X given Y and then uses Bayes theorem to flip these around into estimate for Pr(Y=k|X=x). If the observations are assumed to be normal, then the model is similar to logistic regression model.

How Bayes Theorem works for LDA?
Before explaining the use of Bayes Theorem in LDA, let me introduce some terminology about Bayes Theorem first:

Prior Probability
Prior probability, in Bayes statistical inference, is the probability of an event before new information or data is collected. It is the best rational assessment of probability of an outcome based on current knowledge before an experiment is performed.

Posterior Probability
Prior probability needs to be revised when new information or data becomes available. The revised probability is Posterior probability. In statistical terms, it is the probability of event A occurring given event B occurring.

Bayes Theorem
Bayes theorem describes the probability of an event based on new information that is related to the event, that is posterior probability.
P(A) = \text{the prior probability of A occurring}\\ P(A|B) = \frac{P(AB)}{P(B)} = \frac{P(A)P(B|A)}{P(B)} = \text{the posterior probability of event A given event B occurring}\\ P(B|A) = \frac{P(AB)}{P(A)} = \text{the conditional probability that B occurs given event A}

Now suppose that we want to classify the observation into one of k classes, where k \geq 2. We let \pi_k represent the prior probability and f_k(x) = Pr(X=x|Y=k) denote the probability distribution that an observation comes from Kth class. We have:
Pr(Y=k|X=x) = \frac{Pr(X=x \bigcap Y=k)}{Pr(X=x)} = \frac{Pr(Y=k)Pr(X=x|Y=k)}{Pr(X=x)} = \frac{Pr(Y=k)Pr(X=x|Y=k)}{\sum Pr(Y=K) Pr(X=x|Y=K)}\\ \because \text{we denote }f_k(x) = Pr(X=x|Y=k) \text{ and }\pi_k = Pr(Y=k)\\ \therefore \frac{Pr(Y=k)Pr(X=x|Y=k)}{\sum Pr(Y=K) Pr(X=x|Y=K)} = \frac{\pi_kf_k(x)}{\sum^k_{i=1}\pi_i f_{(x)}}

  • Pr(Y=K) is the prior probability
  • Pr(Y=k|X=x) is the posterior probability
  • Pr(Y=k|X=x) = \frac{Pr(X=x \bigcap Y=k)}{Pr(X=x)} = \frac{Pr(Y=k)Pr(X=x|Y=k)}{Pr(X=x)} is Bayes Theorem

Estimating parameters of LDA

First let's look at the function of LDA
\hat{\delta}_k (x) = x\frac{\hat{\mu_k}}{\sigma ^2} - \frac{\hat{\mu}^2_k}{2\sigma^2}+\log(\hat{\pi}_k)
We do not need to understand and memorize the complex formula. All we need to know is that this is a linear function of x and it has two unknown parameters, \mu and \sigma. This is why we call the statistical method Linear.
Based on the formula, we can get the assumption of Linear Discriminant Analysis: LDA assumes that observations come from Gaussian distribution with a class-specific mean vector and a common variance \sigma^2

The function is a little bit complicated because of Greek Letters. I think it would be better if it is written like this:
\log(\frac{p(x)}{1-p(x)}) =c_0 + c_1x\\ c_0 \text{ and } c_1 \text{ are functions of } \mu_1,\mu_2, \text{and } \sigma^2.
In fact, I have written that LDA is just another indirect method of modeling the distribution of response Y compared with logistic regression. So it is very similar to logistic regression and they just differ in the fitting procedure. Therefore, LDA is still modeling the relationship between log odds and predictors, but LDA needs to estimate \mu and \sigma from sample while logistic regression needs to estimate \beta_0,..., \beta_p using maximum likelihood.
Therefore, the equation above can be understanded as
\log(\frac{p(x)}{1-p(x)}) =c_0 + c_1x = x\frac{\hat{\mu_k}}{\sigma ^2} - \frac{\hat{\mu}^2_k}{2\sigma^2}+\log(\hat{\pi}_k)

  • c_o corresponds to - \frac{\hat{\mu}^2_k}{2\sigma^2}+\log(\hat{\pi}_k)
  • c_1 corresponds to \frac{\hat{\mu_k}}{\sigma ^2}

The estimation of mean and variance of sample is pretty easy, we just need to calculate them based on the training observations. And we want the response to be the largest when plugging the estimates for mean and variance into the equation.

OK, now we know a lot about LDA, let me compare it with logistic regression.

Logistic regression Versus LDA

Consider the two-class setting with p=1 estimator, and let p_1(x) and p_2(x) = 1-p_1(x) be the probabilities that x belongs to class 1 and class2.

  • Logistic regression can be wirtten as
    \log(\frac{p(x)}{1-p(x)}) = \log(\frac{p_1(x)}{p_2(x)}) = \beta_0+\beta_1x
  • For LDA, the model is
    \log(\frac{p(x)}{1-p(x)}) = \log(\frac{p_1(x)}{p_2(x)})=c_0 + c_1x
    Therefore, based on the functional form of f, we can find that both LDA and logistic regression produce linear decision boundaries.

On the other hand, there is also a significant difference between the two approaches. They have different method of estimating parameters.
Logistic regression estimate parameters using maximum likelihood, while linear discriminant analysis estimate parameters, c_0 and c_1, using the estimated mean and variance from a normal distribution.

When to use LDA and when to use logistic regression?
On the condition that the decision boundary is linear, we should choose between the two approaches based on:

  • If the observations come from Gaussian distribution with specific class mean and and a common covariance matrix in each class, LDA can provide improvements over logistic regression
  • If the normal assumption is not met, then logistic regression performs better than LDA
  • If the sample is very small and the response has more than two classes, we also should choose LDA

Quadratic Discriminant Analysis

LDA assumes that observations within each class are drawn from a multivariate Gaussian distribution with a class-specific mean vector and covariance matrix that is common to all K classes. In contrast, Quadratic discriminant analysis assumes that each class has its own covariance matrix: perhaps the correlation between predictors and class 1 is 0.5 while the correlation between predictors and class 2 is -0.5. And the fuctional form of QDA is quadratic instead of linear, and this is how quadratic discriminant analysis is called quadratic.

The function of QDA

QDA

Again we need to understand not the complex equation but that it is a quadratic function of x and it also has two unknown parameters \mu and \sigma.

Summary
So based on this, we can summarize the similarities and differences between LDA and QDA.

  • LDA is a linear function with linear decision boundary, while QDA is a quadratic function with quadratic decision boundary
  • LDA assumes that each class has the same covariance matrix while QDA assumes that each class has its own covariance matrix
  • LDA and QDA both assume that observations come from Gaussian distribution

What are the scenarios where we should LDA and the ones for QDA?
The question depends on one core conception that is omnipresent in the studying and career of machine learning, variance and bias tradeoff.

  • If K classes share the common covariance matrix, the LDA has a linear decision boundary, which means that the coefficients of LDA model should be linear. In this setting, LDA is a less flexible classifier than QDA and thus has lower variance. So LDA can improve the performance of predictions. In addition, if the training observations are very small, LDA is also better option because reducing variance is the priority. On the other hand, QDA will only fits a model much more flexible than necessary, meaning substantially high variance that even cannot be offset by the decrease in bias.
  • If the training set is very large or if the assumption of same covariance matrix is not held, QDA is recommended. In the case of large train set, the variance reduces as more samples come in and so is not a problem for model. So we need to pay attention to bias and a more flexible model is needed.

Given that variance/bias tradeoff lies at the heart of machine learning and statistical learning, Let me introduce it. As long as we grasp the essence of it, we basically know how to choose among various machine learning algorithms.

Variance/Bias tradeoff

We know that the performance of model can be evaluated by MSE(mean squared error), given by:
MSE = \frac{1}{n}\sum^n_{i=1}(y_i-\hat{f(x_i)})^2
We want the MSE to be as small as possible because small mse means that the predicted value is close to true value.

Also, we know that the evaluation of model should be performed on test set rather than train set. That is to say we finally want to have a model with the minimum test mse.
Now the test MSE can be decomposed into the sum of three parts:
\text{the variance of }\hat{f}(x_0), \text{the squared bias of }\hat{f}(x_0) \text{ and the variance of the error term } \epsilon.\\ E(y-\hat{f}(x_0))^2 = Var(\hat{f}(x_0))+[\text{Bias}(\hat{f}(x_0))]^2+Var(\epsilon)

The equation tells us that in order to minimize the expected test error, we need to select a statistical learning method that simultaneously achieves low variance and low bias.

Let me quote one example figure from Introduction to Stats Learning for explanation.

demo

First, what is variance?
Variance refers to the amount by which f would change if we estiamted it using different training data set. Since the training data are used to fit the statistical learning method, different training data sets will result in a differnt f. But ideally the estimator for f should not vary too much between training sets. However, if a method has high variance, then a small change in training data can cause large changes in f. Takes the green curve from the left-hand panel of the figure as an example, the green curve is following the data very closely, which means that the predicted values are very close to the actual value. So this curve has low bias but high variance.
Why does it have high variance? The reason is that changing any data points may cause the estimate of f change considerably. We can conclude this by observing the slope of the green curve.

What is bias?
Bias refers to the error that is introduced by approximating a real-life problem by a much simpler model. For example, linear regression assumes that there is a linear relationship between X and Y. It is unlikely that any real-life problem truly has such a simple relationship, and so performing linear regression will undoubtlessly lead to inaccurate predicted values that are far away from true values, that is some bias in f.

Now with the knowledge of variance and bias, What is variance/bias tradeoff?
Good test set performance of a statistical learning method requires low variance and low bias. This is referred to as a trade-off because it is easy to obtain a method with extremely low bias and high variance (for example we fit a curve that passes through every single observation) or high bias and low variance (we just fit a horizontal line).

A general rule about v/b tradeoff
the variance increases but bias decreases as flexibility increases. And the relative rate of change of the two terms determine whether the overall test mse increase ot not. As we increase the flexibility of statistical methods, the bias tends to reduce faster than the variance increases, so the overall test MSE decreases. However, at some point increasing the flexibility has little impact on the reduction in bias but significantly increases variance of model. In this case, the overall test MSE increases. This is why sometimes a simple linear regression model performs better than highly flexible methods, such as KNN and decision trees.

Exercise
OK, we know the conception of variance/bias tradeoff, now let me post one exercise about the conception of variance/bias tradeoff.

  • Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.
  • Explain why?

Solution

image.png

  • Trian Error. Train error is a monotonically decreasing blue curve because the curve will more closely fit data points as flexibility increases, meaning more accurate estimate of f.
  • Test Error. Test error is an U-shaped curve, monotonically decreasing but starting to significantly increase after a pariticular point. The reason for this is that test error decreases as flexibility increases but at some point increasing flexibility has little impact on bias but significantly increases the variance. In this setting, we may have problem of overfitting, training mse being small but test mse being large.
  • Bias and Variance. As a general rule, as flexibility increases, bias will decrease and variance will increase. Variance refers to the amount by which f would change if we estimated it using different training set. Bias refers to the error that is introduced by approximating a real-life problem by using a simpler model. As flexibility increases, the highly flexible learning methods will follow the data very closely, and in this case they have low bias but high variance because a small change in data points can cause substantial changes in estimates of f.
  • Bayes error. This is a horinzontal line intersecting at the y axis and below Test MSe because the expected test MSE will always be greater the Var(ε*)

Next article

  • The comparison of various classifiers
  • Confusion matrix for evaluation of the performance of classifiers
最后編輯于
?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市喊儡,隨后出現(xiàn)的幾起案子士骤,更是在濱河造成了極大的恐慌,老刑警劉巖矾利,帶你破解...
    沈念sama閱讀 221,635評論 6 515
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件姑裂,死亡現(xiàn)場離奇詭異,居然都是意外死亡男旗,警方通過查閱死者的電腦和手機舶斧,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 94,543評論 3 399
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來察皇,“玉大人茴厉,你說我怎么就攤上這事∈踩伲” “怎么了矾缓?”我有些...
    開封第一講書人閱讀 168,083評論 0 360
  • 文/不壞的土叔 我叫張陵,是天一觀的道長溃睹。 經(jīng)常有香客問我而账,道長,這世上最難降的妖魔是什么因篇? 我笑而不...
    開封第一講書人閱讀 59,640評論 1 296
  • 正文 為了忘掉前任泞辐,我火速辦了婚禮,結果婚禮上竞滓,老公的妹妹穿的比我還像新娘咐吼。我一直安慰自己,他們只是感情好商佑,可當我...
    茶點故事閱讀 68,640評論 6 397
  • 文/花漫 我一把揭開白布锯茄。 她就那樣靜靜地躺著,像睡著了一般茶没。 火紅的嫁衣襯著肌膚如雪肌幽。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 52,262評論 1 308
  • 那天抓半,我揣著相機與錄音喂急,去河邊找鬼。 笑死笛求,一個胖子當著我的面吹牛廊移,可吹牛的內容都是我干的。 我是一名探鬼主播探入,決...
    沈念sama閱讀 40,833評論 3 421
  • 文/蒼蘭香墨 我猛地睜開眼狡孔,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了蜂嗽?” 一聲冷哼從身側響起苗膝,我...
    開封第一講書人閱讀 39,736評論 0 276
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎植旧,沒想到半個月后荚醒,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體芋类,經(jīng)...
    沈念sama閱讀 46,280評論 1 319
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 38,369評論 3 340
  • 正文 我和宋清朗相戀三年界阁,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片胖喳。...
    茶點故事閱讀 40,503評論 1 352
  • 序言:一個原本活蹦亂跳的男人離奇死亡泡躯,死狀恐怖,靈堂內的尸體忽然破棺而出丽焊,到底是詐尸還是另有隱情较剃,我是刑警寧澤,帶...
    沈念sama閱讀 36,185評論 5 350
  • 正文 年R本政府宣布技健,位于F島的核電站写穴,受9級特大地震影響,放射性物質發(fā)生泄漏雌贱。R本人自食惡果不足惜啊送,卻給世界環(huán)境...
    茶點故事閱讀 41,870評論 3 333
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望欣孤。 院中可真熱鬧馋没,春花似錦、人聲如沸降传。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,340評論 0 24
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽婆排。三九已至声旺,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間段只,已是汗流浹背腮猖。 一陣腳步聲響...
    開封第一講書人閱讀 33,460評論 1 272
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留翼悴,地道東北人缚够。 一個月前我還...
    沈念sama閱讀 48,909評論 3 376
  • 正文 我出身青樓,卻偏偏與公主長得像鹦赎,于是被迫代替她去往敵國和親谍椅。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 45,512評論 2 359

推薦閱讀更多精彩內容