CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression, and logistic regression for classification. Linear regression. The D case without regularization (Ordinary Least Squares) x: input variable (also called independent, predictor, explanatory variable) y: output variable (also called dependent, response variable) A scatter plot shows (x i, y i ) for i =... n as dots. Model assumption: Y = β 0 + β x + ɛ where ɛ N(0, σ 2 ). On a test point x, E(Y x ) = E(β 0 + β x + ɛ) = β 0 + β x + E(ɛ) = β 0 + β x. σ 2 (Y x ) = σ2. Given data, maximum likelihood estimate = least squares estimate Convex objective. Setting gradient to zero gives ( ˆβ 0, ˆβ n ) = arg b0,b (β 0 + β x i y i ) 2. ˆβ = (xi x)(y i ȳ) (xi x) 2. ˆβ 0 = ȳ ˆβ x. We used x = ( x i )/n and ȳ = ( y i )/n. This is known as ordinary least squares (OLS). Let ŷ i = ˆβ 0 + ˆβ x i. The residual is y i ŷ i. The residual sum of squares is (yi ŷ i ) 2, and an unbiased estimate of σ 2 is ˆσ 2 (yi ŷ i ) 2 =. n 2 The denoator is n 2 because ˆβ 0, ˆβ used up two degrees of freedom. Note the best constant fit would have been f(x) = ȳ, whose residual sum of squares is (yi ȳ) 2. The coefficient of deteration r 2 is a measure of how much better a non-constant line fit is: r 2 (yi ŷ i ) 2 = (yi ȳ) 2. r 2 [0, ], being desirable.
Linear Models in Machine Learning 2.2 Ridge regression Input variables x = (x 0 =, x,..., x p ) R p+, output variable y R. y = x β + ɛ. ɛ N(0, σ 2 ). Linear in β. x i can be nonlinear functions of input. Examples: pth-degree polynomial regression y = β 0 + β x + β 2 x 2 +... + β p x p + ɛ. second order model: y = β 0 + β x + β 2 x 2 + β 3 x 2 + β 4 x x 2 + β 5 x 2 2 + ɛ. In general y = β 0 + j β j φ j (x) + ɛ, where φ j () s are arbitrary feature functions of x. An unbiased estimate of σ 2 is ˆσ 2 = (yi ŷ i ) 2 n (p + ). It will be convenient to arrange the input in a matrix X n p where each row is a feature vector, and a label vector y n. The linear regression model is y = Xβ + ɛ where ɛ is a vector, too. Without regularization, the ordinary least squares (OLS) method finds parameter Rewrite the objective as β y Xβ 2. (y Xβ) (y Xβ) = y y 2β X y + β X Xβ. Take the gradient with respect to β and set it to the zero vector: Assug X X is invertible we obtain the solution 2X y + 2X Xβ = 0 β = (X X) X y. OLS is certainly not going to work, however, in the high dimension setting when p > n, since X X is rank deficient and not invertible. More frequently used in practice for its stability, ridge regression solves a regularized optimization problem instead: y Xβ 2 + λ β 2. () β Rewrite the objective as y y 2β X y + β X Xβ + λβ β.
Linear Models in Machine Learning 3 Take the gradient with respect to β and set it to the zero vector: The solution is 2X y + 2X Xβ + 2λβ = 0 2X y + 2(X X + λi)β = 0 β = (X X + λi) X y. Note (X X + λi) is always invertible. With appropriate λ ridge regression can have smaller mean squared error. 2 Logistic Regression Consider binary classification with y =,, and each example is represented by a feature vector x. The intuition is to first map x to a real number, such that large positive number means that x is likely to be positive (y = ), and negative number means x is negative (y = ). This is done just like in linear regression: x (2) The next step is to squash the range down to [0, ] so one can interpret it as the label probability. This is done via the logistic function: p(y = x) = σ( x) = + exp( x). (3) For binary classification with -, class encoding, one can unify the definition for p(y = x) and p(y = x) with p(y x) = + exp( y x). (4) Logistic regression can be easily generalized to multiple classes. Let there be K classes. Each class has its own parameter k. The probability is defined via the softmax function p(y = k x) = exp(k x) K (5) exp( i x). If a predicted class label (instead of a probability) is desired, simply predict the class with the largest probability p(y = k x). Logistic regression produces linear decision boundaries between classes. We will focus on binary classification in the rest of this note. 2. Training Training (i.e., finding the parameter ) can be done by maximizing the conditional log likelihood of the training data {(x, y) :n }: max log p(y i x i, ). (6) However, when the training data is linearly separable, two bad things happen:. goes to infinity; 2. There are infinite number of MLE s. To see this, note any step function (sigmoid with = ) that is in the gap between the two classes is an MLE. One way to avoid this is to incorporate a prior on in the form of a zero-mean Gaussian with covariance λ I, N (0, λ I ), (7) Strictly speaking, one needs only K parameter vectors.
Linear Models in Machine Learning 4 and seek the MAP estimate. This is essentially smoothing, since large values will be penalized more. That is, we seek the that maximizes log p() + = λ 2 2 + = λ 2 2 Equivalently, one imizes the l 2 -regularized negative log likelihood loss log p(y i x i, ) (8) log p(y i x i, ) + constant (9) log ( + exp( y i x i ) ) + constant. (0) λ 2 2 + log ( + exp( y i x i ) ). () This is a convex function so there is a unique global imum of. Unfortunately there is no closed form solution. One typically solves the optimization problem with Newton-Raphson iterations, also known as iterative reweighted least squares for logistic regression. 3 Stochastic Gradient Descent In anticipation of more complex non-convex learners, we present a simple training algorithm that works for both linear regression () and logistic regression (). Observing that both models can be written as follows: l(x i, y i, ) + λ 2 2 (2) where l(x i, y i, ) is the loss function. The gradient descent algorithm starts at an arbitrary 0, and repeats the following update until convergence: ( n ) t+ = t η l(x i, y i, ) + λ 2 2 (3) = t η l(x i, y i, ) λ (4) where η is a small positive number known as the step size. Each iteration of gradient descent needs to go over all n training points to compute the gradient, which may be too expensive. Note that we can write it equivalently as t+ = t nηe l(x, y, ) ηλ (5) where the expectation is with respect to the empirical distribution on the training set. Stochastic Gradient Descent (SGD) instead performs the following: At each iteration draw ( x, ỹ) from the empirical distribution, i.e. select a training point uniformly at random. Perform the update t+ = t nη l( x, ỹ, ) ηλ. (6)
Linear Models in Machine Learning 5 The update, in expectation, agrees with (5). Note the above may differ from other text by the coefficient n. This is an artifact of us not dividing the sum of losses by n, and amounts to a rescaling of λ. For ridge regression, Therefore, For logistic regression, Therefore, l(x, y, ) = SGD can be further stablized by: l(x, y, ) = (x y) 2. l(x, y, ) = 2(x y)x. l(x, y, ) = log ( + exp( y x) ). + exp( y x) exp( y x)( yx).. Instead of taking the last iteration s parameter vector T, take the average of all parameters over time: = T T t. t= You will see that, for reasonably large T the average parameter achieves an MSE objective that is very close to the imum. It will still fluctuate a little as one should expect, but overall it s a lot more stable. 2. Alternatively (or in conjunction), you may try to diish the step size η as you go. One common scheme is to let the stepwise in iteration t to be η t = η 0 / t where η 0 is the original stepsize. As t =, 2,... increases, η decreases. This will also stabilize SGD. Both methods are motivated by theoretical analysis of SGD.