1 Introduction to Machine Learning Regression Computer Science, Tel-Aviv University, 2013-14
Classification Input: X Real valued, vectors over real. Discrete values (0,1,2,...) Other structures (e.g., strings, graphs, etc.) Output: Y Discrete (0,1,2,...)
Regression Input: X Real valued, vectors over real. Discrete values (0,1,2,...) Output: Y Real valued, vectors over real.
Examples: Regression Weight + height cholesterol level Age + gender time spent in front of the TV Past choices of a user 'Netflix score' Profile of a job (user, machine, time) Memory usage of a submitted process.
Linear Regression Input: A set of points (x i,y i ) Assume there is a linear relation between y and x.
Linear Regression Input: A set of points (x i,y i ) Assume there is a linear relation between y and x. Find a,b by solving
Regression: Minimize the Residuals y = a x + b
Regression: Minimize the Residuals r i = y i a x i b (x i,y i ) y = a x + b
Likelihood Formulation Model assumptions: Input:
Likelihood Formulation Model assumptions: Input:
Likelihood Formulation Model assumptions: Input: Likelihood maximized when
Note: We can add another variable x d+1 =1, and set a d+1 =b. Therefore, without loss of generality
Matrix Notations
The Normal Equations
Functions over n-dimensions 18 For a function f(x 1,,x n ), the gradient is the vector of partial derivatives: " f = f,, f % $ ' # x 1 x n & In one dimension: derivative.
Gradient Descent Goal: Minimize a function Algorithm: 1. Start from a point 2. Compute u = f (x 1 i,..., x n i ) 3. Update 4. Return to (2), unless converged.
Gradient Descent Gradient Descent iteration: Advantage: simple, efficient.
Online Least Squares Online update step: Advantage: Efficient, similar to perceptron.
Singularity issues Not very efficient since we need to inverse a matrix. The solution is unique if X t X is invertible. If X t X is singular, we have an infinite number of solutions to the equations. What is the solution minimizing? ky Xak 2
The Singular Case
The Singular Case 8 6 4 2 0 2 4 6 8 2 0 2 4 3 2 1 0 1 2
The Singular Case
The Singular Case
Risk of Overfitting Say we have a very large number of variables (d is large). When the number of variables is large we usually have colinear variables, and therefore is singular. X t X X t X Even if is non-singular, there is a risk for over-fitting. For instance, if d=n we can get Intuitively, we want a small number of variables to explain y.
Regularization Letλbe a regularization parameter. Ideally, we need to solve the following: This is a hard problem (NP-hard).
Shrinkage Methods Lasso regression: Ridge regression:
Ridge Regression
Ridge Regression
Ridge Regression Positive definite and therefore nonsingular
Ridge Regression Bayesian View y = X j a j X j + Prior on a
Ridge Regression Bayesian View Prior on a log P osterior(a,,data) = 1 2 2 nx i=1 n 2 log(2 2 ) (y i a x i ) 2 1 2 2 d 2 log(2 2 ) dx i=1 a 2 i Maximizing the posterior is equivalent to Ridge with = 2 2
Lasso Regression
Lasso Regression The above is equivalent to the following quadratic program:
Lasso Regression Bayesian View Prior on a
Lasso Regression Bayesian View Prior on a og P osterior(a Data,, ) = 1 2 2 nx i=1 (y i a x i ) 2 n 2 log(2 2 ) dx + log( 4 2 ) 2 2 a i i=1
Lasso vs. Ridge 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5 4 3 2 1 0 1 2 3 4 5 Laplace vs. Normal priors (mean 0, variance 1)
An Equivalent Formulation Lasso: Ridge: Claim: for every λ there is λ that produces the same solution
Breaking Linearity This can be solved using the usual linear regression by plugging X
Regression for Classification Input: X Real valued, vectors over real. Discrete values (0,1,2,...) Other structures (e.g., strings, graphs, etc.) Output: Y Discrete (0 or 1) We treat the probability Pr(Y X) as a linear function of X. Problem: Pr(Y X) should be bounded in [0,1].
Logistic Regression Model: 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5 4 3 2 1 0 1 2 3 4 5
Logistic Regression Given training data, we can write down the likelihood: linear concave There is a unique solution can be found using gradient descent.