Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

1 Introduction to Machine Learning Regression Computer Science, Tel-Aviv University, 2013-14

Classification Input: X Real valued, vectors over real. Discrete values (0,1,2,...) Other structures (e.g., strings, graphs, etc.) Output: Y Discrete (0,1,2,...)

Regression Input: X Real valued, vectors over real. Discrete values (0,1,2,...) Output: Y Real valued, vectors over real.

Examples: Regression Weight + height cholesterol level Age + gender time spent in front of the TV Past choices of a user 'Netflix score' Profile of a job (user, machine, time) Memory usage of a submitted process.

Linear Regression Input: A set of points (x i,y i ) Assume there is a linear relation between y and x.

Linear Regression Input: A set of points (x i,y i ) Assume there is a linear relation between y and x. Find a,b by solving

Regression: Minimize the Residuals y = a x + b

Regression: Minimize the Residuals r i = y i a x i b (x i,y i ) y = a x + b

Likelihood Formulation Model assumptions: Input:

Likelihood Formulation Model assumptions: Input: Likelihood maximized when

Note: We can add another variable x d+1 =1, and set a d+1 =b. Therefore, without loss of generality

Matrix Notations

The Normal Equations

Functions over n-dimensions 18 For a function f(x 1,,x n ), the gradient is the vector of partial derivatives: " f = f,, f % $ ' # x 1 x n & In one dimension: derivative.

Gradient Descent Goal: Minimize a function Algorithm: 1. Start from a point 2. Compute u = f (x 1 i,..., x n i ) 3. Update 4. Return to (2), unless converged.

Gradient Descent Gradient Descent iteration: Advantage: simple, efficient.

Online Least Squares Online update step: Advantage: Efficient, similar to perceptron.

Singularity issues Not very efficient since we need to inverse a matrix. The solution is unique if X t X is invertible. If X t X is singular, we have an infinite number of solutions to the equations. What is the solution minimizing? ky Xak 2

The Singular Case

The Singular Case 8 6 4 2 0 2 4 6 8 2 0 2 4 3 2 1 0 1 2

The Singular Case

Risk of Overfitting Say we have a very large number of variables (d is large). When the number of variables is large we usually have colinear variables, and therefore is singular. X t X X t X Even if is non-singular, there is a risk for over-fitting. For instance, if d=n we can get Intuitively, we want a small number of variables to explain y.

Regularization Letλbe a regularization parameter. Ideally, we need to solve the following: This is a hard problem (NP-hard).

Shrinkage Methods Lasso regression: Ridge regression:

Ridge Regression

Ridge Regression Positive definite and therefore nonsingular

Ridge Regression Bayesian View y = X j a j X j + Prior on a

Ridge Regression Bayesian View Prior on a log P osterior(a,,data) = 1 2 2 nx i=1 n 2 log(2 2 ) (y i a x i ) 2 1 2 2 d 2 log(2 2 ) dx i=1 a 2 i Maximizing the posterior is equivalent to Ridge with = 2 2

Lasso Regression

Lasso Regression The above is equivalent to the following quadratic program:

Lasso Regression Bayesian View Prior on a

Lasso Regression Bayesian View Prior on a og P osterior(a Data,, ) = 1 2 2 nx i=1 (y i a x i ) 2 n 2 log(2 2 ) dx + log( 4 2 ) 2 2 a i i=1

Lasso vs. Ridge 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5 4 3 2 1 0 1 2 3 4 5 Laplace vs. Normal priors (mean 0, variance 1)

An Equivalent Formulation Lasso: Ridge: Claim: for every λ there is λ that produces the same solution

Breaking Linearity This can be solved using the usual linear regression by plugging X

Regression for Classification Input: X Real valued, vectors over real. Discrete values (0,1,2,...) Other structures (e.g., strings, graphs, etc.) Output: Y Discrete (0 or 1) We treat the probability Pr(Y X) as a linear function of X. Problem: Pr(Y X) should be bounded in [0,1].

Logistic Regression Model: 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5 4 3 2 1 0 1 2 3 4 5

Logistic Regression Given training data, we can write down the likelihood: linear concave There is a unique solution can be found using gradient descent.