STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing Lecture 4: ML Models (Overview) Cho-Jui Hsieh UC Davis April 17, 2017

Outline Linear regression Ridge regression Logistic regression Other finite-sum models

Linear Regression

Regression Input: training data x 1, x 2,..., x n R d and corresponding outputs y 1, y 2,..., y n R Training: compute a function f such that f (x i ) y i for all i Prediction: given a testing sample x, predict the output as f ( x) Examples: Income, number of children Consumer spending Processes, memory Power consumption Financial reports Risk Atmospheric conditions Precipitation

Linear Regression Assume f ( ) is a linear function parameterized by w R d : f (x) = w T x

Linear Regression Assume f ( ) is a linear function parameterized by w R d : f (x) = w T x Training: compute the model w such that w T x i y i for all i

Linear Regression Assume f ( ) is a linear function parameterized by w R d : f (x) = w T x Training: compute the model w such that w T x i y i for all i Equivalent to solving n w = argmin (w T x i y i ) 2 w R d i=1

Linear Regression: probability interpretation Assume the data is generated from the probability model: Maximum likelihood estimator: y i w T x i + ɛ i, ɛ i N (0, 1) w = argmax log P(y 1,..., y n x 1,..., x n, w) w n = argmax log P(y i x i, w) w = argmax w = argmax w = argmin w i=1 n 1 log( e (w T x i y i ) 2 /2 ) 2π i=1 n 1 2 (w T x i y i ) 2 + constant i=1 n (w T x i y i ) 2 i=1

Linear Regression: written as a matrix form Linear regression: w = argmin w R d n i=1 (w T x i y i ) 2 Matrix form: let X R n d be the matrix where the i-th row is x i, y = [y 1,..., y n ] T, then linear regression can be written as w = argmin w R d X w y 2 2

Solving Linear Regression Minimize the sum of squared error J(w) J(w) = 1 X w y 2 2 = 1 2 (X w y)t (X w y) = 1 2 w T X T X w y T X w + 1 2 yt y Derivative: w J(w) = X T X w X T y Setting the derivative equal to zero gives the normal equation Therefore, w = (X T X ) 1 X T y X T X w = X T y

Regularized Linear Regression

Overfitting Overfitting: the model has low training error but high prediction error. Using too many features can lead to overfitting

Regularization to Avoid Overfitting Enforce the solution to have low L2-norm: argmin w n w T x i y i 2 s.t. w 2 K i=1 Equivalent to the following problem with some λ argmin w n w T x i yi 2 + λ w 2 i=1

Regularized Linear Regression Regularized Linear Regression: R(w): regularization argmin X w y 2 + R(w) w Ridge Regression (l 2 regularization): argmin X w y 2 + λ w 2 w

Ridge Regression Ridge regression: argmin w R d 1 2 X w y 2 + λ 2 w 2 }{{} J(w) Closed form solution: optimal solution w satisfies J(w ) = 0: X T X w X T y + λw = 0 (X T X + λi )w = X T y

Ridge Regression Ridge regression: argmin w R d 1 2 X w y 2 + λ 2 w 2 }{{} J(w) Closed form solution: optimal solution w satisfies J(w ) = 0: X T X w X T y + λw = 0 (X T X + λi )w = X T y Optimial solution: w = (X T X + λi ) 1 X T y

Time Complexity When X is dense: Closed form solution requires O(nd 2 + d 3 ) if X is dense Efficient if d is very small Runs forever when d > 100, 000 Typical case for big data applications: X R n d is sparse with large n and large d How can we solve the problem?

Logistic Regression

Binary Classification Input: training data x 1, x 2,..., x n R d and corresponding outputs y 1, y 2,..., y n {+1, 1} Training: compute a function f such that sign(f (x i )) y i for all i Prediction: given a testing sample x, predict the output as sign(f ( x))

Logistic Regression Assume linear scoring function: s = f (x) = w T x

Logistic Regression Assume linear scoring function: s = f (x) = w T x Logistic hypothesis: where θ(s) = es 1+e s = 1 1+e s P(y = 1 x) = θ(w T x),

Error measure: likelihood Likelihood of D = (x 1, y 1 ),, (x N, y N ): Π N n=1p(y n x n )

Error measure: likelihood Likelihood of D = (x 1, y 1 ),, (x N, y N ): Π N n=1p(y n x n ) { θ(w T x) for y = +1 P(y x) = 1 θ(w T x) = θ( w T x) for y = 1 P(y x) = θ(yw T x)

Error measure: likelihood Likelihood of D = (x 1, y 1 ),, (x N, y N ): Π N n=1p(y n x n ) { θ(w T x) for y = +1 P(y x) = 1 θ(w T x) = θ( w T x) for y = 1 P(y x) = θ(yw T x) Likelihood: Π N n=1p(y n x n ) = Π N n=1θ(y n w T x n )

Maximizing the likelihood Find w to maximize the likelihood! max w ΠN n=1θ(y n w T x n ) max w log(πn n=1θ(y n w T x n )) min w log(πn n=1θ(y n w T x n )) min w min w min w N log(θ(y n w T x n )) n=1 N 1 log( θ(y n w T x n ) ) n=1 N log(1 + e ynw T x n ) n=1

Empirical Risk Minimization (linear) Most linear ML algorithms follow 1 N min loss(w T x n, y n ) w N n=1 Linear regression: loss(h(x n ), y n ) = (w T x n y n ) 2 Logistic regression: loss(h(x n ), y n ) = log(1 + e ynw T x n )

Empirical Risk Minimization (general) Assume f W (x) is the decision function to be learned (W is the parameters of the function) General empirical risk minimization: min W 1 N N loss(f W (x n ), y n ) n=1 Example: Neural network (f W ( ) is the network)

Coming up Optimization Questions?