ECS289: Scalable Machine Learning

Size: px

Start display at page:

Download "ECS289: Scalable Machine Learning"

Todd Watts
5 years ago
Views:

1 ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015

2 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers

3 Regression Input: training data x 1, x 2,..., x n R d and corresponding outputs y 1, y 2,..., y n R Training: compute a function f such that f (x i ) y i for all i Prediction: given a testing sample x, predict the output as f ( x) Examples: Income, number of children Consumer spending Processes, memory Power consumption Financial reports Risk Atmospheric conditions Precipitation

4 Linear Regression Assume f ( ) is a linear function parameterized by w R d : f (x) = w T x Training: compute the model w R d such that w T x i y i for all i Prediction: given a testing sample x, the prediction value is w T x How to find w? w = argmin w R d n (w T x i y i ) 2 i=1

5 Linear Regression: probability interpretation Assume the data is generated from the probability model: Maximum likelihood estimator: y i w T x i + ɛ i, ɛ i N (0, 1) w = argmax log P(y 1,..., y n x 1,..., x n, w) w n = argmax log P(y i x i, w) w = argmax w = argmax w = argmin w i=1 n 1 log( e (w T x i y i ) 2 /2 ) 2π i=1 n 1 2 (w T x i y i ) 2 + constant i=1 n (w T x i y i ) 2 i=1

6 Linear Regression: written as a matrix form Linear regression: w = argmin w R d n i=1 (w T x i y i ) 2 Matrix form: let X R n d be the matrix where the i-th row is x i, y = [y 1,..., y n ] T, then linear regression can be written as w = argmin w R d X w y 2 2

7 Data Structure

8 Dense Matrix vs Sparse Matrix Any matrix X R m n can be stored as dense or sparse Dense Matrix: most entries in X are nonzero (mn space) Sparse Matrix: only few entries in X are nonzero (O(nnz) space)

9 Dense Matrix Operations Column format. Let A R m n, B R m n, s R A + B, sa, A T : mn operations Let A R m n, b R n 1 Ab: mn operations

10 Dense Matrix Operations Matrix-matrix multiplication: what is the time complexity of computing AB?

11 Dense Matrix Operations Assume A, B R n n, what is the time complexity of computing AB? Naive implementation: O(n 3 ) Theoretical best: O(n 2.xxx ) (but slower than naive implementation in practice) Best way in practice: using BLAS (Basic Linear Algebra Subprograms)

12 Dense Matrix Operations BLAS matrix product: O(mnk) for computing AB where A R m k, B R k n Compute matrix product block by block to minimize cache miss rate Can be called from C, Fortran; can be used in MATLAB, R, Python,... Three levels of BLAS operations: 1 Level 1: Vector operations, e.g., y = αx + y 2 Level 2: Matrix-Vector operations, e.g., y = αax + βy 3 Level 3: Matrix-Matrix operations, e.g., y = αab + βc

13 BLAS

14 Sparse Matrix Operations Widely-used format: Compressed Sparse Column (CSC), Compressed Sparse Row (CSR),... CSC: three arrays for storing an m n matrix with nnz nonzeroes 1 val (nnz real numbers): the values of each nonzero elements 2 row ind (nnz integers): the row indices corresponding to the values 3 col ptr (n + 1 integers): the list of value indexes where each column starts

.. CSR: three arrays for storing an m n matrix with nnz nonzeroes 1 val (nnz real numbers): the

15 Sparse Matrix Operations Widely-used format: Compressed Sparse Column (CSC), Compressed Sparse Row (CSR),... CSR: three arrays for storing an m n matrix with nnz nonzeroes 1 val (nnz real numbers): the values of each nonzero elements 2 col ind (nnz integers): the column indices corresponding to the values 3 row ptr (m + 1 integers): the list of value indexes where each row starts

16 Sparse Matrix Operations If A R m n (sparse), B R m n (sparse or dense), s R A + B, sa, A T : O(nnz) operations If A R m n, b R n 1 Ab: O(nnz) operations If A R m k (sparse), B R k n (dense): AB: O((nnz)n) operations (use sparse BLAS) If A R m k (sparse), B R k n (sparse): AB: O(nnz(A)nnz(B)/k) in average AB: O(nnz(A)n) worst case The resulting matrix will be much denser

17 Closed Form Solution

18 Solving Linear Regression Minimize the sum of squared error J(w) J(w) = 1 X w y 2 2 = 1 2 (X w y)t (X w y) = 1 2 w T X T X w y T X w yt y Derivative: w J(w) = X T X w X T y Setting the derivative equal to zero gives the normal equation Therefore, w = (X T X ) 1 X T y X T X w = X T y

19 Solving Linear Regression Minimize the sum of squared error J(w) J(w) = 1 X w y 2 2 = 1 2 (X w y)t (X w y) = 1 2 w T X T X w y T X w yt y Derivative: w J(w) = X T X w X T y Setting the derivative equal to zero gives the normal equation Therefore, w = (X T X ) 1 X T y but X T X may be non-invertible... X T X w = X T y

20 Solving Linear Regression Normal equation: X T X w = X T y If X T X is invertible (typically when # samples > # features): w = (X T X ) 1 y If X T X is low-rank (typically when # features > # samples): infinite number of solutions (why?) Least norm solution: w = X y where X is the pseudo inverse

21 Regularized Linear Regression

22 Overfitting Overfitting: the model has low training error but high prediction error. Using too many features can lead to overfitting

23 Regularization to Avoid Overfitting Enforce the solution to have low L2-norm: argmin w n w T x i y i 2 s.t. w 2 K i=1 Equivalent to the following problem with some λ argmin w n w T x i yi 2 + λ w 2 i=1

24 Avoid Overfitting by Controlling Model Complexity (*) In the following, we derive a bound for generalization (prediction) error Training & testing data are generated from the same distribution (x i, y i ) D Generalization error (expectation of prediction error): R(f ) := E (x,y) D [(f (x) y) 2 ] Best model (minimizing prediction error): f := argmin f F R(f ) Empirical error: error on the training data ˆR(ˆf ) := 1 n n (f (x i ) y i ) 2 i=1 Our estimator: ˆf := argmin f F ˆR(f ) F: a class of function (e.g., all linear functions {f : f (x) = w T x})

25 Avoid Overfitting by Controlling Model Complexity (*) Generalization error of ˆf : R(ˆf ) =R(f ) + ( ˆR(ˆf ) ˆR(f )) + (R(ˆf ) ˆR(ˆf )) + ( ˆR(f ) R(f )) R(f ) max f F ˆR(f ) R(f ) where f = argmin f F R(f ). Overfitting is due to large max f F ˆR(f ) R(f ) How to make this term smaller? 1 Have more samples (larger n) 2 Make F a smaller set (regularization)

26 Avoid overfitting by Controlling Model Complexity (*) R(ˆf ) R(f ) + 2 max }{{} ˆR(f ) R(f ) f F generalization error of the estimator Control F to be a subset of linear functions: 1 Ridge regression: F = {f (x) = w T x w 2 K} where K is a constant 2 Lasso: F = {f (x) = w T x w 1 K} where K is a constant If K is large overfitting (max f F ˆR(f ) R(f ) is large) If K is small underfitting (R(f ) is large because F does not cover the best solution)

27 Regularized Linear Regression Regularized Linear Regression: R(w): regularization argmin X w y 2 + R(w) w Ridge Regression (l 2 regularization): Lasso (l 1 regularization): Note that w 1 = d i=1 w i argmin X w y 2 + λ w 2 w argmin X w y 2 + λ w 1 w

28 Regularization Lasso: the solution is sparse, but no closed form solution

29 Ridge Regression Ridge regression: argmin w R d 1 2 X w y 2 + λ 2 w 2 }{{} J(w) Closed form solution: optimal solution w satisfies J(w ) = 0: X T X w X T y + λw = 0 (X T X + λi )w = X T y Optimial solution: w = (X T X + λi ) 1 X T y Inverse always exists because X T X + λi is positive definite What s the computational complexity?

30 Time Complexity (Closed Form Solution)

31 Time Complexity (closed form solution of ridge regression) Goal: compute (X T X + λi ) 1 (X T y) Step 1: Compute A = X T X + λi and b = X T y O(nd 2 + nd) if X is dense Step 2: several options with O(d 3 ) time complexity 1 Compute A 1, and then compute w = A 1 b 2 Directly solve the linear system Aw = b (calling underlying LAPACK routine) (Cholesky factorization) Which one is better?

32 Time Complexity (closed form solution) Time Complexity for step 2: O(d 3 )

33 More on Linear System Solvers Can be done by linear system solver: computing w such that Aw = y Different libraries can be used: Default LAPACK Intel Math Kernel Library (Intel MKL) AMD Core Math Library (ACML) Parallel numerical linear algebra packages are available today: ScaLAPACK (SCAlable Linear Algebra PACKage): linear algebra routines for distributed memory machines PLAPACK (Parallel Linear Algebra PACKage): dense linear algebra routines PLASMA (Parallel Linear Algebra Software for Multicore Architectures): combine state-of-the-art solutions in parallel algorithms and scheduling for optimized solutions of linear systems, eigenvalue problems,... However, still O(d 3 ) time complexity to solve Aw = y when A R d d

34 Time Complexity When X is dense: Closed form solution requires O(nd 2 + d 3 ) if X is dense Efficient if d is very small Runs forever when d > 100, 000 Typical case for big data applications: X R n d is sparse with large n and large d How can we solve the problem?

35 Iterative Solver (Coordinate Descent)

36 Optimization: iterative solver Iterative solver: generate a sequence of (improving) approximate solutions for a problem. w 0 (initialpoint) w 1 w 2 w

37 Convergence of the iterative algorithm Convergence properties: Will the algorithm converge to a point? Will it converge to optimal solution(s)? What s the convergence rate (how fast will the sequence converge)? We will just introduce existing properties in this class, but we will focus on: computational complexity per iteration

38 Coordinate Descent Solve the optimization problem: argmin w J(w) Update one variable at a time Obtain a model with reasonable performance for a few iterations Randomized coordinate descent: pick a random coordinate to update at each iteration

39 Coordinate Descent Input: X R N d, y R N, initial w (0) Output: Solution w 1: t = 0 2: while not converged do 3: Randomly choose a variable w j 4: Compute the optimal update on w j by solving δ = argmin δ J(w + δe j ) J(w). 5: Update w j : w j w j + δ 6: t t + 1 7: end while

40 Coordinate Descent Input: X R N d, y R N, initial w (0) Output: Solution w 1: t = 0 2: while not converged do 3: Randomly choose a variable w j 4: Compute the optimal update on w j by solving δ = argmin δ J(w + δe j ) J(w). 5: Update w j : w j w j + δ 6: t t + 1 7: end while Q: What is the exact CD rule for Ridge Regression?

41 Coordinate Descent Input: X R N d, y R N, initial w (0) Output: Solution w 1: t = 0 2: while not converged do 3: Randomly choose a variable w j 4: Compute the optimal update on w j by solving δ = argmin δ J(w + δe j ) J(w). 5: Update w j : w j w j + δ 6: t t + 1 7: end while Q: What is the exact CD rule for Ridge Regression? n δ i=1 = X ij(xi T w y i ) + λw j λ + n i=1 X ij 2

42 Coordinate Descent (convergence) If J(w) is strongly convex, randomized coordinate converges to the global optimum with a global linear convergence rate. For ridge regression, J(w) is strongly convex because 2 J(w) = X T X + λi Will show more details in next class

43 Time Complexity For updating w j, we need to compute δ = What s the computational complexity? n i=1 X ij(xi T w y i ) + λw j λ + n i=1 X ij 2

44 Time Complexity For updating w j, we need to compute δ = n i=1 X ij(xi T w y i ) + λw j λ + n i=1 X ij 2 Assume X R n d is a sparse matrix; each column of X has n i nonzero entries ( i n i = nnz(x )) A naive approach: For each coordinate update (w j ) 1 Compute r i := xi T w y i for all i: O(nnz(X )) operations 2 Compute h j := n i=1 X ij 2: O(n j) operations 3 Compute δ n i=1 = r i X ij +λw j λ+h j : O(n j ) operations 4 w j w j + δ

45 Time Complexity For updating w j, we need to compute δ = n i=1 X ij(xi T w y i ) + λw j λ + n i=1 X ij 2 Assume X R n d is a sparse matrix; each column of X has n i nonzero entries ( i n i = nnz(x )) Precompute h j := n i=1 X ij 2 for all j = 1,..., d, O(nnz(X )) operations For each coordinate update (w j ): 1 Compute r i := xi T w y i for all i: O(nnz(X )) operations 2 Compute δ n i=1 = r i X ij +λw j λ+h j : O(n j ) operations 3 w j w j + δ

46 Time Complexity For updating w j, we need to compute δ = n i=1 X ij(xi T w y i ) + λw j λ + n i=1 X ij 2 Assume X R n d is a sparse matrix; each column of X has n i nonzero entries ( i n i = nnz(x )) Precompute h j := n i=1 X ij 2 for all j = 1,..., d, O(nnz(X )) operations Precompute r i := xi T w y i for all i, O(nnz(X ))operations For each coordinate update (w j ): 1 Compute δ n i=1 = r i X ij +λw j : O(n j ) operations λ+h j 2 For all i = 1, 2,..., n, update r i r i + δ X ij. O(n j ) operations 3 w j w j + δ

47 Time Complexity Averagely O( n) per coordinate update, where n = nnz(x )/d T outer iterations, each outer iteration d coordinate updates: T O(nnz(X )) operations Closed form solution: O(nnz(X )d + d 3 ) operations Can be easily extended to solve the LASSO problem

48 Other Iterative Solvers Other iterative solvers for ridge regression: Gradient Descent Stochastic Gradient Descent......

49 Coming up Next class: intro to optimization Questions?

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing Lecture 4: ML Models (Overview) Cho-Jui Hsieh UC Davis April 17, 2017 Outline Linear regression Ridge regression Logistic regression Other finite-sum