ECS289: Scalable Machine Learning

Size: px
Start display at page:

Download "ECS289: Scalable Machine Learning"


1 ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015

2 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers

3 Regression Input: training data x 1, x 2,..., x n R d and corresponding outputs y 1, y 2,..., y n R Training: compute a function f such that f (x i ) y i for all i Prediction: given a testing sample x, predict the output as f ( x) Examples: Income, number of children Consumer spending Processes, memory Power consumption Financial reports Risk Atmospheric conditions Precipitation

4 Linear Regression Assume f ( ) is a linear function parameterized by w R d : f (x) = w T x Training: compute the model w R d such that w T x i y i for all i Prediction: given a testing sample x, the prediction value is w T x How to find w? w = argmin w R d n (w T x i y i ) 2 i=1

5 Linear Regression: probability interpretation Assume the data is generated from the probability model: Maximum likelihood estimator: y i w T x i + ɛ i, ɛ i N (0, 1) w = argmax log P(y 1,..., y n x 1,..., x n, w) w n = argmax log P(y i x i, w) w = argmax w = argmax w = argmin w i=1 n 1 log( e (w T x i y i ) 2 /2 ) 2π i=1 n 1 2 (w T x i y i ) 2 + constant i=1 n (w T x i y i ) 2 i=1

6 Linear Regression: written as a matrix form Linear regression: w = argmin w R d n i=1 (w T x i y i ) 2 Matrix form: let X R n d be the matrix where the i-th row is x i, y = [y 1,..., y n ] T, then linear regression can be written as w = argmin w R d X w y 2 2

7 Data Structure

8 Dense Matrix vs Sparse Matrix Any matrix X R m n can be stored as dense or sparse Dense Matrix: most entries in X are nonzero (mn space) Sparse Matrix: only few entries in X are nonzero (O(nnz) space)

9 Dense Matrix Operations Column format. Let A R m n, B R m n, s R A + B, sa, A T : mn operations Let A R m n, b R n 1 Ab: mn operations

10 Dense Matrix Operations Matrix-matrix multiplication: what is the time complexity of computing AB?

11 Dense Matrix Operations Assume A, B R n n, what is the time complexity of computing AB? Naive implementation: O(n 3 ) Theoretical best: O(n ) (but slower than naive implementation in practice) Best way in practice: using BLAS (Basic Linear Algebra Subprograms)

12 Dense Matrix Operations BLAS matrix product: O(mnk) for computing AB where A R m k, B R k n Compute matrix product block by block to minimize cache miss rate Can be called from C, Fortran; can be used in MATLAB, R, Python,... Three levels of BLAS operations: 1 Level 1: Vector operations, e.g., y = αx + y 2 Level 2: Matrix-Vector operations, e.g., y = αax + βy 3 Level 3: Matrix-Matrix operations, e.g., y = αab + βc


14 Sparse Matrix Operations Widely-used format: Compressed Sparse Column (CSC), Compressed Sparse Row (CSR),... CSC: three arrays for storing an m n matrix with nnz nonzeroes 1 val (nnz real numbers): the values of each nonzero elements 2 row ind (nnz integers): the row indices corresponding to the values 3 col ptr (n + 1 integers): the list of value indexes where each column starts

15 Sparse Matrix Operations Widely-used format: Compressed Sparse Column (CSC), Compressed Sparse Row (CSR),... CSR: three arrays for storing an m n matrix with nnz nonzeroes 1 val (nnz real numbers): the values of each nonzero elements 2 col ind (nnz integers): the column indices corresponding to the values 3 row ptr (m + 1 integers): the list of value indexes where each row starts

16 Sparse Matrix Operations If A R m n (sparse), B R m n (sparse or dense), s R A + B, sa, A T : O(nnz) operations If A R m n, b R n 1 Ab: O(nnz) operations If A R m k (sparse), B R k n (dense): AB: O((nnz)n) operations (use sparse BLAS) If A R m k (sparse), B R k n (sparse): AB: O(nnz(A)nnz(B)/k) in average AB: O(nnz(A)n) worst case The resulting matrix will be much denser

17 Closed Form Solution

18 Solving Linear Regression Minimize the sum of squared error J(w) J(w) = 1 X w y 2 2 = 1 2 (X w y)t (X w y) = 1 2 w T X T X w y T X w yt y Derivative: w J(w) = X T X w X T y Setting the derivative equal to zero gives the normal equation Therefore, w = (X T X ) 1 X T y X T X w = X T y

19 Solving Linear Regression Minimize the sum of squared error J(w) J(w) = 1 X w y 2 2 = 1 2 (X w y)t (X w y) = 1 2 w T X T X w y T X w yt y Derivative: w J(w) = X T X w X T y Setting the derivative equal to zero gives the normal equation Therefore, w = (X T X ) 1 X T y but X T X may be non-invertible... X T X w = X T y

20 Solving Linear Regression Normal equation: X T X w = X T y If X T X is invertible (typically when # samples > # features): w = (X T X ) 1 y If X T X is low-rank (typically when # features > # samples): infinite number of solutions (why?) Least norm solution: w = X y where X is the pseudo inverse

21 Regularized Linear Regression

22 Overfitting Overfitting: the model has low training error but high prediction error. Using too many features can lead to overfitting

23 Regularization to Avoid Overfitting Enforce the solution to have low L2-norm: argmin w n w T x i y i 2 s.t. w 2 K i=1 Equivalent to the following problem with some λ argmin w n w T x i yi 2 + λ w 2 i=1

24 Avoid Overfitting by Controlling Model Complexity (*) In the following, we derive a bound for generalization (prediction) error Training & testing data are generated from the same distribution (x i, y i ) D Generalization error (expectation of prediction error): R(f ) := E (x,y) D [(f (x) y) 2 ] Best model (minimizing prediction error): f := argmin f F R(f ) Empirical error: error on the training data ˆR(ˆf ) := 1 n n (f (x i ) y i ) 2 i=1 Our estimator: ˆf := argmin f F ˆR(f ) F: a class of function (e.g., all linear functions {f : f (x) = w T x})

25 Avoid Overfitting by Controlling Model Complexity (*) Generalization error of ˆf : R(ˆf ) =R(f ) + ( ˆR(ˆf ) ˆR(f )) + (R(ˆf ) ˆR(ˆf )) + ( ˆR(f ) R(f )) R(f ) max f F ˆR(f ) R(f ) where f = argmin f F R(f ). Overfitting is due to large max f F ˆR(f ) R(f ) How to make this term smaller? 1 Have more samples (larger n) 2 Make F a smaller set (regularization)

26 Avoid overfitting by Controlling Model Complexity (*) R(ˆf ) R(f ) + 2 max }{{} ˆR(f ) R(f ) f F generalization error of the estimator Control F to be a subset of linear functions: 1 Ridge regression: F = {f (x) = w T x w 2 K} where K is a constant 2 Lasso: F = {f (x) = w T x w 1 K} where K is a constant If K is large overfitting (max f F ˆR(f ) R(f ) is large) If K is small underfitting (R(f ) is large because F does not cover the best solution)

27 Regularized Linear Regression Regularized Linear Regression: R(w): regularization argmin X w y 2 + R(w) w Ridge Regression (l 2 regularization): Lasso (l 1 regularization): Note that w 1 = d i=1 w i argmin X w y 2 + λ w 2 w argmin X w y 2 + λ w 1 w

28 Regularization Lasso: the solution is sparse, but no closed form solution

29 Ridge Regression Ridge regression: argmin w R d 1 2 X w y 2 + λ 2 w 2 }{{} J(w) Closed form solution: optimal solution w satisfies J(w ) = 0: X T X w X T y + λw = 0 (X T X + λi )w = X T y Optimial solution: w = (X T X + λi ) 1 X T y Inverse always exists because X T X + λi is positive definite What s the computational complexity?

30 Time Complexity (Closed Form Solution)

31 Time Complexity (closed form solution of ridge regression) Goal: compute (X T X + λi ) 1 (X T y) Step 1: Compute A = X T X + λi and b = X T y O(nd 2 + nd) if X is dense Step 2: several options with O(d 3 ) time complexity 1 Compute A 1, and then compute w = A 1 b 2 Directly solve the linear system Aw = b (calling underlying LAPACK routine) (Cholesky factorization) Which one is better?

32 Time Complexity (closed form solution) Time Complexity for step 2: O(d 3 )

33 More on Linear System Solvers Can be done by linear system solver: computing w such that Aw = y Different libraries can be used: Default LAPACK Intel Math Kernel Library (Intel MKL) AMD Core Math Library (ACML) Parallel numerical linear algebra packages are available today: ScaLAPACK (SCAlable Linear Algebra PACKage): linear algebra routines for distributed memory machines PLAPACK (Parallel Linear Algebra PACKage): dense linear algebra routines PLASMA (Parallel Linear Algebra Software for Multicore Architectures): combine state-of-the-art solutions in parallel algorithms and scheduling for optimized solutions of linear systems, eigenvalue problems,... However, still O(d 3 ) time complexity to solve Aw = y when A R d d

34 Time Complexity When X is dense: Closed form solution requires O(nd 2 + d 3 ) if X is dense Efficient if d is very small Runs forever when d > 100, 000 Typical case for big data applications: X R n d is sparse with large n and large d How can we solve the problem?

35 Iterative Solver (Coordinate Descent)

36 Optimization: iterative solver Iterative solver: generate a sequence of (improving) approximate solutions for a problem. w 0 (initialpoint) w 1 w 2 w

37 Convergence of the iterative algorithm Convergence properties: Will the algorithm converge to a point? Will it converge to optimal solution(s)? What s the convergence rate (how fast will the sequence converge)? We will just introduce existing properties in this class, but we will focus on: computational complexity per iteration

38 Coordinate Descent Solve the optimization problem: argmin w J(w) Update one variable at a time Obtain a model with reasonable performance for a few iterations Randomized coordinate descent: pick a random coordinate to update at each iteration

39 Coordinate Descent Input: X R N d, y R N, initial w (0) Output: Solution w 1: t = 0 2: while not converged do 3: Randomly choose a variable w j 4: Compute the optimal update on w j by solving δ = argmin δ J(w + δe j ) J(w). 5: Update w j : w j w j + δ 6: t t + 1 7: end while

40 Coordinate Descent Input: X R N d, y R N, initial w (0) Output: Solution w 1: t = 0 2: while not converged do 3: Randomly choose a variable w j 4: Compute the optimal update on w j by solving δ = argmin δ J(w + δe j ) J(w). 5: Update w j : w j w j + δ 6: t t + 1 7: end while Q: What is the exact CD rule for Ridge Regression?

41 Coordinate Descent Input: X R N d, y R N, initial w (0) Output: Solution w 1: t = 0 2: while not converged do 3: Randomly choose a variable w j 4: Compute the optimal update on w j by solving δ = argmin δ J(w + δe j ) J(w). 5: Update w j : w j w j + δ 6: t t + 1 7: end while Q: What is the exact CD rule for Ridge Regression? n δ i=1 = X ij(xi T w y i ) + λw j λ + n i=1 X ij 2

42 Coordinate Descent (convergence) If J(w) is strongly convex, randomized coordinate converges to the global optimum with a global linear convergence rate. For ridge regression, J(w) is strongly convex because 2 J(w) = X T X + λi Will show more details in next class

43 Time Complexity For updating w j, we need to compute δ = What s the computational complexity? n i=1 X ij(xi T w y i ) + λw j λ + n i=1 X ij 2

44 Time Complexity For updating w j, we need to compute δ = n i=1 X ij(xi T w y i ) + λw j λ + n i=1 X ij 2 Assume X R n d is a sparse matrix; each column of X has n i nonzero entries ( i n i = nnz(x )) A naive approach: For each coordinate update (w j ) 1 Compute r i := xi T w y i for all i: O(nnz(X )) operations 2 Compute h j := n i=1 X ij 2: O(n j) operations 3 Compute δ n i=1 = r i X ij +λw j λ+h j : O(n j ) operations 4 w j w j + δ

45 Time Complexity For updating w j, we need to compute δ = n i=1 X ij(xi T w y i ) + λw j λ + n i=1 X ij 2 Assume X R n d is a sparse matrix; each column of X has n i nonzero entries ( i n i = nnz(x )) Precompute h j := n i=1 X ij 2 for all j = 1,..., d, O(nnz(X )) operations For each coordinate update (w j ): 1 Compute r i := xi T w y i for all i: O(nnz(X )) operations 2 Compute δ n i=1 = r i X ij +λw j λ+h j : O(n j ) operations 3 w j w j + δ

46 Time Complexity For updating w j, we need to compute δ = n i=1 X ij(xi T w y i ) + λw j λ + n i=1 X ij 2 Assume X R n d is a sparse matrix; each column of X has n i nonzero entries ( i n i = nnz(x )) Precompute h j := n i=1 X ij 2 for all j = 1,..., d, O(nnz(X )) operations Precompute r i := xi T w y i for all i, O(nnz(X ))operations For each coordinate update (w j ): 1 Compute δ n i=1 = r i X ij +λw j : O(n j ) operations λ+h j 2 For all i = 1, 2,..., n, update r i r i + δ X ij. O(n j ) operations 3 w j w j + δ

47 Time Complexity Averagely O( n) per coordinate update, where n = nnz(x )/d T outer iterations, each outer iteration d coordinate updates: T O(nnz(X )) operations Closed form solution: O(nnz(X )d + d 3 ) operations Can be easily extended to solve the LASSO problem

48 Other Iterative Solvers Other iterative solvers for ridge regression: Gradient Descent Stochastic Gradient Descent......

49 Coming up Next class: intro to optimization Questions?

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 4: ML Models (Overview) Cho-Jui Hsieh UC Davis April 17, 2017 Outline Linear regression Ridge regression Logistic regression Other finite-sum

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)

More information

Coordinate Update Algorithm Short Course The Package TMAC

Coordinate Update Algorithm Short Course The Package TMAC Coordinate Update Algorithm Short Course The Package TMAC Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 16 TMAC: A Toolbox of Async-Parallel, Coordinate, Splitting, and Stochastic Methods C++11 multi-threading

More information

A Quick Tour of Linear Algebra and Optimization for Machine Learning

A Quick Tour of Linear Algebra and Optimization for Machine Learning A Quick Tour of Linear Algebra and Optimization for Machine Learning Masoud Farivar January 8, 2015 1 / 28 Outline of Part I: Review of Basic Linear Algebra Matrices and Vectors Matrix Multiplication Operators

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: Naïve Bayes

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Numerical Linear Algebra Background Cho-Jui Hsieh UC Davis May 15, 2018 Linear Algebra Background Vectors A vector has a direction and a magnitude

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 5: Numerical Linear Algebra Cho-Jui Hsieh UC Davis April 20, 2017 Linear Algebra Background Vectors A vector has a direction and a magnitude

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Some notes on efficient computing and setting up high performance computing environments

Some notes on efficient computing and setting up high performance computing environments Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting

More information

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University. SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA Chandola@UB CSE 474/574 1

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

CSC321 Lecture 2: Linear Regression

CSC321 Lecture 2: Linear Regression CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website:

More information

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method ORIE 4741: Learning with Big Messy Data Proximal Gradient Method Professor Udell Operations Research and Information Engineering Cornell November 13, 2017 1 / 31 Announcements Be a TA for CS/ORIE 1380:

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

l 1 and l 2 Regularization

l 1 and l 2 Regularization David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about

More information

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Introduction Introduction We wanted to parallelize a serial algorithm for the pivoted Cholesky factorization

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Regularized Least Squares

Regularized Least Squares Regularized Least Squares Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Summary In RLS, the Tikhonov minimization problem boils down to solving a linear system (and this

More information

Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries

Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou ( User and Application Support Scientific Computing Centre @ AUTH Presentation Outline

More information

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning Convex Optimization Lecture 15 - Gradient Descent in Machine Learning Instructor: Yuanzhang Xiao University of Hawaii at Manoa Fall 2017 1 / 21 Today s Lecture 1 Motivation 2 Subgradient Method 3 Stochastic

More information

This ensures that we walk downhill. For fixed λ not even this may be the case.

This ensures that we walk downhill. For fixed λ not even this may be the case. Gradient Descent Objective Function Some differentiable function f : R n R. Gradient Descent Start with some x 0, i = 0 and learning rate λ repeat x i+1 = x i λ f(x i ) until f(x i+1 ) ɛ Line Search Variant

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression Behavioral Data Mining Lecture 7 Linear and Logistic Regression Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips Linear Regression

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 6 Structured and Low Rank Matrices Section 6.3 Numerical Optimization Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at

More information

Is the test error unbiased for these programs?

Is the test error unbiased for these programs? Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics


More information

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012 Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,

More information

CS 542G: Conditioning, BLAS, LU Factorization

CS 542G: Conditioning, BLAS, LU Factorization CS 542G: Conditioning, BLAS, LU Factorization Robert Bridson September 22, 2008 1 Why some RBF Kernel Functions Fail We derived some sensible RBF kernel functions, like φ(r) = r 2 log r, from basic principles

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Regularized Least Squares

Regularized Least Squares Regularized Least Squares Ryan M. Rifkin Google, Inc. 2008 Basics: Data Data points S = {(X 1, Y 1 ),...,(X n, Y n )}. We let X simultaneously refer to the set {X 1,...,X n } and to the n by d matrix whose

More information

CSC 411 Lecture 6: Linear Regression

CSC 411 Lecture 6: Linear Regression CSC 411 Lecture 6: Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 06-Linear Regression 1 / 37 A Timely XKCD UofT CSC 411: 06-Linear Regression

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : Dec 10, 2016 Wu-Jun Li (

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016


More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Variations of Logistic Regression with Stochastic Gradient Descent

Variations of Logistic Regression with Stochastic Gradient Descent Variations of Logistic Regression with Stochastic Gradient Descent Panqu Wang( Phuc Xuan Nguyen( January 26, 2012 Abstract In this paper, we extend the traditional logistic

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Gradient Descent, Newton-like Methods Mark Schmidt University of British Columbia Winter 2017 Admin Auditting/registration forms: Submit them in class/help-session/tutorial this

More information

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University, 1 Introduction to Machine Learning Regression Computer Science, Tel-Aviv University, 2013-14 Classification Input: X Real valued, vectors over real. Discrete values (0,1,2,...) Other structures (e.g.,

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Lecture 2: Linear regression

Lecture 2: Linear regression Lecture 2: Linear regression Roger Grosse 1 Introduction Let s ump right in and look at our first machine learning algorithm, linear regression. In regression, we are interested in predicting a scalar-valued

More information

Sample questions for Fundamentals of Machine Learning 2018

Sample questions for Fundamentals of Machine Learning 2018 Sample questions for Fundamentals of Machine Learning 2018 Teacher: Mohammad Emtiyaz Khan A few important informations: In the final exam, no electronic devices are allowed except a calculator. Make sure

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Machine Learning Linear Models

Machine Learning Linear Models Machine Learning Linear Models Outline II - Linear Models 1. Linear Regression (a) Linear regression: History (b) Linear regression with Least Squares (c) Matrix representation and Normal Equation Method

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as: CS 1: Machine Learning Spring 15 College of Computer and Information Science Northeastern University Lecture 3 February, 3 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu 1 Introduction Ridge

More information

CSE546 Machine Learning, Autumn 2016: Homework 1

CSE546 Machine Learning, Autumn 2016: Homework 1 CSE546 Machine Learning, Autumn 2016: Homework 1 Due: Friday, October 14 th, 5pm 0 Policies Writing: Please submit your HW as a typed pdf document (not handwritten). It is encouraged you latex all your

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 6: Training versus Testing (LFD 2.1) Cho-Jui Hsieh UC Davis Jan 29, 2018 Preamble to the theory Training versus testing Out-of-sample error (generalization error): What

More information

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication. CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J 7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Numerical Methods I Non-Square and Sparse Linear Systems

Numerical Methods I Non-Square and Sparse Linear Systems Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant

More information

Machine Learning. Lecture 2: Linear regression. Feng Li.

Machine Learning. Lecture 2: Linear regression. Feng Li. Machine Learning Lecture 2: Linear regression Feng Li School of Computer Science and Technology Shandong University Fall 2017 Supervised Learning Regression: Predict

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For

More information

B553 Lecture 5: Matrix Algebra Review

B553 Lecture 5: Matrix Algebra Review B553 Lecture 5: Matrix Algebra Review Kris Hauser January 19, 2012 We have seen in prior lectures how vectors represent points in R n and gradients of functions. Matrices represent linear transformations

More information

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning Elman Mansimov 1 September 24, 2015 1 Modified based on Shenlong Wang s and Jake Snell s tutorials, with additional contents borrowed from Kevin Swersky and Jasper Snoek

More information

Matrix Factorization Techniques for Recommender Systems

Matrix Factorization Techniques for Recommender Systems Matrix Factorization Techniques for Recommender Systems Patrick Seemann, December 16 th, 2014 16.12.2014 Fachbereich Informatik Recommender Systems Seminar Patrick Seemann Topics Intro New-User / New-Item

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Roundoff Error. Monday, August 29, 11

Roundoff Error. Monday, August 29, 11 Roundoff Error A round-off error (rounding error), is the difference between the calculated approximation of a number and its exact mathematical value. Numerical analysis specifically tries to estimate

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Quick Introduction to Nonnegative Matrix Factorization

Quick Introduction to Nonnegative Matrix Factorization Quick Introduction to Nonnegative Matrix Factorization Norm Matloff University of California at Davis 1 The Goal Given an u v matrix A with nonnegative elements, we wish to find nonnegative, rank-k matrices

More information