Subsampling for Ridge Regression via Regularized Volume Sampling

Size: px
Start display at page:

Download "Subsampling for Ridge Regression via Regularized Volume Sampling"

Transcription

1 Subsampling for Ridge Regression via Regularized Volume Sampling Micha l Dereziński and Manfred Warmuth University of California at Santa Cruz AISTATS 18, / 22

2 Linear regression y x 2 / 22

3 Optimal solution y w = argmin w x (x i w y i ) 2 i 3 / 22

4 How many labels needed to get close to optimum? y - All x i given - But labels y i unknown Guess how many needed? x 4 / 22

5 How many labels needed to get close to optimum? y - All x i given - But labels y i unknown Guess how many needed? x 4 / 22

6 Answer: one label y x x i 5 / 22

7 Which one? - x max (furthest from 0) is bad - any deterministic choice is bad y Good: 1 label y i drawn x 2 i x y i E i (x j j E i w i x i }{{} w i = i x i x max x j w ) 2 = j P(i) {}}{ x 2 i x 2 w i (x j w y j ) 2 {}}{ y i x i = w 6 / 22

8 Which one? - x max (furthest from 0) is bad - any deterministic choice is bad y Good: 1 label y i drawn x 2 i x y i E i (x j j E i w i x i }{{} w i = i x i x max x j w ) 2 = j P(i) {}}{ x 2 i x 2 w i (x j w y j ) 2 {}}{ y i x i = w 6 / 22

9 Which one? - x max (furthest from 0) is bad - any deterministic choice is bad y Good: 1 label y i drawn x 2 i x y i E i (x j j E i w i x i }{{} w i = i x i x max x j w ) 2 = j P(i) {}}{ x 2 i x 2 w i (x j w y j ) 2 {}}{ y i x i = w 6 / 22

10 Which one? - x max (furthest from 0) is bad - any deterministic choice is bad y Good: 1 label y i drawn x 2 i x y i E i (x j j E i w i x i }{{} w i = i x i x max x j w ) 2 = j P(i) {}}{ x 2 i x 2 w i (x j w y j ) 2 {}}{ y i x i = w 6 / 22

11 Generalization: Volume Sampling Given n points x i R d Choose subset S of d indices Get labels y i R for i S Find w (S): solution for the d examples {(x i, y i ) : i S} Key prior result [DW17] 1 : for any (x 1, y 1 ),..., (x n, y n ), E S (x j w (S) x j w ) 2 = d (x j w y j ) 2 j j when S chosen squared volume of parallelopiped spanned by the {x i : i S} 1 DW. NIPS 17 7 / 22

12 Generalization: Volume Sampling Given n points x i R d Choose subset S of d indices Get labels y i R for i S Find w (S): solution for the d examples {(x i, y i ) : i S} Key prior result [DW17] 1 : for any (x 1, y 1 ),..., (x n, y n ), E S (x j w (S) x j w ) 2 = d (x j w y j ) 2 j j when S chosen squared volume of parallelopiped spanned by the {x i : i S} Unbiasedness: E S w (S) = w 1 DW. NIPS 17 7 / 22

13 Generalization: Volume Sampling Given n points x i R d Choose subset S of d indices Get labels y i R for i S Find w (S): solution for the d examples {(x i, y i ) : i S} Key prior result [DW17] 1 : for any (x 1, y 1 ),..., (x n, y n ), E S (x j w (S) x j w ) 2 = d (x j w y j ) 2 j j when S chosen squared volume of parallelopiped spanned by the {x i : i S} Open: Replace factor d with ɛ, using small S d 1 DW. NIPS 17 7 / 22

14 Two types of volume sampling Let X R d n be full rank, n d X S = x i x j Distribution over all s-element subsets S: det(x S P(S) X S) if s d [DRVW06] 1 det(x S X S ) if s d [AB13]2 det(x S X S ) = squared volume of parallelepiped P(x i, x j ) x i x j 1 [DRVW06] Deshpande et al. SODA 06 2 [AB13] Avron and Boutsidis. JMAA 13 8 / 22

15 Prior work: efficient volume sampling algorithms Sampling sets of size s d: 1. Deshpande and Rademacher (FOCS 10) O(snd ω log d) 2. Guruswami and Sinop (SODA 12) O(snd 2 ) Sampling sets of size s d: 1. Li et al. (NIPS 17) (approximate MCMC) Õ(s3 nd 2 ) 2. DW (NIPS 17) O(n 2 d) We give an O(nd 2 ) algorithm for s d 9 / 22

16 Prior work: efficient volume sampling algorithms Sampling sets of size s d: 1. Deshpande and Rademacher (FOCS 10) O(snd ω log d) 2. Guruswami and Sinop (SODA 12) O(snd 2 ) Sampling sets of size s d: 1. Li et al. (NIPS 17) (approximate MCMC) Õ(s3 nd 2 ) 2. DW (NIPS 17) O(n 2 d) We give an O(nd 2 ) algorithm for s d 9 / 22

17 Our contributions: subsampling for linear regression 1. Statistical 1.1 Dimension-free error bounds (random noise assumptions) ( ) degrees of freedom Error = O subset size s 1.2 Lower bounds volume sampling achieves near-optimal error bounds no i.i.d. sampling works as well for small sample size s 2. Computational 2.1 Regularized volume sampling 2.2 Faster sampling algorithm O(nd 2 ) for any s d 10 / 22

18 Our contributions: subsampling for linear regression 1. Statistical 1.1 Dimension-free error bounds (random noise assumptions) ( ) degrees of freedom Error = O subset size s 1.2 Lower bounds volume sampling achieves near-optimal error bounds no i.i.d. sampling works as well for small sample size s 2. Computational 2.1 Regularized volume sampling 2.2 Faster sampling algorithm O(nd 2 ) for any s d 10 / 22

19 Our contributions: subsampling for linear regression 1. Statistical 1.1 Dimension-free error bounds (random noise assumptions) ( ) degrees of freedom Error = O subset size s 1.2 Lower bounds volume sampling achieves near-optimal error bounds no i.i.d. sampling works as well for small sample size s 2. Computational 2.1 Regularized volume sampling 2.2 Faster sampling algorithm O(nd 2 ) for any s d 10 / 22

20 Statistical model: linear labels plus bounded noise Fixed design matrix: X R d n, linear model: w R d y = X w + ξ, where E[ξ] = 0, and Var[ξ] σ 2 I Goal: Minimize Mean Squared Prediction Error using few labels MSPE(ŵ) = E 1 n n (x i ŵ x i w ) 2 i=1 11 / 22

21 Statistical model: linear labels plus bounded noise Fixed design matrix: X R d n, linear model: w R d y = X w + ξ, where E[ξ] = 0, and Var[ξ] σ 2 I Goal: Minimize Mean Squared Prediction Error using few labels MSPE(ŵ) = E 1 n n (x i ŵ x i w ) 2 i=1 11 / 22

22 Dimension-free error bound (XS X S +λi) 1 X S y S {}}{ Ridge estimator: ŵλ (S) = argmin X S w y S 2 + λ w 2 w Main theorem If λ σ2, then there is a distribution over subsets S of size s w 2 s.t. MSPE(ŵ λ (S)) σ2 d λ s d λ +1 = O ( σ 2 ) d λ, s where d λ = d λ i i=1 λ i +λ < d, {λ i} - eigenvalues of XX d λ - degrees of freedom (statistical dimension) of X, given λ Note: Under decaying eigenvalues, d λ d 12 / 22

23 Dimension-free error bound (XS X S +λi) 1 X S y S {}}{ Ridge estimator: ŵλ (S) = argmin X S w y S 2 + λ w 2 w Main theorem If λ σ2, then there is a distribution over subsets S of size s w 2 s.t. MSPE(ŵ λ (S)) σ2 d λ s d λ +1 = O ( σ 2 ) d λ, s where d λ = d λ i i=1 λ i +λ < d, {λ i} - eigenvalues of XX d λ - degrees of freedom (statistical dimension) of X, given λ Note: Under decaying eigenvalues, d λ d 12 / 22

24 Regularized volume sampling Naive idea: replace P(S) det(x S X S ), with P(S) det(x S X S + λi) Problem: Not clear how to sample efficiently Intuition: No simple extension of the Cauchy-Binet formula ( ) n d det(x S X S ) = det(xx ) s d S: S =s S: S =s det(x S X S + λi) =??? 13 / 22

25 Regularized volume sampling Naive idea: replace P(S) det(x S X S ), with P(S) det(x S X S + λi) Problem: Not clear how to sample efficiently Intuition: No simple extension of the Cauchy-Binet formula ( ) n d det(x S X S ) = det(xx ) s d S: S =s S: S =s det(x S X S + λi) =??? 13 / 22

26 Reverse iterative sampling {1..n} S P(S i S) S = S i size n n 1 s s 1 Start with S = {1..n} Sample index i S Go to set S i = S {i} Repeat until desired size P(S i S) det(x S i X S i + λi) det(x S X S + λi) = 1 x i (X S X S + λi) 1 x i Note: not the same as P(S) det(x S X S + λi) (unless λ = 0) 14 / 22

27 Reverse iterative sampling {1..n} S P(S i S) S = S i size n n 1 s s 1 Start with S = {1..n} Sample index i S Go to set S i = S {i} Repeat until desired size P(S i S) det(x S i X S i + λi) det(x S X S + λi) = 1 x i (X S X S + λi) 1 x i Note: not the same as P(S) det(x S X S + λi) (unless λ = 0) 14 / 22

28 Reverse iterative sampling {1..n} S P(S i S) S = S i size n n 1 s s 1 Start with S = {1..n} Sample index i S Go to set S i = S {i} Repeat until desired size P(S i S) det(x S i X S i + λi) det(x S X S + λi) = 1 x i (X S X S + λi) 1 x i Note: not the same as P(S) det(x S X S + λi) (unless λ = 0) 14 / 22

29 Key Expectation Bound Lemma For λ-regularized volume sampling of set S of size s E S (X S X S + λi) 1 n d λ + 1 s d λ + 1 (XX + λi) 1 (for λ = 0, this was shown in [DW17]) Proof of main theorem Bias-variance decomposition for fixed S: MSPE(ŵ λ (S) S) = (bias)2 + variance σ2 n tr(x (X S X S + λi) 1 X) Take expectation over S and apply the lemma. 15 / 22

30 Key Expectation Bound Lemma For λ-regularized volume sampling of set S of size s E S (X S X S + λi) 1 n d λ + 1 s d λ + 1 (XX + λi) 1 (for λ = 0, this was shown in [DW17]) Proof of main theorem Bias-variance decomposition for fixed S: MSPE(ŵ λ (S) S) = (bias)2 + variance σ2 n tr(x (X S X S + λi) 1 X) Take expectation over S and apply the lemma. 15 / 22

31 Volume sampling vs leverage score sampling Leverage score sampling: examples selected i.i.d. w.p. x i (XX ) 1 x i. How many labels needed for MSPE(ŵ) σ 2? (σ 2 - label noise) 1. Volume sampling 2d λ labels 2. Any i.i.d. sampling Ω(d λ ln d λ ) labels (lower bound) Our volume sampling is as fast as computing exact leverage scores 16 / 22

32 Volume sampling vs leverage score sampling Leverage score sampling: examples selected i.i.d. w.p. x i (XX ) 1 x i. How many labels needed for MSPE(ŵ) σ 2? (σ 2 - label noise) 1. Volume sampling 2d λ labels 2. Any i.i.d. sampling Ω(d λ ln d λ ) labels (lower bound) Our volume sampling is as fast as computing exact leverage scores 16 / 22

33 Volume sampling vs leverage score sampling Leverage score sampling: examples selected i.i.d. w.p. x i (XX ) 1 x i. How many labels needed for MSPE(ŵ) σ 2? (σ 2 - label noise) 1. Volume sampling 2d λ labels 2. Any i.i.d. sampling Ω(d λ ln d λ ) labels (lower bound) Our volume sampling is as fast as computing exact leverage scores 16 / 22

34 Volume sampling vs leverage score sampling Leverage score sampling: examples selected i.i.d. w.p. x i (XX ) 1 x i. How many labels needed for MSPE(ŵ) σ 2? (σ 2 - label noise) 1. Volume sampling 2d λ labels 2. Any i.i.d. sampling Ω(d λ ln d λ ) labels (lower bound) Our volume sampling is as fast as computing exact leverage scores 16 / 22

35 Volume sampling vs leverage scores on real data 17 / 22

36 Simple algorithm for volume sampling {1..n} S P(S i S) S = S i size n n 1 s s 1 Start with S = {1..n} Sample index i S Go to set S i = S {i} Repeat until desired size Simple algorithm: Update distribution P(S i S) at every step Runtime: n s {}}{ steps Problem: Quadratic dependence on n O(n d) {}}{ update = O(n 2 d) 18 / 22

37 Faster algorithm via rejection sampling Recall: P(S i S) 1 x i (X S X S + λi) 1 x i Idea: Rejection sampling from distribution P(S i S) 1. Sample i uniformly from set S, 2. Compute h i = 1 x i (X S X S + λi) 1 x i, one trial 3. With probability 1 h i reject and go back to 1. We show: Number of trials per step is constant w.h.p. Runtime: n s {}}{ steps O(1) {}}{ trials per step O(d 2 ) {}}{ compute h i = O(nd 2 ) Result: Linear dependence on n 19 / 22

38 Faster algorithm via rejection sampling Recall: P(S i S) 1 x i (X S X S + λi) 1 x i Idea: Rejection sampling from distribution P(S i S) 1. Sample i uniformly from set S, 2. Compute h i = 1 x i (X S X S + λi) 1 x i, one trial 3. With probability 1 h i reject and go back to 1. We show: Number of trials per step is constant w.h.p. Runtime: n s {}}{ steps O(1) {}}{ trials per step O(d 2 ) {}}{ compute h i = O(nd 2 ) Result: Linear dependence on n 19 / 22

39 Faster volume sampling on real data RegVol: simple algorithm FastRegVol: rejection sampling Time (seconds) cadata RegVol FastRegVol Time (seconds) MSD RegVol FastRegVol Data size (n) Data size (n) cpusmall abalone Time (seconds) RegVol FastRegVol Data size (n) Time (seconds) RegVol FastRegVol Data size (n) 20 / 22

40 Conclusions 1. Dimension-free error bounds for subsampled regression 2. Optimal subsampling must be joint 3. Why pick volume sampling over leverage score sampling? 3.1 Better error bounds for small sample sizes, 3.2 Nearly the same computational efficiency. 21 / 22

41 Conclusions 1. Dimension-free error bounds for subsampled regression 2. Optimal subsampling must be joint 3. Why pick volume sampling over leverage score sampling? 3.1 Better error bounds for small sample sizes, 3.2 Nearly the same computational efficiency. 21 / 22

42 Conclusions 1. Dimension-free error bounds for subsampled regression 2. Optimal subsampling must be joint 3. Why pick volume sampling over leverage score sampling? 3.1 Better error bounds for small sample sizes, 3.2 Nearly the same computational efficiency. 21 / 22

43 Thank you 22 / 22

Sketched Ridge Regression:

Sketched Ridge Regression: Sketched Ridge Regression: Optimization and Statistical Perspectives Shusen Wang UC Berkeley Alex Gittens RPI Michael Mahoney UC Berkeley Overview Ridge Regression min w f w = 1 n Xw y + γ w Over-determined:

More information

Randomized algorithms for the approximation of matrices

Randomized algorithms for the approximation of matrices Randomized algorithms for the approximation of matrices Luis Rademacher The Ohio State University Computer Science and Engineering (joint work with Amit Deshpande, Santosh Vempala, Grant Wang) Two topics

More information

Online Kernel PCA with Entropic Matrix Updates

Online Kernel PCA with Entropic Matrix Updates Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth University of California - Santa Cruz ICML 2007, Corvallis, Oregon April 23, 2008 D. Kuzmin, M. Warmuth (UCSC) Online Kernel

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) 0 overview Our Contributions: 1 overview Our Contributions: A near optimal low-rank

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Randomized Numerical Linear Algebra: Review and Progresses

Randomized Numerical Linear Algebra: Review and Progresses ized ized SVD ized : Review and Progresses Zhihua Department of Computer Science and Engineering Shanghai Jiao Tong University The 12th China Workshop on Machine Learning and Applications Xi an, November

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

10725/36725 Optimization Homework 4

10725/36725 Optimization Homework 4 10725/36725 Optimization Homework 4 Due November 27, 2012 at beginning of class Instructions: There are four questions in this assignment. Please submit your homework as (up to) 4 separate sets of pages

More information

Sample questions for Fundamentals of Machine Learning 2018

Sample questions for Fundamentals of Machine Learning 2018 Sample questions for Fundamentals of Machine Learning 2018 Teacher: Mohammad Emtiyaz Khan A few important informations: In the final exam, no electronic devices are allowed except a calculator. Make sure

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

1 Exercises for lecture 1

1 Exercises for lecture 1 1 Exercises for lecture 1 Exercise 1 a) Show that if F is symmetric with respect to µ, and E( X )

More information

On-line Variance Minimization

On-line Variance Minimization On-line Variance Minimization Manfred Warmuth Dima Kuzmin University of California - Santa Cruz 19th Annual Conference on Learning Theory M. Warmuth, D. Kuzmin (UCSC) On-line Variance Minimization COLT06

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

A determinantal point process for column subset selection

A determinantal point process for column subset selection A determinantal point process for column subset selection Ayoub Belhadji 1, Rémi Bardenet 1, Pierre Chainais 1 1 Univ. Lille, CNRS, Centrale Lille, UMR 9189 - CRIStAL, 59651 Villeneuve d Ascq, France Abstract

More information

Cross-Validation with Confidence

Cross-Validation with Confidence Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation

More information

Tighter Low-rank Approximation via Sampling the Leveraged Element

Tighter Low-rank Approximation via Sampling the Leveraged Element Tighter Low-rank Approximation via Sampling the Leveraged Element Srinadh Bhojanapalli The University of Texas at Austin bsrinadh@utexas.edu Prateek Jain Microsoft Research, India prajain@microsoft.com

More information

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology Rirdge Regression Szymon Bobek Institute of Applied Computer science AGH University of Science and Technology Based on Carlos Guestrin adn Emily Fox slides from Coursera Specialization on Machine Learnign

More information

Relative-Error CUR Matrix Decompositions

Relative-Error CUR Matrix Decompositions RandNLA Reading Group University of California, Berkeley Tuesday, April 7, 2015. Motivation study [low-rank] matrix approximations that are explicitly expressed in terms of a small numbers of columns and/or

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Online Kernel PCA with Entropic Matrix Updates

Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth Computer Science Department, University of California - Santa Cruz dima@cse.ucsc.edu manfred@cse.ucsc.edu Abstract A number of updates for density matrices have been developed

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

7.3 Ridge Analysis of the Response Surface

7.3 Ridge Analysis of the Response Surface 7.3 Ridge Analysis of the Response Surface When analyzing a fitted response surface, the researcher may find that the stationary point is outside of the experimental design region, but the researcher wants

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized

More information

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012 Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

2.1 Laplacian Variants

2.1 Laplacian Variants -3 MS&E 337: Spectral Graph heory and Algorithmic Applications Spring 2015 Lecturer: Prof. Amin Saberi Lecture 2-3: 4/7/2015 Scribe: Simon Anastasiadis and Nolan Skochdopole Disclaimer: hese notes have

More information

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso COMS 477 Lecture 6. Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso / 2 Fixed-design linear regression Fixed-design linear regression A simplified

More information

Accelerated Dense Random Projections

Accelerated Dense Random Projections 1 Advisor: Steven Zucker 1 Yale University, Department of Computer Science. Dimensionality reduction (1 ε) xi x j 2 Ψ(xi ) Ψ(x j ) 2 (1 + ε) xi x j 2 ( n 2) distances are ε preserved Target dimension k

More information

Sketching as a Tool for Numerical Linear Algebra

Sketching as a Tool for Numerical Linear Algebra Sketching as a Tool for Numerical Linear Algebra David P. Woodruff presented by Sepehr Assadi o(n) Big Data Reading Group University of Pennsylvania February, 2015 Sepehr Assadi (Penn) Sketching for Numerical

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Pack only the essentials: distributed sequential sampling for adaptive kernel DL

Pack only the essentials: distributed sequential sampling for adaptive kernel DL Pack only the essentials: distributed sequential sampling for adaptive kernel DL with Daniele Calandriello and Alessandro Lazaric SequeL team, Inria Lille - Nord Europe, France appeared in AISTATS 2017

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 6: Bias and variance (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 49 Our plan today We saw in last lecture that model scoring methods seem to be trading off two different

More information

Composite Hypotheses and Generalized Likelihood Ratio Tests

Composite Hypotheses and Generalized Likelihood Ratio Tests Composite Hypotheses and Generalized Likelihood Ratio Tests Rebecca Willett, 06 In many real world problems, it is difficult to precisely specify probability distributions. Our models for data may involve

More information

Provably Correct Algorithms for Matrix Column Subset Selection with Selectively Sampled Data

Provably Correct Algorithms for Matrix Column Subset Selection with Selectively Sampled Data Journal of Machine Learning Research 18 (2018) 1-42 Submitted 5/15; Revised 9/16; Published 4/18 Provably Correct Algorithms for Matrix Column Subset Selection with Selectively Sampled Data Yining Wang

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Learning without correspondence

Learning without correspondence Learning without correspondence Daniel Hsu Computer Science Department & Data Science Institute Columbia University Introduction Example #1: unlinked data sources Two separate data sources about same entities:

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation Discussion of sampling approach in big data

Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation Discussion of sampling approach in big data Discussion of sampling approach in big data Big data discussion group at MSCS of UIC Outline 1 Introduction 2 The framework 3 Bias and variance 4 Approximate computation of leverage 5 Empirical evaluation

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Lecture 1 and 2: Random Spanning Trees

Lecture 1 and 2: Random Spanning Trees Recent Advances in Approximation Algorithms Spring 2015 Lecture 1 and 2: Random Spanning Trees Lecturer: Shayan Oveis Gharan March 31st Disclaimer: These notes have not been subjected to the usual scrutiny

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Column Selection via Adaptive Sampling

Column Selection via Adaptive Sampling Column Selection via Adaptive Sampling Saurabh Paul Global Risk Sciences, Paypal Inc. saupaul@paypal.com Malik Magdon-Ismail CS Dept., Rensselaer Polytechnic Institute magdon@cs.rpi.edu Petros Drineas

More information

AMS 7 Correlation and Regression Lecture 8

AMS 7 Correlation and Regression Lecture 8 AMS 7 Correlation and Regression Lecture 8 Department of Applied Mathematics and Statistics, University of California, Santa Cruz Suumer 2014 1 / 18 Correlation pairs of continuous observations. Correlation

More information

Cross-Validation with Confidence

Cross-Validation with Confidence Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University WHOA-PSI Workshop, St Louis, 2017 Quotes from Day 1 and Day 2 Good model or pure model? Occam s razor We really

More information

Variance Reduction and Ensemble Methods

Variance Reduction and Ensemble Methods Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis

More information

Bayesian Inference of Noise Levels in Regression

Bayesian Inference of Noise Levels in Regression Bayesian Inference of Noise Levels in Regression Christopher M. Bishop Microsoft Research, 7 J. J. Thomson Avenue, Cambridge, CB FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop

More information

Example continued. Math 425 Intro to Probability Lecture 37. Example continued. Example

Example continued. Math 425 Intro to Probability Lecture 37. Example continued. Example continued : Coin tossing Math 425 Intro to Probability Lecture 37 Kenneth Harris kaharri@umich.edu Department of Mathematics University of Michigan April 8, 2009 Consider a Bernoulli trials process with

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)

More information

Chapter 4: Factor Analysis

Chapter 4: Factor Analysis Chapter 4: Factor Analysis In many studies, we may not be able to measure directly the variables of interest. We can merely collect data on other variables which may be related to the variables of interest.

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014 Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014 Put your solution to each problem on a separate sheet of paper. Problem 1. (5166) Assume that two random samples {x i } and {y i } are independently

More information

Graph Sparsification I : Effective Resistance Sampling

Graph Sparsification I : Effective Resistance Sampling Graph Sparsification I : Effective Resistance Sampling Nikhil Srivastava Microsoft Research India Simons Institute, August 26 2014 Graphs G G=(V,E,w) undirected V = n w: E R + Sparsification Approximate

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Approximate Spectral Clustering via Randomized Sketching

Approximate Spectral Clustering via Randomized Sketching Approximate Spectral Clustering via Randomized Sketching Christos Boutsidis Yahoo! Labs, New York Joint work with Alex Gittens (Ebay), Anju Kambadur (IBM) The big picture: sketch and solve Tradeoff: Speed

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,

More information

STOP, a i+ 1 is the desired root. )f(a i) > 0. Else If f(a i+ 1. Set a i+1 = a i+ 1 and b i+1 = b Else Set a i+1 = a i and b i+1 = a i+ 1

STOP, a i+ 1 is the desired root. )f(a i) > 0. Else If f(a i+ 1. Set a i+1 = a i+ 1 and b i+1 = b Else Set a i+1 = a i and b i+1 = a i+ 1 53 17. Lecture 17 Nonlinear Equations Essentially, the only way that one can solve nonlinear equations is by iteration. The quadratic formula enables one to compute the roots of p(x) = 0 when p P. Formulas

More information

CS540 Machine learning Lecture 5

CS540 Machine learning Lecture 5 CS540 Machine learning Lecture 5 1 Last time Basis functions for linear regression Normal equations QR SVD - briefly 2 This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 3

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

MS&E 226: Small Data. Lecture 6: Bias and variance (v2) Ramesh Johari

MS&E 226: Small Data. Lecture 6: Bias and variance (v2) Ramesh Johari MS&E 226: Small Data Lecture 6: Bias and variance (v2) Ramesh Johari ramesh.johari@stanford.edu 1 / 47 Our plan today We saw in last lecture that model scoring methods seem to be trading o two di erent

More information

Gradient-based Adaptive Stochastic Search

Gradient-based Adaptive Stochastic Search 1 / 41 Gradient-based Adaptive Stochastic Search Enlu Zhou H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology November 5, 2014 Outline 2 / 41 1 Introduction

More information

Section 9: Generalized method of moments

Section 9: Generalized method of moments 1 Section 9: Generalized method of moments In this section, we revisit unbiased estimating functions to study a more general framework for estimating parameters. Let X n =(X 1,...,X n ), where the X i

More information

Sparse Linear Contextual Bandits via Relevance Vector Machines

Sparse Linear Contextual Bandits via Relevance Vector Machines Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Linear regression Least-squares estimation

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Stochastic optimization Markov Chain Monte Carlo

Stochastic optimization Markov Chain Monte Carlo Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya Weizmann Institute of Science 1 Motivation Markov chains Stationary distribution Mixing time 2 Algorithms Metropolis-Hastings Simulated Annealing

More information

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu Thanks to: Aryan Mokhtari, Mark

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

1 Using standard errors when comparing estimated values

1 Using standard errors when comparing estimated values MLPR Assignment Part : General comments Below are comments on some recurring issues I came across when marking the second part of the assignment, which I thought it would help to explain in more detail

More information

Efficient Bayesian Multivariate Surface Regression

Efficient Bayesian Multivariate Surface Regression Efficient Bayesian Multivariate Surface Regression Feng Li (joint with Mattias Villani) Department of Statistics, Stockholm University October, 211 Outline of the talk 1 Flexible regression models 2 The

More information

Spring 2012 Math 541B Exam 1

Spring 2012 Math 541B Exam 1 Spring 2012 Math 541B Exam 1 1. A sample of size n is drawn without replacement from an urn containing N balls, m of which are red and N m are black; the balls are otherwise indistinguishable. Let X denote

More information

Joint Factor Analysis for Speaker Verification

Joint Factor Analysis for Speaker Verification Joint Factor Analysis for Speaker Verification Mengke HU ASPITRG Group, ECE Department Drexel University mengke.hu@gmail.com October 12, 2012 1/37 Outline 1 Speaker Verification Baseline System Session

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

EECS 275 Matrix Computation

EECS 275 Matrix Computation EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 22 1 / 21 Overview

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training

More information

Simple and Multiple Linear Regression

Simple and Multiple Linear Regression Sta. 113 Chapter 12 and 13 of Devore March 12, 2010 Table of contents 1 Simple Linear Regression 2 Model Simple Linear Regression A simple linear regression model is given by Y = β 0 + β 1 x + ɛ where

More information

Recall that the AR(p) model is defined by the equation

Recall that the AR(p) model is defined by the equation Estimation of AR models Recall that the AR(p) model is defined by the equation X t = p φ j X t j + ɛ t j=1 where ɛ t are assumed independent and following a N(0, σ 2 ) distribution. Assume p is known and

More information