Subsampling for Ridge Regression via Regularized Volume Sampling
|
|
- Frederica Cobb
- 5 years ago
- Views:
Transcription
1 Subsampling for Ridge Regression via Regularized Volume Sampling Micha l Dereziński and Manfred Warmuth University of California at Santa Cruz AISTATS 18, / 22
2 Linear regression y x 2 / 22
3 Optimal solution y w = argmin w x (x i w y i ) 2 i 3 / 22
4 How many labels needed to get close to optimum? y - All x i given - But labels y i unknown Guess how many needed? x 4 / 22
5 How many labels needed to get close to optimum? y - All x i given - But labels y i unknown Guess how many needed? x 4 / 22
6 Answer: one label y x x i 5 / 22
7 Which one? - x max (furthest from 0) is bad - any deterministic choice is bad y Good: 1 label y i drawn x 2 i x y i E i (x j j E i w i x i }{{} w i = i x i x max x j w ) 2 = j P(i) {}}{ x 2 i x 2 w i (x j w y j ) 2 {}}{ y i x i = w 6 / 22
8 Which one? - x max (furthest from 0) is bad - any deterministic choice is bad y Good: 1 label y i drawn x 2 i x y i E i (x j j E i w i x i }{{} w i = i x i x max x j w ) 2 = j P(i) {}}{ x 2 i x 2 w i (x j w y j ) 2 {}}{ y i x i = w 6 / 22
9 Which one? - x max (furthest from 0) is bad - any deterministic choice is bad y Good: 1 label y i drawn x 2 i x y i E i (x j j E i w i x i }{{} w i = i x i x max x j w ) 2 = j P(i) {}}{ x 2 i x 2 w i (x j w y j ) 2 {}}{ y i x i = w 6 / 22
10 Which one? - x max (furthest from 0) is bad - any deterministic choice is bad y Good: 1 label y i drawn x 2 i x y i E i (x j j E i w i x i }{{} w i = i x i x max x j w ) 2 = j P(i) {}}{ x 2 i x 2 w i (x j w y j ) 2 {}}{ y i x i = w 6 / 22
11 Generalization: Volume Sampling Given n points x i R d Choose subset S of d indices Get labels y i R for i S Find w (S): solution for the d examples {(x i, y i ) : i S} Key prior result [DW17] 1 : for any (x 1, y 1 ),..., (x n, y n ), E S (x j w (S) x j w ) 2 = d (x j w y j ) 2 j j when S chosen squared volume of parallelopiped spanned by the {x i : i S} 1 DW. NIPS 17 7 / 22
12 Generalization: Volume Sampling Given n points x i R d Choose subset S of d indices Get labels y i R for i S Find w (S): solution for the d examples {(x i, y i ) : i S} Key prior result [DW17] 1 : for any (x 1, y 1 ),..., (x n, y n ), E S (x j w (S) x j w ) 2 = d (x j w y j ) 2 j j when S chosen squared volume of parallelopiped spanned by the {x i : i S} Unbiasedness: E S w (S) = w 1 DW. NIPS 17 7 / 22
13 Generalization: Volume Sampling Given n points x i R d Choose subset S of d indices Get labels y i R for i S Find w (S): solution for the d examples {(x i, y i ) : i S} Key prior result [DW17] 1 : for any (x 1, y 1 ),..., (x n, y n ), E S (x j w (S) x j w ) 2 = d (x j w y j ) 2 j j when S chosen squared volume of parallelopiped spanned by the {x i : i S} Open: Replace factor d with ɛ, using small S d 1 DW. NIPS 17 7 / 22
14 Two types of volume sampling Let X R d n be full rank, n d X S = x i x j Distribution over all s-element subsets S: det(x S P(S) X S) if s d [DRVW06] 1 det(x S X S ) if s d [AB13]2 det(x S X S ) = squared volume of parallelepiped P(x i, x j ) x i x j 1 [DRVW06] Deshpande et al. SODA 06 2 [AB13] Avron and Boutsidis. JMAA 13 8 / 22
15 Prior work: efficient volume sampling algorithms Sampling sets of size s d: 1. Deshpande and Rademacher (FOCS 10) O(snd ω log d) 2. Guruswami and Sinop (SODA 12) O(snd 2 ) Sampling sets of size s d: 1. Li et al. (NIPS 17) (approximate MCMC) Õ(s3 nd 2 ) 2. DW (NIPS 17) O(n 2 d) We give an O(nd 2 ) algorithm for s d 9 / 22
16 Prior work: efficient volume sampling algorithms Sampling sets of size s d: 1. Deshpande and Rademacher (FOCS 10) O(snd ω log d) 2. Guruswami and Sinop (SODA 12) O(snd 2 ) Sampling sets of size s d: 1. Li et al. (NIPS 17) (approximate MCMC) Õ(s3 nd 2 ) 2. DW (NIPS 17) O(n 2 d) We give an O(nd 2 ) algorithm for s d 9 / 22
17 Our contributions: subsampling for linear regression 1. Statistical 1.1 Dimension-free error bounds (random noise assumptions) ( ) degrees of freedom Error = O subset size s 1.2 Lower bounds volume sampling achieves near-optimal error bounds no i.i.d. sampling works as well for small sample size s 2. Computational 2.1 Regularized volume sampling 2.2 Faster sampling algorithm O(nd 2 ) for any s d 10 / 22
18 Our contributions: subsampling for linear regression 1. Statistical 1.1 Dimension-free error bounds (random noise assumptions) ( ) degrees of freedom Error = O subset size s 1.2 Lower bounds volume sampling achieves near-optimal error bounds no i.i.d. sampling works as well for small sample size s 2. Computational 2.1 Regularized volume sampling 2.2 Faster sampling algorithm O(nd 2 ) for any s d 10 / 22
19 Our contributions: subsampling for linear regression 1. Statistical 1.1 Dimension-free error bounds (random noise assumptions) ( ) degrees of freedom Error = O subset size s 1.2 Lower bounds volume sampling achieves near-optimal error bounds no i.i.d. sampling works as well for small sample size s 2. Computational 2.1 Regularized volume sampling 2.2 Faster sampling algorithm O(nd 2 ) for any s d 10 / 22
20 Statistical model: linear labels plus bounded noise Fixed design matrix: X R d n, linear model: w R d y = X w + ξ, where E[ξ] = 0, and Var[ξ] σ 2 I Goal: Minimize Mean Squared Prediction Error using few labels MSPE(ŵ) = E 1 n n (x i ŵ x i w ) 2 i=1 11 / 22
21 Statistical model: linear labels plus bounded noise Fixed design matrix: X R d n, linear model: w R d y = X w + ξ, where E[ξ] = 0, and Var[ξ] σ 2 I Goal: Minimize Mean Squared Prediction Error using few labels MSPE(ŵ) = E 1 n n (x i ŵ x i w ) 2 i=1 11 / 22
22 Dimension-free error bound (XS X S +λi) 1 X S y S {}}{ Ridge estimator: ŵλ (S) = argmin X S w y S 2 + λ w 2 w Main theorem If λ σ2, then there is a distribution over subsets S of size s w 2 s.t. MSPE(ŵ λ (S)) σ2 d λ s d λ +1 = O ( σ 2 ) d λ, s where d λ = d λ i i=1 λ i +λ < d, {λ i} - eigenvalues of XX d λ - degrees of freedom (statistical dimension) of X, given λ Note: Under decaying eigenvalues, d λ d 12 / 22
23 Dimension-free error bound (XS X S +λi) 1 X S y S {}}{ Ridge estimator: ŵλ (S) = argmin X S w y S 2 + λ w 2 w Main theorem If λ σ2, then there is a distribution over subsets S of size s w 2 s.t. MSPE(ŵ λ (S)) σ2 d λ s d λ +1 = O ( σ 2 ) d λ, s where d λ = d λ i i=1 λ i +λ < d, {λ i} - eigenvalues of XX d λ - degrees of freedom (statistical dimension) of X, given λ Note: Under decaying eigenvalues, d λ d 12 / 22
24 Regularized volume sampling Naive idea: replace P(S) det(x S X S ), with P(S) det(x S X S + λi) Problem: Not clear how to sample efficiently Intuition: No simple extension of the Cauchy-Binet formula ( ) n d det(x S X S ) = det(xx ) s d S: S =s S: S =s det(x S X S + λi) =??? 13 / 22
25 Regularized volume sampling Naive idea: replace P(S) det(x S X S ), with P(S) det(x S X S + λi) Problem: Not clear how to sample efficiently Intuition: No simple extension of the Cauchy-Binet formula ( ) n d det(x S X S ) = det(xx ) s d S: S =s S: S =s det(x S X S + λi) =??? 13 / 22
26 Reverse iterative sampling {1..n} S P(S i S) S = S i size n n 1 s s 1 Start with S = {1..n} Sample index i S Go to set S i = S {i} Repeat until desired size P(S i S) det(x S i X S i + λi) det(x S X S + λi) = 1 x i (X S X S + λi) 1 x i Note: not the same as P(S) det(x S X S + λi) (unless λ = 0) 14 / 22
27 Reverse iterative sampling {1..n} S P(S i S) S = S i size n n 1 s s 1 Start with S = {1..n} Sample index i S Go to set S i = S {i} Repeat until desired size P(S i S) det(x S i X S i + λi) det(x S X S + λi) = 1 x i (X S X S + λi) 1 x i Note: not the same as P(S) det(x S X S + λi) (unless λ = 0) 14 / 22
28 Reverse iterative sampling {1..n} S P(S i S) S = S i size n n 1 s s 1 Start with S = {1..n} Sample index i S Go to set S i = S {i} Repeat until desired size P(S i S) det(x S i X S i + λi) det(x S X S + λi) = 1 x i (X S X S + λi) 1 x i Note: not the same as P(S) det(x S X S + λi) (unless λ = 0) 14 / 22
29 Key Expectation Bound Lemma For λ-regularized volume sampling of set S of size s E S (X S X S + λi) 1 n d λ + 1 s d λ + 1 (XX + λi) 1 (for λ = 0, this was shown in [DW17]) Proof of main theorem Bias-variance decomposition for fixed S: MSPE(ŵ λ (S) S) = (bias)2 + variance σ2 n tr(x (X S X S + λi) 1 X) Take expectation over S and apply the lemma. 15 / 22
30 Key Expectation Bound Lemma For λ-regularized volume sampling of set S of size s E S (X S X S + λi) 1 n d λ + 1 s d λ + 1 (XX + λi) 1 (for λ = 0, this was shown in [DW17]) Proof of main theorem Bias-variance decomposition for fixed S: MSPE(ŵ λ (S) S) = (bias)2 + variance σ2 n tr(x (X S X S + λi) 1 X) Take expectation over S and apply the lemma. 15 / 22
31 Volume sampling vs leverage score sampling Leverage score sampling: examples selected i.i.d. w.p. x i (XX ) 1 x i. How many labels needed for MSPE(ŵ) σ 2? (σ 2 - label noise) 1. Volume sampling 2d λ labels 2. Any i.i.d. sampling Ω(d λ ln d λ ) labels (lower bound) Our volume sampling is as fast as computing exact leverage scores 16 / 22
32 Volume sampling vs leverage score sampling Leverage score sampling: examples selected i.i.d. w.p. x i (XX ) 1 x i. How many labels needed for MSPE(ŵ) σ 2? (σ 2 - label noise) 1. Volume sampling 2d λ labels 2. Any i.i.d. sampling Ω(d λ ln d λ ) labels (lower bound) Our volume sampling is as fast as computing exact leverage scores 16 / 22
33 Volume sampling vs leverage score sampling Leverage score sampling: examples selected i.i.d. w.p. x i (XX ) 1 x i. How many labels needed for MSPE(ŵ) σ 2? (σ 2 - label noise) 1. Volume sampling 2d λ labels 2. Any i.i.d. sampling Ω(d λ ln d λ ) labels (lower bound) Our volume sampling is as fast as computing exact leverage scores 16 / 22
34 Volume sampling vs leverage score sampling Leverage score sampling: examples selected i.i.d. w.p. x i (XX ) 1 x i. How many labels needed for MSPE(ŵ) σ 2? (σ 2 - label noise) 1. Volume sampling 2d λ labels 2. Any i.i.d. sampling Ω(d λ ln d λ ) labels (lower bound) Our volume sampling is as fast as computing exact leverage scores 16 / 22
35 Volume sampling vs leverage scores on real data 17 / 22
36 Simple algorithm for volume sampling {1..n} S P(S i S) S = S i size n n 1 s s 1 Start with S = {1..n} Sample index i S Go to set S i = S {i} Repeat until desired size Simple algorithm: Update distribution P(S i S) at every step Runtime: n s {}}{ steps Problem: Quadratic dependence on n O(n d) {}}{ update = O(n 2 d) 18 / 22
37 Faster algorithm via rejection sampling Recall: P(S i S) 1 x i (X S X S + λi) 1 x i Idea: Rejection sampling from distribution P(S i S) 1. Sample i uniformly from set S, 2. Compute h i = 1 x i (X S X S + λi) 1 x i, one trial 3. With probability 1 h i reject and go back to 1. We show: Number of trials per step is constant w.h.p. Runtime: n s {}}{ steps O(1) {}}{ trials per step O(d 2 ) {}}{ compute h i = O(nd 2 ) Result: Linear dependence on n 19 / 22
38 Faster algorithm via rejection sampling Recall: P(S i S) 1 x i (X S X S + λi) 1 x i Idea: Rejection sampling from distribution P(S i S) 1. Sample i uniformly from set S, 2. Compute h i = 1 x i (X S X S + λi) 1 x i, one trial 3. With probability 1 h i reject and go back to 1. We show: Number of trials per step is constant w.h.p. Runtime: n s {}}{ steps O(1) {}}{ trials per step O(d 2 ) {}}{ compute h i = O(nd 2 ) Result: Linear dependence on n 19 / 22
39 Faster volume sampling on real data RegVol: simple algorithm FastRegVol: rejection sampling Time (seconds) cadata RegVol FastRegVol Time (seconds) MSD RegVol FastRegVol Data size (n) Data size (n) cpusmall abalone Time (seconds) RegVol FastRegVol Data size (n) Time (seconds) RegVol FastRegVol Data size (n) 20 / 22
40 Conclusions 1. Dimension-free error bounds for subsampled regression 2. Optimal subsampling must be joint 3. Why pick volume sampling over leverage score sampling? 3.1 Better error bounds for small sample sizes, 3.2 Nearly the same computational efficiency. 21 / 22
41 Conclusions 1. Dimension-free error bounds for subsampled regression 2. Optimal subsampling must be joint 3. Why pick volume sampling over leverage score sampling? 3.1 Better error bounds for small sample sizes, 3.2 Nearly the same computational efficiency. 21 / 22
42 Conclusions 1. Dimension-free error bounds for subsampled regression 2. Optimal subsampling must be joint 3. Why pick volume sampling over leverage score sampling? 3.1 Better error bounds for small sample sizes, 3.2 Nearly the same computational efficiency. 21 / 22
43 Thank you 22 / 22
Sketched Ridge Regression:
Sketched Ridge Regression: Optimization and Statistical Perspectives Shusen Wang UC Berkeley Alex Gittens RPI Michael Mahoney UC Berkeley Overview Ridge Regression min w f w = 1 n Xw y + γ w Over-determined:
More informationRandomized algorithms for the approximation of matrices
Randomized algorithms for the approximation of matrices Luis Rademacher The Ohio State University Computer Science and Engineering (joint work with Amit Deshpande, Santosh Vempala, Grant Wang) Two topics
More informationOnline Kernel PCA with Entropic Matrix Updates
Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth University of California - Santa Cruz ICML 2007, Corvallis, Oregon April 23, 2008 D. Kuzmin, M. Warmuth (UCSC) Online Kernel
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationsublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)
sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) 0 overview Our Contributions: 1 overview Our Contributions: A near optimal low-rank
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationRandomized Numerical Linear Algebra: Review and Progresses
ized ized SVD ized : Review and Progresses Zhihua Department of Computer Science and Engineering Shanghai Jiao Tong University The 12th China Workshop on Machine Learning and Applications Xi an, November
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationThese slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
More information10725/36725 Optimization Homework 4
10725/36725 Optimization Homework 4 Due November 27, 2012 at beginning of class Instructions: There are four questions in this assignment. Please submit your homework as (up to) 4 separate sets of pages
More informationSample questions for Fundamentals of Machine Learning 2018
Sample questions for Fundamentals of Machine Learning 2018 Teacher: Mohammad Emtiyaz Khan A few important informations: In the final exam, no electronic devices are allowed except a calculator. Make sure
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationLinear Models for Regression
Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationA Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir
A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence
More informationMidterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.
CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic
More information1 Exercises for lecture 1
1 Exercises for lecture 1 Exercise 1 a) Show that if F is symmetric with respect to µ, and E( X )
More informationOn-line Variance Minimization
On-line Variance Minimization Manfred Warmuth Dima Kuzmin University of California - Santa Cruz 19th Annual Conference on Learning Theory M. Warmuth, D. Kuzmin (UCSC) On-line Variance Minimization COLT06
More informationRegression, Ridge Regression, Lasso
Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.
More informationA determinantal point process for column subset selection
A determinantal point process for column subset selection Ayoub Belhadji 1, Rémi Bardenet 1, Pierre Chainais 1 1 Univ. Lille, CNRS, Centrale Lille, UMR 9189 - CRIStAL, 59651 Villeneuve d Ascq, France Abstract
More informationCross-Validation with Confidence
Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation
More informationTighter Low-rank Approximation via Sampling the Leveraged Element
Tighter Low-rank Approximation via Sampling the Leveraged Element Srinadh Bhojanapalli The University of Texas at Austin bsrinadh@utexas.edu Prateek Jain Microsoft Research, India prajain@microsoft.com
More informationRirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology
Rirdge Regression Szymon Bobek Institute of Applied Computer science AGH University of Science and Technology Based on Carlos Guestrin adn Emily Fox slides from Coursera Specialization on Machine Learnign
More informationRelative-Error CUR Matrix Decompositions
RandNLA Reading Group University of California, Berkeley Tuesday, April 7, 2015. Motivation study [low-rank] matrix approximations that are explicitly expressed in terms of a small numbers of columns and/or
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the
More informationOnline Kernel PCA with Entropic Matrix Updates
Dima Kuzmin Manfred K. Warmuth Computer Science Department, University of California - Santa Cruz dima@cse.ucsc.edu manfred@cse.ucsc.edu Abstract A number of updates for density matrices have been developed
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More information7.3 Ridge Analysis of the Response Surface
7.3 Ridge Analysis of the Response Surface When analyzing a fitted response surface, the researcher may find that the stationary point is outside of the experimental design region, but the researcher wants
More informationMA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2
MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized
More informationMachine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline
More information2.1 Laplacian Variants
-3 MS&E 337: Spectral Graph heory and Algorithmic Applications Spring 2015 Lecturer: Prof. Amin Saberi Lecture 2-3: 4/7/2015 Scribe: Simon Anastasiadis and Nolan Skochdopole Disclaimer: hese notes have
More informationCOMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso
COMS 477 Lecture 6. Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso / 2 Fixed-design linear regression Fixed-design linear regression A simplified
More informationAccelerated Dense Random Projections
1 Advisor: Steven Zucker 1 Yale University, Department of Computer Science. Dimensionality reduction (1 ε) xi x j 2 Ψ(xi ) Ψ(x j ) 2 (1 + ε) xi x j 2 ( n 2) distances are ε preserved Target dimension k
More informationSketching as a Tool for Numerical Linear Algebra
Sketching as a Tool for Numerical Linear Algebra David P. Woodruff presented by Sepehr Assadi o(n) Big Data Reading Group University of Pennsylvania February, 2015 Sepehr Assadi (Penn) Sketching for Numerical
More informationCOMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)
COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationPack only the essentials: distributed sequential sampling for adaptive kernel DL
Pack only the essentials: distributed sequential sampling for adaptive kernel DL with Daniele Calandriello and Alessandro Lazaric SequeL team, Inria Lille - Nord Europe, France appeared in AISTATS 2017
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationThis model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that
Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationMS&E 226: Small Data
MS&E 226: Small Data Lecture 6: Bias and variance (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 49 Our plan today We saw in last lecture that model scoring methods seem to be trading off two different
More informationComposite Hypotheses and Generalized Likelihood Ratio Tests
Composite Hypotheses and Generalized Likelihood Ratio Tests Rebecca Willett, 06 In many real world problems, it is difficult to precisely specify probability distributions. Our models for data may involve
More informationProvably Correct Algorithms for Matrix Column Subset Selection with Selectively Sampled Data
Journal of Machine Learning Research 18 (2018) 1-42 Submitted 5/15; Revised 9/16; Published 4/18 Provably Correct Algorithms for Matrix Column Subset Selection with Selectively Sampled Data Yining Wang
More informationMachine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /
Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical
More informationLearning without correspondence
Learning without correspondence Daniel Hsu Computer Science Department & Data Science Institute Columbia University Introduction Example #1: unlinked data sources Two separate data sources about same entities:
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationIntroduction The framework Bias and variance Approximate computation of leverage Empirical evaluation Discussion of sampling approach in big data
Discussion of sampling approach in big data Big data discussion group at MSCS of UIC Outline 1 Introduction 2 The framework 3 Bias and variance 4 Approximate computation of leverage 5 Empirical evaluation
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57
More informationLecture 1 and 2: Random Spanning Trees
Recent Advances in Approximation Algorithms Spring 2015 Lecture 1 and 2: Random Spanning Trees Lecturer: Shayan Oveis Gharan March 31st Disclaimer: These notes have not been subjected to the usual scrutiny
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationColumn Selection via Adaptive Sampling
Column Selection via Adaptive Sampling Saurabh Paul Global Risk Sciences, Paypal Inc. saupaul@paypal.com Malik Magdon-Ismail CS Dept., Rensselaer Polytechnic Institute magdon@cs.rpi.edu Petros Drineas
More informationAMS 7 Correlation and Regression Lecture 8
AMS 7 Correlation and Regression Lecture 8 Department of Applied Mathematics and Statistics, University of California, Santa Cruz Suumer 2014 1 / 18 Correlation pairs of continuous observations. Correlation
More informationCross-Validation with Confidence
Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University WHOA-PSI Workshop, St Louis, 2017 Quotes from Day 1 and Day 2 Good model or pure model? Occam s razor We really
More informationVariance Reduction and Ensemble Methods
Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis
More informationBayesian Inference of Noise Levels in Regression
Bayesian Inference of Noise Levels in Regression Christopher M. Bishop Microsoft Research, 7 J. J. Thomson Avenue, Cambridge, CB FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop
More informationExample continued. Math 425 Intro to Probability Lecture 37. Example continued. Example
continued : Coin tossing Math 425 Intro to Probability Lecture 37 Kenneth Harris kaharri@umich.edu Department of Mathematics University of Michigan April 8, 2009 Consider a Bernoulli trials process with
More informationGeneralization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)
More informationChapter 4: Factor Analysis
Chapter 4: Factor Analysis In many studies, we may not be able to measure directly the variables of interest. We can merely collect data on other variables which may be related to the variables of interest.
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationPh.D. Qualifying Exam Friday Saturday, January 3 4, 2014
Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014 Put your solution to each problem on a separate sheet of paper. Problem 1. (5166) Assume that two random samples {x i } and {y i } are independently
More informationGraph Sparsification I : Effective Resistance Sampling
Graph Sparsification I : Effective Resistance Sampling Nikhil Srivastava Microsoft Research India Simons Institute, August 26 2014 Graphs G G=(V,E,w) undirected V = n w: E R + Sparsification Approximate
More informationLinear Models for Regression
Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationApproximate Spectral Clustering via Randomized Sketching
Approximate Spectral Clustering via Randomized Sketching Christos Boutsidis Yahoo! Labs, New York Joint work with Alex Gittens (Ebay), Anju Kambadur (IBM) The big picture: sketch and solve Tradeoff: Speed
More informationMachine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017
Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,
More informationSTOP, a i+ 1 is the desired root. )f(a i) > 0. Else If f(a i+ 1. Set a i+1 = a i+ 1 and b i+1 = b Else Set a i+1 = a i and b i+1 = a i+ 1
53 17. Lecture 17 Nonlinear Equations Essentially, the only way that one can solve nonlinear equations is by iteration. The quadratic formula enables one to compute the roots of p(x) = 0 when p P. Formulas
More informationCS540 Machine learning Lecture 5
CS540 Machine learning Lecture 5 1 Last time Basis functions for linear regression Normal equations QR SVD - briefly 2 This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 3
More informationMachine Learning in the Data Revolution Era
Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,
More informationAdvanced computational methods X Selected Topics: SGD
Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety
More informationHigh-dimensional regression
High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationMS&E 226: Small Data. Lecture 6: Bias and variance (v2) Ramesh Johari
MS&E 226: Small Data Lecture 6: Bias and variance (v2) Ramesh Johari ramesh.johari@stanford.edu 1 / 47 Our plan today We saw in last lecture that model scoring methods seem to be trading o two di erent
More informationGradient-based Adaptive Stochastic Search
1 / 41 Gradient-based Adaptive Stochastic Search Enlu Zhou H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology November 5, 2014 Outline 2 / 41 1 Introduction
More informationSection 9: Generalized method of moments
1 Section 9: Generalized method of moments In this section, we revisit unbiased estimating functions to study a more general framework for estimating parameters. Let X n =(X 1,...,X n ), where the X i
More informationSparse Linear Contextual Bandits via Relevance Vector Machines
Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationLinear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.
Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Linear regression Least-squares estimation
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationStochastic optimization Markov Chain Monte Carlo
Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya Weizmann Institute of Science 1 Motivation Markov chains Stationary distribution Mixing time 2 Algorithms Metropolis-Hastings Simulated Annealing
More informationHigh Order Methods for Empirical Risk Minimization
High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu Thanks to: Aryan Mokhtari, Mark
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More information1 Using standard errors when comparing estimated values
MLPR Assignment Part : General comments Below are comments on some recurring issues I came across when marking the second part of the assignment, which I thought it would help to explain in more detail
More informationEfficient Bayesian Multivariate Surface Regression
Efficient Bayesian Multivariate Surface Regression Feng Li (joint with Mattias Villani) Department of Statistics, Stockholm University October, 211 Outline of the talk 1 Flexible regression models 2 The
More informationSpring 2012 Math 541B Exam 1
Spring 2012 Math 541B Exam 1 1. A sample of size n is drawn without replacement from an urn containing N balls, m of which are red and N m are black; the balls are otherwise indistinguishable. Let X denote
More informationJoint Factor Analysis for Speaker Verification
Joint Factor Analysis for Speaker Verification Mengke HU ASPITRG Group, ECE Department Drexel University mengke.hu@gmail.com October 12, 2012 1/37 Outline 1 Speaker Verification Baseline System Session
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationEECS 275 Matrix Computation
EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 22 1 / 21 Overview
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training
More informationSimple and Multiple Linear Regression
Sta. 113 Chapter 12 and 13 of Devore March 12, 2010 Table of contents 1 Simple Linear Regression 2 Model Simple Linear Regression A simple linear regression model is given by Y = β 0 + β 1 x + ɛ where
More informationRecall that the AR(p) model is defined by the equation
Estimation of AR models Recall that the AR(p) model is defined by the equation X t = p φ j X t j + ɛ t j=1 where ɛ t are assumed independent and following a N(0, σ 2 ) distribution. Assume p is known and
More information