COMPRESSED AND PENALIZED LINEAR
|
|
- Rosemary Powell
- 5 years ago
- Views:
Transcription
1 COMPRESSED AND PENALIZED LINEAR REGRESSION Daniel J. McDonald Indiana University, Bloomington mypage.iu.edu/ dajmcdon 2 June
2 OBLIGATORY DATA IS BIG SLIDE Modern statistical applications genomics, neural image analysis, text analysis, weather prediction have large numbers of covariates p Also frequently have lots of observations n. Need algorithms which can handle these kinds of data sets, with good statistical properties 2
3 LESSON OF THE TALK Many statistical methods use (perhaps implicitly) a singular value decomposition (SVD) to solve an optimization problem. The SVD is computationally expensive. We want to understand the statistical properties of some approximations which speed up computation and save storage. Spoiler: sometimes approximations actually improve the statistical properties 3
4 CORE TECHNIQUES Suppose we have a matrix X R n p and vector Y R n LEAST SQUARES: l 2 REGULARIZATION: β = argmin Xβ Y 2 2 β β(λ) = argmin Xβ Y λ β 2 2 β 4
5 CORE TECHNIQUES If X fits into RAM, there exist excellent algorithms in LAPACK that are Double precision Very stable cubic complexity, O(np 2 + p 3 ), with small constants require extensive random access to matrix There is a lot of interest in finding and analyzing techniques that extend these approaches to large(r) problems 5
6 OUT-OF-CORE TECHNIQUES If X is too large to manipulate in RAM, there are other approaches (Stochastic) gradient descent Conjugate gradient Rank-one QR updates Krylov subspace methods 6
7 OUT-OF-CORE TECHNIQUES Many techniques focus on randomized compression This is sometimes known as sketching or preconditioning Rokhlin, Tygert, (2008) A fast randomized algorithm for overdetermined linear least-squares regression. Drineas, Mahoney, et al., (2011) Faster least squares approximation. Woodruff, (2014) Sketching as a Tool for Numerical Linear Algebra. Wang, Lee, Mahdavi, Kolar, Srebro, (2016) Sketching meets random projection in the dual. Ma, Mahoney, and Yu, (2015), A statistical perspective on algorithmic leveraging. Pilanci and Wainwright, ( ). Multiple papers. Others. 7
8 COMPRESSION BASIC IDEA: Choose some matrix Q R q n. Use QX (and) QY instead in the optimization. Finding QX for arbitrary Q and X takes O(qnp) computations. This can be expensive. To get this approach to work, we need some structure on Q 8
9 THE Q MATRIX Gaussian: Well behaved distribution and eas(ier) theory. Dense matrix Fast Johnson-Lindenstrauss Methods Randomized Hadamard (or Fourier) transformation: Allows for O(np log(p)) computations. Q = πτ for π a permutation of I and τ = [I q 0]: QX means read q (random) rows Sparse Bernoulli: i.i.d. Q ij 1 with probability 1/(2s) 0 with probability 1 1/s 1 with probability 1/(2s) This means QX takes O ( qnp) s computations on average. 9
10 TYPICAL RESULTS The general philosophy: Find an approximation algorithm that is as close as possible to the solution of the original problem. For OLS, typical results would be to produce an β such that X β Y 2 (1 + ɛ) Xβ Y 2 2 2, X( β β ) 2 ɛ Xβ 2 2 2, β 2 β ɛ β 2 2 2, Here, β should be easier to compute than β = argmin Xβ Y 2 2 β 10
11 COLLABORATOR & GRANT SUPPORT Collaborator: Darren Homrighausen Department of Statistics Southern Methodist University Grant Support: NSF, Institute for New Economic Thinking 11
12 An alternate perspective
13 Form a loss function l : Θ Θ R + RISK The quality of an estimator is given by its risk [ ] R( θ) = E l( θ, θ) We could use l 2 estimation risk: R( θ) = E θ θ or excess l 2 prediction risk R( θ) = E Y X θ 2 = E Y Xθ + Xθ X θ 2 constant + E X(θ θ)
14 RISK DECOMPOSITION For an approximation θ of θ, E θ θ 2 = E θ θ + θ E θ + E θ θ EApprox. error 2 + Variance + Bias 2 Previous analyses focus only on the approximation error (with expectation over the algorithm, not the data). 13
15 BIAS-VARIANCE TRADEOFF MSE Var Bias 2 model complexity Typical result compares to the zero-bias estimator which is assumed to have small variance. We examine E θ θ 2 where the expectation is over everything 2 random. Closest similar analysis is Ma, Mahoney, and Yu (JMLR, 2015). 14
16 COMPRESSED REGRESSION Let Q R q n Call the fully compressed least squares estimator β F C = argmin Q(Xβ Y ) 2 2 β A common way to solve least squares problems that are: Very large or Poorly conditioned The numerical/theoretical properties generally depend on Q, q, X 15
17 1. Full compression: FAMILY OF 4 β F C = argmin Q(Xβ Y ) 2 2 β 2. Partial compression: 1 = argmin QY QXβ 2 2 2Y Q QXβ β = (X Q QX) 1 X Q QY β P C = argmin Y QXβ 2 2 2Y Xβ β = (X Q QX) 1 X Y 1 Also called "Hessian Sketching". 16
18 Write: B = [ β F C β P C ] W = XB 3. Linear combination compression: α lin = argmin W α Y 2 2 α β lin = B α lin 4. Convex combination compression: α con = argmin W α Y α α=1 WE ALSO COMBINE THESE β con = B α con These are simple to calculate given FC and PC. 17
19 WHY THESE? Turns out that FC is (approximately) unbiased, and therefore worse than OLS (has high variance) On the other hand, PC is biased and empirics demonstrate low variance Combination should give better statistical properties We do everything with an l 2 penalty 18
20 Evidence from simulations
21 SIMULATION SETUP Draw X i MVN(0, (1 ρ)i p + ρ11 ) We use ρ = {0.2, 0.8}. Draw β N(0, τ 2 I p ) Draw Y i = X i β + ɛ i with ɛ i N(0, σ 2 ). 19
22 BAYES ESTIMATOR For this model, the optimal estimator (in MSE) is β B = (X X + λ I p ) 1 X Y In particular, with λ = σ2 nτ 2 This is the posterior mode of the Bayes estimator under conjugate normal prior It is also the ridge regression estimator for a particular λ 20
23 GOLDILOCKS With p < n 1 If λ is too big, we will tend to shrink all coefficients to 0. This problem is too hard. 2 If λ is too small, OLS will be very close to the optimal estimator. This problem is too easy. 3 Need τ 2, σ 2 just right. Take τ 2 = π/2. This implies E[ β i ] = 1 (convenient) Take n = Big but computable. Take σ = 50 log(λ ) 1.14 (reasonable) Take p {50, 100, 250, 500} 21
24 ρ = 0.8, p = 50, q/n = log10(estimation risk) log(λ) convex linear FC PC ols bayes 22
25 ρ = 0.2, p = 500, q/n = log10(estimation risk) log(λ) convex linear FC PC ols bayes 23
26 1.00 q: 500 q: 1000 q: % best method rho: 0.2 rho: p convex linear FC PC ols ridge 24
27 SECOND VERSE... In that case, ridge was optimal. We did it again with β i 1. 25
28 ρ = 0.8, p = 50, q/n = log10(estimation risk) log(λ) convex linear FC PC ols ridge 26
29 ρ = 0.2, p = 500, q/n = log10(estimation risk) log(λ) convex linear FC PC ols ridge 27
30 1.00 q: 500 q: 1000 q: % best method rho: 0.2 rho: p convex linear FC PC ols ridge 28
31 SELECTING TUNING PARAMETERS We use GCV with the degrees of freedom: X β(λ) Y GCV(λ) = (1 df/n) 2 df is easy for full or partial compression. For the other cases, we do a dumb approximation: weighted average using α. Easy to calculate for a range of λ without extra computations. Also calculate the divergence to derive Stein s Unbiased Risk Estimate. After tedious algebra, it had odd behavior in practice (can be huge, or negative!?!)
32 λgcv λtest log(testgcv testmin) compression convex linear FC PC 0.0 p convex linear FC PC n = 5000, p = 50, ρ = 0.8, β N(0, π/2) 30
33 Theoretical results (sketch)
34 STANDARD RIDGE RESULTS Theorem ( bias 2 β ) ridge (λ) X = λ 2 β V (D 2 + λi p ) 2 V β. ( tr V[ β ) p ridge (λ) X] = σ 2 d 2 i (d 2 i=1 i +. λ)2 31
35 WHAT S THE TRICK? Similar results are hard for compressed regression. All the estimators depend (at least) on We derived properties of Q Q (X Q QX + λi p ) 1 [ ] s E q Q Q = I n [ ( )] s V vec q Q Q = (s 3) + diag(vec (I n )) + 1 q q I n q K nn So the technique is to do a Taylor expansion around s q Q Q = I n. 32
36 MAIN RESULT Theorem bias 2 [ βf C X] = λ 2 β V (D2 + λi p) 2 V β + o p(1) p tr(v[ βf C X]) = σ 2 d 2 i (d 2 + op(1) i + λ)2 i=1 + (s 3) + q + β M (I H) 2 Mβ q tr ( diag(vec (I n))m M (I H)Mβ β M (I H) ) tr(mm ) + 1 q tr ( (I H)Mβ β M (I H)M M ). Note: M = (X X + λi p ) 1 X and H = XM (hat matrix for ridge regression) 33
37 Applications
38 RNA-SEQ short-read RNA sequence data using a (Poisson) linear model to predict read counts based on the surrounding nucleotides 8 different tissues data is publicly available, preprocessing already done 34
39 RESULTS Test error averaged over 10 replications. On each replication, tuning parameters were chosen with GCV. At the best tuning parameter, the test error was computed and then these were averaged. For each dataset, we randomly chose 75% of the data as training on each replication. 35
40 b1 b2 b3 g1 g2 w1 w2 w log(method error ols error) q: convex linear FC PC ridge 36
41 TAKE-AWAY MESSAGE q = results in data reductions between 74% and 93% q = gives reductions between 48% and 84% ridge and ordinary least squares give equivalent test set performance (differing by less than.001%) This is an easy problem. Full compression is the worst. Skip galaxies 37
42 C URRENT APPLICATION This image and the next are from the Sloan Digital Sky Survey 38
43 SDSS Repeatedly photograph regions of the sky Collect information on piles of objects Around 200 million galaxy spectra at the moment Each spectra contains around 5000 wavelengths Want to predict the redshift from the spectra 39
44 A FEW GALAXIES flux λ(a ) 40
45 SAME GALAXIES AFTER SMOOTHING 5 4 flux λ(a ) 41
46 5000 REDSHIFTS redshift 42
47 q: 1000 q: log(method error ols error) convex linear FC PC ridge 43
48 Conclusions
49 CONCLUSIONS Lots of people are looking at approximations to standard statistical methods (OLS, PCA, etc.) They characterize the approximation error We have been looking at whether we can actually benefit from the approximation Today looked at compressed regression 44
50 ONGOING AND FUTURE WORK We want to generalize our results here to other penalties Also generalized linear models We re also looking at how compression interacts with PCA Combine compression and random projection Connections to sufficient dimension reduction? Iterative versions? 45
Approximate Principal Components Analysis of Large Data Sets
Approximate Principal Components Analysis of Large Data Sets Daniel J. McDonald Department of Statistics Indiana University mypage.iu.edu/ dajmcdon April 27, 2016 Approximation-Regularization for Analysis
More informationSketched Ridge Regression:
Sketched Ridge Regression: Optimization and Statistical Perspectives Shusen Wang UC Berkeley Alex Gittens RPI Michael Mahoney UC Berkeley Overview Ridge Regression min w f w = 1 n Xw y + γ w Over-determined:
More informationIntroduction The framework Bias and variance Approximate computation of leverage Empirical evaluation Discussion of sampling approach in big data
Discussion of sampling approach in big data Big data discussion group at MSCS of UIC Outline 1 Introduction 2 The framework 3 Bias and variance 4 Approximate computation of leverage 5 Empirical evaluation
More informationISyE 691 Data mining and analytics
ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationCOS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION
COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationApproximate Principal Components Analysis of Large Data Sets
Approximate Principal Components Analysis of Large Data Sets Daniel J. McDonald Department of Statistics Indiana University mypage.iu.edu/ dajmcdon October 28, 2015 Motivation OBLIGATORY DATA IS BIG SLIDE
More informationAn efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss
An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao
More informationStatistics 203: Introduction to Regression and Analysis of Variance Penalized models
Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Jonathan Taylor - p. 1/15 Today s class Bias-Variance tradeoff. Penalized regression. Cross-validation. - p. 2/15 Bias-variance
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationA fast randomized algorithm for overdetermined linear least-squares regression
A fast randomized algorithm for overdetermined linear least-squares regression Vladimir Rokhlin and Mark Tygert Technical Report YALEU/DCS/TR-1403 April 28, 2008 Abstract We introduce a randomized algorithm
More informationGradient-based Sampling: An Adaptive Importance Sampling for Least-squares
Gradient-based Sampling: An Adaptive Importance Sampling for Least-squares Rong Zhu Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China. rongzhu@amss.ac.cn Abstract
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationRidge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation
Patrick Breheny February 8 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/27 Introduction Basic idea Standardization Large-scale testing is, of course, a big area and we could keep talking
More informationDATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane
DATA MINING AND MACHINE LEARNING Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Linear models for regression Regularized
More informationA Modern Look at Classical Multivariate Techniques
A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationSketching as a Tool for Numerical Linear Algebra
Sketching as a Tool for Numerical Linear Algebra (Part 2) David P. Woodruff presented by Sepehr Assadi o(n) Big Data Reading Group University of Pennsylvania February, 2015 Sepehr Assadi (Penn) Sketching
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationDegrees of Freedom in Regression Ensembles
Degrees of Freedom in Regression Ensembles Henry WJ Reeve Gavin Brown University of Manchester - School of Computer Science Kilburn Building, University of Manchester, Oxford Rd, Manchester M13 9PL Abstract.
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationarxiv: v1 [stat.me] 23 Jun 2013
A Statistical Perspective on Algorithmic Leveraging Ping Ma Michael W. Mahoney Bin Yu arxiv:1306.5362v1 [stat.me] 23 Jun 2013 Abstract One popular method for dealing with large-scale data sets is sampling.
More informationNumerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725
Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: proximal gradient descent Consider the problem min g(x) + h(x) with g, h convex, g differentiable, and h simple
More informationCS 229r: Algorithms for Big Data Fall Lecture 17 10/28
CS 229r: Algorithms for Big Data Fall 2015 Prof. Jelani Nelson Lecture 17 10/28 Scribe: Morris Yau 1 Overview In the last lecture we defined subspace embeddings a subspace embedding is a linear transformation
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationCPSC 340: Machine Learning and Data Mining. More PCA Fall 2017
CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).
More informationThis model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that
Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationFunctional SVD for Big Data
Functional SVD for Big Data Pan Chao April 23, 2014 Pan Chao Functional SVD for Big Data April 23, 2014 1 / 24 Outline 1 One-Way Functional SVD a) Interpretation b) Robustness c) CV/GCV 2 Two-Way Problem
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationMS&E 226: Small Data
MS&E 226: Small Data Lecture 6: Bias and variance (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 49 Our plan today We saw in last lecture that model scoring methods seem to be trading off two different
More informationA Statistical Perspective on Algorithmic Leveraging
Journal of Machine Learning Research 16 (2015) 861-911 Submitted 1/14; Revised 10/14; Published 4/15 A Statistical Perspective on Algorithmic Leveraging Ping Ma Department of Statistics University of Georgia
More informationNumerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization
Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725 Consider Last time: proximal Newton method min x g(x) + h(x) where g, h convex, g twice differentiable, and h simple. Proximal
More informationRegression Shrinkage and Selection via the Lasso
Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationData-Intensive Statistical Challenges in Astrophysics
Data-Intensive Statistical Challenges in Astrophysics Collaborators: T. Budavari, C-W Yip (JHU), M. Mahoney (Stanford), I. Csabai, L. Dobos (Hungary) Alex Szalay The Johns Hopkins University The Age of
More informationDivide-and-combine Strategies in Statistical Modeling for Massive Data
Divide-and-combine Strategies in Statistical Modeling for Massive Data Liqun Yu Washington University in St. Louis March 30, 2017 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017
More informationRegression, Ridge Regression, Lasso
Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.
More informationStochastic Analogues to Deterministic Optimizers
Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured
More informationAdvances in Extreme Learning Machines
Advances in Extreme Learning Machines Mark van Heeswijk April 17, 2015 Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationGeneralized Elastic Net Regression
Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the
More information4 Bias-Variance for Ridge Regression (24 points)
2 count = 0 3 for x in self.x_test_ridge: 4 5 prediction = np.matmul(self.w_ridge,x) 6 ###ADD THE COMPUTED MEAN BACK TO THE PREDICTED VECTOR### 7 prediction = self.ss_y.inverse_transform(prediction) 8
More informationLinear Regression (9/11/13)
STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter
More informationHigh-dimensional regression
High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationProbabilistic Graphical Models & Applications
Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationFirst Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate
58th Annual IEEE Symposium on Foundations of Computer Science First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate Zeyuan Allen-Zhu Microsoft Research zeyuan@csail.mit.edu
More informationLinear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error?
Simple linear regression Linear Regression Nicole Beckage y " = β % + β ' x " + ε so y* " = β+ % + β+ ' x " Method to assess and evaluate the correlation between two (continuous) variables. The slope of
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationAccelerated Block-Coordinate Relaxation for Regularized Optimization
Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth
More informationEstimators based on non-convex programs: Statistical and computational guarantees
Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright
More informationA fast randomized algorithm for approximating an SVD of a matrix
A fast randomized algorithm for approximating an SVD of a matrix Joint work with Franco Woolfe, Edo Liberty, and Vladimir Rokhlin Mark Tygert Program in Applied Mathematics Yale University Place July 17,
More informationFast Linear Algorithms for Machine Learning
University of Pennsylvania ScholarlyCommons Publicly Accessible Penn Dissertations 1-1-2015 Fast Linear Algorithms for Machine Learning Yichao Lu University of Pennsylvania, luyichao1123@gmail.com Follow
More informationBiostatistics Advanced Methods in Biostatistics IV
Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results
More informationCSC321 Lecture 7: Optimization
CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationComputational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center
Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher Center Outline Modern nonparametric inference for high dimensional data Nonparametric
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationBayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson
Bayesian variable selection via penalized credible regions Brian Reich, NC State Joint work with Howard Bondell and Ander Wilson Brian Reich, NCSU Penalized credible regions 1 Motivation big p, small n
More informationMachine learning, shrinkage estimation, and economic theory
Machine learning, shrinkage estimation, and economic theory Maximilian Kasy December 14, 2018 1 / 43 Introduction Recent years saw a boom of machine learning methods. Impressive advances in domains such
More informationCSC321 Lecture 8: Optimization
CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationA Statistical Perspective on Algorithmic Leveraging
Ping Ma PINGMA@UGA.EDU Department of Statistics, University of Georgia, Athens, GA 30602 Michael W. Mahoney MMAHONEY@ICSI.BERKELEY.EDU International Computer Science Institute and Dept. of Statistics,
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More informationA Short Introduction to the Lasso Methodology
A Short Introduction to the Lasso Methodology Michael Gutmann sites.google.com/site/michaelgutmann University of Helsinki Aalto University Helsinki Institute for Information Technology March 9, 2016 Michael
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationsparse and low-rank tensor recovery Cubic-Sketching
Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru
More informationLinear regression methods
Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response
More informationAdaptive Piecewise Polynomial Estimation via Trend Filtering
Adaptive Piecewise Polynomial Estimation via Trend Filtering Liubo Li, ShanShan Tu The Ohio State University li.2201@osu.edu, tu.162@osu.edu October 1, 2015 Liubo Li, ShanShan Tu (OSU) Trend Filtering
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationCOMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection
COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection Instructor: Herke van Hoof (herke.vanhoof@cs.mcgill.ca) Based on slides by:, Jackie Chi Kit Cheung Class web page:
More informationThe lasso, persistence, and cross-validation
The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University
More informationSVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning
SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are
More informationRisk estimation for high-dimensional lasso regression
Risk estimation for high-dimensional lasso regression arxiv:161522v1 [stat.me] 4 Feb 216 Darren Homrighausen Department of Statistics Colorado State University darrenho@stat.colostate.edu Daniel J. McDonald
More informationRegression I: Mean Squared Error and Measuring Quality of Fit
Regression I: Mean Squared Error and Measuring Quality of Fit -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 The Setup Suppose there is a scientific problem we are interested in solving
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationMSA220/MVE440 Statistical Learning for Big Data
MSA220/MVE440 Statistical Learning for Big Data Lecture 7/8 - High-dimensional modeling part 1 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Classification
More informationMinwise hashing for large-scale regression and classification with sparse data
Minwise hashing for large-scale regression and classification with sparse data Nicolai Meinshausen (Seminar für Statistik, ETH Zürich) joint work with Rajen Shah (Statslab, University of Cambridge) Simons
More informationBayesian linear regression
Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding
More informationMS&E 226: Small Data
MS&E 226: Small Data Lecture 6: Model complexity scores (v3) Ramesh Johari ramesh.johari@stanford.edu Fall 2015 1 / 34 Estimating prediction error 2 / 34 Estimating prediction error We saw how we can estimate
More informationQuick Introduction to Nonnegative Matrix Factorization
Quick Introduction to Nonnegative Matrix Factorization Norm Matloff University of California at Davis 1 The Goal Given an u v matrix A with nonnegative elements, we wish to find nonnegative, rank-k matrices
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationBayesian Regression Linear and Logistic Regression
When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationSample questions for Fundamentals of Machine Learning 2018
Sample questions for Fundamentals of Machine Learning 2018 Teacher: Mohammad Emtiyaz Khan A few important informations: In the final exam, no electronic devices are allowed except a calculator. Make sure
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationSpectral Regularization
Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More information