Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Similar documents
Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Linear Regression (continued)

CSC321 Lecture 2: Linear Regression

MLCC 2017 Regularization Networks I: Linear Models

CPSC 540: Machine Learning

CSC 411 Lecture 6: Linear Regression

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Introduction to Machine Learning

4 Bias-Variance for Ridge Regression (24 points)


Generative Models for Discrete Data

Fundamentals of Machine Learning (Part I)

Fundamentals of Machine Learning

Linear Models for Regression

Lecture 2: Linear regression

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Overfitting, Bias / Variance Analysis

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Online Advertising is Big Business

6.036 midterm review. Wednesday, March 18, 15

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

ECE521 week 3: 23/26 January 2017

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

DATA MINING AND MACHINE LEARNING

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

FA Homework 2 Recitation 2

Sample questions for Fundamentals of Machine Learning 2018

LINEAR REGRESSION, RIDGE, LASSO, SVR

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

Big Data Analytics: Optimization and Randomization

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Statistical Machine Learning from Data

CS 6375 Machine Learning

Linear Regression (9/11/13)

Bayesian Linear Regression [DRAFT - In Progress]

Linear Regression. Volker Tresp 2018

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Regularized Least Squares

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Introduction to Machine Learning

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Ordinary Least Squares Linear Regression

ECS289: Scalable Machine Learning

4 Bias-Variance for Ridge Regression (24 points)

CS540 Machine learning Lecture 5

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Machine Learning - MT & 14. PCA and MDS

Online Advertising is Big Business

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Linear Models for Regression CS534

Linear Model Selection and Regularization

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

CPSC 540: Machine Learning

Linear classifiers: Logistic regression

UVA CS 4501: Machine Learning. Lecture 6: Linear Regression Model with Dr. Yanjun Qi. University of Virginia

STA414/2104 Statistical Methods for Machine Learning II

Randomized Algorithms

Machine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme

Tufts COMP 135: Introduction to Machine Learning

Linear Models for Regression. Sargur Srihari

Machine Learning for OR & FE

Regularized Least Squares

Machine Learning for Biomedical Engineering. Enrico Grisan

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Using SVD to Recommend Movies

Machine Learning Linear Regression. Prof. Matteo Matteucci

Lecture 2 Machine Learning Review

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Lecture 2: Linear Algebra Review

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Numerical Methods I Non-Square and Sparse Linear Systems

Methods for sparse analysis of high-dimensional data, II

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Matrix Factorization Techniques for Recommender Systems

Dimension Reduction Methods

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

Kernel Methods. Barnabás Póczos

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Collaborative Filtering. Radek Pelánek

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Multivariate Normal Models

Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians

Introduction to Logistic Regression

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert

Modern Optimization Techniques

Advances in Extreme Learning Machines

We choose parameter values that will minimize the difference between the model outputs & the true function values.

Transcription:

Linear Regression

Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features Song year Processes, memory Power consumption Historical financials Future stock price Many more

Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label, y x = x 1 x 2 x 3 We assume a linear mapping between features and label: y w 0 + w 1 x 1 + w 2 x 2 + w 3 x 3

Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight We can augment the feature vector to incorporate offset: x = 1 x 1 x 2 x 3 We can then rewrite this linear mapping as scalar product: 3 y ŷ = w i x i = w x i=0

Why a Linear Mapping? Simple Often works well in practice Can introduce complexity via feature extraction

1D Example Goal: find the line of best fit y x coordinate: features y coordinate: labels y ŷ = w 0 + w 1 x x Intercept / Offset Slope

Evaluating Predictions Can measure closeness between label and prediction Shoe size: better to be off by one size than 5 sizes Song year prediction: better to be off by a year than by 20 years What is an appropriate evaluation metric or loss function? Absolute loss: Squared loss: y ŷ (y ŷ) 2 Has nice mathematical properties

How Can We Learn Model (w)? Assume we have n training points, where denotes the ith point Recall two earlier points: Linear assumption: ŷ = w x We use squared loss: (y ŷ) 2 Idea: Find w that minimizes squared loss over training points: n min w i=1 (w y (i) ) 2 ŷ (i){

Given n training points with d features, we define: : matrix storing points X R n d y : real-valued labels R n ŷ : predicted labels, where R n ŷ = Xw w : regression parameters / model to learn R d Least Squares Regression: Learn mapping ( ) from features to labels that minimizes residual sum of squares: n min w Xw y 2 2 w Equivalent min w (w y (i) ) 2 by definition of Euclidean norm i=1

Find solution by setting derivative to zero n 1D: f(w) = wx y 2 2 = (w y (i) ) 2 n df dw (w) =2 i=1 (w y (i) ) wx x x y i=1 = 0 wx x x y = 0 w =(x x) 1 x y Least Squares Regression: Learn mapping ( ) from features to labels that minimizes residual sum of squares: min w Xw y 2 2 w Closed form solution: w =(X X) 1 X y (if inverse exists)

Overfitting and Generalization We want good predictions on new data, i.e., generalization Least squares regression minimizes training error, and could overfit Simpler models are more likely to generalize (Occam s razor) Can we change the problem to penalize for model complexity? Intuitively, models with smaller weights are simpler

Given n training points with d features, we define: : matrix storing points X R n d y : real-valued labels R n ŷ : predicted labels, where R n ŷ = Xw w : regression parameters / model to learn R d Ridge Regression: Learn mapping ( ) that minimizes residual sum of squares along with a regularization term: w Training Error Model Complexity min w Xw y 2 2 + λ w 2 2 free parameter trades off Closed-form solution: w =(X X + λi d ) 1 X y between training error and model complexity

Millionsong Regression Pipeline

training set model full dataset new entity test set accuracy prediction Obtain Raw Data Split Data Feature Extraction Supervised Learning Pipeline Supervised Learning Evaluation Predict

training set model full dataset new entity test set accuracy prediction Obtain Raw Data Goal: Predict song s release year from audio features Raw Data: Millionsong Dataset from UCI ML Repository Western, commercial tracks from 1980-2014 12 timbre averages (features) and release year (label) Split Data Feature Extraction Supervised Learning Evaluation Predict

training set model full dataset new entity test set accuracy prediction Obtain Raw Data Split Data: Train on training set, evaluate with test set Test set simulates unobserved data Test error tells us whether we ve generalized well Split Data Feature Extraction Supervised Learning Evaluation Predict

training set model full dataset new entity test set accuracy prediction Obtain Raw Data Feature Extraction: Quadratic features Compute pairwise feature interactions Captures covariance of initial timbre features Leads to a non-linear model relative to raw features Split Data Feature Extraction Supervised Learning Evaluation Predict

Given 2 dimensional data, quadratic features are: x = x 1 x 2 = Φ(x) = x 2 1 x 1 x 2 x 2 x 1 x 2 2 z = z 1 z 2 = Φ(z) = z 2 1 z 1 z 2 z 2 z 1 z 2 2 More succinctly: Φ (x) = x 2 1 2x 1 x 2 x 2 2 Φ (z) = z 2 1 2z 1 z 2 z 2 2 Equivalent inner products: Φ(x) Φ(z) = x 2 1z 2 1 + 2x 1 x 2 z 1 z 2 + x 2 2z 2 2 = Φ (x) Φ (z)

training set model full dataset new entity test set accuracy prediction Obtain Raw Data Supervised Learning: Least Squares Regression Learn a mapping from entities to continuous labels given a training set Audio features Song year Split Data Feature Extraction Supervised Learning Evaluation Predict

Given n training points with d features, we define: : matrix storing points X R n d y : real-valued labels R n ŷ : predicted labels, where R n ŷ = Xw w : regression parameters / model to learn R d Ridge Regression: Learn mapping ( ) that minimizes residual sum of squares along with a regularization term: w Training Error Model Complexity min w Xw y 2 2 + λ w 2 2 Closed-form solution: w =(X X + λi d ) 1 X y

Ridge Regression: Learn mapping ( ) that minimizes residual sum of squares along with a regularization term: w Training Error Model Complexity min w Xw y 2 2 + λ w 2 2 free parameter trades off between training error and model complexity How do we choose a good value for this free parameter? Most methods have free parameters / hyperparameters to tune First thought: Search over multiple values, evaluate each on test set But, goal of test set is to simulate unobserved data We may overfit if we use it to choose hyperparameters Second thought: Create another hold out dataset for this search

training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Evaluation (Part 1): Hyperparameter tuning Training: train various models Validation: evaluate various models (e.g., Grid Search) Test: evaluate final model s accuracy Split Data Feature Extraction Supervised Learning Evaluation Predict

10-8 10-6 10-4 10-2 1 Regulariza*on-Parameter-(--) λ Grid Search: Exhaustively search through hyperparameter space Define and discretize search space (linear or log scale) Evaluate points via validation error

Hyperparameter-2 10-8 10-6 10-4 10-2 1 Regulariza*on-Parameter-(--) λ Hyperparameter-1 Grid Search: Exhaustively search through hyperparameter space Define and discretize search space (linear or log scale) Evaluate points via validation error

Evaluating Predictions How can we compare labels and predictions for n validation points? Least squares optimization involves squared loss, (y ŷ) 2, so it seems reasonable to use mean squared error (MSE): MSE = 1 n n i=1 (ŷ (i) y (i) ) 2 But MSE s unit of measurement is square of quantity being measured, e.g., squared years for song prediction More natural to use root-mean-square error (RMSE), i.e., MSE

training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Evaluation (Part 2): Evaluate final model Training set: train various models Validation set: evaluate various models Test set: evaluate final model s accuracy Split Data Feature Extraction Supervised Learning Evaluation Predict

training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Split Data Predict: Final model can then be used to make predictions on future observations, e.g., new songs Feature Extraction Supervised Learning Evaluation Predict

Distributed ML: Computation and Storage

Challenge: Scalability Classic ML techniques are not always suitable for modern datasets 60" 50" Moore's"Law" Data Grows Faster 40" 30" 20" Overall"Data" Par8cle"Accel." DNA"Sequencers" than Moore s Law [IDC report, Kathy Yelick, LBNL] 10" 0" 2010" 2011" 2012" 2013" 2014" 2015" Data Machine Learning Distributed Computing

Least Squares Regression: Learn mapping ( ) from features to labels that minimizes residual sum of squares: min w Xw y 2 2 w Closed form solution: w =(X X) 1 X y (if inverse exists) How do we solve this computationally? Computational profile similar for Ridge Regression

Computing Closed Form Solution w =(X X) 1 X y Computation: O(nd 2 + d 3 ) operations Consider number of arithmetic operations ( +,,, / ) Computational bottlenecks: Matrix multiply of X X : O(nd 2 ) operations Matrix inverse: O(d 3 ) operations Other methods (Cholesky, QR, SVD) have same complexity

Storage Requirements w =(X X) 1 X y Computation: O(nd 2 + d 3 ) operations Storage: O(nd + d 2 ) floats Consider storing values as floats (8 bytes) Storage bottlenecks: X X and its inverse: O(d 2 ) floats X : O(nd) floats

Big n and Small d w =(X X) 1 X y Computation: O(nd 2 + d 3 ) operations Storage: O(nd + d 2 ) floats Assume O(d 3 ) computation and O(d 2 ) storage feasible on single machine X Storing and computing X X are the bottlenecks Can distribute storage and computation! Store data points (rows of X ) across machines Compute X X as a sum of outer products

Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9 9 1 + 3 3 + 5 2 = 28

Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9

Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9

Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices 1 2 9 3 5 28 18 3 5 = 4 1 2 11 9 2 3 9 18 4 8 + 9 15 3 5 + 10 15 4 6

Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices 1 2 9 3 5 28 18 3 5 = 4 1 2 11 9 2 3 9 18 4 8 + 9 15 3 5 + 10 15 4 6

Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices 1 2 9 3 5 28 18 3 5 = 4 1 2 11 9 2 3 9 18 4 8 + 9 15 3 5 + 10 15 4 6

Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices 1 2 9 3 5 28 18 3 5 = 4 1 2 11 9 2 3 9 18 4 8 + 9 15 3 5 + 10 15 4 6

d x (1) x (2) n x (n) d x (1) x (2) n X X = n = x (n) i=1 Example: n = 6; 3 workers workers: x (1) x (5) x (4) x (6) x (3) x (2) O(nd) Distributed Storage map: O(nd 2 ) Distributed O(d 2 ) Local Storage Computation reduce: ( ) -1 O(d 3 ) Local Computation O(d 2 ) Local Storage

d x (1) x (2) n x (n) d x (1) x (2) n X X = n = x (n) i=1 Example: n = 6; 3 workers workers: x (1) x (5) x (4) x (6) x (3) x (2) O(nd) Distributed Storage map: O(nd 2 ) Distributed O(d 2 ) Local Storage Computation reduce: ( ) -1 O(d 3 ) Local Computation O(d 2 ) Local Storage

Distributed ML: Computation and Storage, Part II

Big n and Small d w =(X X) 1 X y Computation: O(nd 2 + d 3 ) operations Storage: O(nd + d 2 ) floats Assume O(d 3 ) computation and O(d 2 ) storage feasible on single machine Can distribute storage and computation! Store data points (rows of X ) across machines Compute X X as a sum of outer products

Big n and Small d w =(X X) 1 X y Computation: O(nd 2 + d 3 ) operations Storage: O(nd + d 2 ) floats Assume O(d 3 ) computation and O(d 2 ) storage feasible on single machine Can distribute storage and computation! Store data points (rows of X ) across machines Compute X X as a sum of outer products

Big n and Big d w =(X X) 1 X y Computation: O(nd 2 + d 3 ) operations Storage: O(nd + d 2 ) floats X As before, storing and computing X X are bottlenecks Now, storing and operating on Can t easily distribute! X X is also a bottleneck

d x (1) x (2) n x (n) d x (1) x (2) n X X = n = x (n) i=1 Example: n = 6; 3 workers workers: x (1) x (5) x (4) x (6) x (3) x (2) O(nd) Distributed Storage map: O(nd 2 ) Distributed O(d 2 ) Local Storage Computation reduce: ( ) -1 O(d 3 ) Local Computation O(d 2 ) Local Storage

Big n and Big d w =(X X) 1 X y Computation: O(nd 2 + d 3 ) operations Storage: O(nd + d 2 ) floats X As before, storing and computing X X are bottlenecks Now, storing and operating on Can t easily distribute! X X is also a bottleneck 1st Rule of thumb Computation and storage should be linear (in n, d)

Big n and Big d We need methods that are linear in time and space One idea: Exploit sparsity Explicit sparsity can provide orders of magnitude storage and computational gains Sparse data is prevalent Text processing: bag-of-words, n-grams Collaborative filtering: ratings matrix Graphs: adjacency matrix Categorical features: one-hot-encoding Genomics: SNPs, variant calling dense : 1. 0. 0. 0. 0. 0. 3. 8 size : 7 >< sparse : indices : 0 6 >: values : 1. 3.

Big n and Big d We need methods that are linear in time and space One idea: Exploit sparsity Explicit sparsity can provide orders of magnitude storage and computational gains Latent sparsity assumption can be used to reduce dimension, e.g., PCA, low-rank approximation (unsupervised learning) d r d r Low-rank n n

Big n and Big d We need methods that are linear in time and space One idea: Exploit sparsity Explicit sparsity can provide orders of magnitude storage and computational gains Latent sparsity assumption can be used to reduce dimension, e.g., PCA, low-rank approximation (unsupervised learning) Another idea: Use different algorithms Gradient descent is an iterative algorithm that requires O(nd) computation and O(d) local storage per iteration

Closed Form Solution for Big n and Big d Example: n = 6; 3 workers workers: x (1) x (5) x (4) x (6) x (3) x (2) O(nd) Distributed Storage map: O(nd 2 ) Distributed O(d 2 ) Local Storage Computation reduce: ( ) -1 O(d 3 ) Local O(d 2 ) Local Computation Storage

Gradient Descent for Big n and Big d Example: n = 6; 3 workers workers: x (1) x (5) x (4) x (6) x (3) x (2) O(nd) Distributed Storage map: O(nd 2 ) Distributed O(d 2 ) Local Storage Computation reduce: ( ) -1 O(d 3 ) Local O(d 2 ) Local Computation Storage

Gradient Descent for Big n and Big d Example: n = 6; 3 workers workers: x (1) x (5) x (4) x (6) x (3) x (2) map:??? O(nd) Distributed Storage O(nd) O(nd 2 O(d) ) O(d 2 ) Local Distributed Storage Computation reduce: ( ) -1 O(d 3 ) Local O(d 2 ) Local Computation Storage

Gradient Descent for Big n and Big d Example: n = 6; 3 workers workers: x (1) x (5) x (4) x (6) x (3) x (2) map:??? O(nd) Distributed Storage O(nd) O(nd 2 O(d) ) O(d 2 ) Local Distributed Storage Computation O(d) O(d) reduce:? O(d 3 ) Local O(d 2 ) Local Computation Storage