Linear Regression 1 / 25. Karl Stratos. June 18, 2018

Similar documents
CSC321 Lecture 2: Linear Regression

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

CSE446: non-parametric methods Spring 2017

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Overfitting, Bias / Variance Analysis

Lecture 2: Linear regression

Machine Learning Linear Models

CSCI567 Machine Learning (Fall 2014)

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Linear Models for Regression. Sargur Srihari

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Sample questions for Fundamentals of Machine Learning 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Machine Learning (CSE 446): Neural Networks

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Linear Regression (continued)

CSC 411 Lecture 6: Linear Regression

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Announcements - Homework

Introduction to Machine Learning

Machine Learning (CSE 446): Probabilistic Machine Learning

Deep Feedforward Networks. Sargur N. Srihari

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Lecture 6. Regression

Linear Regression. S. Sumitra

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

CS-E3210 Machine Learning: Basic Principles

CMU-Q Lecture 24:

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Neural Network Training

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Statistical Machine Learning from Data

ECS171: Machine Learning

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Bits of Machine Learning Part 1: Supervised Learning

Support Vector Machines

Least Mean Squares Regression. Machine Learning Fall 2018

Lecture 2 Machine Learning Review

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Kaggle.

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Introduction to gradient descent

Least Mean Squares Regression

Review: Support vector machines. Machine learning techniques and image analysis

Deep Feedforward Networks

Convex Optimization Algorithms for Machine Learning in 10 Slides

ECE521 week 3: 23/26 January 2017

Fundamentals of Machine Learning

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

Support vector machines Lecture 4

Supervised Learning. George Konidaris

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

LINEAR REGRESSION, RIDGE, LASSO, SVR

Learning Spectral Graph Segmentation

Least Squares. Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter UCSD

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

More on Neural Networks

Statistical Machine Learning Hilary Term 2018

PAC-learning, VC Dimension and Margin-based Bounds

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Regularized Least Squares

Machine Learning. Regression. Manfred Huber

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Linear Models for Regression

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Support Vector Machines

Gradient-Based Learning. Sargur N. Srihari

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

SVMs, Duality and the Kernel Trick

Engineering 7: Introduction to computer programming for scientists and engineers

TDT4173 Machine Learning

Machine Learning 4771

Gaussian with mean ( µ ) and standard deviation ( σ)

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Statistical Data Mining and Machine Learning Hilary Term 2016

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Machine Learning for NLP

We choose parameter values that will minimize the difference between the model outputs & the true function values.

Linear classifiers Lecture 3

Transcription:

Linear Regression Karl Stratos June 18, 2018 1 / 25

The Regression Problem Problem. Find a desired input-output mapping f : X R where the output is a real value. x = = y = 0.1 How much should I turn my handle, given the environment? 2 / 25

The Regression Problem Problem. Find a desired input-output mapping f : X R where the output is a real value. x = = y = 0.1 How much should I turn my handle, given the environment? Today s focus: data-driven approach to regression 2 / 25

Overview Approaches to the Regression Problem Not Data-Driven Data-Driven: Nonparameteric Data-Driven: Parameteric Linear Regression (a Parameteric Approach) Model and Objective Parameter Estimation Generalization to Multi-Dimensional Input Polynomial Regression 3 / 25

Running Example: Predict Weight from Height Suppose we want a regression model f : X R that predicts weight (in pounds) from height (in inches). What is the input space? 4 / 25

Running Example: Predict Weight from Height Suppose we want a regression model f : X R that predicts weight (in pounds) from height (in inches). What is the input space? X = R 4 / 25

Running Example: Predict Weight from Height Suppose we want a regression model f : X R that predicts weight (in pounds) from height (in inches). What is the input space? X = R Naive approach: stipulate rules. If x [0, 30), then predict y = 50. If x [30, 60), then predict y = 80. If x [60, 70), then predict y = 150. If x 70, then predict y = 200. 4 / 25

Running Example: Predict Weight from Height Suppose we want a regression model f : X R that predicts weight (in pounds) from height (in inches). What is the input space? X = R Naive approach: stipulate rules. If x [0, 30), then predict y = 50. If x [30, 60), then predict y = 80. If x [60, 70), then predict y = 150. If x 70, then predict y = 200. Pro: Immediately programmable 4 / 25

Running Example: Predict Weight from Height Suppose we want a regression model f : X R that predicts weight (in pounds) from height (in inches). What is the input space? X = R Naive approach: stipulate rules. If x [0, 30), then predict y = 50. If x [30, 60), then predict y = 80. If x [60, 70), then predict y = 150. If x 70, then predict y = 200. Pro: Immediately programmable Cons: Uninformed, requires labor-intensive domain-specific rule engineering There is no learning from data (on the machine s part). 4 / 25

Before We Move on to Data-Driven Approaches Rule-based solutions can go surpringly far. eliza: a conversation program from the 60s 5 / 25

Overview Approaches to the Regression Problem Not Data-Driven Data-Driven: Nonparameteric Data-Driven: Parameteric Linear Regression (a Parameteric Approach) Model and Objective Parameter Estimation Generalization to Multi-Dimensional Input Polynomial Regression 6 / 25

Data A set of n height-weight pairs (x (1), y (1) )... (x (n), y (n) ) R R Q. How can we use this data to obtain a weight predictor? 7 / 25

Simple Data-Specific Rules Store all n data points in a dictionary D(x (i) ) = y (i). 1. Predict by memorization ( rote learning ): { D(x) if x D f(x) =? otherwise 2. Or slightly better, predict by nearest neighbor search: ( n ) f(x) = D arg min x x (i) i=1 8 / 25

Nonparameteric Models These are simplest instances of nonparameteric models. It just means that the model doesn t have any associated parameters before seeing the data. Pro: Adapts to data without assuming anything about a given problem, achieving better coverage with more data Cons Not scalable: need to store the entire data Issues with overfitting : model excessively dependent on data, generalizing to new instances can be difficult. 9 / 25

Overview Approaches to the Regression Problem Not Data-Driven Data-Driven: Nonparameteric Data-Driven: Parameteric Linear Regression (a Parameteric Approach) Model and Objective Parameter Estimation Generalization to Multi-Dimensional Input Polynomial Regression 10 / 25

Parameteric Models Dominant approach in machine learning. Assumes a fixed form of model f θ defined by a set of parameters θ. 11 / 25

Parameteric Models Dominant approach in machine learning. Assumes a fixed form of model f θ defined by a set of parameters θ. The parameters θ are learned from data S = { (x (i), y (i) ) } n i=1 by optimizing a data-dependent objective or loss function J S (θ) R. 11 / 25

Parameteric Models Dominant approach in machine learning. Assumes a fixed form of model f θ defined by a set of parameters θ. The parameters θ are learned from data S = { (x (i), y (i) ) } n i=1 by optimizing a data-dependent objective or loss function J S (θ) R. Optimizing J S wrt. parameter θ is the learning problem! θ = arg min J S (θ) θ 11 / 25

Parameteric Models Dominant approach in machine learning. Assumes a fixed form of model f θ defined by a set of parameters θ. The parameters θ are learned from data S = { (x (i), y (i) ) } n i=1 by optimizing a data-dependent objective or loss function J S (θ) R. Optimizing J S wrt. parameter θ is the learning problem! θ = arg min J S (θ) θ Today: Focus on a simplest parametric model called linear regression. 11 / 25

Overview Approaches to the Regression Problem Not Data-Driven Data-Driven: Nonparameteric Data-Driven: Parameteric Linear Regression (a Parameteric Approach) Model and Objective Parameter Estimation Generalization to Multi-Dimensional Input Polynomial Regression 12 / 25

Linear Regression Model Model parameter: w R Model definition: f w (x) := wx Defines a line with slope w Goal: learn w from data S = { (x (i), y (i) ) } n i=1 Need a data-dependent objective function JS (w) 13 / 25

Least Squares Objective Least squares objective: minimize J LS S (w) := n (y (i) wx (i)) 2 i=1 Idea: fit a line on the training data by reducing the sum of squared residuals 14 / 25

Overview Approaches to the Regression Problem Not Data-Driven Data-Driven: Nonparameteric Data-Driven: Parameteric Linear Regression (a Parameteric Approach) Model and Objective Parameter Estimation Generalization to Multi-Dimensional Input Polynomial Regression 15 / 25

The Learning Problem Solve for the scalar w LS S = arg min w R n (y (i) wx (i)) 2 i=1 }{{} J LS S (w) 16 / 25

The Learning Problem Solve for the scalar w LS S = arg min w R n (y (i) wx (i)) 2 i=1 }{{} J LS S (w) The objective JS LS (w) is strongly convex in w (unless all x (i) = 0), thus the global minimum is uniquely achieved by satisfying w LS S (w) w J LS S w=w LS S = 0 16 / 25

The Learning Problem Solve for the scalar w LS S = arg min w R n (y (i) wx (i)) 2 i=1 }{{} J LS S (w) The objective JS LS (w) is strongly convex in w (unless all x (i) = 0), thus the global minimum is uniquely achieved by satisfying w LS S (w) w J LS S w=w LS S = 0 Solving this system yields the close-form expression: w LS S = n i=1 x(i) y (i) n i=1 (x(i) ) 2 16 / 25

Overview Approaches to the Regression Problem Not Data-Driven Data-Driven: Nonparameteric Data-Driven: Parameteric Linear Regression (a Parameteric Approach) Model and Objective Parameter Estimation Generalization to Multi-Dimensional Input Polynomial Regression 17 / 25

Linear Regression with Multi-Dimensional Input Input x R d is now a vector of d features x 1... x d R. x 1 = 65 (height) x 2 = 29 (age) x 3 = 1 (male indicator) x 4 = 0 (female indicator) = y = 140 (pounds) 18 / 25

Linear Regression with Multi-Dimensional Input Input x R d is now a vector of d features x 1... x d R. x 1 = 65 (height) x 2 = 29 (age) x 3 = 1 (male indicator) x 4 = 0 (female indicator) = y = 140 (pounds) Model: w R d defining f w (x) := w x = w x = w, x = w 1 x 1 + + w d x d 18 / 25

Linear Regression with Multi-Dimensional Input Input x R d is now a vector of d features x 1... x d R. x 1 = 65 (height) x 2 = 29 (age) x 3 = 1 (male indicator) x 4 = 0 (female indicator) = y = 140 (pounds) Model: w R d defining f w (x) := w x = w x = w, x = w 1 x 1 + + w d x d Least squares objective: exactly the same. Assume n d! 2 n JS LS (w) = y (i) w x (i) }{{}}{{} i=1 R R 18 / 25

The Learning Problem Solve for the vector n ( w LS S = arg min y (i) w x (i)) 2 = arg min y Xw 2 2 w R d w R d i=1 where y i = y (i) R and X R n d has rows x (i). 19 / 25

The Learning Problem Solve for the vector n ( w LS S = arg min y (i) w x (i)) 2 = arg min y Xw 2 2 w R d w R d i=1 where y i = y (i) R and X R n d has rows x (i). y Xw 2 2 is strongly convex in w (unless rank(x) < d), thus the global minimum is uniquely achieved by w LS S satisfying y Xw 2 2 = 0 d 1 w w=w LS S 19 / 25

The Learning Problem Solve for the vector n ( w LS S = arg min y (i) w x (i)) 2 = arg min y Xw 2 2 w R d w R d i=1 where y i = y (i) R and X R n d has rows x (i). y Xw 2 2 is strongly convex in w (unless rank(x) < d), thus the global minimum is uniquely achieved by w LS S satisfying y Xw 2 2 = 0 d 1 w w=w LS S Solving this system yields the close-form expression: w LS S = (X X) 1 X y = X + y 19 / 25

Overview Approaches to the Regression Problem Not Data-Driven Data-Driven: Nonparameteric Data-Driven: Parameteric Linear Regression (a Parameteric Approach) Model and Objective Parameter Estimation Generalization to Multi-Dimensional Input Polynomial Regression 20 / 25

Fitting a Polynomial: 1-Dimensional Input In linear regression with scalar input, we learn the slope w R of a line such that y wx. 21 / 25

Fitting a Polynomial: 1-Dimensional Input In linear regression with scalar input, we learn the slope w R of a line such that y wx. In polynomial regression, we learn the coefficients w 1... w p, w p+1 R of a polynomial of degree p such that y w 1 x p + + w p x + w p+1 }{{} bias term 21 / 25

Fitting a Polynomial: 1-Dimensional Input In linear regression with scalar input, we learn the slope w R of a line such that y wx. In polynomial regression, we learn the coefficients w 1... w p, w p+1 R of a polynomial of degree p such that y w 1 x p + + w p x + w p+1 }{{} bias term How? Upon receiving input x, apply polynomial feature expansion to calculate a new representation of x: x p x. x 1 Follow by linear regression with (p + 1)-dimensional input. 21 / 25

Degree of Polynomial = Model Complexity p = 0: Fit a bias term p = 1: Fit a slope and a bias term (i.e., an affine function) p = 2: Learn a quadratic function... https://machinelearningac.wordpress.com/2011/09/15/model-selection-and-the-triple-tradeoff/ 22 / 25

Polynomial Regression with Multi-Dimensional Input Example: p = 2 x 2 1. x 2 d x x 1 x 2 1.. x d x d x d 1 x 1. x d 1 In general: time to calculate feature expansion O(d p ) is exponential in p. 23 / 25

Summary Regression is the problem of learning a real-valued mapping f : X R. Linear regressor is a simplest parametric model that uses parameter w R d to define f w (x) = w x. Fitting a linear regressor on a dataset by a least squares objective so easy that it has a closed-form solution. Polynomial regression: feature expansion followed by linear regression 24 / 25

Last Remarks What if we have a model/objective such that training doesn t have a closed-form solution? Instead of manually fixing dictating the input representation (e.g., a polynomial of degree 3), can we automatically learn a good representation function φ(x) as part of optimization? We will answer these questions later in the course (hint: gradient descent, neural networks). 25 / 25