Linear Regression 1 / 25. Karl Stratos. June 18, PDF Free Download

Linear Regression Karl Stratos June 18, 2018 1 / 25

The Regression Problem Problem. Find a desired input-output mapping f : X R where the output is a real value. x = = y = 0.1 How much should I turn my handle, given the environment? 2 / 25

The Regression Problem Problem. Find a desired input-output mapping f : X R where the output is a real value. x = = y = 0.1 How much should I turn my handle, given the environment? Today s focus: data-driven approach to regression 2 / 25

Overview Approaches to the Regression Problem Not Data-Driven Data-Driven: Nonparameteric Data-Driven: Parameteric Linear Regression (a Parameteric Approach) Model and Objective Parameter Estimation Generalization to Multi-Dimensional Input Polynomial Regression 3 / 25

Running Example: Predict Weight from Height Suppose we want a regression model f : X R that predicts weight (in pounds) from height (in inches). What is the input space? 4 / 25

Running Example: Predict Weight from Height Suppose we want a regression model f : X R that predicts weight (in pounds) from height (in inches). What is the input space? X = R 4 / 25

Running Example: Predict Weight from Height Suppose we want a regression model f : X R that predicts weight (in pounds) from height (in inches). What is the input space? X = R Naive approach: stipulate rules. If x [0, 30), then predict y = 50. If x [30, 60), then predict y = 80. If x [60, 70), then predict y = 150. If x 70, then predict y = 200. Pro: Immediately programmable 4 / 25

Before We Move on to Data-Driven Approaches Rule-based solutions can go surpringly far. eliza: a conversation program from the 60s 5 / 25

Data A set of n height-weight pairs (x (1), y (1) )... (x (n), y (n) ) R R Q. How can we use this data to obtain a weight predictor? 7 / 25

Simple Data-Specific Rules Store all n data points in a dictionary D(x (i) ) = y (i). 1. Predict by memorization ( rote learning ): { D(x) if x D f(x) =? otherwise 2. Or slightly better, predict by nearest neighbor search: ( n ) f(x) = D arg min x x (i) i=1 8 / 25

Nonparameteric Models These are simplest instances of nonparameteric models. It just means that the model doesn t have any associated parameters before seeing the data. Pro: Adapts to data without assuming anything about a given problem, achieving better coverage with more data Cons Not scalable: need to store the entire data Issues with overfitting : model excessively dependent on data, generalizing to new instances can be difficult. 9 / 25

Parameteric Models Dominant approach in machine learning. Assumes a fixed form of model f θ defined by a set of parameters θ. 11 / 25

Parameteric Models Dominant approach in machine learning. Assumes a fixed form of model f θ defined by a set of parameters θ. The parameters θ are learned from data S = { (x (i), y (i) ) } n i=1 by optimizing a data-dependent objective or loss function J S (θ) R. Optimizing J S wrt. parameter θ is the learning problem! θ = arg min J S (θ) θ 11 / 25

Linear Regression Model Model parameter: w R Model definition: f w (x) := wx Defines a line with slope w Goal: learn w from data S = { (x (i), y (i) ) } n i=1 Need a data-dependent objective function JS (w) 13 / 25

Least Squares Objective Least squares objective: minimize J LS S (w) := n (y (i) wx (i)) 2 i=1 Idea: fit a line on the training data by reducing the sum of squared residuals 14 / 25

The Learning Problem Solve for the scalar w LS S = arg min w R n (y (i) wx (i)) 2 i=1 }{{} J LS S (w) 16 / 25

The Learning Problem Solve for the scalar w LS S = arg min w R n (y (i) wx (i)) 2 i=1 }{{} J LS S (w) The objective JS LS (w) is strongly convex in w (unless all x (i) = 0), thus the global minimum is uniquely achieved by satisfying w LS S (w) w J LS S w=w LS S = 0 16 / 25

Linear Regression with Multi-Dimensional Input Input x R d is now a vector of d features x 1... x d R. x 1 = 65 (height) x 2 = 29 (age) x 3 = 1 (male indicator) x 4 = 0 (female indicator) = y = 140 (pounds) Model: w R d defining f w (x) := w x = w x = w, x = w 1 x 1 + + w d x d 18 / 25

The Learning Problem Solve for the vector n ( w LS S = arg min y (i) w x (i)) 2 = arg min y Xw 2 2 w R d w R d i=1 where y i = y (i) R and X R n d has rows x (i). 19 / 25

The Learning Problem Solve for the vector n ( w LS S = arg min y (i) w x (i)) 2 = arg min y Xw 2 2 w R d w R d i=1 where y i = y (i) R and X R n d has rows x (i). y Xw 2 2 is strongly convex in w (unless rank(x) < d), thus the global minimum is uniquely achieved by w LS S satisfying y Xw 2 2 = 0 d 1 w w=w LS S 19 / 25

Fitting a Polynomial: 1-Dimensional Input In linear regression with scalar input, we learn the slope w R of a line such that y wx. 21 / 25

Fitting a Polynomial: 1-Dimensional Input In linear regression with scalar input, we learn the slope w R of a line such that y wx. In polynomial regression, we learn the coefficients w 1... w p, w p+1 R of a polynomial of degree p such that y w 1 x p + + w p x + w p+1 }{{} bias term How? Upon receiving input x, apply polynomial feature expansion to calculate a new representation of x: x p x. x 1 Follow by linear regression with (p + 1)-dimensional input. 21 / 25

Degree of Polynomial = Model Complexity p = 0: Fit a bias term p = 1: Fit a slope and a bias term (i.e., an affine function) p = 2: Learn a quadratic function... https://machinelearningac.wordpress.com/2011/09/15/model-selection-and-the-triple-tradeoff/ 22 / 25

Polynomial Regression with Multi-Dimensional Input Example: p = 2 x 2 1. x 2 d x x 1 x 2 1.. x d x d x d 1 x 1. x d 1 In general: time to calculate feature expansion O(d p ) is exponential in p. 23 / 25

Summary Regression is the problem of learning a real-valued mapping f : X R. Linear regressor is a simplest parametric model that uses parameter w R d to define f w (x) = w x. Fitting a linear regressor on a dataset by a least squares objective so easy that it has a closed-form solution. Polynomial regression: feature expansion followed by linear regression 24 / 25

Last Remarks What if we have a model/objective such that training doesn t have a closed-form solution? Instead of manually fixing dictating the input representation (e.g., a polynomial of degree 3), can we automatically learn a good representation function φ(x) as part of optimization? We will answer these questions later in the course (hint: gradient descent, neural networks). 25 / 25

Linear Regression 1 / 25. Karl Stratos. June 18, 2018