SECTION 3.1 SIMPLE LINEAR REGRESSION

Similar documents
Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 1 OF 11. Figure 1 - Variation in the Response Variable

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

LECTURE 15: SIMPLE LINEAR REGRESSION I

Lecture 2: Linear regression

22 Approximations - the method of least squares (1)

Ordinary Least Squares Linear Regression

Least Mean Squares Regression. Machine Learning Fall 2018

Error Functions & Linear Regression (1)

Least Mean Squares Regression

Chapter 16. Simple Linear Regression and Correlation

Bayesian Linear Regression [DRAFT - In Progress]

Math101, Sections 2 and 3, Spring 2008 Review Sheet for Exam #2:

Testing a Hash Function using Probability

ES-2 Lecture: More Least-squares Fitting. Spring 2017

Approximations - the method of least squares (1)

BIOSTATISTICS NURS 3324

4 The Cartesian Coordinate System- Pictures of Equations

Linear Regression. Udacity

Vectors and their uses

Lecture 4: Training a Classifier

Lecture 4: Training a Classifier

6 Multiple regression

Lecture 20: Further graphing

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

To get horizontal and slant asymptotes algebraically we need to know about end behaviour for rational functions.

Fitting a Straight Line to Data

Linear Classifiers and the Perceptron

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Math 101: Course Summary

Higher-Order Equations: Extending First-Order Concepts

Lesson 3-1: Solving Linear Systems by Graphing

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

Algebra Exam. Solutions and Grading Guide

Chapter 16. Simple Linear Regression and dcorrelation

Graphing. y m = cx n (3) where c is constant. What was true about Equation 2 is applicable here; the ratio. y m x n. = c

Discrete Structures Proofwriting Checklist

The symmetric group R + :1! 2! 3! 1. R :1! 3! 2! 1.

Chapter 0: Some basic preliminaries

f(x) x

Block 3. Introduction to Regression Analysis

Lecture 11: Extrema. Nathan Pflueger. 2 October 2013

Supplement for MAA 3200, Prof S Hudson, Fall 2018 Constructing Number Systems

Describing the Relationship between Two Variables

Descriptive Statistics (And a little bit on rounding and significant digits)

Error Correcting Codes Prof. Dr. P Vijay Kumar Department of Electrical Communication Engineering Indian Institute of Science, Bangalore

Univariate analysis. Simple and Multiple Regression. Univariate analysis. Simple Regression How best to summarise the data?

Principal components

INFERENCE FOR REGRESSION

Chapter 3. Introduction to Linear Correlation and Regression Part 3

Table of contents. Jakayla Robbins & Beth Kelly (UK) Precalculus Notes Fall / 53

Understanding Exponents Eric Rasmusen September 18, 2018

1 Review of the dot product

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Matroids and Greedy Algorithms Date: 10/31/16

Nonlinear Programming (NLP)

19. TAYLOR SERIES AND TECHNIQUES

Linear Regression Linear Least Squares

Calculus with Analytic Geometry I Exam 8 Take Home Part.

At the start of the term, we saw the following formula for computing the sum of the first n integers:

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

Sequence convergence, the weak T-axioms, and first countability

Basic Business Statistics 6 th Edition

Basic Probability Reference Sheet

Ch 7: Dummy (binary, indicator) variables

Lecture 11: Linear Regression

Calculus: What is a Limit? (understanding epislon-delta proofs)

Answers for Ch. 6 Review: Applications of the Integral

MATH 1130 Exam 1 Review Sheet

Eigenvalues and eigenvectors

Business Statistics. Lecture 9: Simple Regression

1. Create a scatterplot of this data. 2. Find the correlation coefficient.

Chapter 19 Sir Migo Mendoza

Linear Regression. Chapter 3

Correlation and Regression

Math 243 OpenStax Chapter 12 Scatterplots and Linear Regression OpenIntro Section and

Basic Definitions: Indexed Collections and Random Functions

CS173 Strong Induction and Functions. Tandy Warnow

0. Introduction 1 0. INTRODUCTION

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Lecture 10: Powers of Matrices, Difference Equations

Statistics for Managers using Microsoft Excel 6 th Edition

CSC321 Lecture 2: Linear Regression

1 Question related to polynomials

Evolution of quantitative traits

The General Linear Model. How we re approaching the GLM. What you ll get out of this 8/11/16

Absolute and Local Extrema

Sequences & Functions

Confidence Intervals

The Multivariate Gaussian Distribution [DRAFT]

( )! ±" and g( x)! ±" ], or ( )! 0 ] as x! c, x! c, x! c, or x! ±". If f!(x) g!(x) "!,

SECTION 9.2: ARITHMETIC SEQUENCES and PARTIAL SUMS

We will now find the one line that best fits the data on a scatter plot.

An introduction to plotting data

INFINITE SUMS. In this chapter, let s take that power to infinity! And it will be equally natural and straightforward.

Transcription:

ow that your calculus concepts are more or less fresh in your head, we start with a simple machine learning algorithm: linear regression. Those who ve used Microsoft Excel to build plots have certainly used linear regression when fitting a line onto their data. We ll start with a simple example using a single feature dataset. Recall that single feature means one variable. For example, using a variable gas-mileage to predict the price of a car. Or perhaps the square-footage of a house to predict its price. SECTIO 3. SIMPLE LIEAR REGRESSIO Linear regression is a statistical method used to model a linear relationship between a dependent variable and independent variables. Like I mentioned above, a dependent variable would be the price of a car. The independent variable would be the gas-mileage. ow, whether or not there exists any true hidden relationship is not important yet. It is up to us to fit a linear model the best we can and evaluate from there. [Table of Data] [Plot of Data] Recall the formula for the line of an equation: y = mx + b. Well, if we are doing a simple linear regression task, we have only one independent variable. When fitting a line onto a plot similar to Figure, think about which value you want to control in order to make the best fit. Well y and x in the line equation are simply variables that serve as input and output in our model, so those can t be optimized. What about m and b? Well m controls the slope, and b controls the y intercept. Maybe if we configure the slope and y-intercept to the right setting, we can get a decent fit. Well that is essentially what linear regression does. ow to generalize, we will replace m and b with w and w 0. Since we re building a model f(x) that takes in input x, we formulate our line equation as such: f x = w, x + w -. Great, now what? Well, we do have previous data we re fitting onto, and we want the best fit line to be close as possible to each point. This means for a particular data point (x,, y, ), we d want y, to be close as possible to the value of our model at that point, f(x, ). The residual, or error of our model at a particular point is then r / = y / f( ). As our model fits the data points better, our residuals across each data point should be minimized.

[ISERT PLOT] Hence, it goes without proving that we need to minimize the sum of the residuals, since they represent the total error of our model against the data (e.g. data points). Error = y, f x, + y f x + + y f x = y / f By averaging the error by the number of data points, we can achieve a mean error of our model. 78, Mean Error = 78, y / f The residual is squared in order to ensure positive error summation. There are actually other additional reasons why we square the residual that are discussed later on and in the practice problems. The entire quantity is divided by ½ for mathematically convenient reasons you ll see soon. Recall our discussion of mean squared error from Chapter 2. Mean Squared Error = 78, y / f We re not doing anything different here. ow we use the term min to denote a minimization operation. min y / f Well, and y / are the data points, so we can t modify those. is just the number of data points; it s also a constant. The only thing we can modify in our model are w and w 0. Since our model is parameterized by w and w 0, so is the error function, or sometimes called the cost function. We denote the cost function as J(w -, w, ), and our model as f, w -, w,. The minimization problem is rewritten and sufficiently defined as such: min H I,H J J(w -, w, ) = min H I,H J,.

The obvious strategy coming out of Chapter 2 is to use calculus. Taking the derivative of J(w -, w, ) can help us find the optimal values for w - and w, such that the cost function is minimized. otice, there are two parameters, the slope w, and the y-intercept w -, also known as the bias term. When we take derivative of a function, we do it with respect to one variable. Since we have two parameters to solve for, we will do KL(H I,H J ) first and then KL(H I,H J ) ; as an example KH J KH I exercise. Then since the derivative of a function is zero at its local/global minimum, we solve for KL(H I,H J ) and KL(H I,H J ). Since it s just basic derivatives and arithmetic, I ll leave it as an KH J KH J exercise problem at the end of the chapter. For each parameter we get the following values: J(w -, w, ) w, = w, = w, y / w, w - = y w / w, w -, = y / w, w - ( ) y / + w, + w - y / + y / + w, w, + + w - w - w, = y / w - J(w -, w, ) w - = w -

= w - y / w, w - = y w / w, w - - = y / w, w - y / w, w - y / y / w, w - = w, y / + w, w - w - w, = y / y / + w, [ISERT PLOT OF ERROR FUCITO I 2D TOPOLOGICAL ] And that s it! We ve found the best fit. Wait! Our task was to minimize, correct? How are we certain that we weren t solving for the maximum? The derivative of a function is zero at both a local maximum and local minimum. There is an another way to ensure we re dealing with a convex function that has a minimum versus a concave function with a maximum. That is, we take the second derivative! If the second derivative is positive, we re dealing with a convex shaped function and if the second derivative is negative, then it s concave shaped. So let s take the second derivative to verify we indeed found the minimum of the cost function. J(w -, w, ) w, = w,

J(w -, w, ) w, = = = w, w, y w / w, w - ( ), = > 0 J(w -, w, ) w - = w - J(w -, w, ) w - = = w - w - = w - y / w, w - ( ) = > 0 Since the second derivatives of the cost with respect to both parameters is strictly positive, the function is convex and the optimal parameter values minimize the error. But we re not done yet, how good is our model? What sort of metrics are there to evaluate our model? SECTIO 3.2 EXPLAIED VARAICE AD COEFFICIET OF DETERMIATIO One way to view (x, y) coordinates on a plot is to see that the y values vary from each other depending the x value paired with it in the data. Hence, finding the right linear regression model f(x) is to explain this variance in y. In statistics, variance quantifies the spread of a particular

variable from its average. The variance, often denoted as σ, is essentially the averaged sum of each value y / (i =.. ) from its mean, y: σ STSUV =, (y / y). It s the job of our model to explain the variance well. So how do we go about calculating the explained variance? Lucky for us, we know how to calculate the unexplained variance; it s just the mean squared error! (dropping the ½ for generalization and consistency) σ WXXTX = i=. y i f x i, w 0, w 2 Think about it. If the explained variance is how good the linear regression model explains the total variance σ, then the unexplained variance is how much the linear regression model fails to explain the data. The fraction of variance unexplained (FVU) is then simply: FVU = \ ` ]^^_^ \ `. a_abc It should go without proving that the fraction of variance explained is simply FVU. This value is commonly known as the coefficient of determination or R. This statistical measure is commonly used to determine the goodness of our regression fit. A R value of indicates our line fits the data perfectly. What would a R value of 0 indicate? (Q) SECTIO 3.3 MULTIPLE LIEAR REGRESSIO PREVIEW Before we gave an example of one feature variable regression problem: gas-mileage to predict car prices. In most problems, there are multiple features to factor in. This means our model isn t just a simple line anymore in which we need to find a slope and y-intercept. The previous problem worked out nicely to look like a y = mx + b line since we were dealing with just one feature variable. Say we re given data and want to predict a person s height. Features include: age, arm length, father s height, and mother s height. Our starting model could look something like this: f x = w, x ghi + w x gjk + w l x mgnoij + w p x kqnoij + w -. The bold x represents a feature vector (recall from Chapter ). The feature vector is essentially an array of our features stored in a (x) matrix given features. The mathematics from here

and onwards takes a bit of a leap. We will cover multiple linear regression as a special topic in the next chapter. With multiple features, the problem becomes multidimensional and requires us to utilize linear algebra for the first time. Chapter will lay out the foundations of linear algebra and probability necessary for the remainder of this text. [CH placed on hold, sorry L]