ow that your calculus concepts are more or less fresh in your head, we start with a simple machine learning algorithm: linear regression. Those who ve used Microsoft Excel to build plots have certainly used linear regression when fitting a line onto their data. We ll start with a simple example using a single feature dataset. Recall that single feature means one variable. For example, using a variable gas-mileage to predict the price of a car. Or perhaps the square-footage of a house to predict its price. SECTIO 3. SIMPLE LIEAR REGRESSIO Linear regression is a statistical method used to model a linear relationship between a dependent variable and independent variables. Like I mentioned above, a dependent variable would be the price of a car. The independent variable would be the gas-mileage. ow, whether or not there exists any true hidden relationship is not important yet. It is up to us to fit a linear model the best we can and evaluate from there. [Table of Data] [Plot of Data] Recall the formula for the line of an equation: y = mx + b. Well, if we are doing a simple linear regression task, we have only one independent variable. When fitting a line onto a plot similar to Figure, think about which value you want to control in order to make the best fit. Well y and x in the line equation are simply variables that serve as input and output in our model, so those can t be optimized. What about m and b? Well m controls the slope, and b controls the y intercept. Maybe if we configure the slope and y-intercept to the right setting, we can get a decent fit. Well that is essentially what linear regression does. ow to generalize, we will replace m and b with w and w 0. Since we re building a model f(x) that takes in input x, we formulate our line equation as such: f x = w, x + w -. Great, now what? Well, we do have previous data we re fitting onto, and we want the best fit line to be close as possible to each point. This means for a particular data point (x,, y, ), we d want y, to be close as possible to the value of our model at that point, f(x, ). The residual, or error of our model at a particular point is then r / = y / f( ). As our model fits the data points better, our residuals across each data point should be minimized.
[ISERT PLOT] Hence, it goes without proving that we need to minimize the sum of the residuals, since they represent the total error of our model against the data (e.g. data points). Error = y, f x, + y f x + + y f x = y / f By averaging the error by the number of data points, we can achieve a mean error of our model. 78, Mean Error = 78, y / f The residual is squared in order to ensure positive error summation. There are actually other additional reasons why we square the residual that are discussed later on and in the practice problems. The entire quantity is divided by ½ for mathematically convenient reasons you ll see soon. Recall our discussion of mean squared error from Chapter 2. Mean Squared Error = 78, y / f We re not doing anything different here. ow we use the term min to denote a minimization operation. min y / f Well, and y / are the data points, so we can t modify those. is just the number of data points; it s also a constant. The only thing we can modify in our model are w and w 0. Since our model is parameterized by w and w 0, so is the error function, or sometimes called the cost function. We denote the cost function as J(w -, w, ), and our model as f, w -, w,. The minimization problem is rewritten and sufficiently defined as such: min H I,H J J(w -, w, ) = min H I,H J,.
The obvious strategy coming out of Chapter 2 is to use calculus. Taking the derivative of J(w -, w, ) can help us find the optimal values for w - and w, such that the cost function is minimized. otice, there are two parameters, the slope w, and the y-intercept w -, also known as the bias term. When we take derivative of a function, we do it with respect to one variable. Since we have two parameters to solve for, we will do KL(H I,H J ) first and then KL(H I,H J ) ; as an example KH J KH I exercise. Then since the derivative of a function is zero at its local/global minimum, we solve for KL(H I,H J ) and KL(H I,H J ). Since it s just basic derivatives and arithmetic, I ll leave it as an KH J KH J exercise problem at the end of the chapter. For each parameter we get the following values: J(w -, w, ) w, = w, = w, y / w, w - = y w / w, w -, = y / w, w - ( ) y / + w, + w - y / + y / + w, w, + + w - w - w, = y / w - J(w -, w, ) w - = w -
= w - y / w, w - = y w / w, w - - = y / w, w - y / w, w - y / y / w, w - = w, y / + w, w - w - w, = y / y / + w, [ISERT PLOT OF ERROR FUCITO I 2D TOPOLOGICAL ] And that s it! We ve found the best fit. Wait! Our task was to minimize, correct? How are we certain that we weren t solving for the maximum? The derivative of a function is zero at both a local maximum and local minimum. There is an another way to ensure we re dealing with a convex function that has a minimum versus a concave function with a maximum. That is, we take the second derivative! If the second derivative is positive, we re dealing with a convex shaped function and if the second derivative is negative, then it s concave shaped. So let s take the second derivative to verify we indeed found the minimum of the cost function. J(w -, w, ) w, = w,
J(w -, w, ) w, = = = w, w, y w / w, w - ( ), = > 0 J(w -, w, ) w - = w - J(w -, w, ) w - = = w - w - = w - y / w, w - ( ) = > 0 Since the second derivatives of the cost with respect to both parameters is strictly positive, the function is convex and the optimal parameter values minimize the error. But we re not done yet, how good is our model? What sort of metrics are there to evaluate our model? SECTIO 3.2 EXPLAIED VARAICE AD COEFFICIET OF DETERMIATIO One way to view (x, y) coordinates on a plot is to see that the y values vary from each other depending the x value paired with it in the data. Hence, finding the right linear regression model f(x) is to explain this variance in y. In statistics, variance quantifies the spread of a particular
variable from its average. The variance, often denoted as σ, is essentially the averaged sum of each value y / (i =.. ) from its mean, y: σ STSUV =, (y / y). It s the job of our model to explain the variance well. So how do we go about calculating the explained variance? Lucky for us, we know how to calculate the unexplained variance; it s just the mean squared error! (dropping the ½ for generalization and consistency) σ WXXTX = i=. y i f x i, w 0, w 2 Think about it. If the explained variance is how good the linear regression model explains the total variance σ, then the unexplained variance is how much the linear regression model fails to explain the data. The fraction of variance unexplained (FVU) is then simply: FVU = \ ` ]^^_^ \ `. a_abc It should go without proving that the fraction of variance explained is simply FVU. This value is commonly known as the coefficient of determination or R. This statistical measure is commonly used to determine the goodness of our regression fit. A R value of indicates our line fits the data perfectly. What would a R value of 0 indicate? (Q) SECTIO 3.3 MULTIPLE LIEAR REGRESSIO PREVIEW Before we gave an example of one feature variable regression problem: gas-mileage to predict car prices. In most problems, there are multiple features to factor in. This means our model isn t just a simple line anymore in which we need to find a slope and y-intercept. The previous problem worked out nicely to look like a y = mx + b line since we were dealing with just one feature variable. Say we re given data and want to predict a person s height. Features include: age, arm length, father s height, and mother s height. Our starting model could look something like this: f x = w, x ghi + w x gjk + w l x mgnoij + w p x kqnoij + w -. The bold x represents a feature vector (recall from Chapter ). The feature vector is essentially an array of our features stored in a (x) matrix given features. The mathematics from here
and onwards takes a bit of a leap. We will cover multiple linear regression as a special topic in the next chapter. With multiple features, the problem becomes multidimensional and requires us to utilize linear algebra for the first time. Chapter will lay out the foundations of linear algebra and probability necessary for the remainder of this text. [CH placed on hold, sorry L]