18.6 Regression and Classification with Linear Models

Size: px

Start display at page:

Download "18.6 Regression and Classification with Linear Models"

Gary Gregory Glenn
6 years ago
Views:

1 18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight line) with input x and output y has the form y = w 1 x + w 0, where w 0 and w 1 are real-valued coefficients to be learned Let w be the vector [w 0, w 1 ] and define h w (x) = w 1 x + w 0 The task of finding the h w that best fits the data is called linear regression To fit a line to the data, all we have to do is find the values of the weights [w 0, w 1 ] that minimize the empirical loss 353 1

2 354 It is traditional (going back to Gauss) to use the squared loss function L 2 summed over all the training examples = = = We would like to find w* = arg min w Loss(h w ) The sum j (y j (w 1 x + w 0 )) 2 is minimized when its partial derivatives with respect to w 0 and w 1 are zero =0 ) =0 355 These equations have a unique solution = = The weight space defined by w 0 and w 1 is convex This is true for every linear regression problem with an L 2 loss function, and it implies that there are no local minima 2

3 356 To go beyond linear models, we will need to face the fact that the equation defining minimum loss will often have no closed-form solution Instead, we will face a general optimization search in continuous weight space As we already know, such problems can be addressed by a hillclimbing algorithm that follows the gradient of the function to be optimized Because we are trying to minimize the loss, we will use gradient descent We choose any starting point in weight space and then move to a neighboring point that is downhill, repeating until we converge on the minimum possible loss 357 w any point in the parameter space loop until convergence do for each w i in w do ) The step size parameter is usually called the learning rate when we are trying to minimize loss in a learning problem It can be a fixed constant, or it can decay over time as the learning problem proceeds For univariate regression, the loss function is a quadratic function, so the partial derivative will be a linear function 3

4 358 Let s consider the case of only one training example, (x, y): 2 Loss( w) ( y hw ( x)) wi wi 2( y hw ( x)) ( y hw ( x)) wi 2( y hw ( x)) ( y ( w1x w0 )) w Applying this to both w 0 and w 1 we get: Loss( w) 2( y hw ( x)) w 0 Loss( w) 2( y hw ( x)) x w 1 i 359 Plugging these values back to the gradient descent update rule (folding the constant 2 into the learning rate ), we get the following learning rules for the weights w 0 w 0 + (y h w (x)) w 1 w 1 + (y h w (x)) x Intuitively: if h w (x) > y the output of the hypothesis is too large reduce w 0 a bit Reduce w 1 if x was a positive input but increase w 1 if x was a negative input For N training examples the derivative of a sum is the sum of the derivatives, and we have w 0 w 0 + j (y j h w (x j )) w 1 w 1 + j (y j h w (x j )) x j 4

5 360 Multivariate linear regression In multivariate linear regression each example x j is an n-element vector Our hypothesis space is the set of functions of the form h sw (x j ) = w 0 + w 1 x j,1 + + w n x j,n = w 0 + i w i x j,i To make w 0 in par with other weights, we can invent a dummy input attribute x j,0 which is defined as always equal to 1 Then h is simply the dot product of the weights and the input vector (or equivalently, the matrix product of the transpose of the weights and the input vector) h sw (x j ) = w x j = w x j = i w i x j,i The best vector of weights, w*, minimizes square-error loss over the examples: w* = arg min w = j L 2 (y j, w x j ) 361 Very much like in the univariate case, gradient descent will reach the (unique) minimum of the loss function; the update equation for each weight w i is w i w i + j x j,i (y j h w (x j )) It is also possible to solve analytically the w that minimizes loss Let y be the vector of outputs for the training examples, and X the data matrix, i.e., the matrix of inputs with one n-dimensional example per row Then the solution w* = (X X) -1 X y minimizes the squared error With multivariate linear regression in high-dimensional spaces it is possible that some dimension that is actually irrelevant appears by chance to be useful, resulting in overfitting 5

6 362 Thus, regularization on multivariate linear functions to avoid overfitting is common In regularization we minimize the total cost of a hypothesis, counting both the empirical loss and the complexity of the hypothesis Cost(h) = EmpLoss(h) + Complexity(h) For linear functions the complexity can be specified as a function of the weights We can consider a family of regularization functions: Complexity(h w ) = L q (w) = i w i q With q = 1 we have L 1 regularization, which minimizes the sum of the absolute values; with q = 2, L 2 regularization minimizes the sum of squares 363 Loss and regularization functions need not be used in pairs: you could use L 2 loss with L 1 regularization, or vice versa Which regularization function to pick depends on the specific problem L 1 regularization has an important advantage: it tends to produce sparse models it often sets many weights to zero, effectively declaring the corresponding attributes to be irrelevant Hypotheses that discard attributes can be easier for human to understand, and may be less likely to overfit The number of examples required to find a good h is linear in the number of irrelevant features for L 2 regularization, but only logarithmic with L 1 regularization 6

7 364 Linear classifiers with a hard threshold Linear functions can be used to do classification by finding a decision boundary (a linear separator) a line or a surface in higher dimensions that separates the two classes (if the data is linearly separable) h w (x) = 1 if w x 0 and 0 otherwise We can think of h as passing the linear function w x through a threshold function h w (x) = Threshold(w x), where Threshold(z) = 1 if z 0 and 0 otherwise For regression minimizing the loss could be considered through closed form solution and by gradient descent in weight space Here we cannot do either of those things because the gradient is zero almost everywhere in weight space except at those points where w x = 0, and at those points the gradient is undefined 365 7

8 366 There is a simple weight update rule that converges to a solution It provides a linear separator that classifies the data perfectly provided the data are linearly separable For a single example (x, y), we have the perceptron learning rule w i w i + (y h w (x)) x i which is essentially identical to the update rule for linear regression Because we are considering 0/1 classification problem, the behavior is somewhat different 367 w i w i + (y h w (x)) x i Both the true value y and the hypothesis output h w (x) are either 0 or 1: If y = h w (x) the output is correct, and the weights are not changed If y = 1 but h w (x) = 0, then w i is increased when the corresponding input x i is positive and decreased when x i is negative. This makes sense, because we want to makew x bigger so that h w (x) outputs a 1 If y = 0 but h w (x) = 1, then w i is decreased when the corresponding input x i is positive and increased when x i is negative. This makes sense, because we want to makew x smaller so that h w (x) outputs a 0 8

9 368 Typically the learning rule is applied one example at a time, choosing examples at random (stochastic gradient descent) The perceptron rule may not converge to a stable solution for fixed learning rate However, if decays as O(1/t) where t is the iteration number, then the perceptron learning rule converges to a minimum error solution when examples are presented in a random sequence If data points are not linearly separable, the learning rule may fail to converge Finding the minimum-error solution is NP-hard Artificial Neural Networks Neuroscience has hypothesized that mental activity consists primarily of electrochemical activity in networks of brain cells called neurons This lead McCulloch and Pitts to devise their mathematical model of the neuron already in 1943 Roughly speaking, it fires when a linear combination of its inputs exceeds some (hard or soft) threshold Hence, it implements a linear classifier A neural network is just a collection of units connected together The properties of the network are determined by its topology and the properties of the neurons Names for the research field: connectionism, parallel distributed processing, neural computation, and computational neuroscience 9

10 370 a 0 =1 w 0,j Activation Function Output Links a i w i,j g a j Input Links Input Function Output a j g n i0 w i, j a i 371 Neural network structures Neural networks are composed of nodes or units connected by directed links A link from unit i to unit j serves to propagate the activation a i from i to j Each link also has a weight w i,j associated with it, which determines the strength and sign of the connection Each unit has a dummy input a 0 = 1 with an associated weight w 0,j (linear regression model) Each unit j first computes a weighted sum of its inputs: = 10

11 372 Then the unit applies an activation function g to this sum to derive the output The activation function g is typically either A hard threshold, in which case the unit is called a perceptron, or A logistic function, in which case the term sigmoid unit is sometimes used Both of these nonlinear activation functions ensure the important property that the entire network of units can represent a nonlinear function 373 There are two fundamentally distinct ways to connect individual neurons together to form a network A feed-forward network has connections only in one direction i.e., it forms a directed acyclic graph Every node receives input from upstream nodes and delivers output to downstream nodes; there are no loops A feed-forward network represents a function of its current input It has no internal state other than the weights themselves Arecurrent network feeds its output back into its own inputs This means that the activation levels of the network form a dynamical system that may reach a stable state or exhibit oscillations or even chaotic behavior 11

12 374 Moreover, the response of the network to a given input depends on its initial state, which may depend on previous inputs Hence, recurrent networks can support short-term memory Interesting models of the brain, but difficult to understand Feed-forward networks are usually arranged in layers, such that each unit receives input only from units in the immediately preceding layer Multilayer networks have one or more layers of hidden units that are not connected to the outputs of the network A network with all the inputs connected directly to the outputs is called a single-layer neural network, or a perceptron network 375 Let us think of a feed-forward neural network as a function h w (x) parametrized by the weights We can express the output as a function of the inputs and the weights As long as we can calculate the derivatives of such expressions with respect to the weights, we can use the gradient-descent loss-minimization method to train the network Because the function represented by a network can be highly nonlinear composed, as it is, of nested nonlinear soft threshold functions we can view neural networks as a tool for doing nonlinear regression 12

13 376 With a single, sufficiently large hidden layer, it is possible to represent any continuous function of the inputs with arbitrary accuracy With two layers, even discontinuous functions can be represented Unfortunately, for any particular network structure, it is hard to characterize exactly which functions can be represented and which ones cannot We can back-propagate the error from the output layer to the hidden layers The back-propagation process emerges directly from a derivation of the overall error gradient Nonparametric Models Linear regression and neural networks are examples of parametric models Independent of the number of training examples they estimate a fixed set of parametersw to define the hypothesis h w (x) An example of a nonparametric model is instance-based learning, where all training examples are retained and used to predict the next example Instead of just simply doing table lookup, which does not generalize well, we can use nearest neighbor models Given a query x q, find the k examples that are nearest to it Let NN(k, x q ) denote the set of k nearest neighbors To do classification, first find NN(k, x q ), then take the plurality (majority) vote of the neighbors 13

14 378 To avoid ties, k is always chosen to be an odd number To do regression, we can take the mean or median of the k neighbors, or we can solve a linear regression problem on the neighbors Nonparametric methods are still subject to underfitting and overfitting, just like parametric methods Since we are looking for nearest neighbors, we need a distance metric How do we measure distance from a query point x q to an example point x j? 379 Typically distances are measured with a Minkowski distance or L p norm: = With p = 2 this is Euclidean distance = and with p = 1 it is Manhattan distance = 14

15 380 Linear classifiers (again) Let f(x) be a linear function of its argument vector x = (x 1,,x m ) = = Input x is classified positive, if f(x) 0 and otherwise x is classified negative 1, 0 sign ) = 1, The hyperplane determined by the equation w x+b = 0 divides the input space in two halfspaces If for an example (x, y) and hypothesis h it holds that yh(x) > 0, then the example has been correctly classified = yh(x) is the margin of the example w.r.t. h 15

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011! 1 Todayʼs topics" Learning from Examples: brief review! Univariate Linear Regression! Batch gradient descent! Stochastic gradient