LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

Size: px

Start display at page:

Download "LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form"

Alisha Lynch
5 years ago
Views:

1 LECTURE # - EURAL COPUTATIO, Feb 4, 4 Linear Regression Assumes a functional form f (, θ) = θ θ θ K θ (Eq) where = (,, ) are the attributes and θ = (θ, θ, θ ) are the function parameters Eample: f (, θ) = θ θ θ θ 3 4, where = (, ) are the attributes and θ = (θ, θ, θ, θ 3 ) are the function parameters ote that function f(,θ) from the eample is linear in the parameters We can easily transform it into a function from from (Eq) by introducing new attributes =, = and =, and 3 = 4 Linear regression is suitable for problems where functional form f(,θ) is nown with sufficient certainty Learning goal: Find θ that minimizes SE SE is a function of parameters θ, so the problem of minimizing SE can be solved by standard methods of the unconstrained optimization Linear regression can be represented by a functional form: f(; θ) = θ θ θ = θ j= ote: is a dummy attribute and its value is a constant equal to Linear regression can also be represented in a graphic form: j j θ θ output θ Goal: inimize ean Square Error (SE): SE = ( y i f ( i ; θ)) i = SE is a quadratic function in parameters θ It is a conve function There is only one minimum, it is the global minimum

2 ath Bacground: Unconstrained optimization Problem: Given a function g( find its minimum (ie minimize g() * = arg min g( R ecessary condition for a data point * to be a minimum: g( *) =, for scalars, g( *) =, for vectors (Jacobian, J) ote: if = [,, L, n g( ], then g( = g( n What is the sufficient condition for a data point * to be a minimum? Eample: If we observe the figure we notice that several points satisfy the necessary condition g( *) = : point A is the global maimum, point B is the global minimum, point C is a local maimum, point D is a local minimum, points E, F, G are saddle points Sufficient condition for a data point * to be a minimum: g( >, for scalars, g( is a positive definite, for vectors (Hessian, H) g( ote: Hessian is defined as H= { h ij }, hij = i j

3 SE Solution: Sufficient condition is =, θ j, j =,,, θ j Therefore, find θ j such that SE = θ =, θ y i i ij j j i There are linear equations with unnown variables we can get a closed-form solution Special Case: If some attribute is a linear combination of others, there is no unique solution SE = yi ij = θi ij (in matri form) X T Y = X T Xθ, θj i= i = = where: X [ ()] = { ij } i=:, j=:(), ( ij is jth attribute of ith data point) Y [ ] = {y i } i=:, θ [() ] = {θ j } j=:() ote: D = [X Y], ie, [X Y] is what we defined previously as the data set The optimal parameter choice is then: θ = (X T X) - X T Y, which is a closed form solution ote: the above solution eists if X T X is invertible, ie if its ran equals, ie no attribute is a linear combination of others (in atlab, use function ran) ote: using matri derivations we can do the optimization in a more elegant way by defining SE = (Y Xθ) T (Y Xθ) θ SE = X T (Y Xθ) = [() ] θ = (X T X) - X T Y

4 onlinear Regression Question: What if we now that f(;θ) is a non-linear parametric function? For eample: f(;θ) = θ θ θ, this is a function nonlinear in parameters Solution: inimize SE = ( y f ( i ; θ)) i Start from the necessary condition for minimum: SE f ( ; ) = ( ( ; )) i θ yi f i θ = θ θ j Again, we have to solve nonlinear equations with unnowns But, this time closed-form solution is not easy to derive ath Bacground: Unconstrained Optimization: j Problem: Given f(, find its minimum Popular Solution: Use the gradient descent algorithm Idea: The gradient of f( at the minimum is zero vector So, start from an initial guess ; calculate gradient f( ); 3 move in the direction opposite of the gradient, ie, generate new guess as = α f( ), where α is a properly selected constant; 4 repeat this process until convergence to the minimum

5 Two problems with gradient descent algorithm: It accepts convergence to a local minimum The simplest solution to avoid the local minimum is to repeat the procedure starting from multiple initial guesses Possible slow convergence to a minimum There are a number of algorithms providing faster convergence (eg conjugate gradient; second order methods such as ewton or quazi-ewton; nonderivative methods) Bac to solving nonlinear regression using gradient descent procedure: Step : Start from an initial guess for parameters θ Step : Update the parameters as θ = θ α f(θ ) Special Case: For linear prediction the update step would be θ = θ αx T (Y Xθ )

6 eural etwors Recall: for linear regression, we can represent the predictor as a networ: θ θ θ m o= θ j j m And logistic regression can be represented by: θ θ θ m g(z) o m eural networs generalize this idea The first step in their definition requires the definition of a neuron We will see that a neural networ is constructed from a collection of neurons This is a neuron: ω ω = bias m ω ω m g(z) o It is standard to use ω to represent the weights in a neural networ The bias ω plays a role analogous to the constant term in a regression model It is an important component of a neuron, but we will generally omit it from the diagrams The type of a neuron is determined by the form of the function gz ( ) Some eamples include: gz ( ) = z: equivalent to linear regression This is a linear neuron

7 gz ( ) = z e : similar to logistic regression This is a sigmoid neuron z > 3 gz ( ) = This is a perceptron z Refer to itchell, achine Learning, chapter 4, for a good coverage of neural networs Eample : A perceptron with two inputs will divide the plane (and, in general, a hyperplane in higher dimensions) into two parts with a straight line: Eample : There are some training sets that cannot be separated by a perceptron: - - We will often apply a simplified visual notation to represent neurons, by omitting the summation node and the function gz ( ), and simply display a neuron as: The field of eural etwors is a large one, so we will restrict our attention to a particular class of s, namely,

8 Feedforward eural etwors In many situations, we will view a neural networ as a blac bo, having a set of inputs and a single output (actually, s are not restricted to a single output In a classification setting, for instance, we might have one output for each class) So the simple picture loos lie m In a multilayer feedforward, the details inside the blac bo loo lie this: o m hidden layers output layer In this layout, all neurons are either hidden neurons or output neurons ote that there are many choices available here: How many hidden layers are there? What type are the neurons (perceptrons, sigmoid, etc)? How many neurons are there per layer? Our goal is to find a way to determine the optimal weights to be assigned to the branches To get a sense of what s involved, consider the simpler case of a multilayer feedforward with a single hidden layer: ω z m ω ω ω z o In this diagram, note that (i) the terms { z, K, z } represent the intermediate outputs coming from the hidden layer, and (ii) the weights in the input layer are labeled with the output inde first (eg, is the weight on the branch from m to the -th neuron in the hidden layer); this loos bacward but turns out to be convenient It is clear that this represents a function that maps the inputs to the output and which is dependent on the values of the weights In other words, neural networs are a parametric function of inputs In this case, if o= f (, ω), then we can state: ω

9 Goal: find optimal weights ω that minimize SE = ( yi f( i, ω )) i = For simplicity, suppose that the nonlinear functions in all neurons are the same function g Then m o= g( ω z ) = g( ω g( ω )) j j j ji i j= j= i= To minimize SE, we can apply the gradient descent algorithm In the course of this algorithm, it will be necessary to compute terms lie: SE SE o = =??? ω o ω 3 3 It s clear that these terms will get complicated, and that we need to figure out how to solve for the weights efficiently We ll see the solution shortly in the bacpropagation algorithm First, we note that there are two variations of the gradient descent algorithm: o The batch version, which is the version that we ve already seen: define SE = ( yi f( i, ω )), and let the iteration step of the algorithm be ω = ω α SE ω i = This is called batch because each update of the weights comes from looing at all of the data points o The stochastic version, which will be used to eplain the bacpropagation algorithm: in this case an update of the weights will occur for every data point Define mse = ( y f(, ω)), and let the iteration step be ω = ω α ω mse We can motivate the bacpropogation algorithm on neural networs by first considering a simpler setting Consider a function defined by z = G( f (, f (, K, f ( ), which can be represented graphically as: f y f y G z f y And suppose we want to compute z Intuitively, if changes a small amount, we see that each of the y intermediate outputs y i changes accordingly, and that these changes all contribute to a change in z This is made precise in the chain rule: z z y z y z y z yi = L = y y y y i= i

10 This pattern is similar when computing partial derivatives of mse the details are slightly more involved with respect to various weights, though The Bacpropagation Algorithm Consider this small piece of a : L o j ω net j net l g ωl o g ol L To perform the update of weights in the stochastic version of the gradient descent algorithm, we need to calculate all partial derivatives of mse with respect to the weights One application of the chain rule gives: net = = ( ) ( ) ω net ω j j To compute ( ), we observe that net = oω i i, so i net ω j = o netl To compute ( ), use the chain rule to yield =, where the summation is over all net l netl net mse neurons l downstream( ) If we define δ =, then this result translates to net netl δ = δl net But since z net = ω g(net i ), we see that (assuming all neurons are sigmoid: gz ( ) = ( e) ): l i net net l li = ω g'(net ) = ω g(net )( g(net )) = ω o ( o ) What we ve shown is that the computation of l l l mse ω j j can be performed by nowing the values of δ l for all neurons l downstream from The bacpropagation algorithm consists of computing the values δ l starting from the output layer, and proceeding bacward through the networ In this way, all desired partial derivatives can be computed efficiently (and in a manner that lends itself well to a matri implementation) The only thing left to specify is how to start off the algorithm (ie, what to do at the output neuron): net t g o Since ms e = ( y o), we have o δ = = = ( y o) o( o) net o net And the algorithm (minus a lot of computational details) is complete

11 ain (theoretical) results on s Given a large enough, with sigmoid neurons: It can approimate an arbitrary continuous function (this can actually be done with a with two hidden layers) It can learn an arbitrary Boolean function (and one hidden layer is sufficient) Problems with s The primary problem is the fact that there are liely to be an etremely large number of local minima (though this difficulty is mitigated by the fact that there is a good chance that a local minimum will provide a good, though still suboptimal, solution) Two approaches to dealing with this problem are (i) variations of gradient descent (eg, gradient descent with an added momentum component, which may allow the algorithm to move past local minima to, at the very least, better local minima); and (ii) the strategy of training the with several different starting points and choosing the best resulting solution

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function