For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Similar documents
EEE 241: Linear Systems

1 Convex Optimization

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Week 5: Neural Networks

Multilayer neural networks

Multi-layer neural networks

Generalized Linear Methods

Multilayer Perceptron (MLP)

Feature Selection: Part 1

10-701/ Machine Learning, Fall 2005 Homework 3

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Neural Networks. Perceptrons and Backpropagation. Silke Bussen-Heyen. 5th of Novemeber Universität Bremen Fachbereich 3. Neural Networks 1 / 17

Lecture 10 Support Vector Machines II

Lecture Notes on Linear Regression

Learning Theory: Lecture Notes

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Kernel Methods and SVMs Extension

Neural networks. Nuno Vasconcelos ECE Department, UCSD

CSC 411 / CSC D11 / CSC C11

Supporting Information

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Evaluation of classifiers MLPs

CS294A Lecture notes. Andrew Ng

Linear Feature Engineering 11

Problem Set 9 Solutions

SDMML HT MSc Problem Sheet 4

Errors for Linear Systems

Introduction to the Introduction to Artificial Neural Network

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Difference Equations

Foundations of Arithmetic

The Geometry of Logit and Probit

CSE 546 Midterm Exam, Fall 2014(with Solution)

), it produces a response (output function g (x)

Classification as a Regression Problem

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Lecture 12: Discrete Laplacian

Vapnik-Chervonenkis theory

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

CS294A Lecture notes. Andrew Ng

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Logistic Classifier CISC 5800 Professor Daniel Leeds

Support Vector Machines

Report on Image warping

Ensemble Methods: Boosting

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

THE SUMMATION NOTATION Ʃ

The exam is closed book, closed notes except your one-page cheat sheet.

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

Singular Value Decomposition: Theory and Applications

Linear Approximation with Regularization and Moving Least Squares

Basic Regular Expressions. Introduction. Introduction to Computability. Theory. Motivation. Lecture4: Regular Expressions

1 Matrix representations of canonical matrices

= z 20 z n. (k 20) + 4 z k = 4

Homework Notes Week 7

Integrals and Invariants of Euler-Lagrange Equations

MATH 567: Mathematical Techniques in Data Science Lab 8

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Week 2. This week, we covered operations on sets and cardinality.

Open Systems: Chemical Potential and Partial Molar Quantities Chemical Potential

ELASTIC WAVE PROPAGATION IN A CONTINUOUS MEDIUM

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

Georgia Tech PHYS 6124 Mathematical Methods of Physics I

NUMERICAL DIFFERENTIATION

Linear Classification, SVMs and Nearest Neighbors

Lecture 2: Prelude to the big shrink

Hidden Markov Models

Natural Language Processing and Information Retrieval

Training Convolutional Neural Networks

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

Maximal Margin Classifier

Erratum: A Generalized Path Integral Control Approach to Reinforcement Learning

Why feed-forward networks are in a bad shape

Note 10. Modeling and Simulation of Dynamic Systems

1 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations

1 Input-Output Mappings. 2 Hebbian Failure. 3 Delta Rule Success.

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Manning & Schuetze, FSNLP (c)1999, 2001

From Biot-Savart Law to Divergence of B (1)

Radial-Basis Function Networks

Appendix for Causal Interaction in Factorial Experiments: Application to Conjoint Analysis

MULTISPECTRAL IMAGE CLASSIFICATION USING BACK-PROPAGATION NEURAL NETWORK IN PCA DOMAIN

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

MA 323 Geometric Modelling Course Notes: Day 13 Bezier Curves & Bernstein Polynomials

VQ widely used in coding speech, image, and video

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

Computational and Statistical Learning theory Assignment 4

Homework Assignment 3 Due in class, Thursday October 15

Yong Joon Ryang. 1. Introduction Consider the multicommodity transportation problem with convex quadratic cost function. 1 2 (x x0 ) T Q(x x 0 )

18.1 Introduction and Recap

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Supplement: Proofs and Technical Details for The Solution Path of the Generalized Lasso

Numerical Heat and Mass Transfer

8 Derivation of Network Rate Equations from Single- Cell Conductance Equations

The Expectation-Maximization Algorithm

Communication Complexity 16:198: February Lecture 4. x ij y ij

Transcription:

Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson and mage processng. 1 Model For now, let us focus on a specfc model of neurons. These are smplfed from realty but can acheve remarkable results. 1.1 Sngle Layer Neural Newtork Let us call the nputs x 1, x 2... x n. We have several nodes v 1, v 2... v n, whch each receve a lnear combnaton of the nputs, w 1 x 1 + w 2 x 2 + + w n x n. The conventon for each edge weght s to lst the ndex of the source n the frst poston and the ndex of the destnaton n the second. For example, edge lnkng nput 1 to node 2, would be w 12. The total nput for node 2 would be S 2 = n w 2 x =1 There are varety of ways we translate S 2 nto a decson, x 2 = g(s 2 ) We call g the actvaton functon. We have several common optons for actvaton functons. 1. Logstc 1 1+e z 2. ReLu (Rectfed lnear unt) { 0 for z 0 z for z > 0 In a sngle-layer neural network, f g s logstc, we would smply have logstc regresson. 1

1.2 Mult-Layer Neural Networks The set of nputs could also be the output from another stage of the neural network. Now, we wll consder a three-layer neural network. We call the nputs x, the second-layer v and the thrd layer o, where all actvaton functons g are the same. We can compute the outputs usng the followng. O = g( w g( k w k x k )) Suppose the frst stage and the second stage are all lnear. Is ths effectve? Of course not. Ths s effectvely one layer of lnear combnatons. As a result, t s mportant that you have some non-lnearty. Wth that sad, even the dscontnuty of a ReLu s suffcent. 2 Tranng What does tranng a neural network mean? It means fndng a w such that the output o s as close as possble to y, the desred output. In other words, the values for a regresson problem or the labels for a classfcaton problem. Here s our 3-step approach: 1. Defne the loss functon L(w). 2. Compute wα. 3. w new w old η w α Compute the gradent of w means we compute partal dervatves. In other words, we compute w = L w k for all edges usng chan rule. We want to know: how can we tweak the edge weght, so that our loss s mnmzed? 2

2.1 Sngle-Layer Neural Networks For bnary classfcaton, an effectve loss functon s the cross entropy. For regresson, use squared error. Note that the followng comes from the maxmum lkelhood estmate for logstc. L = X (y log o + (1 y ) log 1 o ) We can model the actvaton functon g as a sgmod g(z) = 1 1+e z. Fndng w reduces to logstc regresson, so we can use stochastc gradent descent. 2.2 Two-Layer Neural Network Note that ths generalzes to multple layers. We can compute the gradent wth respect to all the weghts, from the nput to the hdden layer and the hdden layer to the output layer. The vector we get wth may be massve, whch s lnear n sze wth respect to the weghts. We can use stochastc gradent descent. Despte the fact that t only fnds local mnma, ths s suffcent for more applcatons. Computng gradents the nave verson wll be quadratc n the number of weghts. The back propagaton s a trck that allows us to compute the gradent n lnear tme. We can add a regularzaton term to mprove performance. Here s the smple dea; the followng s an nvocaton of g, wth many arguments, one of whch s w k. o = g(... w k..., x) We can compute the followng. 3

o = g(... w k + w k..., x) Then, we can compute the losses L(o ), L(o ). We now have a numercal approxmaton for the gradent. L(o ) L(o ) w k The computatonal complexty of ths s what s called a forward pass through the network, as we evaluate the entre network. The cost of ths s lnear n the number of weghts. Gven n weghts, t would be O(n) for a sngle weght, but wth n weghts, ths s O(n 2 ) whch s extremely expensve. 3 Backpropagaton Algorthm Frst, let us consder the bg pcture. We can see that computaton can be and should be shared, so we wll try to mnmze work n ths fashon. Consder an nput layer, a hdden layer, and output layer. We wll compute a varable called δ at the output layer. Usng the deltas, we wll compute the deltas for the hdden layer, and fnally, compute the deltas at the nput layer. Ths s why the algorthm s called back propagaton, as we move from the outputs to the nputs. We can consder ths a form of nducton, where the output layer s the base case. 3.1 Chan Rule Let us move on to fner detals and develop a more precse structure. Recall the chan rule from calculus. Consder the followng f(x) = x 2 z x = 2y z = sn y 4

To compute f, we apply chan rule. Intutvely, a small change n y causes a change n z, y whch causes a small change n x, whch changes f. Alternatvely, a small change n y could drectly affect x, whch changes f. There are, n effect, two paths to f. Chan rule tells us that f y = f x x y + f z z y 3.2 Notaton To dstngush between weghts n dfferent layers, we wll use superscrpts. w (l) s the lnk from node n layer l 1 to node n layer l. Note that l s the layer of the target node. Let L be the ndex of the hghest layer, so f we have L = 10, l wll range from 1 to 10. d (l) wll denote the number of nodes at the layer l. Notaton s farly confusng, but we wll try to follow the aforementoned conventons. To compute x for some node n layer l, we have the followng. We smply took the equaton from before and added superscrpts to denote layers. d (l 1) S (l) = x (l) =1 w (l) x(l 1) = g(s (l) ) 5

3.3 Defnng δ We wll use e for error, whch s equvalent to loss. Let us frst defne the error at a specfc node. δ (l) = S (l) Usng these deltas, we can then compute the gradents. Here s how: Recall that we are lookng for the gradent of the error wth respect to a weght. w affects S, and S then affects x. We can also say the reverse. The error of at x s a functon of the error n S, whch s then a functon of the error n w. More formally sad, we can apply chan rule to re-express the dervatve of e wth respect to the weght. = s (l) S(l) We can now use our delta notaton. In fact, note that the frst term s smply our delta. = δ (l) S (l) S s a lnear combnaton of all w, but we are takng the dervatve wth respect to a specfc w. As a result, we have that S(l) = x (l 1). = δ (l) x(l 1) If we can compute deltas at every node n the neural network, we have an extremely smple formula for the gradents at every node, for every weght. As a result, we have acheved our goal, whch was to compute the gradent of the error wth respect to each weght. We wll now compute the gradent of the error wth respect to each node. 6

3.4 Base Case We wll now buld the full algorthm, usng nducton. Frst, run the entre network, so that we have the values g(s L ). We compute δ n the fnal layer, to start. δ = S (L) Gven the followng error functon, we can plug n and take the dervatve to compute δ. Error = 1 2 (g(s(l) ) y ) 2 We compute the deltas for our output layer frst. Apply chan rule, and note that y constant wth respect to w. δ (L) = δ (L) ( 1 2 (g(s(l) ) y ) 2 ) = (g(s (L) ) y )g (s (L) ) Suppose g s the ReLu. Then, we have the followng possble outputs for g (s (L) ). g (S (L) ) = { 1 f s (L) 0 0 otherwse 7

3.5 Inductve Step We compute δ for an ntermedate layer. Take an dentcal defnton of δ. Per the nductve hypothess, assume that all δ (l) have been computed. δ (l 1) = S (l 1) We examne x (l 1). Whch quanttes does t affect? The answer: t may affect any node n the layer above t, x (l). Specfcally, x (l 1) affects S (l), whch n turn affects x (l). Ths s, agan, expressed formally usng chan rule. = S (l) S x (l 1) x (l 1) s (l 1) Let us now smplfy ths expresson. The frst term, s n fact δ (l), by defnton. = δ (l) S x (l 1) x (l 1) s (l 1) Recall that S s a lnear combnaton of weghts and nodes, so the second term s smply w. = δ (l) w (l) x (l 1) s (l 1) Fnally, the last term can be re-expressed usng the actvaton functon g. = δ (l) w (l) g (S (l 1) ) 8

3.6 Full Algorthm Note that ths algorthm s lnear n the number of weghts n the network and not lnear n the number of nodes. The summaton derved above demonstrates ths. 1. We compute δ at the output layer. 2. We compute δ at the ntermedate layers. 3. Once ths has concluded, we can use the followng t compute the dervatve wth respect to any weght w. = δ (l) x(l 1) Ths concludes the ntroducton to neural networks and ther dervaton. 9