Week 5: Neural Networks

Similar documents
EEE 241: Linear Systems

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Multilayer neural networks

1 Convex Optimization

Multi-layer neural networks

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Lecture Notes on Linear Regression

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Evaluation of classifiers MLPs

Supporting Information

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

Generalized Linear Methods

Feature Selection: Part 1

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Multilayer Perceptron (MLP)

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Lecture 10 Support Vector Machines II

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

10-701/ Machine Learning, Fall 2005 Homework 3

Difference Equations

The exam is closed book, closed notes except your one-page cheat sheet.

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Neural Networks. Perceptrons and Backpropagation. Silke Bussen-Heyen. 5th of Novemeber Universität Bremen Fachbereich 3. Neural Networks 1 / 17

Lecture 12: Discrete Laplacian

SDMML HT MSc Problem Sheet 4

Linear Feature Engineering 11

Transfer Functions. Convenient representation of a linear, dynamic model. A transfer function (TF) relates one input and one output: ( ) system

Logistic Classifier CISC 5800 Professor Daniel Leeds

CSC 411 / CSC D11 / CSC C11

PHYS 705: Classical Mechanics. Calculus of Variations II

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

COS 521: Advanced Algorithms Game Theory and Linear Programming

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Limited Dependent Variables

THE SUMMATION NOTATION Ʃ

CS294A Lecture notes. Andrew Ng

Assortment Optimization under MNL

Marginal Effects in Probit Models: Interpretation and Testing. 1. Interpreting Probit Coefficients

Mean Field / Variational Approximations

MMA and GCMMA two methods for nonlinear optimization

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Lecture 10: Euler s Equations for Multivariable

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 16

Homework Assignment 3 Due in class, Thursday October 15

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

Lecture 3. Ax x i a i. i i

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

1 Matrix representations of canonical matrices

Learning Theory: Lecture Notes

Linear Regression Analysis: Terminology and Notation

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Hidden Markov Models

Linear Approximation with Regularization and Moving Least Squares

MATH 567: Mathematical Techniques in Data Science Lab 8

Linear Classification, SVMs and Nearest Neighbors

The Geometry of Logit and Probit

Exercises. 18 Algorithms

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Expected Value and Variance

Vapnik-Chervonenkis theory

Classification as a Regression Problem

Errors for Linear Systems

Solving Nonlinear Differential Equations by a Neural Network Method

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Training Convolutional Neural Networks

Kernel Methods and SVMs Extension

Problem Set 9 Solutions

Some modelling aspects for the Matlab implementation of MMA

Dynamic Programming. Preview. Dynamic Programming. Dynamic Programming. Dynamic Programming (Example: Fibonacci Sequence)

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Singular Value Decomposition: Theory and Applications

1 GSW Iterative Techniques for y = Ax

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Differentiating Gaussian Processes

CS294A Lecture notes. Andrew Ng

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Module 9. Lecture 6. Duality in Assignment Problems

Lecture 2: Prelude to the big shrink

Erratum: A Generalized Path Integral Control Approach to Reinforcement Learning

Salmon: Lectures on partial differential equations. Consider the general linear, second-order PDE in the form. ,x 2

Conjugacy and the Exponential Family

NUMERICAL DIFFERENTIATION

Lecture 21: Numerical methods for pricing American type derivatives

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Generative classification models

Learning from Data 1 Naive Bayes

= z 20 z n. (k 20) + 4 z k = 4

Introduction to the Introduction to Artificial Neural Network

Markov Chain Monte Carlo Lecture 6

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

From Biot-Savart Law to Divergence of B (1)

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

8 Derivation of Network Rate Equations from Single- Cell Conductance Equations

Transcription:

Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple layers of computaton. At each layer, we have a vector values called actvatons, whch we ll denote h (l) for the actvatons at layer l. By conventon, we ll say that h (0) x (the nput) and for the last layer L, we have h (L) be the output, whch could be the probablty of a label (we ll dscuss regresson and other types of outputs later). We can express the actvatons n the neural net recursvely as h (l) σ(w (l) h (l ) + b (l) ), where σ(z) s a nonlnearty. Last lecture, we saw a common choce of nonlnearty, the sgmod (or logstc functon): σ(z) + exp( z). In a neural network, we must choose the number of layers L and the sze of each layer (the sze of h (0) s the number of nput attrbutes and the sze of h (L) s the number of outputs more on that later). If we have a network wth three layers of weghts (correspondng to two hdden layers), we can explctly wrte t as: h (3) σ(w (3) σ(w (2) σ(w () h (0) + b () ) + b (2) ) + b (3) ). Note that there s a lttle bt of confuson n termnology: ths network has three layers of weghts, gven by W (), W (2), and W (3), but two layers of hdden actvatons h () and h (2), snce h (0) s smply x (and therefore not hdden ), and h (3) s the output. Tradtonally, hdden layer n a neural network refers to a layer of hdden actvatons (or neurons ), but more recent termnology typcally refers to the weghts as layers. They are off by one, so don t get confused! Now, let s check our understandng: What s the data?

Answer. Just lke n logstc regresson, we have an nput vector x, whch can be ether contnuous real-valued, bnary, or categorcal. Just lke wth logstc regresson, categorcal values are often converted nto one-hot encodngs: f some feature takes on M values, we mght add M entres to x, where all but one s zero. Ths avods the need to mpose an orderng on that varable. The output y for now s categorcal (ust lke n logstc regresson), though we ll dscuss how we can have real-valued y as well later. What defnes (parameterzes) the hypothess? Answer. To defne a neural network, we have to frst choose the number and sze of the layers. Ths s a desgn decson (so techncally, number and sze of layers are hyperparameters). We choose these the same way we choose the features h(x) to use for logstc regresson, the amount of regularzaton, etc: we use our ntuton and, when n doubt, valdate aganst the valdaton set. Once we ve fgured out the number of layers and ther sze, the neural network s fully defned by the weghts θ {W (), b (),, W (L), b (L) }. It s precsely these weghts that our learnng algorthm needs to optmze. What s the obectve? Answer. Neural networks are condtonal (or dscrmnatve ) models. Just lke logstc regresson and lnear regresson, neural networks optmze the condtonal log-lkelhood, gven by L(θ) N log p(y x, θ) In the case of a bnary label y and a three-layer network (two hdden layers), ths s gven by L(θ) N { f y 0 : σ(w (3) σ(w (2) σ(w () x + b () ) + b (2) ) + b (3) ) f y : σ(w (3) σ(w (2) σ(w () x + b () ) + b (2) ) + b (3) ) What s the algorthm? Answer. Just lke wth logstc regresson, we ll use gradent ascent to optmze the neural network parameters θ. However, we have many more parameters now, and computng the gradent becomes a lot more dffcult. Furthermore, unlke wth logstc regresson, the neural network obectve s not convex, because the weghts W (l) and b (l) have complex nonlnear effects on the output. That means that gradent ascent can get stuck n local optma when optmzng neural networks, and we cannot n general guarantee a globally optmal soluton. In practce, especally for smaller networks, t s often not hard to fnd good 2

enough solutons wth gradent ascent that are not globally optmal but stll perform well. For very large and deep networks, we often have to thnk about more sophstcated optmzaton algorthms (more on ths later). To summarze, the gradent ascent algorthm smply conssts of repeatedly applyng the followng operaton: θ (+) θ () + α L(θ () ). Note that L(θ () ) here s a huge vector that conssts of the concatenaton of the gradent wth respect to each weght matrx and bas vector. Assume that each weght matrx W (l) has M l rows and M l columns, then L(θ) can be wrtten as: L(θ) dw (), dw () 2, dw () M, dw (),2 dw () M,2 dw () M,M 0 db () db () M dw (2), dw (2) M 2, dw (2) M 2,M db (2) db (2) M 2 dw (L) M L,M L db (L) M L In practce, t can often be more convenent when mplementng gradent ascent for neural networks to smply compute the gradent wth respect to each matrx W (l) and vector b (l) and ncrement them ndvdually accordng to the gradent 3

ascent rule (whch has exactly the same effect). For example, n an obectorented framework, each matrx and vector can be ts own obect that knows how to compute ts own gradent and apply gradent ascent, or else concatenate ts gradent to the huge full gradent vector. In the next secton, we ll dscuss how we can compute the gradent wth respect to each weght matrx W (l). 2 Backpropagaton: base case To evaluate the output of a neural network, we recursvely evaluate each layer as followng: h (l) σ(w (l) h (l ) + b (l) ). We wll also ntroduce z (l) W (l) h (l ) + b (l), such that h (l) σ(z (l) ). Sometmes, you ll see h referred to as post-synaptc and z as pre-synaptc. Ths recursve evaluaton of a neural network s referred to as forward propagaton, because we are propagatng the actvatons from the nput forward to the output. To compute dervatves, we use the chan rule to dfferentate each layer of the neural network wth respect to the obectve. Ths algorthm proceeds from the end of the neural network back down to the frst layer, and s therefore referred to as backward propagaton or backpropagaton. Frst, let s revst the chan rule. If we have the composton of two functons f(g(x)), and we want df dx, we can evaluate t as: df dx df dg dg dx. For example, f f(y) ay, and g(x) bx, then df dx ab. The same dea apples n the multvarate case. Let s say that we have a vector x, and f(y) Ay, and g(x) Bx. Then df dx df ( ) dg df dg dx AB A,k B k,. dx, k Next, let s see how we can compute the gradent for a sngle-layer neural network, whch smply corresponds to logstc regresson. Although the output of ths network s D n the bnary case, we ll stll do the math for the multvarate case, assumng there are M outputs (t ust happens that M ). Ths wll make thngs more convenent later. The lkelhood s gven by L(θ) N log p(y x, W (), b () ). In the bnary case, we smply have log p(y x, W (), b () ) log(h () ) f y and log p(y x, W (), b () ) log( h () ) otherwse. If we assume for now that there s only one datapont (we ll see what happens wth multple dataponts later), we ust have dh () { y 0 : h () y : h () 4

We know that L(h () ) L(σ(z () )) L(σ(W () x + b () )), and therefore we can use the chan rule to get We already know that and therefore dz () dh () d dz () dh () dz dh () () dh () dz () s smply or. We know that h () σ(z () ) + exp( z () ) To see why ths s true, note that Note that entry n dh () dh () dz () dh () + exp( z () ), dh () ( σ(z)) + exp( z) + exp( z) exp( z () ) ( + exp( z () )) 2 exp( z) + exp( z). dh () σ(z () )( σ(z () )) 0 f, so we only need to pontwse multply each by σ(z () )( σ(z () )). Now we ust need to evaluate dw dz () () dz () dw () db dz () () dz () db. () We know that z () W () x + b (). Let s start wth the bas, we smply have db () M dz () dz () db () dz () snce b () only affects z (), and not any other z () where. Evaluatng the dervatve wth respect to the weghts matrx W () s a bt more complex. We have and therefore dw (), M 0 z () M k dz () k W (), x + b (), dz () k dw (), dz () snce we can see n the sum that z () only depends on W (), M. We can also express ths n matrx notaton as dw () dz () xt. 5, x, from to

3 Backpropagaton: recursve case ntro Now, let s say that we have a multlayer neural network. The last layer smply looks lke logstc regresson, except that now nstead of x, we have the actvatons n the second-to-last layer h (L ). We therefore have: db (L) dw (L) (h(l ) ) T. But how can we get the dervatves wth respect to the weghts n the precedng layers? Well, we note that the prevous layers only affect L va h (L ), so we smply need to know. Snce we know that dh (L ) and therefore we can derve dh (L ) M L z (L) W (L) h (L ) + b (L) M L z (L) W (L), h(l ) + b (L), dh (L ) M L n matrx notaton, ths smply becomes M L W (L), dh (L ) (W(L) ) T dz. (L) ( (W (L) ) T ), Now, we can smply proceed recursvely, and use n place of to dh (L ) dh (L) compute the dervatves wth respect to W (L ) and b (L ). 6