Notes on Neural Networks

Size: px

Start display at page:

Download "Notes on Neural Networks"

Earl Bates
5 years ago
Views:

1 Artificial neurons otes on eural etwors Paulo Eduardo Rauber 205 Consider te data set D {(x i y i ) i { n} x i R m y i R d } Te tas of supervised learning consists on finding a function f : R m R d tat is able to generalize from te examples in D Supervised learning as been successful at image classification natural language processing and many oter tass Some types of artificial neural networs can be seen as a function f : R m R d computed by a networ of artificial neurons Te particular way in wic te neurons are connected defines te arcitecture of te networ Artificial neurons are simplified models of real neurons Te obective of tese models is not to simulate te beavior of real neurons but to replicate some of teir ig level functionality Consider an input vector x (x x m ) R m to an artificial neuron We define a weigt vector w (w w m ) R m a bias b R and a tresold θ R Using tese definitions te following expressions give te outputs of different models of artificial neurons: Linear neurons: b m i x iw i b xw; Rectified linear neurons: max(0 b xw); { if b xw θ Binary tresolded neurons: 0 oterwise; Sigmoid neurons: e (bxw) ; Stocastic binary neurons (sampling): P (Y : w b) e (bxw) Te bias term b is te negative intercept of te affine yperplane H {(x x m ) x w x m w m b} By consequence binary tresolded neurons separate R m into two alf-spaces by an affine yperplane Sigmoid neurons perform an analogous but smoot assignment of te input x to one of te alf-spaces Consider binary tresolded sigmoid or stocastic binary neurons Intuitively a ig bias indicates tat a neuron is easy to excite (output a positive value) A low bias indicates te opposite If eac input element x i is interpreted as a feature a positive weigt w i indicates tat te feature excites te neuron and a negative weigt w i indicates tat te feature inibits te neuron Supervised learning in artificial neural networs is performed by updating te weigt vectors of te artificial neurons to minimize a metric of prediction error on te training data set Wen applied to discriminate between d classes te artificial neural networ may output d real values wic correspond to te confidence tat te input belongs to eac class 2 Feedforward neural networs Tis section describes feedforward neural networs an important class of artificial neural networs Consider te data set D {(x i y i ) i { n} x i R m y i R d } Let L represent te number of layers in te networ and (l) represent te number of neurons in layer l wit (L) d We will refer to a neuron in layer l by a corresponding number between and (l) Te neurons in te first layer are also called input units te neurons in te output (last) layer called output units and te oter neurons called idden units etwors wit more tan 3 layers are called deep networs Let w (l) R represent te weigt reacing neuron in layer l from neuron in layer (l ) Te order of te indices is counterintuitive but maes te presentation simpler Furtermore let b (l) R represent te bias for neuron in layer l Consider a layer l for < l L and neuron for (l) Te weigted input to neuron in layer l is defined as z (l) (l ) b (l) w (l) a(l )

2 were te activation of neuron in layer l > is defined as a (l) σ(z (l) ) were σ is a differentiable activation function For example consider te sigmoid activation function defined by σ(z) e In tis case it is easy to sow tat σ (z) σ(z)( σ(z)) z It will be useful to define vectors (and a matrix) tat represent quantities associated to eac neuron in a given layer Te weigted input for layer l > is defined as z (l) (z (l) z(l) (l) ) and te activation vector for layer l is defined as a (l) (a (l) a(l) (l) ) Furtermore we define te bias vectors as b (l) (b (l) b(l) (l) ) and te weigt matrices W (l) of dimension (l) (l ) as (W (l) ) w (l) Using tese definitions te output of eac layer l > can be written as a (l) σ(w (l) a (l ) b (l) ) were te activation function is applied element-wise Given tese definitions te output of te feedforward neural networ f wen a () x is defined as te activation vector of te output layer f(x) a (L) Tis completes te definition of te feedforward neural networ model It is possible to sow tat a feedforward neural networ wit a single idden layer of sigmoid neurons can approximate any real-valued continuous function defined on a compact subset of R m Let te cost C be a differentiable function of te weigts and biases For example consider te (alved) mean squared cost C given by were C n (xy) D c c 2 a(l) y 2 wen a () x It is easy to sow tat a (L) y We define te tas of learning te parameters (weigts and biases) for a feedforward neural networ as finding parameters tat minimize C for a given training data set D Te fact tat te cost can be written as an average of costs for eac element of te data set will be crucial to te proposed optimization procedure Te procedure requires te computation of partial derivatives of te cost wit respect to weigts and biases wic are usually computed by a tecnique called bacpropagation Let te error of neuron in layer l for a fixed (x y) D be defined as δ (l) z (l) Te error for te neurons in layer l is denoted by δ (l) (δ (l) δ(l) ) (l) We also let a c ( (L) ) denote te gradient of c wit respect to a (L) Bacpropagation is a metod for computing te partial derivatives of te cost function of a feedforward artificial neural networ wit respect to its parameters Te metod is based solely on te following six statements wic 2

3 we will demonstrate sortly: δ (L) a c σ (z (L) ) () δ (l) ((W (l) ) T δ (l) ) σ (z (l) ) C C δ (l) (3) a (l ) δ (l) (4) n n (xy) D (xy) D (5) (6) were denotes element-wise multiplication otice ow every quantity on te rigt side can be computed easily from our definitions by starting wit te errors in te output layer for every observation Tat is wy tis metod is called bacpropagation Te statement δ (L) a c σ (z (L) ) is equivalent to for (L) By definition First notice tat a (L) Because z (L) δ (L) δ (L) σ (z (L) ) is a differentiable function of z (L) and c is a differentiable function of a (L) a (L) and a (L) a (L) do not directly affect eac oter (L) only affects c troug a (L) δ (L) (L) σ (z (L) ) were te second equality comes from te cain rule and te tird equality from te fact tat a(l) Tis completes te proof Te statement δ (l) ((W (l) ) T δ (l) ) σ (z (l) ) is equivalent to δ (l) σ (z (l) (l) ) w (l) δ (l) (L) 0 for for < l < L and (l) Because z (l) is a differentiable function of z (l) z(l) and c is a differentiable function of z (l) (l) z (l) (l) By definition terefore z (l) z (l) δ (l) z (l) z (l) (l) (l) b (l) z (l) w (l) z (l) i a (l) z (l) z (l) w (l) i a (l) i w (l) σ (z (l) ) 3

4 Tis gives (l) δ (l) z (l) wic completes te proof Consider te statement function of z (l) z(l) (l) w (l) σ (z (l) (l) ) δ (l) Because z(l) δ (l) w (l) σ (z (l) ) σ (z (l) (l) ) is a differentiable function of b (l) w (l) δ (l) and c is a differentiable (l) z (l) z (l) z (l) z (l) δ (l) z (l) 0 for and oterwise Tis completes te proof since z(l) Similarly consider te statement a (l ) and c is a differentiable function of z (l) z(l) (l) since z(l) i w wic gives (l) w i 0 if i By definition z (l) Te statements C a (l ) n z (l) δ (l) (xy) D z (l) i b (l) δ (l) Because z(l) z (l) i w as we wanted to sow and z (l) z (l) is a differentiable function of w (l) w(l) (l ) z (l) δ (l) w w (l ) i w i a (l ) i Terefore w (l) a(l ) a (l ) C n (xy) D follow easily from te definition of C and c and teir differentiability wit respect to te parameters Tis completes te proof of te six statements tat allow te computation of te gradient of C wit respect to te parameters of te networ troug bacpropagation Te statements are valid for oter activation and cost functions as we will sow in te next Section In eac case it is important to cec weter te assumptions made in te proofs apply Intuitively for eac observation bacpropagation considers te effect of a small increase of w on a parameter w in te networ Tis cange affects every subsequent neuron on a pat to te output and ultimately canges te cost c by a small c Consider a differentiable function f : R n R over variables x (x x n ) Suppose we are interested in finding a global minimum of f Because te gradient f(a) gives te direction of maximum local increase in f at a it would be natural to start at a point a 0 wic could be cosen at random and visit te sequence of points given by a t a t η f(a t ) were te learning rate η R is a small constant Tis tecnique is called gradient descent and can be used as a euristic to find te parameters tat minimize te cost C wit respect to te parameters of te networ Gradient descent is not guaranteed to converge Even if it converges te point at convergence may be a saddle point or a poor local minima Te coice of η considerably affects te success of gradient descent In a given iteration t of gradient descent instead of computing C and C as averages derived from a computation involving all (x y) D it is also possible to consider only a subset D D of randomly cosen observations Te data set D may also be partitioned randomly into subsets called batces wic are considered in sequence In tis case anoter random partition is considered once every subset is used Tis procedure called mini-batces stocastic gradient descent is widely used due to its efficiency Intuitively te procedure maes faster decisions based on sampling Regardless of tese coices a sequence of iterations tat considers all observations in te data set is called an epoc 4

5 Te basic coices involved in learning te parameters for a feedforward neural networ using stocastic gradient descent include at least te number of idden layers te number of neurons in eac idden layer size of te mini-batces te number of epocs and te learning rate η However more coices will be necessary for te enancements proposed in te next capters 3 Cost functions for feedforward neural networs Te previous section introduced feedforward neural networs For simplicity te presentation mentioned only sigmoid activation functions and te (alved) mean squared cost function Tis section begins describes better alternatives for tese functions Consider te weigted input z (L) to neuron in te output layer L If z (L) is large te corresponding neuron is said to be saturated In tat case if σ is te sigmoid function ten σ (z (L) ) is close to zero If te cost function is te mean squared cost te first bacpropagation statement gives δ (L) case δ (L) partial derivatives of te cost wit respect to w (L) (a (L) is close to zero even if te target output y is very different from a (L) y )σ (z (L) ) Terefore in tis Anoter consequence is tat te and b(l) for all will also be close to zero Tis is expected from te mean squared cost because a small cange to te output wic requires a significant cange to te parameters of a saturated neuron would not improve te cost significantly Tis as te undesired effect of slow learning for saturated neurons in te output layer Since canging te learning rate would impact learning on te wole networ tat is not an elegant solution Instead of considering te (alved) mean squared cost suppose we considered a cost function c for a single observation suc tat y σ (z (L) ) a(l) Te first bacpropagation statement would give δ(l) a (L) y wic would avoid slow learning for saturated neurons in te output layer Tat is precisely wy sigmoid output neurons are used togeter wit te cross-entropy cost function Consider te data set D {(x i y i ) i { n} x i R m y i R d } Te cross-entropy cost function c for a single pair (x y) D is defined as c d i wen a () x otice tat te cost function requires a (L) i wen y i a (L) i and tends to infinity wen a (L) i y i y i ln a (L) i ( y i ) ln ( a (L) i ) Consider te partial derivative of te cost c wit respect to a (L) (0 for all i Intuitively te cost is close to zero wic is given by y a (L) ( y ) ( a (L) ) y ( a (L) ) a (L) ( y ) a (L) ( a (L) ) a(l) a (L) y ( a (L) ) If a (L) σ(z (L) ) and σ is te sigmoid function ten a (L) ( a (L) ) σ (z (L) ) and y σ (z (L) ) a(l) By te first bacpropagation statement δ (L) a (L) y as we wanted to sow As previously mentioned te cross-entropy cost function avoids te slow down in learning caused by saturated sigmoid output neurons However tat does not necessarily solve te analogous problem for te previous layers In comparison wit te previous section te cross-entropy cost wit sigmoid neurons only affects te computation of a c It is also common to use a different activation function in te output layer activation function for neuron in te output layer L is is te identity a (L) pair (x y) D given by te (alved) mean squared cost wen a () x In tis case a (L) c 2 a(l) y 2 affects te first bacpropagation statement From te definition of δ (L) z (L) Consider a networ were te Consider again te cost per y Tis cange in te activation function for te last layer only it is easy to sow tat 5

6 (L) δ (L) a (L) y Te last equality follows from a(l) Terefore te (alved) mean squared cost does not slow learning for saturated linear output neurons Tis may be an appropriate coice for regression problems wic consist on predicting real-valued outputs In comparison wit te previous section te (alved) mean square cost wit linear output neurons only affects te first statement of bacpropagation wic sould be substituted by δ (L) a c a (L) y Consider a data set D {(x i y i ) i { n} x i R m y i R d } for a d-classes classification problem Concretely for every pair (x y) D y if x belongs to class and y 0 oterwise Te learning tas can be restated as finding te parameters θ tat maximize te lieliood L(θ : D) of observing y i given x i for every i in a conditional probability distribution P parameterized by θ Θ assuming te sample elements in D are identically distributed and independent given any θ Te lieliood L(θ : D) is defined as L(θ : D) n P (Y y i X x i : θ) i Tis is equivalent to finding te parameters θ tat minimize te negative log-lieliood l(θ : D) given by l(θ : D) n ln P (Y y i X x i : θ) i Suppose we could interpret te output a (L) of neuron in te output layer L of a feedforward neural networ as te probability of input a () x belonging to class Te negative log-lieliood cost function c for a pair (x y) D could be written as c d i y i ln a (L) i were a () x otice tat tis cost function requires a (L) > 0 Te softmax output layer is used precisely so te output of te networ can be interpreted as a parameterized conditional probability distribution P (Y X : θ) A feedforward neural networ as a softmax output layer wen a (L) e z(l) d ez(l) for every neuron d in te output layer It is easy to see tat d a(l) and a (L) > 0 for all Given input a () x te activation a (L) can be interpreted as P (Y e X x : θ) were e is te standard basis vector in direction and θ represents te parameters of te networ Te softmax activation function is significantly different from tose presented so far otice tat a (L) depends on z (L) i for all i In fact tis completely invalidates te first bacpropagation statement However we will now sow tat By definition δ (L) is a differentiable function of a (L) a (L) and a(l) It is also easy to sow tat δ (L) a (L) y Consider te negative log-lieliood cost function c d i y i ln a (L) i Because c d y a (L) is a differentiable function of z (L) for every and d Terefore 6

7 d y a (L) Because te softmax output layer will be applied solely in a classification tas y is for a single and zero oterwise Terefore a (L) Consider te derivative of te activation of neuron wit respect to te weigted input to neuron wic is given by e z(l) d d d d 2 e z (L) ez(l) ez(l) dz (L) e z(l) e z(l) e z (L) By separating fraction above into two and using te definition of te activations a (L) Consider d dz (L) e z (L) If ten d dz (L) Consider again te equation Using te expression for a(l) d dz (L) e z (L) d ez(l) a (L) a(l) e z (L) e z (L) d Oterwise dz (L) { Te equation above can also be written as a (L) ( a(l) ) if oterwise a (L) a(l) a (L) { a (L) if oterwise a (L) a (L) y and a (L) e z (L) 0 Terefore as we wanted to sow In comparison wit te previous section te softmax output layer wit negative log-lieliood cost only affects te first statement of bacpropagation wic sould be substituted by δ (L) a (L) y Tis combination of cost function and output activation is a principled and elegant formulation of te classification tas and is often used in practice 4 Improving feedforward neural networs Tis section begins by describing tecniques to deal wit overfitting In a feedforward neural networ learning consists on finding parameters (weigts and biases) tat minimize a cost function defined on te available data In tis context te networ is also called a model for te data A model is said to overfit wen it as a low cost over te available data but generalizes poorly for unseen data Te most common way to diagnose overfitting is to perform cross-validation In cross-validation te available data is partitioned into two sets One of tese sets is used for learning (training set) and te oter set is used for evaluation (test set) Tis is also ow model efficacy is evaluated 7

8 However feedforward neural networs also ave yperparameters Tese are te parameters tat must be defined even before learning occurs wic include te learning rate number of layers number of neurons in eac layer number of epocs mini-batc size etc Evaluating generalization efficacy troug yperparameters cosen according to teir efficacy on te test set is a metodological mistae Te yperparameters may also overfit te test data wic would result in overly optimistic evaluations Te available data is usually split into tree sets: training validation and testing Te validation set is used to compare te results of different yperparameters wic are used for learning using te training set Te best yperparameters on te validation set are used for learning using bot training and validation sets Finally te model is evaluated in te test set wic gives a generalization efficacy estimate Tis process may be repeated several times using scemes as -fold cross-validation or bootstrapping wic we do not detail ere In te case of feedforward neural networs it is common to create graps comparing cost and accuracy on training and validation data as epocs pass Tis may elp in diagnosing overfitting For example te accuracy on te training data may become perfect after a number of epocs wile te accuracy on te validation data becomes worse Early stopping consists on alting te learning procedure wen validation accuracy does not improve significantly after a given number of epocs Coosing yperparameters for a feedforward neural networ tat acieve good generalization in te validation set can be a very callenging tas Because te number of yperparameters is very large manual fine-tuning is often preferred to scemes suc as grid searc Regularization is an important tecnique to prevent overfitting Intuitively regularization penalizes large weigts in te networ wic could mae te output very sensitive to small canges in te input We define te L2- regularized cost function C for a data set D {(x i y i ) i { n} x i R m y i R d } as C c λ w 2 n 2n (xy) D were c is te cost for te pair (x y) wen a () x W is te set of all weigts in te networ and λ 0 is te regularization factor otice ow te second term increases te overall cost of aving large weigts wen te first term is fixed Intuitively in te case of an extremely large λ te networ would be rewarded for effectively removing most weigts It is important to note tat regularization is not considered into te cost c for a single observation because tat would invalidate te fourt property of bacpropagation It is easy to sow tat te partial derivative of te cost C wit respect to bias b (L) is w (L) C b (L) n (xy) D w W b (L) wic is te same as before Te reason for not penalizing te biases is mostly empirical Te partial derivative of te cost C wit respect to weigt w (L) is C λ n n w(l) w (L) (xy) D w (L) Wen using gradient descent to minimize C wit respect to te parameters of te networ te update rule for can be written as w (L) w(l) η C w (L) η w (L) n (xy) D ( ηλ n )w(l) η n w (L) (xy) D λ n w(l) otice tat te last rule is very similar to gradient descent on a cost function witout regularization except for te factor 0 ( ηλ n ) < (under sensible coices of η and λ) multiplying w(l) Tis regularization metod is also called weigt decay Wen performing stocastic gradient descent te decay factor is always ( ηλ n ) for every update irrespective of batc size w (L) 8

9 Anoter way to reduce overfitting is to use more training data Intuitively tis is useful to constrain te freedom in te coice of parameters wic is one of te maor causes of overfitting Because it is not always easy to obtain more data one particularly useful idea is to expand te available data by transforming input vectors in ways tat mimic expected variations For example if te input vectors correspond to images transformations suc as translation scaling and rotation often sould not affect classification Previously we mentioned tat stocastic gradient descent may begin at a random assignment to te weigts and biases Consider te neuron in layer l and let p (l ) denote te number of neurons in layer l wic is also te number of inputs to neuron Suppose te weigts reacing neuron were sampled according to a Gaussian distribution wit mean 0 and standard deviation σ suc tat w (l) (0 σ2 ) for all By te properties of Gaussian distributed random variables p w (l) (0 pσ2 ) If te weigts were sampled from a standard Gaussian distribution (0 ) ten p w(l) (0 p) If te number of inputs p to neuron were large te neuron could easily become saturated As already mentioned saturated neurons slow learning because small canges to teir parameters do not significantly affect teir outputs In contrast consider w (l) (0 p ) In tis case p w(l) (0 ) wic elps in avoiding early saturation and overall learning success Tis is a common recommendation for weigt initialization in feedforward neural networs Te bias b (l) for neuron can be sampled from (0 ) Consider a feedforward neural networ wit a single sigmoid neuron in eac layer and a cross-entropy cost function In tis case it is easy to sow tat te error δ (l) of te neuron at layer l for a single pair (x y) is L δ (l) (a (L) y ) w (i) σ (z (i) ) for every l > If te weigts in te networ are not very large a single saturated neuron could mae δ (l) very small Tus te partial derivatives of te parameters wit respect to te cost would also be small in layer l wic would compromise learning by gradient descent in all layers previous to and including l An analogous problem occurs in more general networs Because c can be seen as a function of te weigted inputs z (l) for all neurons in a given layer l tis issue is caracterized by z (l)c 0 wic is wy tis learning problem is called vanising gradients Wen te weigts in te networ are large exploding (large) gradients could also become a problem However large weigts are often followed by saturated neurons wic maes te problem less common It may also be important to scale input vector elements to ave zero mean and unit variance across every input vector to avoid early saturation It can be sown tat deep feedforward neural networs (L > 3 layers) can approximate more complicated functions wit fewer parameters tan sallower networs However in practice vanising gradients mae it difficult to obtain better efficacy wit deep feedforward neural networs trained by gradient descent We now describe two tecniques considered important in te context of deep feedforward neural networs: momentum and dropout Te momentum tecnique is a common euristic for training deep artificial neural networs In momentumbased stocastic gradient descent eac parameter w in te networ (weigt or bias) as a corresponding velocity v Te velocity is defined by v 0 0 and C v t µv t η w t were v i and w i correspond respectively to v and w at iteration i of stocastic gradient descent At eac iteration te parameter w is updated by te rule w t w t v t Intuitively te momentum tecnique remembers te velocity of eac parameter allowing larger updates wen te direction of decrease in cost is consistent over many iterations Te parameter 0 µ controls te effect of te previous velocity on te next velocity and µ is commonly interpreted as a coefficient of friction If µ 0 te tecnique is equivalent to stocastic gradient descent Dropout is anoter euristic for training deep artificial neural networs At every iteration of stocastic gradient descent alf te idden neurons are removed at random In most implementations tis can be accomplised by forcing te outputs of te corresponding neurons to be zero Te modified networ is applied as usual to te observations in a mini-batc and bacpropagation follows as if te networ were not canged Te resulting il 9

10 partial derivatives are used to update te parameters of te neurons tat were not removed After training is finised te weigts incoming from idden neurons are alved Tis euristic is believed to mae te networ robust to te absence of particular features wic migt be particular to te training data Dropout is considered related to regularization for trying to reduce overfitting Tere are many euristics for implementing feedforward neural networs tat will not be described in detail in tis text Tey include Hessian optimization input witening rmsprop and adaptive learning rates Altoug we focused most of te discussion on sigmoid neurons rectified linear neurons ave acieved superior results in important bencmars 5 Convolutional neural networs Convolutional neural networs were first developed for image classification altoug tey ave also been applied in oter tass Tis section focuses on image classification using convolutional neural networs A two-dimensional image is a function f : Z 2 R c If f(a) (f (a) f c (a)) ten f i is also called cannel i An element a Z 2 is called a pixel and f(a) is te value of pixel a A window W Z 2 is a finite set W s S s 2 S 2 tat corresponds to a rectangle in te image domain Te size of tis window W is denoted as w were w S s and S 2 s 2 Because te domain D of images of interest is usually a window it is possible to represent an image f by a vector x R c D In tis vector tere is a scalar value f i (a) corresponding to eac cannel i of eac pixel a D Consider a data set D {(x i y i ) i { n} x i R m y i R d } were eac vector x i corresponds to a distinct image f : Z 2 R c all images are defined on te same window D and m c D Te tas of image classification consists on assigning a class label for a given vector x based on generalization from D A convolutional neural networ is particularly well suited for image classification because it explores te spatial relationsips between pixels (organization in Z 2 ) Similarly to feedforward neural networs a convolutional neural networ is also a parameterized function and te parameters are usually learned by stocastic gradient descent on a cost function defined on te training set In contrast to feedforward neural networs tere are tree main types of layers in a convolutional neural networ: convolutional layers pooling layers and fully connected layers A convolutional layer receives an input image f and outputs an image o A convolutional layer is composed solely of artificial neurons Eac artificial neuron in a convolutional layer l receives as input te values in a window W s S s 2 S 2 D of size w were D is te domain of f Te weigted output z (l) of tat neuron is given by In te equation above a (l ) i z (l) c b(l) S i s S 2 w (l) i a(l ) i s 2 f i( ) te value of pixel ( ) in cannel i of te input image Also b (l) te bias of neuron and w (l) i te weigt tat neuron in layer l associates to f i( ) Te activation function for a convolutional layer is usually rectified linear so a (l) max(0 z(l) ) Te definition of z(l) is similar to te definition of te weigted input for a neuron in a feedforward neural networ Te only difference is tat a neuron in a convolutional layer is not necessarily connected to te activations of all neurons in te previous layer but only to te activations in a particular w window W Eac neuron in a convolutional layer as cw weigts and a single bias A neuron in a convolutional layer is replicated (troug parameter saring) for all windows of size w in te domain D wose centers are offset by pre-defined steps Tese steps are te orizontal and vertical strides Te activations corresponding to a neuron replicated in tis way correspond to te values in a single cannel of te output image o (appropriately arranged in Z 2 ) Tus an output image o : Z 2 R is obtained by replicating neurons over te wole domain of te input image Te total number of free parameters in a convolutional layer is only (cw ) If te parameters in a convolutional layer were not sared by replicated neurons te number of parameters would be M(cw ) were M is te number of windows of size w tat fit into f (for te given strides) Te weigted outputs (minus te bias) of replicated neurons correspond to an output cannel tat is analogous to te discrete (multicannel) convolution of te input f wit a particular image g Te values of g correspond to te (sared) weigts of te replicated neurons (appropriately arranged in Z 2 ) Tis assumes tat te orizontal and vertical strides are and tat te domain of te resulting image is always restricted to te window domain of f In oter words eac cannel o u in te output o of a convolutional layer corresponds to a convolution wit an is 0

11 image g u wic is also called a filter Tis is te origin of te name convolutional networ Terefore to define a convolutional layer it is enoug to specify te size of te filters (window size) te number of filters (number of cannels in te output image) orizontal and vertical strides (wic are usually ) Eac cannel in te output of a convolutional layer can also be seen as te response of te input image to a particular (learned) filter Based on tis interpretation eac cannel in te output image is also called an activation map Bacpropagation can be adapted to compute te partial derivative of te cost wit respect to every parameter in a convolutional layer Te fact tat a single weigt affects te output of several neurons must be taen into account We omit te details of bacpropagation for convolutional neural networs in tis text A pooling layer receives an input image f : Z 2 R c and outputs an image o : Z 2 R c A pooling layer reduces te size of te window domain D of f by an operation tat acts independently on eac cannel A common pooling tecnique is max-pooling In max-pooling te maximum value of cannel f i in a particular window of size w corresponds to an output value in cannel o i To define a max-pooling layer it is enoug to specify te size of tese windows and te strides (wic usually matc te window dimensions) Te obective of reducing te spatial domain of te image is to acieve similar results to using comparatively larger convolutional filters in te next layers Tis supposedly allows te detection of iger-level features in te input image wit a reduced number of parameters It is also believed tat max-pooling improves te invariance of te classification to translations of te original image In practice a sequence of alternating convolutional and max-pooling layers as obtained excellent results in many image classification tass Bacpropagation can also be performed troug max-pooling layers In summary a max-pooling layer receives an input image f : Z 2 R c and outputs an image o : Z 2 R c defined by o i ( ) max a W f i (a) were i { c} ( ) Z 2 D is te window domain of f and W D is te input window corresponding to output pixel ( ) A fully connected layer receives an input image f : Z 2 R c or an input vector x and outputs a vector o A fully connected layer is precisely analogous to a layer in a feedforward neural networ and can only be succeeded by oter fully connected layers Te final layer in a convolutional neural networ is always a fully connected layer wit d neurons wic is responsible for representing te classification Bacpropagation in fully connected layers is analogous to bacpropagation in feedforward networs Te tas of detection corresponds to delimiting te spatial extent of particular obects of interest witin an image Consider a convolutional neural networ trained on images wit a window domain D Te input to te first fully connected layer is an image wit a window domain W of size w wic is dependent on te number of pooling layers present in te networ Tis image wit domain W is ultimately mapped (troug forward pass troug fully connected layers) to a vector o R d wic represents te confidence tat te input image belongs to eac class otice tat it is possible to apply te same networ up to te first fully connected layer to an input image wit a muc larger window domain D L Tis results in a larger input image wit domain W L for te first fully connected layer It is possible to partition te window W L into smaller windows W l of size w wic matces te size required by te first fully connected layer Terefore eac smaller window W l of tis larger input image can be associated to an output vector o l R d wic represents te confidence tat te input image contains a particular obect in window W l As a consequence te original larger image is partitioned into rectangular windows for wic te confidence tat eac window contains a particular obect is nown Tis partition can be represented by an image g : Z 2 R d called class map Precisely te same class map would be obtained by exaustively applying te convolutional networ for eac window of size w wic fits into te larger input image altoug te computational cost would be significantly iger In summary for te purpose of detection te convolutional neural networ can be trained wit fixed-size images of obects of interest Detection is performed by resizing te input image so tat te expected size of te obects of interest matces te size observed during training Te resized image is received as input by te adapted networ wic partitions te input to te fully connected layers appropriately Te corresponding class map can be post-processed to obtain rectangular boundaries containing obects or confidences for te presence of a given obect Te coice of yperparameters for convolutional neural networs (types of layers number of layers filters per layer filter sizes strides) is crucial to acieve good performance Deep convolutional neural networs are usually trained in immense data sets requiring (non-trivial) efficient implementations wic include all te improvements mentioned in te previous section After training a convolutional neural networ for a particular data set it is possible to re-use te parameters of te networ (up to its last fully connected layer) as a starting point for anoter

12 classification tas Tis tecnique decouples representation learning from a specific image classification problem and as been very successful in practice 6 Recurrent neural networs Let A denote te set of non-empty sequences of elements in te set A and let X denote te lengt of a sequence X A We denote te t-t element of sequence X by Xt Consider te data set D {(X i Y i ) i { n} X i (R m ) Y i (R d ) } Furtermore let X Y for every (X Y ) D In oter words te data set D is composed of pairs (X Y ) of sequences of te same lengt Eac element of te two sequences is a real vector but Xt and Y t do not necessarily ave te same dimension Te tas of sequence element classification consists on finding a function f : (R m ) (R d ) tat is able to generalize from te examples in D Tis tas is more general tan sequence classification wic consists on finding a function f : (R m ) R d It is not straigtforward to apply te models presented so far to sequence element classification Recurrent neural networs compose a particular class of artificial neural networs tat can be applied to sequence element classification We start by introducing a simple recurrent neural networ wit a single idden layer Many definitions are analogous to tose presented for feedforward neural networs and will be omitted wen tere is no ambiguity Let (l) be te number of neurons in layer l wit () m and (3) d Let b (l) R represent te bias for neuron in layer l > and w (l) R represent te weigt reacing neuron in layer l > from neuron in layer (l ) Furtermore let ω R represent te weigt reacing neuron in layer 2 from neuron in layer 2 from te previous time step Te input activation at time t for (X Y ) D is defined as at () Xt Te weigted input to neuron in layer 2 at time step t is defined as () zt b w at() ω at were a0 may be zero altoug it could also be learned Te activation of neuron in a layer l > at time t is defined as at (l) σ(zt (l) activation function For concreteness we will consider sigmoid activation functions Te weigted input to neuron (3) in layer 3 is defined as zt (3) b (3) w (3) at ) were σ is a differentiable as in a feedforward neural networ Te output of te recurrent neural networ f on input X a () at () is te sequence a (3) at (3) Tis completes te definition of a recurrent neural networ wit a single recurrent idden layer Intuitively te sequence X is presented to te networ element by element Te networ beaves similarly to a single idden layer feedforward neural networ except for te fact tat te output activation at of te idden layer at time t possibly affects te weigted input zt of te idden layer at time t Intuitively an ideal recurrent neural networ would be capable of representing a sequence X : t by its idden layer activation at to allow correct classification of Xt Once again we define te tas of learning te parameters (weigts and biases) for a neural networ as finding parameters tat minimize a cost function C n (XY ) D c Several coices of c are possible as discussed in a previous section Concretely consider te case of a cross-entropy cost function c for sequence element classification defined as c T T d t i Y t i ln (at (3) i ) ( Y t i ) ln ( at (3) i ) were X a () at () is te input sequence and a (3) at (3) is te output sequence As mentioned in a previous section tis cost function requires tat every Y t be a standard basis vector wic represents te fact tat every Xt belongs to a single class 2

13 Te error of neuron in layer l at time step t is defined as δt (l) zt (l) Bacpropagation troug time is a metod for computing te partial derivatives of te cost function of a recurrent neural networ wit respect to its parameters In te case of a recurrent neural networ wit a single idden layer te metod depends solely on te following statements wic will be demonstrated sortly: δt (3) at (3) σ (zt ) ω C C C ω n n n T t T t T t σ (zt (3) ) () (3) w (3) δt(3) ω δt δt (l) (3) δt (l) at(l ) (4) at (5) (XY ) D (XY ) D (XY ) D ω (6) (7) (8) were δt is zero otice ow te error depends on te error δt for all and t Every quantity on te rigt side can be computed easily from our definitions by starting from δt (3) wic can only be computed after te networ observes te entire sequence Tat is wy te metod is called bacpropagation troug time A single idden layer recurrent neural networ can be unfolded into a feedforward networ wic may be useful to interpret bacpropagation troug time For eac pair (X Y ) D let X a () at () and consider a feedforward networ wit mt input neurons dt output neurons and T idden layers wit neurons eac Input at () is connected solely to te idden neurons in layer t and te idden neurons in layer t are connected to output neurons tat correspond to at (3) Furtermore te idden neurons in layer t < T are connected to te idden neurons in layer t Te parameters in te networ are sared bot between corresponding idden neurons in distinct layers and corresponding output neurons An analogous unfolding can be applied to any recurrent neural networ arcitecture otice tat te unfolded recurrent networ is deep for any sequence wit more tan one element wic indicates tat te model may be difficult to train for long sequences Te proof of te bacpropagation troug time statements presented below were derived independently by te autor of tis text Altoug te results are correct according to te literature te details sould be inspected carefully Consider te statement δt (3) at (3) σ (zt (3) ) Because c is differentiable wit respect to at (3) and zt (3) only affects c troug at (3) δt (3) zt (3) at (3) at (3) zt (3) at (3) σ (zt (3) ) 3

14 as we wanted to sow Furtermore if we consider a cross-entropy cost function for sequence element classification and a sigmoid activation function at (3) at (3) T at (3) T Y t T at (3) wic gives δt (3) at (3) T d t i Y t i ln (at (3) i ) ( Y t i ) ln ( at (3) i ) Y t ln (at (3) ) ( Y t ) ln ( at (3) ) ( Y t ) ( at (3) ) Y t at (3) ( at (3) )T at(3) Y t σ (zt (3) )T at (3) σ (zt (3) ) at(3) Y t T In te case of a softmax output layer it is no longer true tat zt (3) only affects c troug at (3) However it is still true tat δt (3) at(3) Y t T Te proof is analogous to tat presented in a previous section Consider te statement for a negative log-lieliood cost function c for sequence element classification σ (zt ) (3) w (3) δt(3) ω δt wic we will prove for every t < T As already mentioned we define δt 0 Because zt only affects c troug zt (3) zt(3) and troug zt (3) zt te cain rule gives zt From te definition of zt (3) zt (3) zt From te definition of zt zt zt (3) (3) zt zt zt zt (3) zt (3) zt δt (3) zt (3) zt b (3) b i ω σ (zt ) i () i ω i at i zt δt zt zt zt zt w (3) i at i w (3) σ (zt ) w i at () i i ω i at i 4

15 Going bac as we wanted to sow Consider te statement (3) δt (3) w(3) σ (zt ) σ (zt ) T (3) δt ω σ (zt ) w (3) δt(3) ω δt t δt(l) wic we prove independently for l 3 and l 2 Because b (3) only affects c troug z (3) zt (3) te cain rule gives b (3) T t zt (3) zt (3) b (3) T t δt (3) wic completes te proof for l 3 Altoug b only affects c troug z zt it is also true tat zt may affect zt for any and t By consequence te cain rule does not apply analogously to te previous case Consider as an artifice te introduction of independent biases bt b for eac time step t Because b may only affect c troug te new biases T bt T b t bt b t bt Because eac bias bt only affects c troug zt we ave bt zt zt b zt as we wanted to sow Consider te statement Because w (3) T t δt(l) at(l ) wic we prove independently for l 2 and l 3 only affects c troug z(3) zt (3) te cain rule gives w (3) T t zt (3) zt (3) w (3) T t δt (3) at wic completes te proof for l 3 Once again w only affects c troug z zt but zt may affect zt for any and t and so te cain rule does not apply analogously to te previous case As an artifice we introduce independent weigts wt w for eac time step t wic gives w T t Because wt only affects c troug zt wt wt wt w T t wt zt zt wt at () wic concludes te proof for l 2 Consider te statement ω T t at 5

16 Tis statement also requires te artifice of introducing independent recurrent weigts ωt time step t Te cain rule gives Because ωt ω T t ωt ωt ω T t ωt ω for eac refers to te weigt for time step t it only affects c troug zt Te te cain rule gives ωt zt zt ωt at as we wanted to sow Te tree last bacpropagation troug time statements follow easily from te definition of te cost C as te average of te cost c over every pair (X Y ) D Tis completes te definition of te partial derivatives of te cost function C wit respect to every parameter in a single idden layer recurrent neural networ Stocastic gradient descent can be employed to minimize C wit te usual lac of guarantees In practice online gradient descent (size one mini-batces) are commonly used Te improvements suggested in previous sections suc as momentum-based gradient descent and regularization can also be employed Finding partial derivatives of te cost wit respect to te parameters becomes very involved for more complicated networ arcitectures In tose cases it is igly advisable to implement recurrent neural networs using automatic differentiation algoritms 7 Long sort-term memory networs Long sort-term memory networs compose an important class of recurrent neural networs Tese networs were developed to deal wit exploding and vanising gradients wic may compromise learning in te recurrent networs presented in te previous section Long sort-term memory networs ave been sown to be very effective in practice Consider te sequence element classification data set D {(X i Y i ) i { n} X i (R m ) Y i (R d ) } Furtermore let X Y for every (X Y ) D We will present long sort-term memory networs wit a single idden layer composed of single-celled memory blocs Many definitions will be analogous to tose presented for single idden layer recurrent neural networs and will be omitted Let () m and (3) d Let denote te number of single-celled memory blocs in te idden layer Te input activation for te networ at time t for (X Y ) D is defined as Xt at () Similarly to a neuron in te idden layer of a typical recurrent neural networ memory bloc also receives te vectors at () and at at time step t and outputs a scalar at Long sort-term memory networs can be unfolded in time in an analogous manner to typical recurrent networs However te computations performed in a memory bloc are considerably more involved tan tose in a recurrent artificial neuron A single-celled memory bloc is composed of four modules: cell input gate I forget gate F and output gate O Te weigted input zt to te cell in memory bloc is defined as () zt b w at() ω at were a0 may be zero Tis is te analogous of te weigted input for neuron in te idden layer of a typical recurrent networ Te activation st of te cell in memory bloc is defined as st at Fst at I g(zt ) were s0 may be zero and g is a differentiable activation function Te terms at F and at I correspond to te activations of te forget and input gates respectively and will be defined sortly Because eac of tese 6

17 two scalars is usually between 0 and tey control ow muc te previous activation of te cell and te current weigted input to te cell affect its current activation Te weigted input zt G of a gate G I F or O in memory bloc is defined as zt G b G ψ G () st w G at() ω Gat were ψ G R is te so-called peepole weigt to te cell activation in te previous time step Te activation at G of a gate G in memory bloc is defined as at G f(zt G ) were f is a differentiable activation function wic is typically te sigmoid function otice tat eac gate G in memory bloc as its own parameters and beaves similarly to a typical recurrent neuron Finally te output activation at of memory bloc is defined as at at O (st ) were is a differentiable activation function Intuitively te activation of te output gate controls ow muc te current activation of te cell affects te output of te memory bloc Intuitively a memory bloc can be interpreted as a parameterized circuit By training te networ a memory bloc may learn wen to store output and erase its memory (cell activation) given te current input activation to te networ and te previous activation of te memory blocs Eac memory bloc as 4( () ) 7 parameters (weigts and biases) Te weigted input to neuron (3) in layer 3 is defined as zt (3) b (3) w (3) at as in a feedforward neural networ Te activation of neuron in layer 3 is at (3) σ(zt (3) ) Te output of te networ f on input X a () at () is te sequence a (3) at (3) Tis completes te definition of a long sort-term memory networ wit a single idden layer of single-celled memory blocs Te intuition beind long sort-term memory networs is very similar to tat for typical recurrent neural networs Te sequence X is presented to te networ element by element An ideal long sort-term memory networ would be capable of representing a sequence X : t by te activation of its memory blocs at and cells st to allow correct classification of Xt As usual learning consists on finding parameters tat minimize a cost function C n (XY ) D c For concreteness consider te case of a cross-entropy cost function c for sequence element classification defined as c T T d t i Y t i ln (at (3) i ) ( Y t i ) ln ( at (3) i ) were X a () at () is te input sequence and a (3) at (3) is te output sequence Tis cost function requires tat every Y t be a standard basis vector We will now present te bacpropagation troug time statements tat allow te computation of te partial derivatives of te cost function wit respect to every parameter in a long sort-term memory networ As usual tese partial derivatives can be used for gradient descent We start by defining auxiliary variables Te error δt (l) of neuron or bloc in layer l at time step t is defined as δt (l) zt (l) Te error G of gate G in layer 2 at time step t is defined as Te activation error ɛ a t G zt G of memory bloc in layer 2 at time step t is defined as 7

18 Te cell activation error ɛ s t ɛ a t at of memory bloc in layer 2 at time step t is defined as ɛ s t st Te following statements summarize error computation by bacpropagation troug time in long sort-term memory networs δt (3) at (3) σ (zt (3) ) () O f (zt O )(st )ɛ a t at I g (zt )ɛ s t (3) F f (zt F )st ɛ s t (4) I f (zt I )g(zt )ɛ s t (5) ɛ s t at O (st )ɛ a t at F ɛ st (3) ɛ a t w (3) δt(3) ω δt G {IFO} G {IFO} ψ G δt G (6) ω G δt G (7) were δt δt G ɛ at ɛ s T 0 and G is eiter I F or O Once again every quantity can be computed easily by starting from δt (3) wic can only be computed after te networ as observed te entire sequence Tese errors can be used to compute te partial derivative of te cost C wit respect to every parameter in te networ using te following statements b G w G ω G ψ G ω T t T t T t T t T t T t T t C p n G (8) G at() (9) Gat (0) Gst () δt (l) (2) δt (l) at(l ) (3) at (4) (XY ) D p (5) 8

19 were p is any parameter (weigt or bias) in te networ and G is eiter I F or O We now present proofs for te aforementioned statements for bacpropagation troug time in long sort-term memory networs Consider te statement δt (3) at (3) σ (zt (3) ) For concreteness let c be te cross-entropy cost function for sequence element classification and consider a sigmoid output layer Because c is differentiable wit respect to at (3) and zt (3) only affects c troug at (3) δt (3) zt (3) at (3) at (3) zt (3) at (3) σ (zt (3) ) as we wanted to sow Tis proof is identical to te one presented in te previous section A sigmoid output layer also gives δt (3) at (3) σ (zt (3) ) at(3) Y t T wic was also proved in te previous section A negative log-lieliood cost function c for sequence element classification wit a softmax output layer also results in δt (3) at (3) Y t T altoug te statement δt (3) σ (zt (3) at (3) ) is no longer true Te proof is straigtforward given te analysis of te negative log-lieliood cost function presented in a previous section A dependency grap may be useful to understand te relationsips between variables on wic te next proofs rely Consider te statement O f (zt O )(st )ɛ a t Because zt O only affects c troug at O zt O Because st is not affected by zt O O ɛ at at at zt O (st ) at O zt O ɛ a t zt O at O (st ) ɛ a t (st )f (zt O ) as we wanted to sow Consider te statement at I g (zt )ɛ s t Because zt only affects c troug st zt st st zt ɛ s t zt at Fst at I g(zt ) Because only g(zt ) is affected by zt on te expression between bracets ɛ s t at I g (zt ) as we wanted to sow otice ow tis error is regulated by te activation of te input gate Intuitively if te input gate is closed (wen its activation is close to zero) weigted input parameters do not significantly affect te cost Consider te statement F zt F F f (zt F st st zt F )st ɛ s t Because zt F only affects c troug st ɛ s t zt F Because only at F is affected by zt F on te expression between bracets F ɛ st f (zt F at Fst at I g(zt ) )st as we wanted to sow Consider te statement I f (zt I )g(zt )ɛ s t Because zt I only affects c troug st 9

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Natural Language Understanding Lecture 12: Recurrent Neural Networks and LSTMs Recap: probability, language models, and feedforward networks Simple Recurrent Networks Adam Lopez Credits: Mirella Lapata