CS294A Lecture notes. Andrew Ng

Similar documents
CS294A Lecture notes. Andrew Ng

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Week 5: Neural Networks

EEE 241: Linear Systems

MATH 567: Mathematical Techniques in Data Science Lab 8

1 Convex Optimization

Lecture 10 Support Vector Machines II

Feature Selection: Part 1

Generalized Linear Methods

Lecture Notes on Linear Regression

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Kernel Methods and SVMs Extension

Supporting Information

Singular Value Decomposition: Theory and Applications

Linear Feature Engineering 11

Linear Approximation with Regularization and Moving Least Squares

10-701/ Machine Learning, Fall 2005 Homework 3

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

Multi-layer neural networks

Multilayer neural networks

Neural networks. Nuno Vasconcelos ECE Department, UCSD

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

CSC 411 / CSC D11 / CSC C11

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Vapnik-Chervonenkis theory

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

1 Matrix representations of canonical matrices

Gaussian Mixture Models

Multilayer Perceptron (MLP)

Errors for Linear Systems

Limited Dependent Variables

Difference Equations

Deep Learning. Boyang Albert Li, Jie Jay Tan

Lecture 2: Prelude to the big shrink

Section 8.3 Polar Form of Complex Numbers

Unsupervised Learning

The Geometry of Logit and Probit

Homework Assignment 3 Due in class, Thursday October 15

Lecture 12: Discrete Laplacian

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Boostrapaggregating (Bagging)

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Ensemble Methods: Boosting

Online Classification: Perceptron and Winnow

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Foundations of Arithmetic

NUMERICAL DIFFERENTIATION

MMA and GCMMA two methods for nonlinear optimization

= z 20 z n. (k 20) + 4 z k = 4

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Introduction to the Introduction to Artificial Neural Network

Solving Nonlinear Differential Equations by a Neural Network Method

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

1 GSW Iterative Techniques for y = Ax

Learning Theory: Lecture Notes

Some Comments on Accelerating Convergence of Iterative Sequences Using Direct Inversion of the Iterative Subspace (DIIS)

Bezier curves. Michael S. Floater. August 25, These notes provide an introduction to Bezier curves. i=0

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Spectral Graph Theory and its Applications September 16, Lecture 5

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

Global Sensitivity. Tuesday 20 th February, 2018

Notes on Frequency Estimation in Data Streams

A linear imaging system with white additive Gaussian noise on the observed data is modeled as follows:

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

LECTURE 9 CANONICAL CORRELATION ANALYSIS

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Numerical Heat and Mass Transfer

One-sided finite-difference approximations suitable for use with Richardson extrapolation

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

763622S ADVANCED QUANTUM MECHANICS Solution Set 1 Spring c n a n. c n 2 = 1.

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

SDMML HT MSc Problem Sheet 4

Lecture 3. Ax x i a i. i i

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Hopfield Training Rules 1 N

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

COS 521: Advanced Algorithms Game Theory and Linear Programming

Evaluation of classifiers MLPs

Report on Image warping

CSE 252C: Computer Vision III

Grover s Algorithm + Quantum Zeno Effect + Vaidman

The exam is closed book, closed notes except your one-page cheat sheet.

Neural Networks. Neural Network Motivation. Why Neural Networks? Comments on Blue Gene. More Comments on Blue Gene

Note on EM-training of IBM-model 1

The Feynman path integral

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

Internet Engineering. Jacek Mazurkiewicz, PhD Softcomputing. Part 3: Recurrent Artificial Neural Networks Self-Organising Artificial Neural Networks

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Fall 2012 Analysis of Experimental Measurements B. Eisenstein/rev. S. Errede

From Biot-Savart Law to Divergence of B (1)

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

Transcription:

CS294A Lecture notes Andrew Ng Sparse autoencoder 1 Introducton Supervsed learnng s one of the most powerful tools of AI, and has led to automatc zp code recognton, speech recognton, self-drvng cars, and a contnually mprovng understandng of the human genome. Despte ts sgnfcant successes, supervsed learnng today s stll severely lmted. Specfcally, most applcatons of t stll requre that we manually specfy the nput features x gven to the algorthm. Once a good feature representaton s gven, a supervsed learnng algorthm can do well. But n such domans as computer vson, audo processng, and natural language processng, there re now hundreds or perhaps thousands of researchers who ve spent years of ther lves slowly and laborously hand-engneerng vson, audo or text features. Whle much of ths feature-engneerng work s extremely clever, one has to wonder f we can do better. Certanly ths labor-ntensve hand-engneerng approach does not scale well to new problems; further, deally we d lke to have algorthms that can automatcally learn even better feature representatons than the hand-engneered ones. These notes descrbe the sparse autoencoder learnng algorthm, whch s one approach to automatcally learn features from unlabeled data. In some domans, such as computer vson, ths approach s not by tself compettve wth the best hand-engneered features, but the features t can learn do turn out to be useful for a range of problems (ncludng ones n audo, text, etc). Further, there re more sophstcated versons of the sparse autoencoder (not descrbed n these notes, but that you ll hear more about later n the class) that do surprsngly well, and n some cases are compettve wth or sometmes even better than some of the hand-engneered representatons. 1

Ths set of notes s organzed as follows. We wll frst descrbe feedforward neural networks and the backpropagaton algorthm for supervsed learnng. Then, we show how ths s used to construct an autoencoder, whch s an unsupervsed learnng algorthm, and fnally how we can buld on ths to derve a sparse autoencoder. Because these notes are farly notaton-heavy, the last page also contans a summary of the symbols used. 2 Neural networks Consder a supervsed learnng problem where we have access to labeled tranng examples (x (), y () ). Neural networks gve a way of defnng a complex, non-lnear form of hypotheses h W,b (x), wth parameters W, b that we can ft to our data. To descrbe neural networks, we use the followng dagram to denote a sngle neuron : Ths neuron s a computatonal unt that takes as nput x 1, x 2, x 3 (and a +1 ntercept term), and outputs h w,b (x) = f(w T x) = f( 3 =1 w x + b), where f : R R s called the actvaton functon. One possble choce for f( ) s the sgmod functon f(z) = 1/(1 + exp( z)); n that case, our sngle neuron corresponds exactly to the nput-output mappng defned by logstc regresson. In these notes, however, we ll use a dfferent actvaton functon, the hyperbolc tangent, or tanh, functon: Here s a plot of the tanh(z) functon: f(z) = tanh(z) = ez e z, (1) e z + e z 2

The tanh(z) functon s a rescaled verson of the sgmod, and ts output range s [ 1, 1] nstead of [0, 1]. Our descrpton of neural networks wll use ths actvaton functon. Note that unlke CS221 and (parts of) CS229, we are not usng the conventon here of x 0 = 1. Instead, the ntercept term s handled separated by the parameter b. Fnally, one dentty that ll be useful later: If f(z) = tanh(z), then ts dervatve s gven by f (z) = 1 (f(z)) 2. (Derve ths yourself usng the defnton of tanh(z) gven n Equaton 1.) 2.1 Neural network formulaton A neural network s put together by hookng together many of our smple neurons, so that the output of a neuron can be the nput of another. For example, here s a small neural network: 3

In ths fgure, we have used crcles to also denote the nputs to the network. The crcles labeled +1 are called bas unts, and correspond to the ntercept term. The leftmost layer of the network s called the nput layer, and the rghtmost layer the output layer (whch, n ths example, has only one node). The mddle layer of nodes s called the hdden layer, because ts values are not observed n the tranng set. We also say that our example neural network has 3 nput unts (not countng the bas unt), 3 hdden unts, and 1 output unt. We wll let n l denote the number of layers n our network; thus n l = 3 n our example. We label layer l as L l, so layer L 1 s the nput layer, and layer L nl the output layer. Our neural network has parameters (W, b) = (W (1), b (1), W (2), b (2) ), where we wrte W (l) j to denote the parameter (or weght) assocated wth the connecton between unt j n layer l, and unt n layer l+1. (Note the order of the ndces.) Also, b (l) s the bas assocated wth unt n layer l+1. Thus, n our example, we have W (1) R 3 3, and W (2) R 1 3. Note that bas unts don t have nputs or connectons gong nto them, snce they always output the value +1. We also let s l denote the number of nodes n layer l (not countng the bas unt). to denote the actvaton (meanng output value) of unt n layer l. For l = 1, we also use a (1) = x to denote the -th nput. Gven a fxed settng of the parameters W, b, our neural network defnes a hypothess h W,b (x) that outputs a real number. Specfcally, the computaton that ths neural network represents s gven by: We wll wrte a (l) a (2) 1 = f(w (1) 11 x 1 + W (1) 12 x 2 + W (1) 13 x 3 + b (1) 1 ) (2) a (2) 2 = f(w (1) 21 x 1 + W (1) 22 x 2 + W (1) 23 x 3 + b (1) 2 ) (3) a (2) 3 = f(w (1) 31 x 1 + W (1) 32 x 2 + W (1) 33 x 3 + b (1) 3 ) (4) h W,b (x) = a (3) 1 = f(w (2) 11 a 1 + W (2) 12 a 2 + W (2) 13 a 3 + b (2) 1 ) (5) In the sequel, we also let z (l) n layer l, ncludng the bas term (e.g., z (2) a (l) = f(z (l) ). denote the total weghted sum of nputs to unt = n j=1 W (1) j x j + b (1) ), so that Note that ths easly lends tself to a more compact notaton. Specfcally, f we extend the actvaton functon f( ) to apply to vectors n an elementwse fashon (.e., f([z 1, z 2, z 3 ]) = [tanh(z 1 ), tanh(z 2 ), tanh(z 3 )]), then we can 4

wrte Equatons (2-5) more compactly as: z (2) = W (1) x + b (1) a (2) = f(z (2) ) z (3) = W (2) a (2) + b (2) h W,b (x) = a (3) = f(z (3) ) More generally, recallng that we also use a (1) = x to also denote the values from the nput layer, then gven layer l s actvatons a (l), we can compute layer l + 1 s actvatons a (l+1) as: z (l+1) = W (l) a (l) + b (l) (6) a (l+1) = f(z (l+1) ) (7) By organzng our parameters n matrces and usng matrx-vector operatons, we can take advantage of fast lnear algebra routnes to quckly perform calculatons n our network. We have so far focused on one example neural network, but one can also buld neural networks wth other archtectures (meanng patterns of connectvty between neurons), ncludng ones wth multple hdden layers. The most common choce s a n l -layered network where layer 1 s the nput layer, layer n l s the output layer, and each layer l s densely connected to layer l + 1. In ths settng, to compute the output of the network, we can successvely compute all the actvatons n layer L 2, then layer L 3, and so on, up to layer L nl, usng Equatons (6-7). Ths s one example of a feedforward neural network, snce the connectvty graph does not have any drected loops or cycles. Neural networks can also have multple output unts. For example, here s a network wth two hdden layers layers L 2 and L 3 and two output unts n layer L 4 : 5

To tran ths network, we would need tranng examples (x (), y () ) where y () R 2. Ths sort of network s useful f there re multple outputs that you re nterested n predctng. (For example, n a medcal dagnoss applcaton, the vector x mght gve the nput features of a patent, and the dfferent outputs y s mght ndcate presence or absence of dfferent dseases.) 2.2 Backpropagaton algorthm We wll tran our neural network usng stochastc gradent descent. For much of CS221 and CS229, we consdered a settng n whch we have a fxed tranng set {(x (1), y (1) ),..., (x (m), y (m) )}, and we ran ether batch or stochastc gradent descent on that fxed tranng set. In these notes, wll take an onlne learnng vew, n whch we magne that our algorthm has access to an unendng sequence of tranng examples {(x (1), y (1) ), (x (2), y (2) ), (x (3), y (3) ),...}. In practce, f we have only a fnte tranng set, then we can form such a sequence by repeatedly vstng our fxed tranng set, so that the examples n the sequence wll repeat. But even n ths case, the onlne learnng vew wll make some of our algorthms easer to descrbe. In ths settng, stochastc gradent descent wll proceed as follows: For = 1, 2, 3,... Get next tranng example (x (), y () ). Update W (l) jk b (l) j (l) := W := b (l) j jk α α b (l) j W (l) jk J(W, b; x (), y () ) J(W, b; x (), y () ) Here, α s the learnng rate parameter, and J(W, b) = J(W, b; x, y) s a cost functon defned wth respect to a sngle tranng example. (When there s no rsk of ambguty, we drop the dependence of J on the tranng example x, y, and smply wrte J(W, b)). If the tranng examples are drawn IID from some tranng dstrbuton D, we can thnk of ths algorthm as tryng to mnmze E (x,y) D [J(W, b; x, y)]. Alternatvely, f our sequence of examples s obtaned by repeatng some fxed, fnte tranng set {(x (1), y (1) ),..., (x (m), y (m) )}, then ths algorthm s standard stochastc gradent descent for mnmzng 1 m m J(W, b; x, y). =1 6

To tran our neural network, we wll use the cost functon: J(W, b; x, y) = 1 2 ( hw,b (x) y 2) λ 2 n l 1 l=1 s l =1 s l+1 ( j=1 W (l) j The frst term s a sum-of-squares error term; the second s a regularzaton term (also called a weght decay term) that tends to decrease the magntude of the weghts, and helps prevent overfttng. 1 The weght decay parameter λ controls the relatve mportance of the two terms. Ths cost functon above s often used both for classfcaton and for regresson problems. For classfcaton, we let y = +1 or 1 represent the two class labels (recall that the tanh(z) actvaton functon outputs values n [ 1, 1], so we use +1/-1 valued outputs nstead of 0/1). For regresson problems, we frst scale our outputs to ensure that they le n the [ 1, 1] range. Our goal s to mnmze E (x,y) [J(W, b; x, y)] as a functon of W and b. To tran our neural network, we wll ntalze each parameter W (l) j and each b (l) to a small random value near zero (say accordng to a N (0, ɛ 2 ) dstrbuton for some small ɛ, say 0.01), and then apply stochastc gradent descent. Snce J(W, b; x, y) s a non-convex functon, gradent descent s susceptble to local optma; however, n practce gradent descent usually works farly well. Also, n neural network tranng, stochastc gradent descent s almost always used rather than batch gradent descent. Fnally, note that t s mportant to ntalze the parameters randomly, rather than to all 0 s. If all the parameters start off at dentcal values, then all the hdden layer unts wll end up learnng the same functon of the nput (more formally, W (1) j wll be the same for all values of, so that a (2) 1 = a (2) 2 =... for any nput x). The random ntalzaton serves the purpose of symmetry breakng. We now descrbe the backpropagaton algorthm, whch gves an effcent way to compute the partal dervatves we need n order to perform stochastc gradent descent. The ntuton behnd the algorthm s as follows. Gven a tranng example (x, y), we wll frst run a forward pass to compute all the actvatons throughout the network, ncludng the output value of the hypothess h W,b (x). Then, for each node n layer l, we would lke to compute 1 Usually weght decay s not appled to the bas terms b (l), as reflected n our defnton for J(W, b; x, y). Applyng weght decay to the bas unts usually makes only a small dfferent to the fnal network, however. If you took CS229, you may also recognze weght decay ths as essentally a varant of the Bayesan regularzaton method you saw there, where we placed a Gaussan pror on the parameters and dd MAP (nstead of maxmum lkelhood) estmaton. 7 ) 2

an error term δ (l) that measures how much that node was responsble for any errors n our output. For an output node, we can drectly measure the dfference between the network s actvaton and the true target value, and use that to defne δ (n l) (where layer n l s the output layer). How about hdden unts? For those, we wll compute δ (l) based on a weghted average of the error terms of the nodes that uses a (l) as an nput. In detal, here s the backpropagaton algorthm: 1. Perform a feedforward pass, computng the actvatons for layers L 2, L 3, and so on up to the output layer L nl. 2. For each output unt n layer n l (the output layer), set δ (n l) = z (n l) 1 2 y h W,b(x) 2 = (y a (n l) ) f (z (n l) ) 3. For l = n l 1, n l 2, n l 3,..., 2 For each node n layer l, set δ (l) = 4. Update each weght W (l) j ( sl+1 j=1 and b (l) j W (l) j := W (l) j α W (l) j δ(l+1) j ( b (l) := b (l) αδ (l+1). ) accordng to: a (l) j δ(l+1) f (z (l) ) ) + λw (l) j Although we have not proved t here, t turns out that W (l) j b (l) J(W, b; x, y) = a (l) j δ(l+1) + λw (l) j, J(W, b; x, y) = δ (l+1). Thus, ths algorthm s exactly mplementng stochastc gradent descent. Fnally, we can also re-wrte the algorthm usng matrx-vectoral notaton. We wll use to denote the element-wse product operator (denoted.* n Matlab or Octave, and also called the Hadamard product), so that f 8

a = b c, then a = b c. Smlar to how we extended the defnton of f( ) to apply element-wse to vectors, we also do the same for f ( ) (so that f ([z 1, z 2, z 3 ]) = [ z 1 tanh(z 1 ), z 2 tanh(z 2 ), z 3 tanh(z 3 )]). The algorthm can then be wrtten: 1. Perform a feedforward pass, computng the actvatons for layers L 2, L 3, up to the output layer L nl, usng Equatons (6-7). 2. For the output layer (layer n l ), set δ (n l) = (y a (n l) ) f (z (n) ) 3. For l = n l 1, n l 2, n l 3,..., 2 Set δ (l) = ( (W (l) ) T δ (l+1)) f (z (l) ) 4. Update the parameters accordng to: W (l) := W (l) α ( δ (l+1) (a (l) ) T + λw (l)) b (l) := b (l) αδ (l+1). Implementaton note 1: In steps 2 and 3 above, we need to compute f (z (l) ) for each value of. Assumng f(z) s the tanh actvaton functon, we would already have a (l) stored away from the forward pass through the network. Thus, usng the expresson that we worked out earler for f (z), we can compute ths as f (z (l) ) = 1 (a (l) ) 2. Implementaton note 2: Backpropagaton s a notorously dffcult algorthm to debug and get rght, especally snce many subtly buggy mplementatons of t for example, one that has an off-by-one error n the ndces and that thus only trans some of the layers of weghts, or an mplementaton that omts the bas term, etc. wll manage to learn somethng that can look surprsngly reasonable (whle performng less well than a correct mplementaton). Thus, even wth a buggy mplementaton, t may not at all be apparent that anythng s amss. So, when mplementng backpropagaton, do read and re-read your code to check t carefully. Some people also numercally check ther computaton of the dervatves; f you know how to do ths, t s worth consderng too. (Feel free to ask us f you want to learn more about ths.) 9

3 Autoencoders and sparsty So far, we have descrbed the applcaton of neural networks to supervsed learnng, n whch we are have labeled tranng examples. Now suppose we have only unlabeled tranng examples set {x (1), x (2), x (3),...}, where x () R n. An autoencoder neural network s an unsupervsed learnng algorthm that apples back propagaton, settng the target values to be equal to the nputs. I.e., t uses y () = x (). Here s an autoencoder: The autoencoder tres to learn a functon h W,b (x) x. In other words, t s tryng to learn an approxmaton to the dentty functon, so as to output ˆx that s smlar to x. The dentty functon seems a partcularly trval functon to be tryng to learn; but by placng constrants on the network, such as by lmtng the number of hdden unts, we can dscover nterestng structure about the data. As a concrete example, suppose the nputs x are the pxel ntensty values from a 10 10 mage (100 pxels) so n = 100, and there are s 2 = 50 hdden unts n layer L 2. Note that we also have y R 100. Snce there are only 50 hdden unts, the network s forced to learn a compressed representaton of the nput. I.e., gven only the vector of hdden unt actvatons a (2) R 50, t must try to reconstruct the 100-pxel 10

nput x. If the nput were completely random say, each x comes from an IID Gaussan ndependent of the other features then ths compresson task would be very dffcult. But f there s structure n the data, for example, f some of the nput features are correlated, then ths algorthm wll be able to dscover some of those correlatons. 2 Our argument above reled on the number of hdden unts s 2 beng small. But even when the number of hdden unts s large (perhaps even greater than the number of nput pxels), we can stll dscover nterestng structure, by mposng other constrants on the network. In partcular, f we mpose a sparsty constrant on the hdden unts, then the autoencoder wll stll dscover nterestng structure n the data, even f the number of hdden unts s large. Informally, we wll thnk of a neuron as beng actve (or as frng ) f ts output value s close to 1, or as beng nactve f ts output value s close to -1. We would lke to constran the neurons to be nactve most of the tme. 3 We wll do ths n an onlne learnng fashon. More formally, we agan magne that our algorthm has access to an unendng sequence of tranng examples {x (1), x (2), x (3),...} drawn IID from some dstrbuton D. Also, let as usual denote the actvaton of hdden unt n the autoencoder. We would lke to (approxmately) enforce the constrant that [ ] E x D a (2) = ρ, a (2) where ρ s our sparsty parameter, typcally a value slghtly above -1.0 (say, ρ 0.9). In other words, we would lke the expected actvaton of each hdden neuron to be close to -0.9 (say). To satsfy ths expectaton constrant, the hdden unt s actvatons must mostly be near -1. Our algorthm for (approxmately) enforcng the expectaton constrant wll have two major components: [ ] Frst, for each hdden unt, we wll keep a runnng estmate of E x D a (2). Second, after each teraton of stochastc gradent descent, we wll slowly adjust that unt s parameters to make ths expected value closer to ρ. 2 In fact, ths smple autoencoder often ends up learnng a low-dmensonal representaton very smlar to PCA s. 3 The term sparsty comes from an alternatve formulaton of these deas usng networks wth a sgmod actvaton functon f, so that the actvatons are between 0 or 1 (rather than -1 and 1). In ths case, sparsty refers to most of the actvatons beng near 0. 11

In each teraton of gradent descent, when we see each tranng nput x we wll compute the hdden unts [ ] actvatons a (2) for each. We wll keep a runnng estmate ˆρ of E x D a (2) by updatng: ˆρ := 0.999ˆρ + 0.001a (2). (Or, n vector notaton, ˆρ := 0.999ˆρ + 0.001a (2).) Here, the 0.999 (and 0.001 ) s a parameter of the algorthm, and there s a wde range of values wll that wll work fne. Ths partcular choce causes ˆρ to be an exponentally-decayed weghted average of about the last 1000 observed values of a (2). Our runnng estmates ˆρ s can be ntalzed to 0 at the start of the algorthm. The second part of the algorthm modfes the parameters so as to try to satsfy the expectaton constrant. If ˆρ > ρ, then we would lke hdden unt to become less actve, or equvalently, for ts actvatons to become closer to -1. Recall that unt s actvaton s ( n ) = f W (1) j x j + b (1), (8) where b (1) a (2) j=1 s the bas term. Thus, we can make unt less actve by decreasng. Smlarly, f ˆρ < ρ, then we would lke unt s actvatons to become larger, whch we can do by ncreasng b (1). Fnally, the further ρ s from ρ, the more aggressvely we mght want to decrease or ncrease b (1) so as to drve the expectaton towards ρ. Concretely, we can use the followng learnng rule: b (1) b (1) := b (1) αβ(ˆρ ρ) (9) where β s an addtonal learnng rate parameter. To summarze, n order to learn a sparse autoencoder usng onlne learnng, upon gettng an example x, we wll () Run a forward pass on our network on nput x, to compute all unts actvatons; () Perform one step of stochastc gradent descent usng backpropagaton; () Perform the updates gven n Equatons (8-9). 4 Vsualzaton Havng traned a (sparse) autoencoder, we would now lke to vsualze the functon learned by the algorthm, to try to understand what t has learned. 12

Consder the case of tranng an autoencoder on 10 10 mages, so that n = 100. Each hdden unt computes a functon of the nput: ( 100 ) a (2) = f W (1) j x j + b (1). j=1 We wll vsualze the functon computed by hdden unt whch depends on the parameters W (1) j (gnorng the bas term for now) usng a 2D mage. In partcular, we thnk of a (1) as some non-lnear feature of the nput x. We ask: What nput mage x would cause a (1) to be maxmally actvated? For ths queston to have a non-trval answer, we must mpose some constrants on x. If we suppose that the nput s norm constraned by x 2 = 100 =1 x2 1, then one can show (try dong ths yourself) that the nput whch maxmally actvates hdden unt s gven by settng pxel x j (for all 100 pxels, j = 1,..., 100) to x j = 100 W (1) j j=1 (W (1) j )2. By dsplayng the mage formed by these pxel ntensty values, we can begn to understand what feature hdden unt s lookng for. If we have an autoencoder wth 100 hdden unts (say), then we our vsualzaton wll have 100 such mages one per hdden unt. By examnng these 100 mages, we can try to understand what the ensemble of hdden unts s learnng. When we do ths for a sparse autoencoder (traned wth 100 hdden unts on 10x10 pxel nputs 4 ) we get the followng result: 4 The results below were obtaned by tranng on whtened natural mages. Whtenng s a preprocessng step whch removes redundancy n the nput, by causng adjacent pxels to become less correlated. 13

Each square n the fgure above shows the (norm bounded) nput mage x that maxmally actves one of 100 hdden unts. We see that the dfferent hdden unts have learned to detect edges at dfferent postons and orentatons n the mage. These features are, not surprsngly, useful for such tasks as object recognton and other vson tasks. When appled to other nput domans (such as audo), ths algorthm also learns useful representatons/features for those domans too. 14

5 Summary of notaton x Input features for a tranng example, x R n. y Output/target values. Here, y can be vector valued. In the case of an autoencoder, y = x. (x (), y () ) The -th tranng example h W,b (x) Output of our hypothess on nput x, usng parameters W, b. Ths should be a vector of the same dmenson as the target value y. W (l) j The parameter assocated wth the connecton between unt j n layer l, and unt n layer l + 1. b (l) The bas term assocated wth unt n layer l + 1. Can also be thought of as the parameter assocated wth the connecton between the bas unt n layer l and unt n layer l + 1. Actvaton (output) of unt n layer l of the network. In addton, snce layer L 1 s the nput layer, we also have a (1) = x. f( ) The actvaton functon. Throughout these notes, we used f(z) = tanh(z). Total weghted sum of nputs to unt n layer l. Thus, a (l) a (l) z (l) α s l n l λ ˆx ρ ˆρ β f(z (l) = ). Learnng rate parameter Number of unts n layer l (not countng the bas unt). Number layers n the network. Layer L 1 s usually the nput layer, and layer L nl the output layer. Weght decay parameter For an autoencoder, ts output;.e., ts reconstructon of the nput x. Same meanng as h W,b (x). Sparsty parameter, whch specfes our desred level of sparsty Our runnng estmate of the expected actvaton of unt (n the sparse autoencoder). Learnng rate parameter for algorthm tryng to (approxmately) satsfy the sparsty constrant. 15