Lecture 16: Backpropogation Algorithm Neural Networks with smooth activation functions

Similar documents
Rademacher Complexity. Examples

1 Onto functions and bijections Applications to Counting

Econometric Methods. Review of Estimation

Dimensionality Reduction and Learning

Regression and the LMS Algorithm

Bayes (Naïve or not) Classifiers: Generative Approach

CSE 5526: Introduction to Neural Networks Linear Regression

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

Introduction to local (nonparametric) density estimation. methods

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

Lecture 02: Bounding tail distributions of a random variable

Unsupervised Learning and Other Neural Networks

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

Part 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971))

Generalized Linear Regression with Regularization

MATH 247/Winter Notes on the adjoint and on normal operators.

Third handout: On the Gini Index

CHAPTER 4 RADICAL EXPRESSIONS

Chapter 5 Properties of a Random Sample

For combinatorial problems we might need to generate all permutations, combinations, or subsets of a set.

= lim. (x 1 x 2... x n ) 1 n. = log. x i. = M, n

MOLECULAR VIBRATIONS

UNIT 2 SOLUTION OF ALGEBRAIC AND TRANSCENDENTAL EQUATIONS

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Supervised learning: Linear regression Logistic regression

Investigation of Partially Conditional RP Model with Response Error. Ed Stanek

A tighter lower bound on the circuit size of the hardest Boolean functions

The Selection Problem - Variable Size Decrease/Conquer (Practice with algorithm analysis)

PTAS for Bin-Packing

THE ROYAL STATISTICAL SOCIETY 2016 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE MODULE 5

( ) 2 2. Multi-Layer Refraction Problem Rafael Espericueta, Bakersfield College, November, 2006

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

MA/CSSE 473 Day 27. Dynamic programming

Summary of the lecture in Biostatistics

Lecture 9: Tolerant Testing

ENGI 4421 Propagation of Error Page 8-01

1. A real number x is represented approximately by , and we are told that the relative error is 0.1 %. What is x? Note: There are two answers.

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Point Estimation: definition of estimators

arxiv: v1 [cs.lg] 22 Feb 2015

b. There appears to be a positive relationship between X and Y; that is, as X increases, so does Y.

Recall MLR 5 Homskedasticity error u has the same variance given any values of the explanatory variables Var(u x1,...,xk) = 2 or E(UU ) = 2 I

ANALYSIS ON THE NATURE OF THE BASIC EQUATIONS IN SYNERGETIC INTER-REPRESENTATION NETWORK

We have already referred to a certain reaction, which takes place at high temperature after rich combustion.

Generative classification models

F. Inequalities. HKAL Pure Mathematics. 進佳數學團隊 Dr. Herbert Lam 林康榮博士. [Solution] Example Basic properties

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Homework 1: Solutions Sid Banerjee Problem 1: (Practice with Asymptotic Notation) ORIE 4520: Stochastics at Scale Fall 2015

2. Independence and Bernoulli Trials

Simple Linear Regression

Arithmetic Mean and Geometric Mean

Investigating Cellular Automata

1 Review and Overview

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Non-uniform Turán-type problems

X ε ) = 0, or equivalently, lim

18.657: Mathematics of Machine Learning

Support vector machines

Algorithms Theory, Solution for Assignment 2

5 Short Proofs of Simplified Stirling s Approximation

STK4011 and STK9011 Autumn 2016

Mu Sequences/Series Solutions National Convention 2014

å 1 13 Practice Final Examination Solutions - = CS109 Dec 5, 2018

Lecture 12: Multilayer perceptrons II

The Mathematical Appendix

TESTS BASED ON MAXIMUM LIKELIHOOD

PPCP: The Proofs. 1 Notations and Assumptions. Maxim Likhachev Computer and Information Science University of Pennsylvania Philadelphia, PA 19104

L5 Polynomial / Spline Curves

C.11 Bang-bang Control

Ideal multigrades with trigonometric coefficients

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Journal of Mathematical Analysis and Applications

Strong Convergence of Weighted Averaged Approximants of Asymptotically Nonexpansive Mappings in Banach Spaces without Uniform Convexity

LINEAR REGRESSION ANALYSIS

Department of Agricultural Economics. PhD Qualifier Examination. August 2011

Bounds on the expected entropy and KL-divergence of sampled multinomial distributions. Brandon C. Roy

Kernel-based Methods and Support Vector Machines

8.1 Hashing Algorithms

1 Mixed Quantum State. 2 Density Matrix. CS Density Matrices, von Neumann Entropy 3/7/07 Spring 2007 Lecture 13. ψ = α x x. ρ = p i ψ i ψ i.

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

LINEARLY CONSTRAINED MINIMIZATION BY USING NEWTON S METHOD

arxiv:math/ v1 [math.gm] 8 Dec 2005

ESS Line Fitting

18.413: Error Correcting Codes Lab March 2, Lecture 8

15-381: Artificial Intelligence. Regression and neural networks (NN)

Logistic regression (continued)

Solving Constrained Flow-Shop Scheduling. Problems with Three Machines

hp calculators HP 30S Statistics Averages and Standard Deviations Average and Standard Deviation Practice Finding Averages and Standard Deviations

A conic cutting surface method for linear-quadraticsemidefinite

2SLS Estimates ECON In this case, begin with the assumption that E[ i

9 U-STATISTICS. Eh =(m!) 1 Eh(X (1),..., X (m ) ) i.i.d

Chapter 4 Multiple Random Variables

22 Nonparametric Methods.

Chapter 14 Logistic Regression Models

Lecture 5: Interpolation. Polynomial interpolation Rational approximation

Machine Learning. Introduction to Regression. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Parameter, Statistic and Random Samples

Assignment 5/MATH 247/Winter Due: Friday, February 19 in class (!) (answers will be posted right after class)

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

Unimodality Tests for Global Optimization of Single Variable Functions Using Statistical Methods

Transcription:

CO-511: Learg Theory prg 2017 Lecturer: Ro Lv Lecture 16: Bacpropogato Algorthm Dsclamer: These otes have ot bee subected to the usual scruty reserved for formal publcatos. They may be dstrbuted outsde ths class oly wth the permsso of the Istructor. o far we ve dscussed covex learg problems. Covex learg problem are of partcular terest maly because they come wth strog theoretcal guaratees. For example, we ca apply GD algorthm to obta desrable learg rates. As t turs out, eve though o-covex problems form formdable challeges theory: They ofte ted to solve may terestg problems practce. I ths lecture we wll dscuss the tas of trag eural etwors usg tochastc Gradet Descet Algorthm. Eve though, we caot guaratee ths algorthm wll coverge to optmum, ofte state-of-the-art results are obtaed by ths algorthm ad t has become a bechmar algorthm for ML. 16.1 Neural Networs wth smooth actvato fuctos We recall that gve a graph (V, E) ad a actvato fucto σ we defed N (V,E),σ to be the class of all eural etwors mplemetable by the archtecture of (V, E) ad actvato fucto σ (ee lectures 5 ad 6). Gve a fxed archtecture a target fucto f ω,b N (V,E),σ s parametrzed by a set of weghts ω : E R ad bas b : V R. The emprcal loss (0-1) s gve by L 0,1 (ω, b) = m =1 l 0,1 (f (0,1) ω,b (x() ), y ) Where we add the superscrpt (0, 1) to ote that we are cosderg a target fucto the class N (V,E),σσsg. Of course, the aforemetoed problem s o-dfferetable ( fact o cotous), therefore we caot apply GD le method. Therefore we wll do two alteratos to the archtectures cosdered so far. Frst stead of σ σsg that we cosdered so far we wll cosder a dfferet actvato fucto. Namely, σ(a) = 1 1 + e a. That meas that each euro, ow returs as output v (t) = σ( ω(t), v(t 1) (x) + b (t) ) whch s a smooth fucto ts parameter. I tur the fucto f ω,b becomes smooth ts parameter (sce ts a composto of addto of smooth fuctos). Remar: Note that we care about smoothess terms of ω ad b!!! Whle f ω,b s a fucto of x: I trag, we cosder the emprcal loss as a fucto of the parameters ad we wat to optmze over these. Of course, ow the target fucto does ot retur 0 or 1 but a real umber, therefore we also replace the 0 1 wth a surrogate covex loss fucto. For cocretess we let l(a, y) = (a y) 2. We ow obta the dfferetable emprcal problem m L (ω, b) = l(f ω,b (x () ), y ) =1 16-1

16-2 Lecture 16: Bacpropogato Algorthm To see that these alterato do ot cause ay loss expressve power or geeralzato we prove the followg clam Clam 16.1. Let (V, E) be a fxed feed-forward graph, the for every sample : 1. 2. For every (ω, b ) f L (ω, b) f L (0,1) (ω, b) m l 0,1 (sg(f ω,b (x () )), y ) L (ω, b ) =1 The frst clam shows that we ca acheve a soluto that s compettve wth the loss of the optmal eural etwor wth 0 1 actvato fucto. The secod statemet tells us that the 0 1 soluto of the optmzer of L wll also have small 0 1 loss. I other words, by mmzg the dfferetable problem, we acheve a soluto wth small emprcal 0 1 loss. Proof. For the frst clam, ote that lm a σ(a) = 1 ad lm a σ(a) = 0, hece hece hece, the frst statemet hold. lm f c ω,c b = f 0,1 c ω,b, lm L (ω, b) L (0,1) c (ω, b) As to the secod statemet, ths follows from the fact that l s a surrogate loss fucto. Thus we tured the o-smooth problem to a dfferetable problem. Ths meas that we ca ow try to apply a gradet descet method, smlar to GD as we used covex problems. There are two ssues to over come 1. Though the loss fucto mght be covex, the ERM problem as a whole, gve ts depedece o the paratmeter s o covex. We have oly show that GD coverges whe the ERM problem s covex the parameters. 2. To perform GD we stll eed to compute the gradet f ω,b, where the depedece betwee the parameters may be hghly volved. The frst problem turs out to be a real ssue ad deed there s o guaratee that GD wll coverge to a global optmum whe the problem s essetally o-covex. I fact, eve covergece to a local mmum s ot guarateed, though oe ca show that GD wll coverge to a crtcal pot (more accurately to a pot where f ω,b ɛ (uder certa smoothess assumptos). The problem s geerally solved by re-teratg the algorhtm from dfferet talzato pots: wth the hope that oe of the staces wll deed coverge to a suffcetly optmal pot. However, all hardess results we dscussed so far apply: Therefore for ay method, f the etwor s expressve eough to represet, for example, tersecto of halfspaces the for some staces the method must fal. The secod pot s actually solvable ad we wll ext see how oe ca compute the gradet of the loss: Ths s ow as the Bacpropagato algorthm, whch has become the worhorse of Mache Learg the past few years.

Lecture 16: Bacpropogato Algorthm 16-3 16.1.1 A Few Remars o NNs practce Before presetg the Bacpropagato algorthm, t s worth dscussg some smplfcatos we have cosdered here over what s ofte used practce: the actvato fucto We are restrctg our atteto to a sgmodal actvato fucto. These has bee used the past. The geeral tuto beg, that they are a smoothg of the 0 1 actvato fucto. I realty, trag wth sgmodal fucto ted to get stuc: whe the weghts are very large the the dervatve starts to behave roughly le the 0 1 fucto whch mea they vash. Oe chage that was suggested s to use the relu actvato fucto σ relu = max(0, a) Ule the sgmodal fucto, ts dervatve does t vash wheever the put s postve. I terms of expressve power, they ca express sgmodal le fucto usg σ relu (a + 1) σ relu (a) o the overall expressvty of the etwor does t chage (as log as we allow twce as may euros at each layer, whch s the same order of euros) Regularzato For geeralzato we rely here o the geeralzato boud of O(E log E ). I practce the umber of free parameters (weghts ad bas) ted to be extremely larger the the umber of examples. Therefore certa regularzato s ofte employed o the weght (e.g. l 2, l 1 regularzato). There have also bee other heurstcs for regulerzg eural etwors such as dropout: Where roughly, durg trag oe zero out some weghts durg the update step. As we saw past lecture GD comes wth ts ow geeralzato guaratees. Geeralzato bouds to GD for o-covex optmzato has bee recetly obtaed [?], but these are ot ecessarly for the learg rates used practce. 16.1.2 The Bacpropagato Algorthm We ext dscuss the Bacpropogato algorthm that computes ω,b lear tme. To smplfy ad mae otatos easer, stead of carryg a bas term: let us assume that each layer V (t) cotas a sgle euro v (t) 0 that always outputs a costat 1. thus the output of a euro s gve by σ( ω, v (t 1) ) ad we supress the bas b as a addtoal weght ω,0. We ext wsh to compute the dervatve f ω. Now suppose euro v (t) computes: Where u (t) v (t) The usg a smple cha rule we obta that (x) = σ(u (t) (x)) (x) = ω, v (t 1) (x). ω, = u(t) = ω, v (t 1) (x) Thus to compute the partal dervatve wth respect to a sgle weght, we see that t s eough to compute.

16-4 Lecture 16: Bacpropogato Algorthm o we focus computg f of some varable z the we have by cha rule:. Now aga suppose f s a fucto of u (t) 1,..., u(t) m, whch are tur fucto m z = =1 u(t) z (16.1) Now f z = u (t 1) choce of actvato fucto s the output of some euro a prevous layer: The calculato of s easy for our u (t) = ω, σ(u (t 1) ) u(t) = ω, σ (u (t 1) ). Usg Eq. 16.1 for f = u (t) wll gve us also ω. we ca recursvely calculate all partal dervatve u(t) for t < t, whch tur The ave approach to calculate the gradet s that we calculate ductvely all dervatves of the form for t < t, the usg Eq. 16.1 wth f = u (t+1) we calculate all dervatves u(t+1) l. Ths calculato calls for each fxed umber of tmes proportoal to E umber of edges, therefore the overall calculato tme s gve by O( V E ). The bac propogato algorhtm calculates the dervatve through dyamcal programmg ad reduces the complexty to O( V + E ): 16.1.3 Bacproporgato We ext cosder a approach to calculate the partal dervatve that taes tme O( V + E ). Algorthm 1 Bacpropogato Iput: Graph G(V, E) ad parameters ω : E R. ET T = depthg,.e v (T ) s the output euro. ET m (T ) = 1 for t = T 1... 1 % tart from top layer ad move toward bottom layer do for = 1,..., V (t) % Go over all euros at layer t do euro v (t) receve messages m (t+1) ad passes a message m (t) (v (t) ), sums them up: = V (t+1) =1 m (t+1) (v (t) ) (v (t 1) ) to each euro at a lower level: ed for ed for Clam 16.2. At each ode v (t) m (t) (v (t 1) ) = the value s exactly ω. v (t)

Lecture 16: Bacpropogato Algorthm 16-5 Proof. We prove the statemet by ducto. The message that receves the output euro s gve by = ω ω = 1. Next for each euro we have by ducto: = V (t+1) =1 ω u (t+1) u(t+1) (16.2) Whch by cha rule gves the desred result The Bacpropagato, each euro does umber of computato that s proportoal to ts degree, overall the umber of calculato s proportoal to twce the umber of edges whch gves overall umber of calculatos O( V + E ).