Statistical Learning Theory. Part I 5. Deep Learning

Statistical Learning Theory Part I 5. Deep Learning Sumio Watanabe Tokyo Institute of Technology

Review : Supervised Learning Training Data X 1, X 2,, X n q(x,y) =q(x)q(y x) Information Source Y 1, Y 2,, Y n Test Data X Y y=f(x,w) Learning Machine

Review : Training and Generalization Training Error n E(w) = (1/n) Σ (Y i -f(x i,w)) 2 i=1 Generalization Error F(w) = (1/m) Σ (Y i -f(x i,w)) 2 m i=1 Stochastic gradient descent: In each t, (X i,y i ) is randomly chosen, w(t+1) - w(t) = - η(t) grad (Y i -f(x i,w(t))) 2

2018/6/22 I-5-1 Deep Learning.

Deep Network We learned three layer network. It is easy to define a network which has deeper layers. o 1 o 2 o N N o 1 o i o N w ij H 2 o j w jk H 1 o k M w km x 1 x 2 x M x 1 x m x M H 2 H 1 o i =σ( w ij σ( w jk σ( w km x m + θ k ) + θ j ) + θ i ) j=1 k=1 M m=1 Note: w ij, w jk, and w km are different parameters.

Error Backropagation (1) It is easy to generalize the training algorithm for deeper network. Network s output o = (o i ), desired output y= (y i ). Square Error E(w) = 1 (o i -y i ) 2 2 (1) Hidden 2 -- Output E w ij N i=1 = (o i -y i ) o i w ij (N = Output dimension) = (o i -y i ) o i (1- o i ) o j Hidden 2 -- Output H 2 o i =σ( w ij o j +θ i ) j=1 Hidden 1 Hidden 2 H 1 o j =σ( w jk o k +θ j ) k=1 Input - Hidden 1 M o k =σ( w km x m +θ k ) m=1 = δ i

Error Backpropagation (2) N 1 E(w) = (o i -y i ) 2 2 i=1 (2) Hidden 1 -- Hidden 2 E N w = (o i -y i ) jk N i=1 o i o j o j w jk o i = o o i (1-o i )w ij j o j = o w j (1-o j )x k jk = (o i -y i ) o i (1-o i )w ij o j (1-o j )o k i=1 = δ j = Σ i δ i w ij o j (1-o j ) Hidden 2 -- Output H 2 o i =σ( w ij o j +θ i ) j=1 Hidden 1 Hidden 2 H 1 o j =σ( w jk o k +θ j ) k=1 Input - Hidden 1 M o k =σ( w km x m +θ k ) m=1

Error Backpropagation (3) N 1 E(w) = (o i -y i ) 2 2 (3) Input Hidden 1 = i=1 E H 2 = (o w i -y i ) km N H 2 i=1 j=1 N i=1 j=1 (o i -y i ) o i o j o j o k o k w km o i (1-o i )w ij o j (1-o j )w jk o k (1-o k )x m = δ k = Σ j δ j w jk o k (1-o k ) Hidden 2 -- Output H 2 o i =σ( w ij o j +θ i ) j=1 Hidden 1 Hidden 2 H 1 o j =σ( w jk o k +θ j ) k=1 Input - Hidden 1 M o k =σ( w km x m +θ k ) m=1

Deep Training Algorithm (3) Hidden 2 -- Output H 2 o i =σ( w ij o j +θ i ) j=1 (2) Hidden 1 -- Hidden 2 H 1 o j =σ( w jk o k +θ j ) k=1 (1) Input -- Hidden 1 M o k =σ( w km x m +θ k ) m=1 (4) Hidden 2 -- Output δ i = (o i -y i ) o i (1- o i ) Δw ij = -η(t) δ i o j Δθ i = -η(t) δ i (5) Hidden 1 -- Hidden 2 δ j = Σ i δ i w ij o j (1-o j ) Δw jk = -η(t) δ j o k Δθ j = -η(t) δ j (6) Input -- Hidden 1 δ k = Σ j δ j w jk o k (1-o k ) Δw km = -η(t) δ k x m Δθ k = -η(t) δ k

Example A four layer network for character recognition is simulated. Character 5*5 Training data 400 Test data 400 0 6 2 4 8 0 6 Image 25 Deep Learning

Example Generalization Error Training Error Trained Deep Learning Training Cycle

I-5-2 Parameter Space of Deep Learning

Deeper network It is easy to define any deeper network. f 1 f 2 f N f 1 f 2 f N x 1 x 2 x M H M f i =σ( u ij σ( w jk ( ) k + θ j ) + φ i ) j=1 k=1 x 1 x 2 x M

Desired Output Difficulty of deep learning Output f 1 f 2 f N Minimize square error. This parameter is far from input. It is difficult for this parameter to estimate the relation between input and output. This parameter is far from output Input x 1 x 2 x M It takes much time for deep network to find the probabilistic relation between inputs and outputs. Improved methods are now being studied.

Parameter Space of Deep Network Square Error Minimum square error estimator diverges. Local optimal sets with singularities 2018/6/22 Very high dimensional space.

Likelihood of Y=a tanh(bx i )+Noise (n=100) Data, n=100 Parameter (a,b) and Likelihood Estimated by ML and Bayes Generalization Errors by ML and Bayes

Parameter Space in Deep Learning (1) The parameter space has very high dimension. (2) The minimum square error estimator often diverges. (3) The square error can not be approximated by any quadratic form. (4) There are many local optimal parameter subsets which has singularities. (5) Thermo-dynamical limit does not exit. (6) Bayesian estimation makes generalization error smaller than the maximum likelihood method. Still there is not any mathematical foundation by which such statistical model can be analyzed.

Example 0 6 Character Recognition 5*5 Training data 2000 Test data 2000 Output 2 Hidden 4 Hidden 6 0 Hidden 8 Input 25 6 Image Deep Learning

(1) Simple error backpropagation Training and test errors are observed by changing initial parameters. Training error average 213.5 standard deviation 414.7 2018/6/22 Mathematical Learning Theory Test error average 265.5 Standard deviation 388.0

(2) Sequential Learning Desired Output Desired Output f 1 f 2 f N Desired Output f 1 f 2 f N f 1 f 2 f N Copy Copy x 1 x 2 x M x 1 x 2 x M x 1 x 2 x M Firstly, a shallow network is trained, then the deeper ones are applied sequentially.

Sequential learning Training error Average 4.1 Standard deviation 1.8 Test error Average 61.6 Standard deviation 7.0 Three layer Four layer 2018/6/22 Mathematical Learning Theory

I-5-3 Autoencoder

Auto Encoder 教師出力 X 1 X 2 X M For both input and output, the same X=(X 1,X 2,,X M ) is prepared. The number of hidden units is made smaller than M. Then a 3 dimensional manifold in R M is obtained. 入力 X 1 X 2 X M データ

Output f ( g, w 2 ) Learning in EutoEncoder Compressed information Decoder g = g(x,w 1 ) Encoder E(w) The same error backpropgation can be applied to the square error = Σ x i f ( g(x i,w 1 ), w 2 ) 2 where w = (w 1, w 2 ) is a parameter X 1 X 2 Input X M

Example Many 5 6 Images are prepared.

出力 How to use AutoEncoder (1) 中間入力 30 dimensional images are compressed into 4 dimensional vectors. Noises are reduced. AutoEncoder can be understood as the nonlinear principal component analysis (PCA), because PCA minimizes E(A,B) = Σ x i ABx i 2, A and B are matrices whose ranks are not larger than K. (K<M).

How to use autoencoder (2) 30 dimensional data are compressed into [0,1] 2 By using decoder, the 30 dimensional image can be generated from 2 dimensional information.

x Two dimensional map of 30 dimensional images is generated. Y

I-5-4 Convolution Neural Network

Data Structure In several data, its structures are known before learning. Image: each pixel has almost same value as its neighbor except boundary. Speech: nonlinear expansion and contraction are contained. Object recognition: same object can be observed from another angle. Weather: prefectures in the same region have almost same weather. By using data structure, an appropriate network can be devised.

Image analysis and convolution network Global information In image analysis, convolution network are often employed successfully. Local analysis 2018/6/22 Mathematical Learning Theory

Multi-resolution analysis In multi-resolution analysis, the local analyses are successively integrated to global information. Convolution network can be understood as a kind of multi-resolution analysis. 2018/6/22 Mathematical Learning Theory

Time Delay Neural Network (TDNN) In human speech, local expansion and contraction are often observed. OHAYO is sometimes pronounced as O H Ha Yo O O, O Ha Yo, and O Ha Y. Time delay neural network was devised so that it can absorb such modification. time Desired output 2018/6/22 time

Deep Learning and Feature extraction (1) Automatic feature extraction In deep learning, feature vectors are automatically generated at hidden variables. If you are lucky, you can find unknown appropriate feature vector for a set of training data. (2) Preprocessing using feature vector If you want to make parallel translation or rotation invariant recognition system, you had better use such invariant feature vectors which are made preprocessing. 2018/6/22 Mathematical Learning Theory

Example A time sequence is predicted from 27 values by a nonlinear function f, x(t) = f(x(t-1),x(t-2),,x(t-27)), which is made by deep learning. time A network which has all connections. time Convolution network.

Example Hakusai Prices by month, from 1970 Jan to to 2013 Dec. E-stat page, http://www.e-stat.go.jp/sg1/estat/estattopportal.do Price Training data Red: Data Blue: Trained Price Month Test data Red: Data Blue: Predicted 2018/6/22 Mathematical Learning Theory Month