Statistical Learning Theory. Part I 5. Deep Learning

Similar documents
Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Basic Principles of Unsupervised and Unsupervised

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Statistical Learning Theory

How to do backpropagation in a brain

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Pattern Recognition and Machine Learning

Unsupervised Neural Nets

Variational Autoencoders

Multilayer Neural Networks

Machine Learning Techniques

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Regularization in Neural Networks

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Neural Network Training

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CSC321 Lecture 20: Autoencoders

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Probabilistic & Unsupervised Learning

Greedy Layer-Wise Training of Deep Networks

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Machine Learning Lecture 5

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

A summary of Deep Learning without Poor Local Minima

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

UNSUPERVISED LEARNING

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

HMM part 1. Dr Philip Jackson

Bayesian Networks (Part I)

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Variational Autoencoders (VAEs)

RegML 2018 Class 8 Deep learning

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Lecture 14: Deep Generative Learning

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Feature Design. Feature Design. Feature Design. & Deep Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Models for Regression. Sargur Srihari

Incremental Stochastic Gradient Descent

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

STA 414/2104: Machine Learning

Learning to Disentangle Factors of Variation with Manifold Learning

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Grundlagen der Künstlichen Intelligenz

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Neural networks. Chapter 20, Section 5 1

Statistical Machine Learning from Data

PATTERN CLASSIFICATION

Generative models for missing value completion

Artificial Neural Networks Examination, June 2005

Artificial Neural Networks

Deep Learning Autoencoder Models

Neural Networks: Backpropagation

Introduction to Machine Learning

Deep unsupervised learning

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Pattern Classification

Algebraic Information Geometry for Learning Machines with Singularities

Lecture 16 Deep Neural Generative Models

Gaussian and Linear Discriminant Analysis; Multiclass Classification

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Bayesian Deep Learning

Tutorial on Machine Learning for Advanced Electronics

Multilayer Neural Networks

Introduction to Neural Networks

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Introduction to Machine Learning

Neural Networks and Deep Learning

Neural networks. Chapter 20. Chapter 20 1

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural networks and optimization

Gradient Descent. Sargur Srihari

Machine Learning. Boris

Reading Group on Deep Learning Session 1

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Convolutional Neural Networks. Srikumar Ramalingam

arxiv: v3 [cs.lg] 14 Jan 2018

CSE/NB 528 Final Lecture: All Good Things Must. CSE/NB 528: Final Lecture

Mining Classification Knowledge

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

STA 414/2104: Lecture 8

L11: Pattern recognition principles

Variational Autoencoder

Neural networks. Chapter 19, Sections 1 5 1

Neural networks: Unsupervised learning

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

STA 414/2104: Lecture 8

Artificial Neural Networks

Online Learning and Sequential Decision Making

Neural Networks. Intro to AI Bert Huang Virginia Tech

Transcription:

Statistical Learning Theory Part I 5. Deep Learning Sumio Watanabe Tokyo Institute of Technology

Review : Supervised Learning Training Data X 1, X 2,, X n q(x,y) =q(x)q(y x) Information Source Y 1, Y 2,, Y n Test Data X Y y=f(x,w) Learning Machine

Review : Training and Generalization Training Error n E(w) = (1/n) Σ (Y i -f(x i,w)) 2 i=1 Generalization Error F(w) = (1/m) Σ (Y i -f(x i,w)) 2 m i=1 Stochastic gradient descent: In each t, (X i,y i ) is randomly chosen, w(t+1) - w(t) = - η(t) grad (Y i -f(x i,w(t))) 2

2018/6/22 I-5-1 Deep Learning.

Deep Network We learned three layer network. It is easy to define a network which has deeper layers. o 1 o 2 o N N o 1 o i o N w ij H 2 o j w jk H 1 o k M w km x 1 x 2 x M x 1 x m x M H 2 H 1 o i =σ( w ij σ( w jk σ( w km x m + θ k ) + θ j ) + θ i ) j=1 k=1 M m=1 Note: w ij, w jk, and w km are different parameters.

Error Backropagation (1) It is easy to generalize the training algorithm for deeper network. Network s output o = (o i ), desired output y= (y i ). Square Error E(w) = 1 (o i -y i ) 2 2 (1) Hidden 2 -- Output E w ij N i=1 = (o i -y i ) o i w ij (N = Output dimension) = (o i -y i ) o i (1- o i ) o j Hidden 2 -- Output H 2 o i =σ( w ij o j +θ i ) j=1 Hidden 1 Hidden 2 H 1 o j =σ( w jk o k +θ j ) k=1 Input - Hidden 1 M o k =σ( w km x m +θ k ) m=1 = δ i

Error Backpropagation (2) N 1 E(w) = (o i -y i ) 2 2 i=1 (2) Hidden 1 -- Hidden 2 E N w = (o i -y i ) jk N i=1 o i o j o j w jk o i = o o i (1-o i )w ij j o j = o w j (1-o j )x k jk = (o i -y i ) o i (1-o i )w ij o j (1-o j )o k i=1 = δ j = Σ i δ i w ij o j (1-o j ) Hidden 2 -- Output H 2 o i =σ( w ij o j +θ i ) j=1 Hidden 1 Hidden 2 H 1 o j =σ( w jk o k +θ j ) k=1 Input - Hidden 1 M o k =σ( w km x m +θ k ) m=1

Error Backpropagation (3) N 1 E(w) = (o i -y i ) 2 2 (3) Input Hidden 1 = i=1 E H 2 = (o w i -y i ) km N H 2 i=1 j=1 N i=1 j=1 (o i -y i ) o i o j o j o k o k w km o i (1-o i )w ij o j (1-o j )w jk o k (1-o k )x m = δ k = Σ j δ j w jk o k (1-o k ) Hidden 2 -- Output H 2 o i =σ( w ij o j +θ i ) j=1 Hidden 1 Hidden 2 H 1 o j =σ( w jk o k +θ j ) k=1 Input - Hidden 1 M o k =σ( w km x m +θ k ) m=1

Deep Training Algorithm (3) Hidden 2 -- Output H 2 o i =σ( w ij o j +θ i ) j=1 (2) Hidden 1 -- Hidden 2 H 1 o j =σ( w jk o k +θ j ) k=1 (1) Input -- Hidden 1 M o k =σ( w km x m +θ k ) m=1 (4) Hidden 2 -- Output δ i = (o i -y i ) o i (1- o i ) Δw ij = -η(t) δ i o j Δθ i = -η(t) δ i (5) Hidden 1 -- Hidden 2 δ j = Σ i δ i w ij o j (1-o j ) Δw jk = -η(t) δ j o k Δθ j = -η(t) δ j (6) Input -- Hidden 1 δ k = Σ j δ j w jk o k (1-o k ) Δw km = -η(t) δ k x m Δθ k = -η(t) δ k

Example A four layer network for character recognition is simulated. Character 5*5 Training data 400 Test data 400 0 6 2 4 8 0 6 Image 25 Deep Learning

Example Generalization Error Training Error Trained Deep Learning Training Cycle

I-5-2 Parameter Space of Deep Learning

Deeper network It is easy to define any deeper network. f 1 f 2 f N f 1 f 2 f N x 1 x 2 x M H M f i =σ( u ij σ( w jk ( ) k + θ j ) + φ i ) j=1 k=1 x 1 x 2 x M

Desired Output Difficulty of deep learning Output f 1 f 2 f N Minimize square error. This parameter is far from input. It is difficult for this parameter to estimate the relation between input and output. This parameter is far from output Input x 1 x 2 x M It takes much time for deep network to find the probabilistic relation between inputs and outputs. Improved methods are now being studied.

Parameter Space of Deep Network Square Error Minimum square error estimator diverges. Local optimal sets with singularities 2018/6/22 Very high dimensional space.

Likelihood of Y=a tanh(bx i )+Noise (n=100) Data, n=100 Parameter (a,b) and Likelihood Estimated by ML and Bayes Generalization Errors by ML and Bayes

Parameter Space in Deep Learning (1) The parameter space has very high dimension. (2) The minimum square error estimator often diverges. (3) The square error can not be approximated by any quadratic form. (4) There are many local optimal parameter subsets which has singularities. (5) Thermo-dynamical limit does not exit. (6) Bayesian estimation makes generalization error smaller than the maximum likelihood method. Still there is not any mathematical foundation by which such statistical model can be analyzed.

Example 0 6 Character Recognition 5*5 Training data 2000 Test data 2000 Output 2 Hidden 4 Hidden 6 0 Hidden 8 Input 25 6 Image Deep Learning

(1) Simple error backpropagation Training and test errors are observed by changing initial parameters. Training error average 213.5 standard deviation 414.7 2018/6/22 Mathematical Learning Theory Test error average 265.5 Standard deviation 388.0

(2) Sequential Learning Desired Output Desired Output f 1 f 2 f N Desired Output f 1 f 2 f N f 1 f 2 f N Copy Copy x 1 x 2 x M x 1 x 2 x M x 1 x 2 x M Firstly, a shallow network is trained, then the deeper ones are applied sequentially.

Sequential learning Training error Average 4.1 Standard deviation 1.8 Test error Average 61.6 Standard deviation 7.0 Three layer Four layer 2018/6/22 Mathematical Learning Theory

I-5-3 Autoencoder

Auto Encoder 教師 出力 X 1 X 2 X M For both input and output, the same X=(X 1,X 2,,X M ) is prepared. The number of hidden units is made smaller than M. Then a 3 dimensional manifold in R M is obtained. 入力 X 1 X 2 X M データ

Output f ( g, w 2 ) Learning in EutoEncoder Compressed information Decoder g = g(x,w 1 ) Encoder E(w) The same error backpropgation can be applied to the square error = Σ x i f ( g(x i,w 1 ), w 2 ) 2 where w = (w 1, w 2 ) is a parameter X 1 X 2 Input X M

Example Many 5 6 Images are prepared.

出力 How to use AutoEncoder (1) 中間 入力 30 dimensional images are compressed into 4 dimensional vectors. Noises are reduced. AutoEncoder can be understood as the nonlinear principal component analysis (PCA), because PCA minimizes E(A,B) = Σ x i ABx i 2, A and B are matrices whose ranks are not larger than K. (K<M).

How to use autoencoder (2) 30 dimensional data are compressed into [0,1] 2 By using decoder, the 30 dimensional image can be generated from 2 dimensional information.

x Two dimensional map of 30 dimensional images is generated. Y

I-5-4 Convolution Neural Network

Data Structure In several data, its structures are known before learning. Image: each pixel has almost same value as its neighbor except boundary. Speech: nonlinear expansion and contraction are contained. Object recognition: same object can be observed from another angle. Weather: prefectures in the same region have almost same weather. By using data structure, an appropriate network can be devised.

Image analysis and convolution network Global information In image analysis, convolution network are often employed successfully. Local analysis 2018/6/22 Mathematical Learning Theory

Multi-resolution analysis In multi-resolution analysis, the local analyses are successively integrated to global information. Convolution network can be understood as a kind of multi-resolution analysis. 2018/6/22 Mathematical Learning Theory

Time Delay Neural Network (TDNN) In human speech, local expansion and contraction are often observed. OHAYO is sometimes pronounced as O H Ha Yo O O, O Ha Yo, and O Ha Y. Time delay neural network was devised so that it can absorb such modification. time Desired output 2018/6/22 time

Deep Learning and Feature extraction (1) Automatic feature extraction In deep learning, feature vectors are automatically generated at hidden variables. If you are lucky, you can find unknown appropriate feature vector for a set of training data. (2) Preprocessing using feature vector If you want to make parallel translation or rotation invariant recognition system, you had better use such invariant feature vectors which are made preprocessing. 2018/6/22 Mathematical Learning Theory

Example A time sequence is predicted from 27 values by a nonlinear function f, x(t) = f(x(t-1),x(t-2),,x(t-27)), which is made by deep learning. time A network which has all connections. time Convolution network.

Example Hakusai Prices by month, from 1970 Jan to to 2013 Dec. E-stat page, http://www.e-stat.go.jp/sg1/estat/estattopportal.do Price Training data Red: Data Blue: Trained Price Month Test data Red: Data Blue: Predicted 2018/6/22 Mathematical Learning Theory Month