Cheng Soon Ong & Christian Walder. Canberra February June 2018

Similar documents
Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Cheng Soon Ong & Christian Walder. Canberra February June 2018

STA 414/2104: Lecture 8

Artificial Neural Networks. MGS Lecture 2

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

STA 414/2104: Lecture 8

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

The XOR problem. Machine learning for vision. The XOR problem. The XOR problem. x 1 x 2. x 2. x 1. Fall Roland Memisevic

Greedy Layer-Wise Training of Deep Networks

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Reading Group on Deep Learning Session 1

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Introduction to Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Techniques

Pattern Recognition and Machine Learning

Probabilistic & Unsupervised Learning

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks and Deep Learning

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Dimensionality Reduction and Principle Components Analysis

ECE521 Lectures 9 Fully Connected Neural Networks

CSC321 Lecture 20: Autoencoders

OPTIMIZATION METHODS IN DEEP LEARNING

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Unsupervised Neural Nets

CS281 Section 4: Factor Analysis and PCA

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Feed-forward Network Functions

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Linear & nonlinear classifiers

CSC 411 Lecture 12: Principal Component Analysis

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Deep Learning Autoencoder Models

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Statistical Machine Learning

Introduction to Neural Networks

Deep Feedforward Networks

Machine Learning Lecture 5

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Deep Belief Networks are compact universal approximators

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Tensor Methods for Feature Learning

Lecture 16 Deep Neural Generative Models

Lecture 7: Con3nuous Latent Variable Models

Lecture 5: Logistic Regression. Neural Networks

ECE521 Lecture 7/8. Logistic Regression

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Neural Networks and Deep Learning.

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

From perceptrons to word embeddings. Simon Šuster University of Groningen

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

1 What a Neural Network Computes

PATTERN CLASSIFICATION

CS798: Selected topics in Machine Learning

Linear & nonlinear classifiers

Denoising Autoencoders

A summary of Deep Learning without Poor Local Minima

How to do backpropagation in a brain

Deep Feedforward Networks

Course 395: Machine Learning - Lectures

UNSUPERVISED LEARNING

Ch.6 Deep Feedforward Networks (2/3)

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

RegML 2018 Class 8 Deep learning

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Deep Learning for NLP

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

Learning Deep Architectures

PCA and Autoencoders

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Introduction to Neural Networks

STA 4273H: Statistical Machine Learning

Feature Design. Feature Design. Feature Design. & Deep Learning

Transcription:

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis s Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 828

Part XXII Neural Network 3 766of 828

Number of layers expression of a function is compact when it has few computational elements, i.e. few degrees of freedom that need to be tuned by learning for a fixed number of training examples, expect that compact representations of the target function would yield better generalization Example representations affine operations, sigmoid logistic regression has depth 1, fixed number of units (a.k.a. neurons) fixed kernel, affine operations kernel machine (e.g. SVM) has two levels, with as many units as data points stacked neural network of multiple linear transformation followed by a non-linearity deep neural network has arbitrary depth with arbitrary number of units per layer 767of 828

Easier to represent with more layers An old result: functions that can be compactly represented by a depth k architecture might require an exponential number of computational elements to be represented by a depth k 1 architecture Example, the d bit parity function { d parity : (b 1,..., b d ) {0, 1} d 1 if i=1 b i 0 otherwise is even Theorem: d-bit parity circuits of depth 2 have exponential size Analogous in modern deep learning: Shallow networks require exponentially more parameters for the same number of modes Canadian deep learning mafia. 768of 828

Recall: Multi-layer Neural Network Architecture M y k (x, w) = g w (2) kj h j=0 ( D i=0 ) w (1) ji x i where w now contains all weight and bias parameters. hidden units x D w (1) MD z M w (2) KM y K inputs outputs y 1 x 1 x 0 z 1 w (2) 10 z 0 We could add more hidden layers 769of 828

Empirical observations - pre 2006 Deep architectures get stuck in local minima or plateaus As architecture gets deeper, more difficult to obtain good generalisation Hard to initialise random weights well 1 or 2 hidden layers seem to perform better 2006: Unsupervised pre-training, find distributed representation 770of 828

Deep representation - intuition Bengio, "Learning Deep Architectures for AI", 2009 771of 828

Deep representation - practice AlexNet / VGG-F network visualized by mneuron. 772of 828

Recall: PCA Idea: Linearly project the data points onto a lower dimensional subspace such that the variance of the projected data is maximised, or the distortion error from the projection is minimised. Both formulation lead to the same result. Need to find the lower dimensional subspace, called the principal subspace. u 1 x 2 x n x n x 1 773of 828

Multiple PCA layers? - linear transforms Principle Component Analysis is a linear transformation (because it is a projection) The composite of two linear transformations is linear Linear transformations M : R m R n are matrices Let S and T be matrices of appropriate dimension such that ST is defined ST(X + X ) = ST(X) + ST(X ) Similarly for multiplication with a scalar multiple PCA layers pointless 774of 828

Multiple PCA layers? - projection Let X T X = UΛU T be the eigenvalue decomposition of the covariance matrix (what is assumed about the mean?). Define U k to be the matrix of the first k columns of U, for the k largest eigenvalues. Define Λ k similarly Consider the features formed by projecting onto the principal components Z = XU k We perform PCA a second time, Z T Z = VΛ Z V T. By the definition of Z and X T X, and the orthogonality of U Z T Z = (XU k ) T (XU k ) = U T k X T XU k = U T k UΛU T U k = Λ k Hence Λ Z = Λ k and V is the identity, therefore the second PCA has no effect again, multiple PCA layers pointless 775of 828

An autoencoder is trained to encode the input x into some representation c(x) so that the input can be reconstructed from that representation the target output of the autoencoder is the autoencoder input itself With one linear hidden layer and the mean squared error criterion, the k hidden units learn to project the input in the span of the first k principal components of the data If the hidden layer is nonlinear, the autoencoder behaves differently from PCA, with the ability to capture multimodal aspects of the input distribution Let f be the decoder. We want to minimise the reconstruction error N l (x n, f (c(x n ))) n=1 776of 828

Cost function Recall: f (c(x)) is the reconstruction produced by the network Minimisation of the negative log likelihood of the reconstruction, given the encoding c(x) RE = log P(x c(x)) If x c(x) is Gaussian, we recover the familiar squared error If the inputs x i are either binary or considered to be binomial probabilities, then the loss function would be the cross entropy log P(x c(x)) = x i log f i (c(x)) + (1 x i ) log(1 f i (c(x))) where f i ( ) is the i th component of the decoder 777of 828

Undercomplete Consider a small number of hidden units. c(x) is viewed as a lossy compression of x Cannot have small loss for all x, so focus on training examples Hope code c(x) is a distributed representation that captures the main factors of variation in the data 778of 828

Stacking autoencoders Let c j and f j be the encoder and corresponding decoder of the j th layer Let z j be the representation after the encoder c j We can define multiple layers of autoencoders recursively. For example, let z 1 = c 1 (x), and z 2 = c 2 (z 1 ), the corresponding decoding is given by f 1 (f 2 (z 2 )) Because of non-linear activation functions, the latent feature z 2 can capture more complex patterns than z 1. 779of 828

Higher level image features - faces codingplayground.blogspot.com 780of 828

Pre-training supervised neural networks Latent features z j in layer j can capture high level patterns z j = c j (c j 1 ( c 2 (c 1 (x)) )) These features may also be useful for supervised learning tasks. In contrast to the feed forward network, the features z j are constructed in an unsupervised fashion. Discard the decoding layers, and directly use z j with a supervised training method, such as logistic regression. Various such pre-trained networks are available on-line, e.g VGG-19. 781of 828

Xavier Initialisation / ReLU Layer-wise unsupervised pre-training facilitates learning by extracting useful features for subsequent supervised backprop. Pre-training also avoids saturation (large magnitude arguments to, say, sigmoidal units). Simpler Xavier initialization can also avoid saturation. Let the inputs x i N (0, 1), weights w i N (0, σ 2 ) and activation z = m i=1 x iy i. Then: m VAR[z] = E[(z E[z]) 2 ] = E[z 2 ] = E[( x i w i ) 2 ] m = E[(x i w i ) 2 ] = i=1 i=1 i=1 m E[xi 2 ]E[w 2 i ] = mσ 2. So we set σ = 1/ m to have nice activations. Similarly for subsequent layers in the network. ReLU activations h(x) = max(x, 0) also help in practice. 782of 828

Higher dimensional hidden layer if there is no other constraint, then an autoencoder with d-dimensional input and an encoding of dimension at least d could potentially just learn the identity function Avoid by: Regularisation Early stopping of stochastic gradient descent Add noise in the encoding Sparsity constraint on code c(x). 783of 828

Denoising autoencoder Add noise to input, keeping perfect example as output tries to: 1 preserve information of input 2 undo stochastic corruption process Reconstruction log likelihood log P(x c(ˆx)) where x noise free, ˆx corrupted 784of 828

Image denoising Images with Gaussian noise added. Denoised using Stacked Sparse Denoising Images from Xie et. al. NIPS 2012 785of 828

Inpainting - 1 Free a bird Image from http: //cimg.eu/greycstoration/demonstration.shtml 786of 828

Inpainting - 2 c 2018 Undo text over image Image from Bach et. al. ICCV tutorial 2009 787of 828

Recall: Basis functions For fixed basis functions φ(x), we use domain knowledge for encoding features Neural networks use data to learn a set of transformations. φ i (x) is the i th component of the feature vector, and is learned by the network. The transformations φ i ( ) for a particular dataset may no longer be orthogonal, and furthermore may be minor variations of each other. We collect all the transformed features into a matrix Φ. 788of 828

Sparse representations Idea: Have many hidden nodes, but only a few active for a particular code c(x). Student t prior on codes l 1 penalty on coefficients α Given bases in matrix Φ, look for codes by choosing α such that input signal x is reconstructed with low l 2 reconstruction error, while w is sparse min α N n=1 1 2 xn Φαn 2 2 + λ α 1 Φ is overcomplete, no longer orthogonal Sparse small number of non-zero α i. Exact recovery under certain conditions (coherence): l 1 l 0. l 1 regulariser Laplace prior p(α i) = λ 2 exp( λ αi ). 789of 828

The image denoising problem y = x }{{} orig + }{{}}{{} w measurements original image noise 790of 828

Sparsity assumption Only have noisy measurements y = x }{{} orig + }{{}}{{} w measurements original image noise Given Φ R m p, find α such that α 0 is small for x Φα where 0 is the number of non-zero elements of α. Φ is not necessarily features constructed from training data. Minimise reconstruction error min α N n=1 1 2 x n Φα n 2 2 + λ α 0 791of 828

Convex relaxation Want to minimise number of components min α N n=1 1 2 x n Φα n 2 2 + λ α 0 but 0 is hard to optimise Relax to a convex norm min α N n=1 where α 1 = n α n. 1 2 x n Φα n 2 2 + λ α 1 In some settings does minimisation with l 1 regularisation give the same solution as minimisation with l 0 regularisation (exact recovery)? 792of 828

Mutual coherence Expect to be ok when columns of Φ not too parallel Assume columns of Φ are normalised to unit norm Let K = ΦΦ T be the Gram matrix, then K(i, j) is the value of the inner product between φ i and φ j. Define the mutual coherence M = M(Φ) = max K(i, j) i j If we have an orthogonal basis, then Φ is an orthogonal matrix, hence K(i, j) = 0 when i j. However, if we have very similar columns, then M 1. 793of 828

Exact recovery conditions If a minimiser of the true l 0 problem, α satisfies α 0 < 1 M then it is the unique sparsest solution. If α satisfies the stronger condition α 0 < 1 ( 1 + 1 ) 2 M then the minimiser of the l 1 relaxation has the same sparsity pattern as α. 794of 828

References Yoshua Bengio, "Learning Deep Architectures for AI", Foundations and Trends in, 2009 http://deeplearning.net/tutorial/ http://www.deeplearningbook.org/contents/ autoencoders.html Fuchs, "On Sparse Representations in Arbitrary Redundant Bases", IEEE Trans. Info. Theory, 2004 Xavier Glorot and Yoshua Bengio, "Understanding the difficulty of training deep feedforward neural networks", 2010. 795of 828