Probabilistic Reasoning in Deep Learning

Similar documents
Density Estimation. Seungjin Choi

Bayesian Machine Learning

Lecture : Probabilistic Machine Learning

Nonparameteric Regression:

Probabilistic & Bayesian deep learning. Andreas Damianou

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Deep latent variable models

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Introduction to Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Variational Autoencoder

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Bayesian Machine Learning

Variational Autoencoders

CS-E3210 Machine Learning: Basic Principles

Gaussian Processes (10/16/13)

GAUSSIAN PROCESS REGRESSION

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Introduction to Probabilistic Machine Learning

PMR Learning as Inference

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Variational Inference via Stochastic Backpropagation

CPSC 540: Machine Learning

Bayesian Machine Learning

Massachusetts Institute of Technology

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Probabilistic & Unsupervised Learning

CPSC 540: Machine Learning

Mathematical Formulation of Our Example

Probabilistic Graphical Models for Image Analysis - Lecture 4

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Nonparametric Bayesian Methods (Gaussian Processes)

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

20: Gaussian Processes

Introduction into Bayesian statistics

Linear Dynamical Systems (Kalman filter)

Lecture 5: GPs and Streaming regression

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

COMP90051 Statistical Machine Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Bayesian Deep Learning

Bayesian Support Vector Machines for Feature Ranking and Selection

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

The connection of dropout and Bayesian statistics

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Linear Regression (9/11/13)

Introduction to Bayesian inference

Auto-Encoding Variational Bayes

Statistical learning. Chapter 20, Sections 1 4 1

Model Selection for Gaussian Processes

Lecture 13 : Variational Inference: Mean Field Approximation

Linear Models for Regression

g-priors for Linear Regression

Bayesian Interpretations of Regularization

Lecture 4: Probabilistic Learning

Variational Autoencoders (VAEs)

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Machine Learning

CMU-Q Lecture 24:

Gaussian Mixture Models

An Introduction to Statistical and Probabilistic Linear Models

Posterior Regularization

y Xw 2 2 y Xw λ w 2 2

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Algorithmisches Lernen/Machine Learning

Bayesian Inference and MCMC

Unsupervised Learning

Latent Dirichlet Allocation Introduction/Overview

GWAS V: Gaussian processes

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Learning Bayesian network : Given structure and completely observed data

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification

Lecture 1b: Linear Models for Regression

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Lecture 3: Machine learning, classification, and generative models

Probabilistic Graphical Models & Applications

ECE521 week 3: 23/26 January 2017

STAT 518 Intro Student Presentation

Expectation Propagation in Dynamical Systems

Data Mining Techniques

Basic math for biology

Latent Variable Models

Neutron inverse kinetics via Gaussian Processes

Statistical learning. Chapter 20, Sections 1 3 1

Transcription:

Probabilistic Reasoning in Deep Learning Dr Konstantina Palla, PhD palla@stats.ox.ac.uk September 2017 Deep Learning Indaba, Johannesburgh Konstantina Palla 1 / 39

OVERVIEW OF THE TALK Basics of Bayesian Inference Konstantina Palla 2 / 39

OVERVIEW OF THE TALK Basics of Bayesian Inference Bayesian reasoning inside DNNs. Konstantina Palla 2 / 39

OVERVIEW OF THE TALK Basics of Bayesian Inference Bayesian reasoning inside DNNs. > The Bayesian paradigm as an approach to account for uncetainty in the DNNs. Konstantina Palla 2 / 39

OVERVIEW OF THE TALK Basics of Bayesian Inference Bayesian reasoning inside DNNs. > The Bayesian paradigm as an approach to account for uncetainty in the DNNs. Connection to other statistical models (nonparametric models). Konstantina Palla 2 / 39

OVERVIEW OF THE TALK Basics of Bayesian Inference Bayesian reasoning inside DNNs. > The Bayesian paradigm as an approach to account for uncetainty in the DNNs. Connection to other statistical models (nonparametric models). > How kernels and Gaussian Processes fit in the picture of NNs. Konstantina Palla 2 / 39

PART I Bayesian Inference: Basics Konstantina Palla 3 / 39

BAYESIAN INFERENCE: BASICS Data: Observations Y = y (1), y (2),..., y (N) Problem: How was it generated? Konstantina Palla 4 / 39

BAYESIAN INFERENCE: BASICS Data: Observations Y = y (1), y (2),..., y (N) Problem: How was it generated? Answer: Make assumptions distributional θ y (i) θ p θ (y θ) y Konstantina Palla 4 / 39

BAYESIAN INFERENCE: BASICS Data: Observations Y = y (1), y (2),..., y (N) Problem: How was it generated? Answer: Make assumptions distributional θ y (i) θ p θ (y θ) y y (i) θ N (y; µ, σ 2 ) θ = {µ, σ 2 } Konstantina Palla 4 / 39

BAYESIAN INFERENCE: BASICS Data: Observations Y = y (1), y (2),..., y (N) Problem: How was it generated? Answer: Make assumptions distributional θ y (i) θ p θ (y θ) y likelihood: the probability of the data under the model parameters θ y (i) θ N (y; µ, σ 2 ) θ = {µ, σ 2 } N p(y θ) = p(y (i) θ) i=1 N = N (y (i) µ, σ 2 ) i=1 Konstantina Palla 4 / 39

BAYESIAN INFERENCE: BASICS Bayesian Principle All forms of uncertainty should be expressed by means of probability distributions. θ Prior: θ p(θ) Our prior belief before seeing the data Posterior: Our updated belief after having seen the data Bayes Rule y p(θ Y) = p(y θ)p(θ) p(y) Konstantina Palla 5 / 39

BAYESIAN INFERENCE: BASICS Baye s rule p(θ Y) = p(y θ)p(θ) p(y) Konstantina Palla 6 / 39

BAYESIAN INFERENCE: BASICS Baye s rule posterior p(θ Y) = p(y θ)p(θ) p(y) The posterior is proportional to likelihood prior Konstantina Palla 6 / 39

BAYESIAN INFERENCE: BASICS Baye s rule likelihood posterior p(θ Y) = p(y θ)p(θ) p(y) The posterior is proportional to likelihood prior Konstantina Palla 6 / 39

BAYESIAN INFERENCE: BASICS likelihood Baye s rule posterior p(θ Y) = prior p(y θ)p(θ) p(y) The posterior is proportional to likelihood prior Konstantina Palla 6 / 39

BAYESIAN INFERENCE: BASICS likelihood Baye s rule posterior p(θ Y) = prior p(y θ)p(θ) p(y) The posterior is proportional to likelihood prior Update my prior beliefs after I see the data Konstantina Palla 6 / 39

BAYESIAN INFERENCE: BASICS P(θ Y) P(Y θ)p(θ) Konstantina Palla 7 / 39

BAYESIAN INFERENCE: BASICS Make predictions For unseen y marginalise! P(y Y) = P(y θ)p(θ Y)dθ Konstantina Palla 8 / 39

BAYESIAN INFERENCE: BASICS Make predictions For unseen y marginalise! P(y Y) = P(y θ)p(θ Y)dθ The prediction does not depend on a point estimate of θ but it is expressed as a weighted average over all the possible values of θ Konstantina Palla 8 / 39

BAYESIAN INFERENCE: BASICS Why is this useful? > Uncertainty is taken into account No point estimates BUT distribution over the unknown θ; p(θ Y) Konstantina Palla 9 / 39

BAYESIAN INFERENCE: BASICS Why is this useful? > Uncertainty is taken into account No point estimates BUT distribution over the unknown θ; p(θ Y) > Natural handling of missing data θ y a probability distribution is estimated for each missing value θ? θ p(θ Y) Konstantina Palla 9 / 39

BAYESIAN INFERENCE: BASICS Why is this useful? > Uncertainty is taken into account No point estimates BUT distribution over the unknown θ; p(θ Y) > Natural handling of missing data θ y a probability distribution is estimated for each missing value θ? θ p(θ Y) > Adresses issues like regularization, overfitting useful in Deep Neural Networks. Konstantina Palla 9 / 39

PART II Bayesian Reasoning in (Deep) Neural Networks Konstantina Palla 10 / 39

PART II Bayesian Reasoning in (Deep) Neural Networks Overview of (D)NNs Konstantina Palla 10 / 39

PART II Bayesian Reasoning in (Deep) Neural Networks Overview of (D)NNs Bayesian Reasoning in DNNs Konstantina Palla 10 / 39

OVERVIEW OF (DEEP) NEURAL NETWORKS Input layer Hidden layer Hidden layer Hidden layer Output layer........ What is Deep Learning A framework for constructing flexible models a way of constructing outputs using functions on inputs. Konstantina Palla 11 / 39

OVERVIEW OF DEEP NNS Repetition of building blocks Konstantina Palla 12 / 39

OVERVIEW OF DEEP NNS Building block x x 1 β 1 x 2 β 2 f (η) y η = β T x linear. β M f (η) non linear x M y > Output: y > Linear predictor : η = M m=1 βmxm = βt x, coefficients β weights > Inverse Link function: f (η) activation function Konstantina Palla 13 / 39

OVERVIEW OF DEEP NNS..... In each layer l the input is linearly composed η l = β T l x l Konstantina Palla 14 / 39

OVERVIEW OF DEEP NNS..... In each layer l the input is linearly composed η l = β T l x l a non-linear function is applied f l(η l ) Konstantina Palla 14 / 39

OVERVIEW OF DEEP NNS..... In each layer l the input is linearly composed η l = β T l x l a non-linear function is applied f l(η l ) A Deep Neural Network construction of recursive building blocks Konstantina Palla 14 / 39

OVERVIEW OF DEEP NNS x η = β T x f (η) η l = β T l x l f l(η l ) y A Deep Neural Network can represent functions of increasing complexity. Konstantina Palla 15 / 39

DEEP NNS AS A PROBABILISTIC MODEL x 1 β 1 x 2 β 2 f (η) y. β p x p η = β T x Konstantina Palla 16 / 39

DEEP NNS AS A PROBABILISTIC MODEL x 1 β 1 x y x 2. β 2 β p f (η) y Parametric! p(y x; β) x p η = β T x Konstantina Palla 16 / 39

DEEP NNS AS A PROBABILISTIC MODEL x y p(y x; β) Parametric! x 1 β 1 Example: Regression x 2. β 2 β p f (η) η f (η) = η = β T x y = β T x + ɛ, ɛ N (0, σ 2 ) p(y x, β) = N (β T x, σ 2 ) x p η = β T x Konstantina Palla 16 / 39

DEEP NNS AS A PROBABILISTIC MODEL x y p(y x; β) Parametric! x 1 β 1 Example: Regression x 2. β 2 β p f (η) η f (η) = η = β T x y = β T x + ɛ, ɛ N (0, σ 2 ) p(y x, β) = N (β T x, σ 2 ) x p η = β T x Train the network: learn β? Konstantina Palla 16 / 39

DEEP NNS AS A PROBABILISTIC MODEL Learn β Maximum Likelihood Principal Minimize the negative log-likelihood J (β) = log(p(y x, β)) β MLE : arg min J(β) β Konstantina Palla 17 / 39

DEEP NNS AS A PROBABILISTIC MODEL Learn β Maximum Likelihood Principal Minimize the negative log-likelihood J (β) = log(p(y x, β)) β MLE : arg min J(β) β Regression x y p(y x, β) = N (β T x, σ 2 ) Konstantina Palla 17 / 39

DEEP NNS AS A PROBABILISTIC MODEL Learn β Maximum Likelihood Principal Minimize the negative log-likelihood J (β) = log(p(y x, β)) β MLE : arg min J(β) β Regression x Squared Loss function y J(β) = 1 2 N {y (i) β T x (i) } 2 + const. i=1 p(y x, β) = N (β T x, σ 2 ) Konstantina Palla 17 / 39

DEEP NNS AS A PROBABILISTIC MODEL Learn β Maximum Likelihood Principal Minimize the negative log-likelihood J (β) = log(p(y x, β)) β MLE : arg min J(β) β Regression x Squared Loss function y J(β) = 1 2 N {y (i) β T x (i) } 2 + const. i=1 p(y x, β) = N (β T x, σ 2 ) > Solve using Stachastic Gradient, back-propagation etc. Konstantina Palla 17 / 39

DEEP NNS AS A PROBABILISTIC MODEL δ β Maximum Likelihood Estimation β is fixed but unknown - Point estimates - Overfitting Konstantina Palla 18 / 39

DEEP NNS AS A PROBABILISTIC MODEL δ β p(β) Maximum Likelihood Estimation β is fixed but unknown - Point estimates - Overfitting Bayesian Approach β is unknown/uncertain + Account for uncertainty p(β), p(β y) + Regularisation Konstantina Palla 18 / 39

BAYESIAN REASONING IN DEEP LEARNING Maximum a Posteriori Estimation β x y Konstantina Palla 19 / 39

BAYESIAN REASONING IN DEEP LEARNING Maximum a Posteriori Estimation Distribution over the network weights β β x β p(β) y x, β p(y x, β) y Konstantina Palla 19 / 39

BAYESIAN REASONING IN DEEP LEARNING Maximum a Posteriori Estimation Distribution over the network weights β β x β p(β) y x, β p(y x, β) Update the belief over β Bayes rule y Likelihood p(β Y) p(y β, X) prior p(β) Konstantina Palla 19 / 39

BAYESIAN REASONING IN DEEP LEARNING Maximum a Posteriori Estimation Distribution over the network weights β β x β p(β) y x, β p(y x, β) Update the belief over β Bayes rule y Likelihood p(β Y) p(y β, X) prior p(β) Maximum a Posteriori Find β MAP that maximizes log p(β Y) Find β MAP that minimizes J (β) = log p(β Y) β MAP : arg min J(β) β Konstantina Palla 19 / 39

BAYESIAN REASONING IN DEEP LEARNING Regression: Maximum a Posteriori β x y prior over the weights y = β T x + ɛ multivariate Gaussian p(β) = N (m β, λ 1 I) p(y x, β) = N (β T x, σ 2 ) Konstantina Palla 20 / 39

BAYESIAN REASONING IN DEEP LEARNING Regression: Maximum a Posteriori β prior over the weights x y y = β T x + ɛ multivariate Gaussian Loss function: J (β) = log p(β Y) J (β) = 1 2 N {y (i) β T x (i) } 2 + i=1 weight decay - regularization λ 2 βt β +const. p(β) = N (m β, λ 1 I) p(y x, β) = N (β T x, σ 2 ) Konstantina Palla 20 / 39

BAYESIAN REASONING IN DEEP LEARNING Regression: Maximum a Posteriori β prior over the weights x y y = β T x + ɛ multivariate Gaussian Loss function: J (β) = log p(β Y) J (β) = 1 2 N {y (i) β T x (i) } 2 + i=1 weight decay - regularization λ 2 βt β +const. The posterior distribution over the weights allows for a natural regularizer to arise! Bayesian side effect p(β) = N (m β, λ 1 I) p(y x, β) = N (β T x, σ 2 ) Konstantina Palla 20 / 39

BAYESIAN REASONING IN DEEP LEARNING Regression: Maximum a Posteriori β prior over the weights x y y = β T x + ɛ multivariate Gaussian p(β) = N (m β, λ 1 I) Loss function: J (β) = log p(β Y) J (β) = 1 2 N {y (i) β T x (i) } 2 + i=1 weight decay - regularization λ 2 βt β +const. The posterior distribution over the weights allows for a natural regularizer to arise! Bayesian side effect β MAP : arg min J(β) β p(y x, β) = N (β T x, σ 2 ) Konstantina Palla 20 / 39

BAYESIAN REASONING IN DEEP LEARNING Regression: Maximum a Posteriori β prior over the weights x y y = β T x + ɛ multivariate Gaussian p(β) = N (m β, λ 1 I) p(y x, β) = N (β T x, σ 2 ) Loss function: J (β) = log p(β Y) J (β) = 1 2 N {y (i) β T x (i) } 2 + i=1 weight decay - regularization λ 2 βt β +const. The posterior distribution over the weights allows for a natural regularizer to arise! Bayesian side effect β MAP : arg min J(β) β Predicitons: y Y, x N (β T MAP x, σ 2 ) Konstantina Palla 20 / 39

BAYESIAN REASONING IN DEEP LEARNING What we have seen so far... > Being Bayesian over the weights β, i.e. putting a prior distribution, allows for a better loss function in learning. natural regularization Konstantina Palla 21 / 39

BAYESIAN REASONING IN DEEP LEARNING What we have seen so far... > Being Bayesian over the weights β, i.e. putting a prior distribution, allows for a better loss function in learning. natural regularization > (D)NNs don t handle missing data x p(y x; β) y x (i), y (i), i = 1,..., N train data y : p(y x ; β) Konstantina Palla 21 / 39

BAYESIAN REASONING IN DEEP LEARNING What we have seen so far... > Being Bayesian over the weights β, i.e. putting a prior distribution, allows for a better loss function in learning. natural regularization > (D)NNs don t handle missing data > Need for p(x); x p(x) x p(y x; β) y x (i), y (i), i = 1,..., N train data y : p(y x ; β) x (i)?, y (i) Konstantina Palla 21 / 39

BAYESIAN REASONING IN DEEP LEARNING What we have seen so far... > Being Bayesian over the weights β, i.e. putting a prior distribution, allows for a better loss function in learning. natural regularization > (D)NNs don t handle missing data > Need for p(x); x p(x) > (D)NNs don t generate new examples x p(y x; β) y x (i), y (i), i = 1,..., N train data y : p(y x ; β) x (i)?, y (i) Konstantina Palla 21 / 39

BAYESIAN REASONING IN DEEP LEARNING What we have seen so far... > Being Bayesian over the weights β, i.e. putting a prior distribution, allows for a better loss function in learning. natural regularization > (D)NNs don t handle missing data > Need for p(x); x p(x) > (D)NNs don t generate new examples > p(x, y) = p(y x)p(x) draw x (i) p(x (i) ) draw y (i) p(y (i) x (i) ) p(y x; β) x y x (i), y (i), i = 1,..., N train data y : p(y x ; β) x (i)?, y (i)? Konstantina Palla 21 / 39

BAYESIAN REASONING IN DEEP LEARNING What we have seen so far... > Being Bayesian over the weights β, i.e. putting a prior distribution, allows for a better loss function in learning. natural regularization > (D)NNs don t handle missing data > Need for p(x); x p(x) > (D)NNs don t generate new examples > p(x, y) = p(y x)p(x) draw x (i) p(x (i) ) draw y (i) p(y (i) x (i) ) p(y x; β) x y x (i), y (i), i = 1,..., N train data y : p(y x ; β) x (i)?, y (i)? Need for a fully probabilstic model Konstantina Palla 21 / 39

BAYESIAN REASONING IN DEEP LEARNING Autoencoder as a fully probabilistic Neural Network Konstantina Palla 22 / 39

> Two parts: > Encoder: "encodes" the input to hidden code (representation) > Decoder: "decodes" the input back > Minimize reconstruction error L = y ỹ 2 2 Konstantina Palla 23 / 39 BAYESIAN REASONING IN DEEP LEARNING Autoencoder Input Output Code z y Encoder Decoder ỹ

BAYESIAN REASONING IN DEEP LEARNING Autoencoder Input Output Code z y Encoder Decoder ỹ Unsupervised method: Experiences data y and attempts to learn the distribution over y, i.e. p(y z) Konstantina Palla 24 / 39

BAYESIAN REASONING IN DEEP LEARNING Input Output Code p φ (z y) p θ (y z) y z ỹ Encoder Decoder {θ, φ} NN s weights Konstantina Palla 25 / 39

BAYESIAN REASONING IN DEEP LEARNING Decoder Generative Model z Latent variable model ỹ z p(z) ỹ p θ (y z) Konstantina Palla 26 / 39

BAYESIAN REASONING IN DEEP LEARNING Encoder Inference Mechanism z Decoder Generative Model z Latent variable model y ỹ Learn p φ (z y) φ Learn θ z p(z) ỹ p θ (y z) Konstantina Palla 26 / 39

BAYESIAN REASONING IN DEEP LEARNING Encoder Inference Mechanism Decoder Generative Model z z Latent variable model y ỹ Learn p φ (z y) φ z p(z) Learn θ ỹ p θ (y z) Inference: Learn φ, θ weights! Konstantina Palla 26 / 39

BAYESIAN REASONING IN DEEP LEARNING Variational Inference p φ (z y) = p(y z)p(z) p(y)? Cannot sample from p φ (z y) cannot construct it using the NNs structure (weights φ) Konstantina Palla 27 / 39

BAYESIAN REASONING IN DEEP LEARNING Variational Inference Main Principle approximate distribution q φ (z) p(z y) p φ (z y) = p(y z)p(z) p(y)? Cannot sample from p φ (z y) cannot construct it using the NNs structure (weights φ) Konstantina Palla 27 / 39

BAYESIAN REASONING IN DEEP LEARNING Variational Inference Main Principle approximate distribution q φ (z) p(z y) p φ (z y) = p(y z)p(z) p(y)? Cannot sample from p φ (z y) cannot construct it using the NNs structure (weights φ) z z q φ (z) q φ (z) y Konstantina Palla 27 / 39

BAYESIAN REASONING IN DEEP LEARNING Variational Inference Main Principle approximate distribution q φ (z) p(z y) p φ (z y) = p(y z)p(z) p(y)? KL[q φ (z) p(z y)] Cannot sample from p φ (z y) cannot construct it using the NNs structure (weights φ) KL measures the information lost when using q to approximate p z z q φ (z) q φ (z) y Konstantina Palla 27 / 39

BAYESIAN REASONING IN DEEP LARNING Variational Inference Cost function A metric for how well {φ, θ} capture the data generating p θ (y z) & latent distributions p φ (z y) Minimize L = log p(y θ, φ) Konstantina Palla 28 / 39

BAYESIAN REASONING IN DEEP LARNING Variational Inference Cost function A metric for how well {φ, θ} capture the data generating p θ (y z) & latent distributions p φ (z y) Minimize L = log p(y θ, φ) Reconstruction cost: Expected log-likelihood measures how well samples from q φ (z) are able to explain the data y. Penalty: This divergence measures how much information is lost (in units of nats) when using q φ (z) to represent p(z y) Minimize F Reconstruction F(φ, θ; y) = E qφ (z y)[log p θ (y z)] + Penalty KL[q φ (z) p(z y)] Konstantina Palla 28 / 39

BAYESIAN REASONING IN DEEP LARNING Variational Inference Cost function A metric for how well {φ, θ} capture the data generating p θ (y z) & latent distributions p φ (z y) Reconstruction cost: Expected log-likelihood measures how well samples from q φ (z) are able to explain the data y. Penalty: This divergence measures how much information is lost (in units of nats) when using q φ (z) to represent p(z y) Minimize F Reconstruction Penalty F(φ, θ; y) = E qφ (z y)[log p θ (y z)] + KL[q φ (z) p(z y)] Penalty is derived from your model and does not need to be designed Konstantina Palla 28 / 39

BAYESIAN REASONING IN DEEP LEARNING Bayesian & DL marriage - What have we gained? z z q φ (z) z ++ Principled inference approach - Bayesian Penalty as intrinsic part of the model! ++ Encode uncertainty ++ Impute missing data ỹ ỹ p θ (y z) y Model Inference Konstantina Palla 29 / 39

Deep Learning as a kernel method > So far... Predictions made using point estimates β (weights) parametric > Now... Store the entire training set in order to make predictions kernel approach non-parametric Konstantina Palla 30 / 39

DEEP LEARNING AS A KERNEL METHOD.... y x β φ(.) Regression example M y = β T φ(x) = β mφ m(x) m=1 Konstantina Palla 31 / 39

DEEP LEARNING AS A KERNEL METHOD Parametric approach J(β) = 1 2 N n=1 {βt φ(x n) y n} 2 + λ 2 βt β > MLE training point estimates of β > Predictions: J(β) = 0 β MLE = 1 λ N {β T φ(x n) y n}φ(x) n n=1 y = β T MLE φ(x ) Predictions purely based on the learned parameter- all the information from the data is transferred into a set of parameters Konstantina Palla 32 / 39

DEEP LEARNING AS A KERNEL METHOD Parametric approach J(β) = 1 2 N n=1 {βt φ(x n) y n} 2 + λ 2 βt β > MLE training point estimates of β > Predictions: J(β) = 0 β MLE = 1 λ N {β T φ(x n) y n}φ(x) n n=1 y = β T MLE φ(x ) Predictions purely based on the learned parameter- all the information from the data is transferred into a set of parameters parametric! Konstantina Palla 32 / 39

DEEP LEARNING AS A KERNEL METHOD Change of scenery...let the data speak J(β) = 1 2 N n=1 {βt φ(x n) y n} 2 + λ 2 βt β J(β) = 0 β MLE = 1 λ = N {β T φ(x n) y n}φ(x n) n=1 N α nφ(x n) = n=1 design matrix M N Φ T α n = 1 λ (βt φ(x n) y n) N 1 α Konstantina Palla 33 / 39

DEEP LEARNING AS A KERNEL METHOD Change of scenery...let the data speak J(β) = 1 2 N n=1 {βt φ(x n) y n} 2 + λ 2 βt β dual representation J(β) = 0 β MLE = 1 λ = N {β T φ(x n) y n}φ(x n) n=1 N α nφ(x n) = n=1 design matrix M N Φ T N 1 α J(α) = 0 α = ( Gram matrix N N K +λi N) 1 y κ lm = φ(x l), φ(x m) = k(x l, x m) kernel function α n = 1 λ (βt φ(x n) y n) Konstantina Palla 33 / 39

DEEP LEARNING AS A KERNEL METHOD Why? β MLE = 1 N {β T φ(x n) y n}φ(x n) λ n=1 Konstantina Palla 34 / 39

DEEP LEARNING AS A KERNEL METHOD Why? α = (K + λi N) 1 y β MLE = 1 N {β T φ(x n) y n}φ(x n) λ n=1 Konstantina Palla 34 / 39

DEEP LEARNING AS A KERNEL METHOD Why? β MLE = 1 λ Predictions: N {β T φ(x n) y n}φ(x n) n=1 point estimateno memory of data α = (K + λi N) 1 y y = β T MLE φ(x ) Konstantina Palla 34 / 39

DEEP LEARNING AS A KERNEL METHOD Why? β MLE = 1 λ Predictions: N {β T φ(x n) y n}φ(x n) n=1 point estimateno memory of data α = (K + λi N) 1 y all data y = k(x, x ) T (K + λi) 1 y y = β T MLE φ(x ) Konstantina Palla 34 / 39

DEEP LEARNING AS A KERNEL METHOD Why? β MLE = 1 λ Predictions: N {β T φ(x n) y n}φ(x n) n=1 point estimateno memory of data α = (K + λi N) 1 y all data y = k(x, x ) T (K + λi) 1 y y = β T MLE φ(x ) parametric non parametric Konstantina Palla 34 / 39

DEEP LEARNING AS A KERNEL METHOD Why? β MLE = 1 λ Predictions: N {β T φ(x n) y n}φ(x n) n=1 point estimateno memory of data α = (K + λi N) 1 y all data y = k(x, x ) T (K + λi) 1 y y = β T MLE φ(x ) parametric non parametric Memory in β MLE Memory by actual storing all the data Konstantina Palla 34 / 39

DEEP LEARNING AS A KERNEL METHOD Deep Networks dual Kernel Methods Konstantina Palla 35 / 39

DEEP LEARNING AS A KERNEL METHOD Step back... φ 1(x) φ 2(x) β 1 β 2 y M y = β T φ(x) = β mφ m(x) y = f (x) TASK: learn the best regression function f () m. β M φ M(x) Konstantina Palla 36 / 39

DEEP LEARNING AS A KERNEL METHOD Step back... φ 1(x) φ 2(x) β 1 β 2 y y = β T φ(x) = M β mφ m(x) m y = f (x) TASK: learn the best regression function f (). β M What about a bit more of Bayesian reasoning? φ M(x) f = [f (x 1), f (x 2),..., f (x n)] p(f) Consider f a multivariate variable- Apply distribution! Konstantina Palla 36 / 39

DEEP LEARNING AS A KERNEL METHOD φ 1(x) φ 2(x) β 1 β 2 y y = β T φ(x) = M β mφ m(x) m y = f (x). φ M(x) β M f = [f (x 1), f (x 2),..., f (x n)] Number of hidden units φ m(x) to infinity, M β N (m, Σ) f N (0, K), K ij = k(x i, x j) Konstantina Palla 37 / 39

DEEP LEARNING AS A KERNEL METHOD φ 1(x) φ 2(x) β 1 β 2 y y = β T φ(x) = M β mφ m(x) m y = f (x). φ M(x) β M f = [f (x 1), f (x 2),..., f (x n)] Number of hidden units φ m(x) to infinity, M β N (m, Σ) f N (0, K), K ij = k(x i, x j) When N, sample from a Gaussian process f GP(0, K) Konstantina Palla 37 / 39

DEEP LEARNING AS A KERNEL METHOD φ 1(x) φ 2(x) β 1 β 2 y y = β T φ(x) = M β mφ m(x) m y = f (x). φ M(x) β M f = [f (x 1), f (x 2),..., f (x n)] Number of hidden units φ m(x) to infinity, M β N (m, Σ) Predict y = f (x ) f N (0, K), K ij = k(x i, x j) When N, sample from a Gaussian process f GP(0, K) p(f X, y, x ) = p(y x, f, X, y)p(f X, y)df Konstantina Palla 37 / 39

DEEP LEARNING AS A KERNEL METHOD φ 1(x) φ 2(x). φ M(x) β 1 β 2 β M Number of hidden units φ m(x) to infinity, M y Gaussian Processes inifnite limits Deep Networks Konstantina Palla 38 / 39

Thank you! Konstantina Palla 39 / 39