Probabilistic Reasoning in Deep Learning

Probabilistic Reasoning in Deep Learning Dr Konstantina Palla, PhD palla@stats.ox.ac.uk September 2017 Deep Learning Indaba, Johannesburgh Konstantina Palla 1 / 39

OVERVIEW OF THE TALK Basics of Bayesian Inference Konstantina Palla 2 / 39

OVERVIEW OF THE TALK Basics of Bayesian Inference Bayesian reasoning inside DNNs. Konstantina Palla 2 / 39

OVERVIEW OF THE TALK Basics of Bayesian Inference Bayesian reasoning inside DNNs. > The Bayesian paradigm as an approach to account for uncetainty in the DNNs. Konstantina Palla 2 / 39

OVERVIEW OF THE TALK Basics of Bayesian Inference Bayesian reasoning inside DNNs. > The Bayesian paradigm as an approach to account for uncetainty in the DNNs. Connection to other statistical models (nonparametric models). Konstantina Palla 2 / 39

PART I Bayesian Inference: Basics Konstantina Palla 3 / 39

BAYESIAN INFERENCE: BASICS Data: Observations Y = y (1), y (2),..., y (N) Problem: How was it generated? Konstantina Palla 4 / 39

BAYESIAN INFERENCE: BASICS Data: Observations Y = y (1), y (2),..., y (N) Problem: How was it generated? Answer: Make assumptions distributional θ y (i) θ p θ (y θ) y Konstantina Palla 4 / 39

BAYESIAN INFERENCE: BASICS Data: Observations Y = y (1), y (2),..., y (N) Problem: How was it generated? Answer: Make assumptions distributional θ y (i) θ p θ (y θ) y y (i) θ N (y; µ, σ 2 ) θ = {µ, σ 2 } Konstantina Palla 4 / 39

BAYESIAN INFERENCE: BASICS Data: Observations Y = y (1), y (2),..., y (N) Problem: How was it generated? Answer: Make assumptions distributional θ y (i) θ p θ (y θ) y likelihood: the probability of the data under the model parameters θ y (i) θ N (y; µ, σ 2 ) θ = {µ, σ 2 } N p(y θ) = p(y (i) θ) i=1 N = N (y (i) µ, σ 2 ) i=1 Konstantina Palla 4 / 39

BAYESIAN INFERENCE: BASICS Bayesian Principle All forms of uncertainty should be expressed by means of probability distributions. θ Prior: θ p(θ) Our prior belief before seeing the data Posterior: Our updated belief after having seen the data Bayes Rule y p(θ Y) = p(y θ)p(θ) p(y) Konstantina Palla 5 / 39

BAYESIAN INFERENCE: BASICS Baye s rule p(θ Y) = p(y θ)p(θ) p(y) Konstantina Palla 6 / 39

BAYESIAN INFERENCE: BASICS Baye s rule posterior p(θ Y) = p(y θ)p(θ) p(y) The posterior is proportional to likelihood prior Konstantina Palla 6 / 39

BAYESIAN INFERENCE: BASICS Baye s rule likelihood posterior p(θ Y) = p(y θ)p(θ) p(y) The posterior is proportional to likelihood prior Konstantina Palla 6 / 39

BAYESIAN INFERENCE: BASICS likelihood Baye s rule posterior p(θ Y) = prior p(y θ)p(θ) p(y) The posterior is proportional to likelihood prior Konstantina Palla 6 / 39

BAYESIAN INFERENCE: BASICS likelihood Baye s rule posterior p(θ Y) = prior p(y θ)p(θ) p(y) The posterior is proportional to likelihood prior Update my prior beliefs after I see the data Konstantina Palla 6 / 39

BAYESIAN INFERENCE: BASICS P(θ Y) P(Y θ)p(θ) Konstantina Palla 7 / 39

BAYESIAN INFERENCE: BASICS Make predictions For unseen y marginalise! P(y Y) = P(y θ)p(θ Y)dθ Konstantina Palla 8 / 39

BAYESIAN INFERENCE: BASICS Make predictions For unseen y marginalise! P(y Y) = P(y θ)p(θ Y)dθ The prediction does not depend on a point estimate of θ but it is expressed as a weighted average over all the possible values of θ Konstantina Palla 8 / 39

BAYESIAN INFERENCE: BASICS Why is this useful? > Uncertainty is taken into account No point estimates BUT distribution over the unknown θ; p(θ Y) Konstantina Palla 9 / 39

BAYESIAN INFERENCE: BASICS Why is this useful? > Uncertainty is taken into account No point estimates BUT distribution over the unknown θ; p(θ Y) > Natural handling of missing data θ y a probability distribution is estimated for each missing value θ? θ p(θ Y) Konstantina Palla 9 / 39

PART II Bayesian Reasoning in (Deep) Neural Networks Konstantina Palla 10 / 39

PART II Bayesian Reasoning in (Deep) Neural Networks Overview of (D)NNs Konstantina Palla 10 / 39

PART II Bayesian Reasoning in (Deep) Neural Networks Overview of (D)NNs Bayesian Reasoning in DNNs Konstantina Palla 10 / 39

OVERVIEW OF (DEEP) NEURAL NETWORKS Input layer Hidden layer Hidden layer Hidden layer Output layer........ What is Deep Learning A framework for constructing flexible models a way of constructing outputs using functions on inputs. Konstantina Palla 11 / 39

OVERVIEW OF DEEP NNS Repetition of building blocks Konstantina Palla 12 / 39

OVERVIEW OF DEEP NNS Building block x x 1 β 1 x 2 β 2 f (η) y η = β T x linear. β M f (η) non linear x M y > Output: y > Linear predictor : η = M m=1 βmxm = βt x, coefficients β weights > Inverse Link function: f (η) activation function Konstantina Palla 13 / 39

OVERVIEW OF DEEP NNS..... In each layer l the input is linearly composed η l = β T l x l Konstantina Palla 14 / 39

OVERVIEW OF DEEP NNS..... In each layer l the input is linearly composed η l = β T l x l a non-linear function is applied f l(η l ) Konstantina Palla 14 / 39

OVERVIEW OF DEEP NNS..... In each layer l the input is linearly composed η l = β T l x l a non-linear function is applied f l(η l ) A Deep Neural Network construction of recursive building blocks Konstantina Palla 14 / 39

OVERVIEW OF DEEP NNS x η = β T x f (η) η l = β T l x l f l(η l ) y A Deep Neural Network can represent functions of increasing complexity. Konstantina Palla 15 / 39

DEEP NNS AS A PROBABILISTIC MODEL x 1 β 1 x 2 β 2 f (η) y. β p x p η = β T x Konstantina Palla 16 / 39

DEEP NNS AS A PROBABILISTIC MODEL x 1 β 1 x y x 2. β 2 β p f (η) y Parametric! p(y x; β) x p η = β T x Konstantina Palla 16 / 39

DEEP NNS AS A PROBABILISTIC MODEL x y p(y x; β) Parametric! x 1 β 1 Example: Regression x 2. β 2 β p f (η) η f (η) = η = β T x y = β T x + ɛ, ɛ N (0, σ 2 ) p(y x, β) = N (β T x, σ 2 ) x p η = β T x Konstantina Palla 16 / 39

DEEP NNS AS A PROBABILISTIC MODEL x y p(y x; β) Parametric! x 1 β 1 Example: Regression x 2. β 2 β p f (η) η f (η) = η = β T x y = β T x + ɛ, ɛ N (0, σ 2 ) p(y x, β) = N (β T x, σ 2 ) x p η = β T x Train the network: learn β? Konstantina Palla 16 / 39

DEEP NNS AS A PROBABILISTIC MODEL Learn β Maximum Likelihood Principal Minimize the negative log-likelihood J (β) = log(p(y x, β)) β MLE : arg min J(β) β Konstantina Palla 17 / 39

DEEP NNS AS A PROBABILISTIC MODEL Learn β Maximum Likelihood Principal Minimize the negative log-likelihood J (β) = log(p(y x, β)) β MLE : arg min J(β) β Regression x y p(y x, β) = N (β T x, σ 2 ) Konstantina Palla 17 / 39

DEEP NNS AS A PROBABILISTIC MODEL Learn β Maximum Likelihood Principal Minimize the negative log-likelihood J (β) = log(p(y x, β)) β MLE : arg min J(β) β Regression x Squared Loss function y J(β) = 1 2 N {y (i) β T x (i) } 2 + const. i=1 p(y x, β) = N (β T x, σ 2 ) Konstantina Palla 17 / 39

DEEP NNS AS A PROBABILISTIC MODEL δ β Maximum Likelihood Estimation β is fixed but unknown - Point estimates - Overfitting Konstantina Palla 18 / 39

DEEP NNS AS A PROBABILISTIC MODEL δ β p(β) Maximum Likelihood Estimation β is fixed but unknown - Point estimates - Overfitting Bayesian Approach β is unknown/uncertain + Account for uncertainty p(β), p(β y) + Regularisation Konstantina Palla 18 / 39

BAYESIAN REASONING IN DEEP LEARNING Maximum a Posteriori Estimation β x y Konstantina Palla 19 / 39

BAYESIAN REASONING IN DEEP LEARNING Maximum a Posteriori Estimation Distribution over the network weights β β x β p(β) y x, β p(y x, β) y Konstantina Palla 19 / 39

BAYESIAN REASONING IN DEEP LEARNING Maximum a Posteriori Estimation Distribution over the network weights β β x β p(β) y x, β p(y x, β) Update the belief over β Bayes rule y Likelihood p(β Y) p(y β, X) prior p(β) Maximum a Posteriori Find β MAP that maximizes log p(β Y) Find β MAP that minimizes J (β) = log p(β Y) β MAP : arg min J(β) β Konstantina Palla 19 / 39

BAYESIAN REASONING IN DEEP LEARNING Regression: Maximum a Posteriori β x y prior over the weights y = β T x + ɛ multivariate Gaussian p(β) = N (m β, λ 1 I) p(y x, β) = N (β T x, σ 2 ) Konstantina Palla 20 / 39

BAYESIAN REASONING IN DEEP LEARNING Regression: Maximum a Posteriori β prior over the weights x y y = β T x + ɛ multivariate Gaussian Loss function: J (β) = log p(β Y) J (β) = 1 2 N {y (i) β T x (i) } 2 + i=1 weight decay - regularization λ 2 βt β +const. p(β) = N (m β, λ 1 I) p(y x, β) = N (β T x, σ 2 ) Konstantina Palla 20 / 39

BAYESIAN REASONING IN DEEP LEARNING Regression: Maximum a Posteriori β prior over the weights x y y = β T x + ɛ multivariate Gaussian Loss function: J (β) = log p(β Y) J (β) = 1 2 N {y (i) β T x (i) } 2 + i=1 weight decay - regularization λ 2 βt β +const. The posterior distribution over the weights allows for a natural regularizer to arise! Bayesian side effect p(β) = N (m β, λ 1 I) p(y x, β) = N (β T x, σ 2 ) Konstantina Palla 20 / 39

BAYESIAN REASONING IN DEEP LEARNING Regression: Maximum a Posteriori β prior over the weights x y y = β T x + ɛ multivariate Gaussian p(β) = N (m β, λ 1 I) Loss function: J (β) = log p(β Y) J (β) = 1 2 N {y (i) β T x (i) } 2 + i=1 weight decay - regularization λ 2 βt β +const. The posterior distribution over the weights allows for a natural regularizer to arise! Bayesian side effect β MAP : arg min J(β) β p(y x, β) = N (β T x, σ 2 ) Konstantina Palla 20 / 39

BAYESIAN REASONING IN DEEP LEARNING Regression: Maximum a Posteriori β prior over the weights x y y = β T x + ɛ multivariate Gaussian p(β) = N (m β, λ 1 I) p(y x, β) = N (β T x, σ 2 ) Loss function: J (β) = log p(β Y) J (β) = 1 2 N {y (i) β T x (i) } 2 + i=1 weight decay - regularization λ 2 βt β +const. The posterior distribution over the weights allows for a natural regularizer to arise! Bayesian side effect β MAP : arg min J(β) β Predicitons: y Y, x N (β T MAP x, σ 2 ) Konstantina Palla 20 / 39

BAYESIAN REASONING IN DEEP LEARNING What we have seen so far... > Being Bayesian over the weights β, i.e. putting a prior distribution, allows for a better loss function in learning. natural regularization > (D)NNs don t handle missing data > Need for p(x); x p(x) x p(y x; β) y x (i), y (i), i = 1,..., N train data y : p(y x ; β) x (i)?, y (i) Konstantina Palla 21 / 39

BAYESIAN REASONING IN DEEP LEARNING What we have seen so far... > Being Bayesian over the weights β, i.e. putting a prior distribution, allows for a better loss function in learning. natural regularization > (D)NNs don t handle missing data > Need for p(x); x p(x) > (D)NNs don t generate new examples x p(y x; β) y x (i), y (i), i = 1,..., N train data y : p(y x ; β) x (i)?, y (i) Konstantina Palla 21 / 39

BAYESIAN REASONING IN DEEP LEARNING What we have seen so far... > Being Bayesian over the weights β, i.e. putting a prior distribution, allows for a better loss function in learning. natural regularization > (D)NNs don t handle missing data > Need for p(x); x p(x) > (D)NNs don t generate new examples > p(x, y) = p(y x)p(x) draw x (i) p(x (i) ) draw y (i) p(y (i) x (i) ) p(y x; β) x y x (i), y (i), i = 1,..., N train data y : p(y x ; β) x (i)?, y (i)? Konstantina Palla 21 / 39

BAYESIAN REASONING IN DEEP LEARNING What we have seen so far... > Being Bayesian over the weights β, i.e. putting a prior distribution, allows for a better loss function in learning. natural regularization > (D)NNs don t handle missing data > Need for p(x); x p(x) > (D)NNs don t generate new examples > p(x, y) = p(y x)p(x) draw x (i) p(x (i) ) draw y (i) p(y (i) x (i) ) p(y x; β) x y x (i), y (i), i = 1,..., N train data y : p(y x ; β) x (i)?, y (i)? Need for a fully probabilstic model Konstantina Palla 21 / 39

BAYESIAN REASONING IN DEEP LEARNING Autoencoder as a fully probabilistic Neural Network Konstantina Palla 22 / 39

> Two parts: > Encoder: "encodes" the input to hidden code (representation) > Decoder: "decodes" the input back > Minimize reconstruction error L = y ỹ 2 2 Konstantina Palla 23 / 39 BAYESIAN REASONING IN DEEP LEARNING Autoencoder Input Output Code z y Encoder Decoder ỹ

BAYESIAN REASONING IN DEEP LEARNING Autoencoder Input Output Code z y Encoder Decoder ỹ Unsupervised method: Experiences data y and attempts to learn the distribution over y, i.e. p(y z) Konstantina Palla 24 / 39

BAYESIAN REASONING IN DEEP LEARNING Input Output Code p φ (z y) p θ (y z) y z ỹ Encoder Decoder {θ, φ} NN s weights Konstantina Palla 25 / 39

BAYESIAN REASONING IN DEEP LEARNING Decoder Generative Model z Latent variable model ỹ z p(z) ỹ p θ (y z) Konstantina Palla 26 / 39

BAYESIAN REASONING IN DEEP LEARNING Encoder Inference Mechanism z Decoder Generative Model z Latent variable model y ỹ Learn p φ (z y) φ Learn θ z p(z) ỹ p θ (y z) Konstantina Palla 26 / 39

BAYESIAN REASONING IN DEEP LEARNING Encoder Inference Mechanism Decoder Generative Model z z Latent variable model y ỹ Learn p φ (z y) φ z p(z) Learn θ ỹ p θ (y z) Inference: Learn φ, θ weights! Konstantina Palla 26 / 39

BAYESIAN REASONING IN DEEP LEARNING Variational Inference p φ (z y) = p(y z)p(z) p(y)? Cannot sample from p φ (z y) cannot construct it using the NNs structure (weights φ) Konstantina Palla 27 / 39

BAYESIAN REASONING IN DEEP LEARNING Variational Inference Main Principle approximate distribution q φ (z) p(z y) p φ (z y) = p(y z)p(z) p(y)? Cannot sample from p φ (z y) cannot construct it using the NNs structure (weights φ) Konstantina Palla 27 / 39

BAYESIAN REASONING IN DEEP LEARNING Variational Inference Main Principle approximate distribution q φ (z) p(z y) p φ (z y) = p(y z)p(z) p(y)? KL[q φ (z) p(z y)] Cannot sample from p φ (z y) cannot construct it using the NNs structure (weights φ) KL measures the information lost when using q to approximate p z z q φ (z) q φ (z) y Konstantina Palla 27 / 39

BAYESIAN REASONING IN DEEP LARNING Variational Inference Cost function A metric for how well {φ, θ} capture the data generating p θ (y z) & latent distributions p φ (z y) Minimize L = log p(y θ, φ) Reconstruction cost: Expected log-likelihood measures how well samples from q φ (z) are able to explain the data y. Penalty: This divergence measures how much information is lost (in units of nats) when using q φ (z) to represent p(z y) Minimize F Reconstruction F(φ, θ; y) = E qφ (z y)[log p θ (y z)] + Penalty KL[q φ (z) p(z y)] Konstantina Palla 28 / 39

BAYESIAN REASONING IN DEEP LARNING Variational Inference Cost function A metric for how well {φ, θ} capture the data generating p θ (y z) & latent distributions p φ (z y) Reconstruction cost: Expected log-likelihood measures how well samples from q φ (z) are able to explain the data y. Penalty: This divergence measures how much information is lost (in units of nats) when using q φ (z) to represent p(z y) Minimize F Reconstruction Penalty F(φ, θ; y) = E qφ (z y)[log p θ (y z)] + KL[q φ (z) p(z y)] Penalty is derived from your model and does not need to be designed Konstantina Palla 28 / 39

BAYESIAN REASONING IN DEEP LEARNING Bayesian & DL marriage - What have we gained? z z q φ (z) z ++ Principled inference approach - Bayesian Penalty as intrinsic part of the model! ++ Encode uncertainty ++ Impute missing data ỹ ỹ p θ (y z) y Model Inference Konstantina Palla 29 / 39

Deep Learning as a kernel method > So far... Predictions made using point estimates β (weights) parametric > Now... Store the entire training set in order to make predictions kernel approach non-parametric Konstantina Palla 30 / 39

DEEP LEARNING AS A KERNEL METHOD.... y x β φ(.) Regression example M y = β T φ(x) = β mφ m(x) m=1 Konstantina Palla 31 / 39

DEEP LEARNING AS A KERNEL METHOD Parametric approach J(β) = 1 2 N n=1 {βt φ(x n) y n} 2 + λ 2 βt β > MLE training point estimates of β > Predictions: J(β) = 0 β MLE = 1 λ N {β T φ(x n) y n}φ(x) n n=1 y = β T MLE φ(x ) Predictions purely based on the learned parameter- all the information from the data is transferred into a set of parameters Konstantina Palla 32 / 39

DEEP LEARNING AS A KERNEL METHOD Change of scenery...let the data speak J(β) = 1 2 N n=1 {βt φ(x n) y n} 2 + λ 2 βt β J(β) = 0 β MLE = 1 λ = N {β T φ(x n) y n}φ(x n) n=1 N α nφ(x n) = n=1 design matrix M N Φ T α n = 1 λ (βt φ(x n) y n) N 1 α Konstantina Palla 33 / 39

DEEP LEARNING AS A KERNEL METHOD Change of scenery...let the data speak J(β) = 1 2 N n=1 {βt φ(x n) y n} 2 + λ 2 βt β dual representation J(β) = 0 β MLE = 1 λ = N {β T φ(x n) y n}φ(x n) n=1 N α nφ(x n) = n=1 design matrix M N Φ T N 1 α J(α) = 0 α = ( Gram matrix N N K +λi N) 1 y κ lm = φ(x l), φ(x m) = k(x l, x m) kernel function α n = 1 λ (βt φ(x n) y n) Konstantina Palla 33 / 39

DEEP LEARNING AS A KERNEL METHOD Why? β MLE = 1 N {β T φ(x n) y n}φ(x n) λ n=1 Konstantina Palla 34 / 39

DEEP LEARNING AS A KERNEL METHOD Why? α = (K + λi N) 1 y β MLE = 1 N {β T φ(x n) y n}φ(x n) λ n=1 Konstantina Palla 34 / 39

DEEP LEARNING AS A KERNEL METHOD Why? β MLE = 1 λ Predictions: N {β T φ(x n) y n}φ(x n) n=1 point estimateno memory of data α = (K + λi N) 1 y y = β T MLE φ(x ) Konstantina Palla 34 / 39

DEEP LEARNING AS A KERNEL METHOD Why? β MLE = 1 λ Predictions: N {β T φ(x n) y n}φ(x n) n=1 point estimateno memory of data α = (K + λi N) 1 y all data y = k(x, x ) T (K + λi) 1 y y = β T MLE φ(x ) Konstantina Palla 34 / 39

DEEP LEARNING AS A KERNEL METHOD Why? β MLE = 1 λ Predictions: N {β T φ(x n) y n}φ(x n) n=1 point estimateno memory of data α = (K + λi N) 1 y all data y = k(x, x ) T (K + λi) 1 y y = β T MLE φ(x ) parametric non parametric Konstantina Palla 34 / 39

DEEP LEARNING AS A KERNEL METHOD Deep Networks dual Kernel Methods Konstantina Palla 35 / 39

DEEP LEARNING AS A KERNEL METHOD Step back... φ 1(x) φ 2(x) β 1 β 2 y M y = β T φ(x) = β mφ m(x) y = f (x) TASK: learn the best regression function f () m. β M φ M(x) Konstantina Palla 36 / 39

DEEP LEARNING AS A KERNEL METHOD Step back... φ 1(x) φ 2(x) β 1 β 2 y y = β T φ(x) = M β mφ m(x) m y = f (x) TASK: learn the best regression function f (). β M What about a bit more of Bayesian reasoning? φ M(x) f = [f (x 1), f (x 2),..., f (x n)] p(f) Consider f a multivariate variable- Apply distribution! Konstantina Palla 36 / 39

DEEP LEARNING AS A KERNEL METHOD φ 1(x) φ 2(x) β 1 β 2 y y = β T φ(x) = M β mφ m(x) m y = f (x). φ M(x) β M f = [f (x 1), f (x 2),..., f (x n)] Number of hidden units φ m(x) to infinity, M β N (m, Σ) f N (0, K), K ij = k(x i, x j) Konstantina Palla 37 / 39

DEEP LEARNING AS A KERNEL METHOD φ 1(x) φ 2(x) β 1 β 2 y y = β T φ(x) = M β mφ m(x) m y = f (x). φ M(x) β M f = [f (x 1), f (x 2),..., f (x n)] Number of hidden units φ m(x) to infinity, M β N (m, Σ) Predict y = f (x ) f N (0, K), K ij = k(x i, x j) When N, sample from a Gaussian process f GP(0, K) p(f X, y, x ) = p(y x, f, X, y)p(f X, y)df Konstantina Palla 37 / 39

DEEP LEARNING AS A KERNEL METHOD φ 1(x) φ 2(x). φ M(x) β 1 β 2 β M Number of hidden units φ m(x) to infinity, M y Gaussian Processes inifnite limits Deep Networks Konstantina Palla 38 / 39

Thank you! Konstantina Palla 39 / 39