Estimation theory and information geometry based on denoising
|
|
- Lynette Allison
- 6 years ago
- Views:
Transcription
1 Estimation theory and information geometry based on denoising Aapo Hyvärinen Dept of Computer Science & HIIT Dept of Mathematics and Statistics University of Helsinki Finland 1
2 Abstract What is the best prior to be used in denoising by Bayesian inference? Consider infinitesimal gaussian noise Assume we can estimate prior parameters from noise-free data Solution: Fitting gradient of log-density ψ w.r.t. data variable Minimize squared distance of ψ of data and ψ of model Using partial integration, distance be computed by a simple formula Related problem: Estimation of non-normalized models Computationally simple solution provided by the same estimator No need to compute normalization constant (partition function) Leads to a new kind of information geometry 2
3 Starting point: Best prior for denoising Consider observed signal y which is a noisy version of original signal x which comes from a prior distribution with parameter vector θ Assume: p(y,x,θ) = cexp( 1 2σ 2 y x 2 )p(x θ) (1) We infer the original signal by MAP inference. ˆx MAP (ˆθ,y) = argmax x p(y x)p(x ˆθ) = arg max log p(y x)+log p(x ˆθ) x We estimate parameters θ from a separate sample of noise-free signals x. What is the optimal method of estimating θ? (A single point estimate) 3
4 Difference to classical optimality analysis Classical analysis of optimality of estimators considers errors in parameter values Here, we consider error in the restored (denoised) signal (Euclidean distance between x and its MAP estimate) These errors need to be related, cf. collinearity in linear regression Also: to be computationally realistic, we don t use a full Bayesian restoration, instead take point estimate of θ and use MAP estimate. We also assume that we can observe noise-free signals from which to estimate the parameters. 4
5 Analysis of estimation error Assume signal is corrupted by infinitely small gaussian noise as above Theorem 1 Assume that all the the log-pdf s are differentiable, and the estimation error in MAP estimation x = ˆx x is small. Then first-order approximation of error is x 2 = σ 4 E E smaller terms (2) wheree 1 = ψ 0 (x) ψ(x ˆθ) ande 2 = ψ 0 (x)+ψ(y x) Note thate 2 does not depend on θ Thus, optimal estimation of θ is by minimization of E px { E 1 2 } 5
6 Definition of score function (in this talk) Define model score function R n R n as ψ(ξ θ) = ( log p(ξ θ) ξ 1,..., log p(ξ θ) ξ n ) T = ξ log p(ξ θ) Similarly, define data score function as ψ x (ξ) = ξ log p x (ξ) where observed data is assumed to follow p x (.). Optimal estimator obtained by minimizing a distance between model score function ψ(. θ) and score function of observed data ψ x (.): J(θ) = 1 2 Estimator consistent almost by construction ξ R n p x(ξ) ψ(ξ θ) ψ x (ξ) 2 dξ (3) 6
7 Related problem: Non-normalized model estimation We want to estimate a parametric model of a multivariate random vector x R n Density function is known only up to a multiplicative constant p(x θ) = 1 (θ) q(x θ) (θ) = q(ξ θ) dξ ξ Rn Functional form of q is known (can be easily computed) cannot be computed with reasonable computing time Typical application: Markov Random Fields 7
8 Previous solutions to estimation of non-normalized models Monte Carlo methods for estimating Consistent estimators (convergence to real parameter values when sample size ) Computation very slow (I think) Various approximations, e.g. variational methods Computation often fast Consistency not known, or proven inconsistent Pseudo-likelihood and contrastive divergence Presumably consistent Computations slow with continuous-valued variables: needs 1-D integration at every step, or sophisticated MCMC methods 8
9 Score matching can be used for non-normalized models No need to compute normalization constant because ψ(ξ θ) = ξ logq(ξ θ)+ ξ log(θ) = ξ logq(ξ θ)+0 (4) In the objective function we have score function of data distribution ψ x (.). How to compute it? In fact, no need to compute it because Theorem 2 Assume some regularity conditions, and smooth densities. Then, the score matching objective function J can be expressed as J(θ) = [ i ψ i (ξ θ)+ 12 ] ψ i(ξ θ) 2 dξ+const. (5) ξ R n p x(ξ) n i=1 where the constant does not depend on θ, and ψ i (ξ θ) = logq(ξ θ), and i ψ i (ξ θ) = 2 logq(ξ θ) ξ i ξ 2 i 9
10 Simple explanation of trick Consider objective function J(θ): 1 2 p x (ξ) ψ x (ξ) 2 dξ+ 1 2 p x (ξ) ψ(ξ θ) 2 dξ p x (ξ)ψ x (ξ) T ψ(ξ θ)dξ First term does not depend on θ. Second term easy to compute. The trick is to use partial integration on third term. In one dimension: p x (x)(log p x ) (x)ψ(x θ)dx = = p x (x) p x(x) p x (x) ψ(x θ)dx p x(x)ψ(x θ)dx = 0 p x (x)ψ (x θ)dx This is why score function of data distribution p x (x) disappears! 10
11 Final method of score matching Replace integration over sample density p x (.) by sample average Given T observations x(1),...,x(t), minimize J(θ) = 1 T T n t=1 i=1 [ i ψ i (x(t) θ)+ 12 ψ i(x(t) θ) 2 ] (6) where ψ i is a partial derivative of non-normalized model log-density logq, and i ψ i a second partial derivative Only needs evaluation of some derivatives of the non-normalized (log)-density q which are simple to compute (by assumption) Thus: statistical optimality in denoising and computational simplicity for non-normalized models obtained with the same estimator 11
12 Interesting result: Closed-form solution in the exponential family Assume pdf can be expressed in the form log p(ξ θ) = Define matrices of partial derivatives: m θ k F k (ξ) log(θ) (7) k=1 K ki (ξ) = F k ξ i, and H ki (ξ) = 2 F k ξ 2 i (8) Then, the score matching estimator is given by: ˆθ = [ Ê{K(x)K(x) T } ] 1 ( Ê{h i (x)}) (9) i where Ê denotes the sample average, and the vector h i is the i-th column of the matrix H. 12
13 Extensions of score matching Can be extended to non-negative data Basic score matching cannot be directly use because density is typically not smooth over R n. Can be extended to binary variables However, utility questionable because pseudolikelihood is computationally efficient in that case Can be shown to be equivalent to a special case of contrastive divergence (equal in expectation when using Langevin MCMC method and infinitesimal step size) 13
14 An information geometry Considering p x fixed, we define a Hilbertian structure in the space of score functions. [ n ] p 1, p 2 = p x (ξ) ψ 1,i (ξ)ψ 2,i (ξ) dξ = p x (ξ)ψ 1 (ξ) T ψ 2 (ξ)dξ i=1 (10) Dot-product defines norm and distance Score matching is performed by minimization of distance of p x and p(. θ) in this metric. 14
15 Pythagorean decomposition for exponential families Exponential family is linear subspace Estimation is orthogonal projection on that subspace Pythagorean equality p x 2 = dist 2 (p(. ˆθ), p x )+ p(. ˆθ) 2 (11) Can be interpreted in terms of denoising capability of MAP estimation: var of noise which can be removed by MAP denoising = noise var not removed due to imperfect prior +noise var removed by prior Intuitively, denoising is possible because of structure in the signal, which leads to a more speculative interpretation: Structure in data = Structure not modelled + Structure modelled 15
16 Interesting point: We can even use improper densities Nothing in the method requires the densities to integrable at all We can use all kinds of functional forms for the densities For example, density can stay constant at infinity p(x;µ,σ) = [1+exp( x µ σ )] 1 (12) 16
17 Experiment: overcomplete basis of natural images Likelihood: log p(x) = m k=1 α kg(w T k x)+(w 1,...,w n,α 1,...,α n ) Objective function J = m 1 α k k=1 T T t=1g (w T k x(t))+ 1 2 m α j α k w T 1 j w k j,k=1 T T g(w T k x(t))g(wt j x(t)) t=1 (13) 120 basis vectors from image 8 8 patches (no dimension reduction) 17
18 Experiment 2: denoising p 0 : several 1-D densities of zero mean Modelled (approximated) by a logistic distribution with a location parameter θ: log p(x θ) = 2logcosh( π 2 (x θ)) log4 3 Gaussian noise added, parameter estimated by different methods, and MAP inference done. 18
19 Denoise by MAP gauss mixt 1 gauss mixt 2 chi square Laplacian SM: value of ˆθ ML: value of ˆθ noise variance = 0.05 SM: squared error in ˆx ML: squared error in ˆx PP: error in x p-value of difference e SM: performance index ML: performance index noise variance = 0.1 SM: squared error in ˆx ML: squared error in ˆx PP: error in x p-value of difference e SM: performance index ML: performance index noise variance = 0.2 SM: squared error in ˆx ML: squared error in ˆx PP: error in x p-value of difference e SM: performance index ML: performance index noise variance = 0.5 SM: squared error in ˆx ML: squared error in ˆx PP: error in x p-value of difference SM: performance index
20 Conclusion We propose to estimate a parametric model by minimizing the squared distance of the score functions (gradients of log-density w.r.t. data variable) of model density and data distribution Statistically optimal prior for removing infinitesimal gaussian noise by MAP inference Computationally simple (no integration) for non-normalized densities, yet consistent Closed-form solution in some exponential families Geometric interpretations possible No need for densities to be integrable at all 20
From independent component analysis to score matching
From independent component analysis to score matching Aapo Hyvärinen Dept of Computer Science & HIIT Dept of Mathematics and Statistics University of Helsinki Finland 1 Abstract First, short introduction
More informationGatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV
Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV Aapo Hyvärinen Gatsby Unit University College London Part III: Estimation of unnormalized models Often,
More informationEstimating Unnormalized models. Without Numerical Integration
Estimating Unnormalized Models Without Numerical Integration Dept of Computer Science University of Helsinki, Finland with Michael Gutmann Problem: Estimation of unnormalized models We want to estimate
More informationConnections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN
Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University
More informationESTIMATION THEORY AND INFORMATION GEOMETRY BASED ON DENOISING. Aapo Hyvärinen
ESTIMATION THEORY AND INFORMATION GEOMETRY BASED ON DENOISING Aapo Hyvärinen Dept of Computer Science, Dept of Mathematics & Statistics, and HIIT University of Helsinki, Finland. ABSTRACT We consider a
More informationSome extensions of score matching
Some extensions of score matching Aapo Hyvärinen Helsinki Institute for Information Technology and Dept of Computer Science, University of Helsinki Finland Abstract Many probabilistic models are only defined
More informationEstimating Unnormalised Models by Score Matching
Estimating Unnormalised Models by Score Matching Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018 Program 1. Basics
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationEfficient Variational Inference in Large-Scale Bayesian Compressed Sensing
Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing George Papandreou and Alan Yuille Department of Statistics University of California, Los Angeles ICCV Workshop on Information
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of
More informationMachine learning - HT Maximum Likelihood
Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationMarkov Random Fields
Markov Random Fields Umamahesh Srinivas ipal Group Meeting February 25, 2011 Outline 1 Basic graph-theoretic concepts 2 Markov chain 3 Markov random field (MRF) 4 Gauss-Markov random field (GMRF), and
More informationAdaptive HMC via the Infinite Exponential Family
Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University College London RegML, 2017 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationDetection and Estimation Theory
Detection and Estimation Theory Instructor: Prof. Namrata Vaswani Dept. of Electrical and Computer Engineering Iowa State University http://www.ece.iastate.edu/ namrata Slide 1 What is Estimation and Detection
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationBayesian Methods and Uncertainty Quantification for Nonlinear Inverse Problems
Bayesian Methods and Uncertainty Quantification for Nonlinear Inverse Problems John Bardsley, University of Montana Collaborators: H. Haario, J. Kaipio, M. Laine, Y. Marzouk, A. Seppänen, A. Solonen, Z.
More informationLearning features by contrasting natural images with noise
Learning features by contrasting natural images with noise Michael Gutmann 1 and Aapo Hyvärinen 12 1 Dept. of Computer Science and HIIT, University of Helsinki, P.O. Box 68, FIN-00014 University of Helsinki,
More informationA two-layer ICA-like model estimated by Score Matching
A two-layer ICA-like model estimated by Score Matching Urs Köster and Aapo Hyvärinen University of Helsinki and Helsinki Institute for Information Technology Abstract. Capturing regularities in high-dimensional
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationCIFAR Lectures: Non-Gaussian statistics and natural images
CIFAR Lectures: Non-Gaussian statistics and natural images Dept of Computer Science University of Helsinki, Finland Outline Part I: Theory of ICA Definition and difference to PCA Importance of non-gaussianity
More informationECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering
ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Siwei Guo: s9guo@eng.ucsd.edu Anwesan Pal:
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationThe Bayesian approach to inverse problems
The Bayesian approach to inverse problems Youssef Marzouk Department of Aeronautics and Astronautics Center for Computational Engineering Massachusetts Institute of Technology ymarz@mit.edu, http://uqgroup.mit.edu
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationTWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen
TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES Mika Inki and Aapo Hyvärinen Neural Networks Research Centre Helsinki University of Technology P.O. Box 54, FIN-215 HUT, Finland ABSTRACT
More informationEstimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators
Estimation theory Parametric estimation Properties of estimators Minimum variance estimator Cramer-Rao bound Maximum likelihood estimators Confidence intervals Bayesian estimation 1 Random Variables Let
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationTensor intro 1. SIAM Rev., 51(3), Tensor Decompositions and Applications, Kolda, T.G. and Bader, B.W.,
Overview 1. Brief tensor introduction 2. Stein s lemma 3. Score and score matching for fitting models 4. Bringing it all together for supervised deep learning Tensor intro 1 Tensors are multidimensional
More informationBayesian Inference by Density Ratio Estimation
Bayesian Inference by Density Ratio Estimation Michael Gutmann https://sites.google.com/site/michaelgutmann Institute for Adaptive and Neural Computation School of Informatics, University of Edinburgh
More informationBayesian Regression Linear and Logistic Regression
When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we
More informationML estimation: Random-intercepts logistic model. and z
ML estimation: Random-intercepts logistic model log p ij 1 p = x ijβ + υ i with υ i N(0, συ) 2 ij Standardizing the random effect, θ i = υ i /σ υ, yields log p ij 1 p = x ij β + σ υθ i with θ i N(0, 1)
More informationTheory of Maximum Likelihood Estimation. Konstantin Kashin
Gov 2001 Section 5: Theory of Maximum Likelihood Estimation Konstantin Kashin February 28, 2013 Outline Introduction Likelihood Examples of MLE Variance of MLE Asymptotic Properties What is Statistical
More informationAutoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas
On for Energy Based Models Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas Toronto Machine Learning Group Meeting, 2011 Motivation Models Learning Goal: Unsupervised
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College
More informationIntroduction to the regression problem. Luca Martino
Introduction to the regression problem Luca Martino 2017 2018 1 / 30 Approximated outline of the course 1. Very basic introduction to regression 2. Gaussian Processes (GPs) and Relevant Vector Machines
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationLecture 6: Graphical Models: Learning
Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)
More informationLearning MN Parameters with Alternative Objective Functions. Sargur Srihari
Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationNatural Image Statistics
Natural Image Statistics A probabilistic approach to modelling early visual processing in the cortex Dept of Computer Science Early visual processing LGN V1 retina From the eye to the primary visual cortex
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationLinear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.
Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Linear regression Least-squares estimation
More informationParameter Estimation in a Moving Horizon Perspective
Parameter Estimation in a Moving Horizon Perspective State and Parameter Estimation in Dynamical Systems Reglerteknik, ISY, Linköpings Universitet State and Parameter Estimation in Dynamical Systems OUTLINE
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationEstimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator
Estimation Theory Estimation theory deals with finding numerical values of interesting parameters from given set of data. We start with formulating a family of models that could describe how the data were
More information10. Linear Models and Maximum Likelihood Estimation
10. Linear Models and Maximum Likelihood Estimation ECE 830, Spring 2017 Rebecca Willett 1 / 34 Primary Goal General problem statement: We observe y i iid pθ, θ Θ and the goal is to determine the θ that
More informationProbabilistic Graphical Models for Image Analysis - Lecture 4
Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.
More informationGaussian processes for inference in stochastic differential equations
Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017
More informationMAP Examples. Sargur Srihari
MAP Examples Sargur srihari@cedar.buffalo.edu 1 Potts Model CRF for OCR Topics Image segmentation based on energy minimization 2 Examples of MAP Many interesting examples of MAP inference are instances
More informationMachine Learning Summer School
Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationEnergy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13
Energy Based Models Stefano Ermon, Aditya Grover Stanford University Lecture 13 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 1 / 21 Summary Story so far Representation: Latent
More informationCOMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017
COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY
More informationCovariance function estimation in Gaussian process regression
Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian
More informationModel Selection and Geometry
Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationAppendices: Stochastic Backpropagation and Approximate Inference in Deep Generative Models
Appendices: Stochastic Backpropagation and Approximate Inference in Deep Generative Models Danilo Jimenez Rezende Shakir Mohamed Daan Wierstra Google DeepMind, London, United Kingdom DANILOR@GOOGLE.COM
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationNeed for Sampling in Machine Learning. Sargur Srihari
Need for Sampling in Machine Learning Sargur srihari@cedar.buffalo.edu 1 Rationale for Sampling 1. ML methods model data with probability distributions E.g., p(x,y; θ) 2. Models are used to answer queries,
More informationApproximate inference in Energy-Based Models
CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models Geoffrey Hinton Two types of density model Stochastic generative model using directed acyclic graph (e.g. Bayes Net) Energy-based
More informationStat 5101 Lecture Notes
Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random
More informationBasic math for biology
Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood
More informationPART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics
Table of Preface page xi PART I INTRODUCTION 1 1 The meaning of probability 3 1.1 Classical definition of probability 3 1.2 Statistical definition of probability 9 1.3 Bayesian understanding of probability
More informationPatterns of Scalable Bayesian Inference Background (Session 1)
Patterns of Scalable Bayesian Inference Background (Session 1) Jerónimo Arenas-García Universidad Carlos III de Madrid jeronimo.arenas@gmail.com June 14, 2017 1 / 15 Motivation. Bayesian Learning principles
More informationStein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d
More informationMachine Learning (CS 567) Lecture 5
Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol
More informationLecture Notes 1: Vector spaces
Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationVariational Inference via Stochastic Backpropagation
Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation
More informationMetric-based classifiers. Nuno Vasconcelos UCSD
Metric-based classifiers Nuno Vasconcelos UCSD Statistical learning goal: given a function f. y f and a collection of eample data-points, learn what the function f. is. this is called training. two major
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov
More informationBasic Sampling Methods
Basic Sampling Methods Sargur Srihari srihari@cedar.buffalo.edu 1 1. Motivation Topics Intractability in ML How sampling can help 2. Ancestral Sampling Using BNs 3. Transforming a Uniform Distribution
More informationData assimilation in high dimensions
Data assimilation in high dimensions David Kelly Courant Institute New York University New York NY www.dtbkelly.com February 12, 2015 Graduate seminar, CIMS David Kelly (CIMS) Data assimilation February
More informationMachine Learning Basics III
Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient
More informationSTAT 730 Chapter 4: Estimation
STAT 730 Chapter 4: Estimation Timothy Hanson Department of Statistics, University of South Carolina Stat 730: Multivariate Analysis 1 / 23 The likelihood We have iid data, at least initially. Each datum
More informationGaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak
Gaussian processes and bayesian optimization Stanisław Jastrzębski kudkudak.github.io kudkudak Plan Goal: talk about modern hyperparameter optimization algorithms Bayes reminder: equivalent linear regression
More informationLecture 3: Pattern Classification
EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures
More informationMarkov Chain Monte Carlo (MCMC)
Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can
More informationExpectation Propagation for Approximate Bayesian Inference
Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationVariational Learning : From exponential families to multilinear systems
Variational Learning : From exponential families to multilinear systems Ananth Ranganathan th February 005 Abstract This note aims to give a general overview of variational inference on graphical models.
More informationGaussian with mean ( µ ) and standard deviation ( σ)
Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (
More informationA Bayesian Treatment of Linear Gaussian Regression
A Bayesian Treatment of Linear Gaussian Regression Frank Wood December 3, 2009 Bayesian Approach to Classical Linear Regression In classical linear regression we have the following model y β, σ 2, X N(Xβ,
More informationNoise-contrastive estimation of unnormalized statistical models, and its application to natural image statistics
9/2/2 Noise-contrastive estimation of unnormalized statistical models, and its application to natural image statistics Michael U. Gutmann Aapo Hyvärinen Department of Computer Science Department of Mathematics
More informationDeep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016
Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is Variational Inference? What is Variational Inference? Want to estimate some distribution, p*(x) p*(x) What is
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationRiemannian Stein Variational Gradient Descent for Bayesian Inference
Riemannian Stein Variational Gradient Descent for Bayesian Inference Chang Liu, Jun Zhu 1 Dept. of Comp. Sci. & Tech., TNList Lab; Center for Bio-Inspired Computing Research State Key Lab for Intell. Tech.
More informationMachine Learning Basics: Maximum Likelihood Estimation
Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning
More informationThe Poisson transform for unnormalised statistical models. Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB)
The Poisson transform for unnormalised statistical models Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB) Part I Unnormalised statistical models Unnormalised statistical models
More informationStat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2
Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, 2010 Jeffreys priors Lecturer: Michael I. Jordan Scribe: Timothy Hunter 1 Priors for the multivariate Gaussian Consider a multivariate
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationTraining an RBM: Contrastive Divergence. Sargur N. Srihari
Training an RBM: Contrastive Divergence Sargur N. srihari@cedar.buffalo.edu Topics in Partition Function Definition of Partition Function 1. The log-likelihood gradient 2. Stochastic axiu likelihood and
More informationA NEW INFORMATION THEORETIC APPROACH TO ORDER ESTIMATION PROBLEM. Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.
A EW IFORMATIO THEORETIC APPROACH TO ORDER ESTIMATIO PROBLEM Soosan Beheshti Munther A. Dahleh Massachusetts Institute of Technology, Cambridge, MA 0239, U.S.A. Abstract: We introduce a new method of model
More informationBayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang
Bayesian Dropout Tue Herlau, Morten Morup and Mikkel N. Schmidt Discussed by: Yizhe Zhang Feb 20, 2016 Outline 1 Introduction 2 Model 3 Inference 4 Experiments Dropout Training stage: A unit is present
More information