Scalable robust hypothesis tests using graphical models

Similar documents
Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Lecture 7 Introduction to Statistical Decision Theory

Uncertainty. Jayakrishnan Unnikrishnan. CSL June PhD Defense ECE Department

PATTERN RECOGNITION AND MACHINE LEARNING

Machine Learning Techniques for Computer Vision

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Recent Advances in Bayesian Inference Techniques

Latent Tree Approximation in Linear Model

Online Forest Density Estimation

A Deep Interpretation of Classifier Chains

p L yi z n m x N n xi

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Bayesian Machine Learning - Lecture 7

ECE521 Lecture7. Logistic Regression

Probabilistic Graphical Models

Lecture 22: Error exponents in hypothesis testing, GLRT

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to Bayesian Learning

outline Nonlinear transformation Error measures Noisy targets Preambles to the theory

the tree till a class assignment is reached

Computational Genomics

Machine learning - HT Maximum Likelihood

CSCE 478/878 Lecture 6: Bayesian Learning

Active Learning and Optimized Information Gathering

Information Theory & Decision Trees

Variational sampling approaches to word confusability

Doug Cochran. 3 October 2011

Adapted Feature Extraction and Its Applications

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Efficient Likelihood-Free Inference

11. Learning graphical models

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Multivariate statistical methods and data mining in particle physics

Variance Reduction and Ensemble Methods

Learning SVM Classifiers with Indefinite Kernels

MODULE -4 BAYEIAN LEARNING

Estimation and Optimization: Gaps and Bridges. MURI Meeting June 20, Laurent El Ghaoui. UC Berkeley EECS

20: Gaussian Processes

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Vers un apprentissage subquadratique pour les mélanges d arbres

Learning discrete graphical models via generalized inverse covariance matrices

Pattern Recognition and Machine Learning

ECE521 week 3: 23/26 January 2017

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Online Learning of Probabilistic Graphical Models

Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries

Lecture 3 Classification, Logistic Regression

Online Estimation of Discrete Densities using Classifier Chains

Least Squares Regression

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Markov Random Fields

CSCI-567: Machine Learning (Spring 2019)

Introduction to Statistical Learning Theory

Symbolic Variable Elimination in Discrete and Continuous Graphical Models. Scott Sanner Ehsan Abbasnejad

Applications of Information Geometry to Hypothesis Testing and Signal Detection

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

STA 4273H: Statistical Machine Learning

Linear Models for Regression CS534

STA 4273H: Statistical Machine Learning

Information geometry for bivariate distribution control

STA 4273H: Statistical Machine Learning

Statistical Inference

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Estimation of Rényi Information Divergence via Pruned Minimal Spanning Trees 1

Notes on Machine Learning for and

Statistical Machine Learning from Data

Probabilistic and Bayesian Machine Learning

Probabilistic Graphical Models (I)

Machine Learning. Ensemble Methods. Manfred Huber

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Robust Monte Carlo Methods for Sequential Planning and Decision Making

Expectation Propagation for Approximate Bayesian Inference

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Least Squares Regression

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Probabilistic Graphical Models

Bayesian Networks BY: MOHAMAD ALSABBAGH

Sampling Algorithms for Probabilistic Graphical models

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

CS6220: DATA MINING TECHNIQUES

Integrating Correlated Bayesian Networks Using Maximum Entropy

Lecture 8: Information Theory and Statistics

Expectation Propagation in Dynamical Systems

Example: multivariate Gaussian Distribution

9 Forward-backward algorithm, sum-product on factor graphs

Probabilistic Machine Learning. Industrial AI Lab.

Probabilistic Graphical Models

Variational Inference. Sargur Srihari

Lecture 6: Graphical Models: Learning

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

STA 4273H: Statistical Machine Learning

Statistical Learning. Philipp Koehn. 10 November 2015

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Expectation Propagation Algorithm

Statistical Pattern Recognition

Linear Dynamical Systems

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Transcription:

Scalable robust hypothesis tests using graphical models Umamahesh Srinivas ipal Group Meeting October 22, 2010

Binary hypothesis testing problem Random vector x = (x 1,...,x n ) R n generated from either of two hypotheses H 0 : x g(x H 0 ) H 1 : x g(x H 1 ) Given: Training sets T 0 and T 1, K samples each Goal: Classify new sample as coming from H 0 or H 1 Assumptions: Conditional densities g(x H 0 ) and g(x H 1 ) known exactly Samples in T 0 and T 1 generated i.i.d. from g(x H 0 ) and g(x H 1 ) respectively Likelihood ratio test (LRT) L(x) := g(x H H 1 1) τ (1) g(x H 0 ) H 0 10/22/2010 ipal Group Meeting 2

Binary hypothesis testing problem Random vector x = (x 1,...,x n ) R n generated from either of two hypotheses H 0 : x g(x H 0 ) H 1 : x g(x H 1 ) Given: Training sets T 0 and T 1, K samples each Goal: Classify new sample as coming from H 0 or H 1 Assumptions: Conditional densities g(x H 0 ) and g(x H 1 ) known exactly Samples in T 0 and T 1 generated i.i.d. from g(x H 0 ) and g(x H 1 ) respectively Likelihood ratio test (LRT) L(x) := g(x H H 1 1) τ (1) g(x H 0 ) H 0 10/22/2010 ipal Group Meeting 2

Binary hypothesis testing problem Random vector x = (x 1,...,x n ) R n generated from either of two hypotheses H 0 : x g(x H 0 ) H 1 : x g(x H 1 ) Given: Training sets T 0 and T 1, K samples each Goal: Classify new sample as coming from H 0 or H 1 Assumptions: Conditional densities g(x H 0 ) and g(x H 1 ) known exactly Samples in T 0 and T 1 generated i.i.d. from g(x H 0 ) and g(x H 1 ) respectively Likelihood ratio test (LRT) L(x) := g(x H H 1 1) τ (1) g(x H 0 ) H 0 10/22/2010 ipal Group Meeting 2

Need for robustness Assumption of knowledge of true densities unrealistic: Limited training Training data acquired in the presence of noise Dynamically evolving conditional densities Secondary physical effects on signal not modeled Robust hypothesis test 1 (RHT): Uncertainty in knowledge of true densities modeled as class of distributions in the proximity of some nominal density Minimum level of performance guaranteed for all models in the vicinity of nominal density 1 Huber, 1965 10/22/2010 ipal Group Meeting 3

Need for robustness Assumption of knowledge of true densities unrealistic: Limited training Training data acquired in the presence of noise Dynamically evolving conditional densities Secondary physical effects on signal not modeled Robust hypothesis test 1 (RHT): Uncertainty in knowledge of true densities modeled as class of distributions in the proximity of some nominal density Minimum level of performance guaranteed for all models in the vicinity of nominal density 1 Huber, 1965 10/22/2010 ipal Group Meeting 3

Measures of model proximity Contamination model: F c k = {f(x) : f(x) = (1 ɛ k)f k (x)+ɛ k h(x)}, k = 0,1, where f k (x) are the nominal densities, 0 ɛ 0,ɛ 1 1, and h(x) is an unknown probability density. Total variation: F TV k = {f(x) : d TV (f k,f) = f k (x) f(x) dx < ɛ}, k = 0,1. Kullback-Leibler divergence: F KL k = {f(x) : D(f k f) = f k (x)ln ( ) fk (x) dx < ɛ}, k = 0,1. f(x) 10/22/2010 ipal Group Meeting 4

Problem set-up D: convex set of pointwise randomized decision functions δ( ). For observation x, we select H 1 with probability δ(x) and H 0 with probability 1 δ(x). False alarm: P F (δ,f 0 ) = Miss: P M (δ,f 1 ) = δ(x)f 0 (x)dx (2) (1 δ(x))f 1 (x)dx. (3) For equally likely hypotheses, the probability of error is given by P E (δ,f 0,f 1 ) = 1 2 [P F(δ,f 0 )+P M (δ,f 1 )]. (4) 10/22/2010 ipal Group Meeting 5

Minimax RHT where (δ R,f L 0 (x),fl 1 (x)) = argmin δ D max f 0,f 1 F cp E(δ,f 0,f 1 ), (5) δ R is the robust test (f L 0,f L 1 ) are least favorable densities in F c = F c 0 F c 1. 10/22/2010 ipal Group Meeting 6

Solution to minimax RHT f L 0 (x) = { (1 ɛ0 )f 0 (x) 1 c (1 ɛ 0 )f 1 (x) { f1 L (x) = (1 ɛ1 )f 1 (x) c (1 ɛ 1 )f 0 (x) f 1 L 1 (x) f0 δ R (x) = L(x) 1 0 f 1(x) f < 0(x) c f 1(x) f (6) 0(x) c f 1(x) f > 0(x) c f 1(x) f (7) 0(x) c f1 L (x), (8) < 1 f0 L (x) where c and c are defined such that f0 L and f1 L are valid probability distributions, leading to: ( ) f1 (x) P 0 f 0 (x) < c + 1 ( ) f1 (x) c P 1 f 0 (x) c = 1 (9) 1 ɛ 0 ( ) ( ) f1 (x) P 1 f 0 (x) > c +c f1 (x) P 0 f 0 (x) c = 1. (10) 1 ɛ 1 P k is the probability measure w.r.t f k (x). 10/22/2010 ipal Group Meeting 7

Underlying intuition Choice of c and c : Consider L(x) = g(x H 1) n g(x H 0 ) = g(x i H 1 ) g(x i H 0 ). If any factor in the product approaches 0 or, L(x) is affected. Introduce robustness by clipping the likelihood ratios to the range c,c. Least favorable densities: Choose f L 0 (x) as close as possible to f 1 (x), and f L 1 (x) as close as possible to f 0 (x). i=1 10/22/2010 ipal Group Meeting 8

Scalability challenge RHT reduces to finding c and c such that: ( ) f1 (x) P 0 f 0 (x) < c + 1 ( ) f1 (x) c P 1 f 0 (x) c ( ) ( ) f1 (x) P 1 f 0 (x) > c +c f1 (x) P 0 f 0 (x) c = 1 1 ɛ 0 = 1 1 ɛ 1. Highly nonlinear equations; require Monte Carlo methods (sample generation). Scales very poorly with dimension - computationally intractable. 10/22/2010 ipal Group Meeting 9

Probabilistic graphical models Graph G = (V,E) is defined by a set of nodes V = {1,...,n}, and a set of edges E V V which connect pairs of nodes. Graphical model: Random vector defined on a graph such that each node represents one (or more) random variables, and edges reveal conditional dependencies. Underlying graph structure leads to factorization of joint probability distribution. Leverage efficient graph-based algorithms for statistical inference and learning. Trade-off between graph complexity and approximation accuracy. 10/22/2010 ipal Group Meeting 10

Probabilistic graphical models Graph G = (V,E) is defined by a set of nodes V = {1,...,n}, and a set of edges E V V which connect pairs of nodes. Graphical model: Random vector defined on a graph such that each node represents one (or more) random variables, and edges reveal conditional dependencies. Underlying graph structure leads to factorization of joint probability distribution. Leverage efficient graph-based algorithms for statistical inference and learning. Trade-off between graph complexity and approximation accuracy. 10/22/2010 ipal Group Meeting 10

Some graph structures Tree: x1 x2 x3 x4 x5 x6 x7 f(x) = f(x 1 )f(x 2 x 1 )f(x 3 x 1 )f(x 4 x 2 )f(x 5 x 2 )f(x 6 x 3 )f(x 7 x 3 ). Forest: Undirected acyclic graph with exactly (n 1) edges. Chow-Liu (1965): optimal tree approximation reduces to a maximum weight spanning tree (MWST) problem. Graph with k < (n 1) edges. Junction-tree: Tree-structured graph with edges between clusters of nodes. Clusters connected by an edge have at least one common node. 10/22/2010 ipal Group Meeting 11

Block-tree graphs Disjoint clusters of nodes, with only one path connecting any two clusters. 1,2,3 4,5,6 7,8,9 10,11,12 13,14,15 16,17,18 Figure: Example of a block-tree graph Benefits: Favorable complexity-performance trade-off Low cost of sample generation Efficient greedy algorithms to compute block-trees. 10/22/2010 ipal Group Meeting 12

Realizing RHT on block-tree graphs Suppose f(x) is Gaussian with mean zero. State-space model on the block-tree graph 2 is given as: where u Ci is white noise. Computing c and c : x Ci = A i x CΥ(i) +u Ci, (11) A i = E(x Ci x T C Υ(i) )[E(x CΥ(i) x T C Υ(i) )] 1 (12) E(u Ci u T C i ) = E(x Ci x T C i ) A i E(x CΥ(i) x T C i ), (13) 1 For each f k (x), compute block-tree graphs G k using a specified value of m (number of nodes in a cluster). Using recursive sampling, generate sample sets S k, k = 0,1. 2 Using S 0 and S 1, compute c and c by Monte Carlo methods. 2 Vats and Moura, 2010 10/22/2010 ipal Group Meeting 13

Complexity benefits Assuming Gaussianity, generating a sample from f(x) is O(n 3 ) - inversion of an n n matrix. For L generated samples, total complexity is O(Ln 3 ). Using block-tree graph with cluster size m, computing block-tree graph has complexity O(logn)+O(mn 2 ) O(mn 2 ), while generating samples has complexity O(rm 3 ) = O(m 2 n). For L generated samples, total complexity is O(L (mn 2 +m 2 n)) O(L mn 2 ). Reduction in complexity for sparse graphical models, since m n and L L. 10/22/2010 ipal Group Meeting 14

Results Probability of misclassification 0.5 0.45 0.4 0.35 0.3 0.25 0.2 RHT Classical hypothesis test 0 0.1 0.2 0.3 0.4 0.5 ε (= ε 0 = ε 1 ) Figure: Error probability as a function of ɛ for classical hypothesis testing and RHT. 10/22/2010 ipal Group Meeting 15

Results Probability of misclassification 0.5 0.45 0.4 0.35 0.32 0.3 0.25 0.2 0.15 0.1 RHT Graph based RHT (m = 20) Graph based RHT (m = 10) 100 200 300 400 500 Number of training samples Figure: Error probability as a function of training size, for RHT and graph -based RHT (dense inverse covariance matrix). 10/22/2010 ipal Group Meeting 16

Results 0.5 0.45 RHT Graph based RHT Probability of misclassification 0.4 0.35 0.3 0.25 0.2 0.15 0.1 100 200 300 400 500 Number of training samples Figure: Error probability as a function of training size, for RHT and graph -based RHT (sparse inverse covariance matrix). 10/22/2010 ipal Group Meeting 17

Results 0.5 0.45 R1 RHT Graph based RHT Probability of misclassification 0.4 0.35 0.3 0.27 0.25 0.24 0.2 0.15 R2 R3 0.1 0.05 100 200 300 400 500 Number of training samples Figure: Automatic target recognition: Misclassification probability as a function of number training samples for graph-based RHT and RHT. Classification is performed on real-world SAR images. 10/22/2010 ipal Group Meeting 18

Summary Real-world classification problems: high-dimensional data, limited training, noisy acquisition need for robust hypothesis tests. Minimax test minimizes worst-case performance of making a decision via pursuit of least favorable densities. RHT is computationally intractable for high-dimensional data. Approximate densities by block-tree graphs and instantiate RHT - significant computational benefits with tolerable loss in classification performance. 10/22/2010 ipal Group Meeting 19