Challenges (& Some Solutions) and Making Connections

Similar documents
Conditional Independence

Directed Graphical Models

Causal Inference & Reasoning with Causal Bayesian Networks

Abstract. Three Methods and Their Limitations. N-1 Experiments Suffice to Determine the Causal Relations Among N Variables

Summary of the Bayes Net Formalism. David Danks Institute for Human & Machine Cognition

Bayesian Machine Learning

Chapter 5: HYPOTHESIS TESTING

Probabilistic Classification

Automatic Causal Discovery

Causal Discovery with Latent Variables: the Measurement Problem

Chris Bishop s PRML Ch. 8: Graphical Models

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

STA 4273H: Statistical Machine Learning

Directed Graphical Models or Bayesian Networks

Bayesian Learning (II)

Covariance. if X, Y are independent

Conditional probabilities and graphical models

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Tutorial: Causal Model Search

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Probabilistic Graphical Models

Introduction to Artificial Intelligence. Unit # 11

CS6220: DATA MINING TECHNIQUES

Statistical Approaches to Learning and Discovery

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Bayesian networks. Instructor: Vincent Conitzer

The Naïve Bayes Classifier. Machine Learning Fall 2017

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Probabilistic Graphical Models (I)

PAC-learning, VC Dimension and Margin-based Bounds

Probabilistic Causal Models

CMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th

On the errors introduced by the naive Bayes independence assumption

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Week 5: Logistic Regression & Neural Networks

Statistical Inference. Hypothesis Testing

Bayesian Networks for Causal Analysis

Day 3: Search Continued

Causal Inference. Prediction and causation are very different. Typical questions are:

Graphical Models - Part I

Ockham Efficiency Theorem for Randomized Scientific Methods

Machine Learning, Midterm Exam

TDT70: Uncertainty in Artificial Intelligence. Chapter 1 and 2

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

Lecture Notes 22 Causal Inference

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Algorithmisches Lernen/Machine Learning

Probabilistic Graphical Networks: Definitions and Basic Results

Rapid Introduction to Machine Learning/ Deep Learning

CS6220: DATA MINING TECHNIQUES

Probabilistic Graphical Models

Causality in Econometrics (3)

Learning Multivariate Regression Chain Graphs under Faithfulness

Causal Effect Identification in Alternative Acyclic Directed Mixed Graphs

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Bayesian Networks (Part II)

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Bayesian Networks. Machine Learning, Fall Slides based on material from the Russell and Norvig AI Book, Ch. 14

Directed Graphical Models

Statistical Models for Causal Analysis

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Probabilistic Machine Learning

Naïve Bayes Classifiers

University of Washington Department of Electrical Engineering EE512 Spring, 2006 Graphical Models

Generative Clustering, Topic Modeling, & Bayesian Inference

1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution

Causal Discovery. Richard Scheines. Peter Spirtes, Clark Glymour, and many others. Dept. of Philosophy & CALD Carnegie Mellon

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Bayesian Networks Representation

Dimensionality reduction

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Introduction to Graphical Models

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Naïve Bayes classification

CPSC 540: Machine Learning

BAYESIAN MACHINE LEARNING.

Probabilistic Models

Measurement Error and Causal Discovery

Short Note: Naive Bayes Classifiers and Permanence of Ratios

Bayesian Networks Inference with Probabilistic Graphical Models

CS Machine Learning Qualifying Exam

COMP61011! Probabilistic Classifiers! Part 1, Bayes Theorem!

Bayesian Methods in Artificial Intelligence

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Homework 1 2/7/2018 SOLUTIONS Exercise 1. (a) Graph the following sets (i) C = {x R x in Z} Answer:

Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features

CS Lecture 3. More Bayesian Networks

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Chapter 9. Hypothesis testing. 9.1 Introduction

Towards an extension of the PC algorithm to local context-specific independencies detection

Machine Learning

CSC 411: Lecture 09: Naive Bayes

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Introduction to Causal Calculus

CS 188: Artificial Intelligence Fall 2009

Transcription:

Challenges (& Some Solutions) and Making Connections

Real-life Search All search algorithm theorems have form: If the world behaves like, then (probability 1) the algorithm will recover the true structure Two types of problems: Probability 1 proofs don t help in the short run World might not fit our assumptions

Search in the Short Run Bayes net learning algorithms can give the wrong answer if the data fail to reflect the true associations and independencies Of course, this is a problem for all inference: we might just be really unlucky Note: This is not (really) the problem of unrepresentative samples (e.g., black swans)

Convergence in Search In search, we would like to bound our possible error as we acquire data I.e., we want search procedures that have uniform convergence Without uniform convergence, Cannot set confidence intervals for inference Not every Bayesian, regardless of priors over hypotheses, agrees on probable bounds, no matter how loose

Pointwise Convergence Assume hypothesis H is true Then For any standard of closeness to H, and For any standard of successful refutation, Then for every hypothesis that is not close to H, there is a sample size for which that hypothesis is refuted

Uniform Convergence Assume hypothesis H is true Then For any standard of closeness to H, and For any standard of successful refutation, There is a sample size such that for all hypotheses H* that are not close to H, H* is refuted.

Theorems about Convergence There are procedures that, for every model, pointwise converge to the OME class containing the true causal model. (Spirtes, Glymour, & Scheines, 1993) There is no procedure that, for every model, uniformly converges to the OME class containing the true causal model. (Robins, Scheines, Spirtes, & Wasserman, 1999; 2003)

Uncooperative World Mixed populations Variable definition problem Structure mis-specification Selection bias

Mixed Populations A mixture of populations #1 and #2 can have different statistics than either population by itself The different statistics can be quite different E.g., association where the sub-populations all have independencies Or positive association when they all have negative associations And so on

Mixed Populations Simple example Population #1: P(X) = 0.2; P(Y) = 0.2 Population #2: P(X) = 0.8; P(Y) = 0.8 And mix the populations 50% / 50% In this particular mixture, P(X) = 0.5; P(X Y) = 0.68 X and Y are associated in the mixture

Mixed Populations More complicated: Consider the following data: Men Treated Untreated Alive 3 20 Alive Dead 3 24 Dead Women Treated Untreated 16 3 25 6 P(A T) = 0.5 P(A T) = 0.39 P(A U) = 0.45 P(A U) = 0.333 Treatment is superior in both groups!

Mixed Populations More complicated: Consider the following data: Better off not being Treated! Alive Pooled Treated Untreated 19 23 Dead 28 30 P(A T) = 0.404 P(A U) = 0.434

Variable Definition Illustrative example: Suppose you want to know the impact of Total Cholesterol on Heart Disease Problem: TC = Low Density Lipid + High Density Lipid + - LDL HD & HDL HD There is no fact of matter about the effect of TC on HD

Variable Definition In particular, consider predicting the effect of an intervention on Total Cholesterol Cannot make a determinate prediction, since it depends on how the intervention occurs Increase TC by raising LDL? Or raising HDL? And search is also difficult in this condition Similar situation arises with other variables

Variable Definition Idea: Just divide up the world into the finestgrained structure available to you Won t work in all cases, but will for many Problem: But lots of fine-grained structure doesn t make a causal difference We don t need to know molecular structure to do lots of causal inference So which differences matter? Depends on causal structure Cycle

Model Mis-specification If the model contains parameters that do not occur in the true structure (or vice versa), then the estimated parameters can be deeply misleading about the nature of the world (though formally correct) That is, the best estimated model might still be quite different from the truth

Model Mis-specification Example: True model: X Y Z W Estimated model: X W Y Z Some of the estimation errors: P(W X) parameters will all be (close-to-) identical P(Y Z, W) parameters depend on W, even though W and Y are (in the data) independent And so on

Selection Bias Sometimes, a variable of interest is a cause of whether people get in the sample E.g., measure various skills or knowledge in college students Or measuring joblessness by a phone survey during the middle of the day Simple problem: You might get a skewed picture of the population

Selection Bias If two variables matter, then we have: Factor A Factor B Sample Sample = 1 for everyone we measure That is equivalent to conditioning on Sample Induces an association between A and B!

Connections Bayes nets can help us understand many other techniques Regression Factor Analysis, PCA, ICA Naïve Bayes classification And give novel inspiration for other types of problems Markov blanket classifiers

Regression Given data D, find the coefficients for: Y = i a i X i + Y that minimize squared error in prediction Relatively simple computation There are also nonlinear versions Non-zero coefficient iff X i Y {X 1,, X n } \ X i

Regression as Search Regression is a (classical) parameter estimation method that assumes a particular causal structure But many people use regression for search (i.e., non-zero coefficient causal link) Regression is a reliable parameter estimator, but unreliable for search

Regression as Search U X Y E (and knowledge that X and Y precede E in time) Regression X and Y both cause E Constraint-based Bayes net search Neither X nor Y cause E

Factor Analysis Assume linear equations Given some set of (observed) features, determine the coefficients for (a fixed number of) unobserved variables that minimize the error

Factor Analysis If we have one factor, then we find coefficients to minimize error in: F i = a i + b i U where U is the unobserved variable (with fixed mean and variance) Two factors Minimize error in: F i = a i + b i,1 U 1 + b i,2 U 2

Factor Analysis The decision about exactly how many factors to use is typically based on some simplicity vs. fit tradeoff Also, the interpretation of the unobserved factors must be provided by the scientist The data do not dictate the meaning of the unobserved factors (though it can sometimes be obvious )

Factor Analysis & Graphs One-variable factor analysis is equivalent to finding the ML parameter estimates for the SEM with graph: U F 1 F 2 F n

Factor Analysis & Graphs Two-variable factor analysis is equivalent to finding the ML parameter estimates for the SEM with graph: U 1 F 1 F 2 F n U 2

PCA and ICA Two widely-used analysis techniques: Principal Components Analysis (PCA) Independent Components Analysis (ICA) Both are generalizations of factor analysis Both have natural reconstructions / explanations in terms of Bayes nets

Naïve Bayes Classifiers General classification problem: Suppose we have classes C defined on a set of features F 1,, F n, Classify a novel instance by computing P(C i F 1,, F n ) In general, this computation is hard!

Naïve Bayes Classifiers Naïve Bayes assumption: Each feature is independent of every other feature conditional on the class For all i,j,k, F i F j C k P(C k F 1,, F n ) P(C k ) P(F 1 C k ) P(F n C k )

Naïve Bayes Classifiers Advantages Fast classification of novel cases Large sample sizes for parameter estimation Relatively small number of parameters Disadvantages The fundamental assumption almost never holds in the real world I.e., the class almost never screens off features But sometimes it works well enough

Naïve Bayes as a Graph Consider the graph: C F 1 F 2 F n Markov + faithfulness Naïve Bayes assumption

Markov Blanket Classification Generally, doing classification requires using only the variables that matter The Markov Blanket of X is the subset of variables such that X is independent of all of the other variables in the system I.e., the Markov blanket is the set that screens off X from the rest of the system

Markov Blanket Classification Classification is computing conditional probabilities, so the Markov blanket suffices for optimal classification Given the Markov blanket, no other variables in the system carry information about X Graphical (causal) structure defines the inputs for the optimal classifier

Markov Blanket Classification The Markov blanket of X contains: The children of X The parents of X The parents of the children of X Justification: First two are obvious Knowing the children of X induces an association between X and the children s parents We re conditioning on a collider, as in Gas Car Battery

Markov Blanket Classification Consider X in this graph: Z Y A W Markov blanket of X is: Parents of X; Children of X; and Parents of Children of X X C B D

Markov Blanket Classification Given some graph, we thus have an efficient method for calculating the inputs to an optimal classifier for all X And in lots of cases, the quantitative side of the classifier is also easy to assemble Of course, if you have the wrong variables, or bad measurements, then