Machine Learning 4771

Similar documents
The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model

Probability and MLE.

Chapter 6 Principles of Data Reduction

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Outline. L7: Probability Basics. Probability. Probability Theory. Bayes Law for Diagnosis. Which Hypothesis To Prefer? p(a,b) = p(b A) " p(a)

Agnostic Learning and Concentration Inequalities

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

15-780: Graduate Artificial Intelligence. Density estimation

Lecture 19: Convergence

Elementary manipulations of probabilities

Properties of Joints Chris Piech CS109, Stanford University

Uncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty

Statistical Pattern Recognition

Bayesian networks are graphical models that characterize how variables are independent of each other.

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

1 Models for Matched Pairs

Introduction to Artificial Intelligence CAP 4601 Summer 2013 Midterm Exam

Lecture 12: September 27

Approximations and more PMFs and PDFs

Bayesian Belief Network

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Bayesian Belief Network

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Last time, we talked about how Equation (1) can simulate Equation (2). We asserted that Equation (2) can also simulate Equation (1).

Lecture 7: Properties of Random Samples

10-701/ Machine Learning Mid-term Exam Solution

5 : Exponential Family and Generalized Linear Models

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

AMS570 Lecture Notes #2

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Machine Learning Theory (CS 6783)

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Mixtures of Gaussians and the EM Algorithm

Bayesian Methods: Introduction to Multi-parameter Models

Lecture 8: Convergence of transformations and law of large numbers

Empirical Process Theory and Oracle Inequalities

HOMEWORK I: PREREQUISITES FROM MATH 727

Exercises Advanced Data Mining: Solutions

Exponential Families and Bayesian Inference

Module 1 Fundamentals in statistics

Lecture 11 and 12: Basic estimation theory

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Classification with linear models

Mathematical Statistics - MS

A quick activity - Central Limit Theorem and Proportions. Lecture 21: Testing Proportions. Results from the GSS. Statistics and the General Population

Recursive Updating Fixed Parameter

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

Algorithms for Clustering

CS284A: Representations and Algorithms in Molecular Biology

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Expectation maximization

Lecture 2: Monte Carlo Simulation

Machine Learning.

Lecture 2. The Lovász Local Lemma

Lecture 2: April 3, 2013

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Stat 421-SP2012 Interval Estimation Section

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Lecture 14: Graph Entropy

Naïve Bayes. Naïve Bayes

Distribution of Random Samples & Limit theorems

Problem Set 4 Due Oct, 12

Stat410 Probability and Statistics II (F16)

7.1 Convergence of sequences of random variables

Introduction to Computational Molecular Biology. Gibbs Sampling

Lecture Notes 15 Hypothesis Testing (Chapter 10)

Chapter 6 Sampling Distributions

Entropies & Information Theory

Expectation-Maximization Algorithm.

KLMED8004 Medical statistics. Part I, autumn Estimation. We have previously learned: Population and sample. New questions

EE / EEE SAMPLE STUDY MATERIAL. GATE, IES & PSUs Signal System. Electrical Engineering. Postal Correspondence Course

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Directed Graphical Models or Bayesian Networks

7.1 Convergence of sequences of random variables

Random Variables, Sampling and Estimation

An Introduction to Randomized Algorithms

Simulation. Two Rule For Inverting A Distribution Function

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities


Solution of Final Exam : / Machine Learning

STAT 203 Chapter 18 Sampling Distribution Models

Instructor: Judith Canner Spring 2010 CONFIDENCE INTERVALS How do we make inferences about the population parameters?

Integer Programming (IP)

Lecture 23: Minimal sufficiency

Clustering: Mixture Models

Lecture 12: November 13, 2018

Lecture 20: Multivariate convergence and the Central Limit Theorem

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Pixel Recurrent Neural Networks

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

Transcription:

Machie Learig 4771 Istructor: Toy Jebara

Topic 14 Structurig Probability Fuctios for Storage Structurig Probability Fuctios for Iferece Basic Graphical Models Graphical Models Parameters as Nodes

Structurig PDFs for Storage Probability tables quickly grow if p has may variables p(x = p( flu?,headache?,...,temperature? For D true/false medical variables Expoetial blow-up of storage size for the probability Example: 8x8 biary images of digits If multiomial with M choices, probabilities are how big? As i Naïve Bayes or Multivariate Beroulli, if words were idepedet thigs are much more efficiet p(x = p( flu?p ( headache?...p ( temperature? 0.73 0.27 For D true/false medical variables (really eve less tha that table size = 2 D 0.2 0.8 0.54 0.46 table size = 2 D

Structurig PDFs for Iferece Iferece: goal is to predict some variables give others x1: flu x2: fever x3: sius ifectio Patiet claims headache x4: temperature ad high temperature. x5: sius swellig Does he have a flu? x6: headache Give fidigs variables X f ad ukow variables X u predict queried variables X q Classical approach: truth tables (slow or logic etworks Moder approach: probability tables (slow or Bayesia etworks (fast belief propagatio, juctio tree algorithm

From Logic Nets to Bayes Nets 1980 s expert systems & logic etworks became popular x1 x2 x1 v x2 x1^x2 x1 -> x2 T T T T T T F T F F F T T F T F F F F T Problem: icosistecy, 2 paths ca give differet aswers Problem: rules are hard, istead use soft probability tables x3 = x1 ^ x2 x3=0 x3=1 p(x3 x1,x2 x3=0 x3=1 x2=0 x2=1 x2=0 x2=1 x2=0 x2=1 x2=0 x2=1 x1=0 x1=1 1.0 1.0 1.0 0.0 x1=0 x1=1 0.0 0.0 0.0 1.0 x1=0 x1=1 0.8 0.7 0.7 0.1 x1=0 x1=1 0.2 0.3 0.3 0.9 These directed graphs are called Bayesia Networks

Graphical Models & Bayes Nets Idepedece assumptios make probability tables smaller But real evets i the world ot completely idepedet! Complete idepedece is urealistic Graphical models use a graph to describe more subtle depedecies ad idepedecies: amely: coditioal idepedecies (like causality but ot exactly Directed Graphical Model, also called Bayesia Network use a directed acylic graph (DAG. Neural Network = Graphical Fuctio Represetatio Bayesia Network = Graphical Probability Represetatio

Graphical Models & Bayes Nets Node: a radom variable (discrete or cotiuous Idepedet: o lik Depedet: lik Arrow: from paret to child (like causality, ot exactly Child: destiatio of arrow, respose Paret: root of arrow, trigger paretsof child i = pa i = π i Graph: depedece/idepedece Graph: shows factorizatio of joit joit = products of coditioals p( x 1,,x = p( x i pa i = p x i π i DAG: directed acyclic graph p(x,y = p(xp(y p(x,y = p(y xp(x x

Basic Graphical Models Idepedece: all odes are uliked Shadig: variable is observed, coditio o it moves to the right of the bar i the pdf Examples of simplest coditioal idepedece situatios p( x 1,,x = p( x i pa i = p( x i π i 1 Markov chai: p( x,y,z = p( xp( y xp( z y Example biary evets: x = presidet says war y = geeral orders attack z = soldier shoots gu x z y = p ( x,y,z p( y,z = p ( x y p x y,z

Basic Graphical Models 2 1 Cause, 2 effects: y = flu x = sore throat z = temperature 3 2 Causes, 1 effect: x = rai y = wet driveway z = car oil leak p( x,y,z = p( yp( x yp( z y p( x,y,z = p( xp( zp( y x,z x z y Explaiig away x z x x z y Each coditioal is a mii-table (Multiomial or Beroulli coditioed o parets

Basic Graphical Models 2 1 Cause, 2 effects: y = flu x = sore throat z = temperature 3 2 Causes, 1 effect: x = dad is diabetic y = child is diabetic z = mom is diabetic p( x,y,z = p( yp( x yp( z y p( x,y,z = p( xp( zp( y x,z x z y Explaiig away x z x x z y Each coditioal is a mii-table (Multiomial or Beroulli coditioed o parets

Graphical Models Example: factorizatio of the followig system of variables = p( x i pa i,,x = p x i π i p( x 1,,x 6 = p( x 1

Graphical Models Example: factorizatio of the followig system of variables = p( x i pa i,,x,,x 6 = p x i π i = = p( x 1 p( x 2 = p( x 1 p( x 2 p( x 3 = p( x 1 p( x 2 p( x 3 p( x 4 = p( x 1 p( x 2 p( x 3 p( x 4 p( x 5 x 3 = p( x 1 p( x 2 p( x 3 p( x 4 p( x 5 x 3 p x 6,x 5 How big are these tables (if biary variables?

Graphical Models Example: factorizatio of the followig system of variables = p( x i pa i,,x,,x 6 = p x i π i = = p( x 1 p( x 2 = p( x 1 p( x 2 p( x 3 = p( x 1 p( x 2 p( x 3 p( x 4 = p( x 1 p( x 2 p( x 3 p( x 4 p( x 5 x 3 = p( x 1 p( x 2 p( x 3 p( x 4 p( x 5 x 3 p x 6,x 5 How big are these tables (if biary variables?

Graphical Models Example: factorizatio of the followig system of variables = p( x i pa i,,x Iterpretatio??? = p x i π i p( x 1,,x 6 = p( x 1 p( x 2 p( x 3 p( x 4 p( x 5 x 3 p( x 6,x 5

Graphical Models Example: factorizatio of the followig system of variables = p( x i pa i,,x Iterpretatio: 1: flu 2: fever 3: sius ifectio 4: temperature 5: sius swellig 6: headache =,,x 6 = p x i π i p( x 2 p( x 3 p( x 4 p( x 5 x 3 p x 6,x 5

Graphical Models Normalizig probability tables. Joit distributios sum to 1. BUT, coditioals sum to 1 for each settig of parets. p(x 2-1 1 x =0 = 1 p x p(x,y 4-1 p(x y 4-2 x,y p( x,y = 1 x x = 1 = 1 p x y = 0 p x y = 1 p(x,y,z 8-1 p(x y,z 8-4 p x,y,z = 1 x,y,z y=0 y=1 x=0 x=1 x x x x = 1 = 1 = 1 = 1 p x y = 0,z = 0 p x y = 1,z = 0 p x y = 0,z = 1 p x y = 1,z = 1

Graphical Models Example: factorizatio of the followig system of variables = p( x i pa i,,x Iterpretatio 1: flu 2: fever 3: sius ifectio 4: temperature 5: sius swellig 6: headache =,,x 6 = p x i π i p( x 2 p( x 3 p( x 4 p( x 5 x 3 p x 6,x 5 2 6 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 3 4 63 13 vs. degrees of freedom

Parameters as Nodes Cosider the model variable θ ALSO as a radom variable But would eed a prior distributio P(θ igore for ow Recall: Naïve Bayes, word probabilities are idepedet Text: Multivariate Beroulli 50000 x = α d ( d 1 α 1 x d d p x α Text: Multiomial p( X α = d =1 M m=1 X m! M X m=1 m! M X α m m=1 m

Cotiuous Coditioal Models I previous slide, θ ad α were a radom variable i graph But, θ ad α are cotiuous Network ca have both discrete & cotiuous odes Joit factorizes ito coditioals that are either: 1 discrete coditioal probability tables 2 cotiuous coditioal probability distributios Most popular cotiuous distributio = Gaussia

Graphical Models I EM, we saw how to hadle odes that are: observed (shaded, hidde variables (E, parameters (M But, oly cosidered simple iid, sigle paret, structures More geerally, have arbitrary DAG without loops Notatio: { } = { odes/radomvars,edges } { } {( x i,x j : i j} { } = subset G = X,E X = x 1,,x M E = X c = x 1,x 3,x 4 Wat to do 4 thigs with these graphical models: 1 Lear Parameters (to fit to data 2 Query idepedece/depedece 3 Perform Iferece (get margials/max a posteriori 4 Compute Likelihood (e.g. for classificatio

Graphical Models Graph factorizes probability: Topological graph: odes are i order so that parets π come before childre =,,x 6 p( x 2 p( x 3 p( x 4 p( x 5 x 3 p x 6,x 5 = p( x i π i,,x Questio? Which is the more geeral graph?

Graphical Models Graph factorizes probability: Topological graph: odes are i order so that parets π come before childre =,,x 6 p( x 2 p( x 3 p( x 4 p( x 5 x 3 p x 6,x 5 = p( x i π i,,x Questio? Which is the more geeral graph? Coditioal probability tables ca be chose to make busier graph look like simpler graph =