Day 5: Generative models, structured classification

Similar documents
Introduc)on to Bayesian methods (con)nued) - Lecture 16

Naïve Bayes Lecture 17

Day 3: Classification, logistic regression

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Support vector machines Lecture 4

The Bayes classifier

Machine Learning for Signal Processing Bayes Classification and Regression

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Logistic Regression. Machine Learning Fall 2018

Bayesian Methods: Naïve Bayes

Naïve Bayes classification

Expectation Maximization Algorithm

ECE 5984: Introduction to Machine Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Bayesian Learning (II)

ECE 5424: Introduction to Machine Learning

Day 4: Classification, support vector machines

Lecture : Probabilistic Machine Learning

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Nonparametric Bayesian Methods (Gaussian Processes)

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

COMP90051 Statistical Machine Learning

Machine Learning, Fall 2012 Homework 2

Machine Learning Algorithm. Heejun Kim

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

The Naïve Bayes Classifier. Machine Learning Fall 2017

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Variance Reduction and Ensemble Methods

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Bias-Variance Tradeoff

Multivariate statistical methods and data mining in particle physics

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

PAC-learning, VC Dimension and Margin-based Bounds

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

MLE/MAP + Naïve Bayes

Generative modeling of data. Instructor: Taylor Berg-Kirkpatrick Slides: Sanjoy Dasgupta

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Generative v. Discriminative classifiers Intuition

ECE 5984: Introduction to Machine Learning

Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Linear classifiers Lecture 3

Machine Learning

INTRODUCTION TO PATTERN RECOGNITION

Machine Learning Practice Page 2 of 2 10/28/13

Introduction to Bayesian Learning. Machine Learning Fall 2018

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Recitation 2: Probability

Generative v. Discriminative classifiers Intuition

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

CSE446: Naïve Bayes Winter 2016

MLE/MAP + Naïve Bayes

CPSC 340: Machine Learning and Data Mining

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam Solutions

Linear and Logistic Regression. Dr. Xiaowei Huang

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

CS 6375 Machine Learning

Machine Learning Linear Classification. Prof. Matteo Matteucci

Announcements. Proposals graded

Machine Learning

Introduction to Probabilistic Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Today. Calculus. Linear Regression. Lagrange Multipliers

Logistic Regression. William Cohen

Generative v. Discriminative classifiers Intuition

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

BAYESIAN DECISION THEORY

Machine Learning for Signal Processing Bayes Classification

CSCI-567: Machine Learning (Spring 2019)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning, Fall 2009: Midterm

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Statistical Data Mining and Machine Learning Hilary Term 2016

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Introduction. Chapter 1

PAC-learning, VC Dimension and Margin-based Bounds

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Mixture Models & EM algorithm Lecture 21

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Transcription:

Day 5: Generative models, structured classification Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 22 June 2018

Linear regression Classification Topics so far o nearest neighbors, decision trees, logistic regression Yesterday o Maximum margin classifiers, Kernel trick Today o Quick review of probability o Generative models naive Bayes classifier o Structured Prediction conditional random fields 1

Several slides adapted from David Sontag who in turn credits Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Vibhav Gogate 2

Bayesian/probabilistic learning Uses probability to model data and/or quantify uncertainties in prediction o Systematic framework to incorporate prior knowledge o Framework for composing and reasoning about uncertainity o What is the confidence in the prediction given observations so far? Model assumptions need not hold (and often do not hold) in reality o even so, many probabilistic models work really well in practice 3

Quick overview of random variables Random variables: A variable about which we (may) have uncertainty o e.g., W = weather tomorrow, or T = temperature For all random variables X domain X of X is the set of values X can take Discrete random variables: probability distribution is a table Pr W = sun = 0.6 o For discrete RV X, x X, Pr X = x 0 and Pr X = x = 1? X Continuous random X with domain X R o Cumulative distribution function F C t = Pr(X t) again F C t 0,1 and also F C = 0, F C + = 1 o Probability density function (if exists) P C t = KL M N KO Is always positive, but can be greater than 1 4

Quick overview of random variables Expectation Discrete RV E f(x) = f(x) Pr(X = x)? X Mean E X Variance E X E X V 5

Joint distributions Joint distribution of random variables X W, X V,, X Y is defined for all x W X W, x V X V,, x Y X Y p x W, x V,, x Y = Pr(X W = x W, X V = x V,, X Y = x Y ) How may numbers needed for d variables each having domain of K values? o K Y!! Too many numbers, usually some assumption is made to reduce number of probabilities 6

Marginal distribution Sub-tables obtained by elimination of variables Probability distribution of a subset of variables 7

Marginal distribution Sub-tables obtained by elimination of variables Probability distribution of a subset of variables Given: joint distribution p x W, x V,, x Y = Pr(X W = x W, X V = x V,, X Y = x Y ) for x W X W, x V X V,, x Y X Y Say we want get a marginal of just x W, x V, x \, that is we want to get p x W, x V, x ] = Pr (X W = x W, X V = x V, X ] = x ] ) This can be obtained by mariginalizing p x W, x V, x ] = ` ` ` p(x W, x V, z b, x ], z \,, z Y ) c f X f c e X e c d X d 8

Conditioning Random variables X and Y with domains X and Y Pr X = x, Y = y Pr X = x Y = y = Pr Y = y Probability distributions of a subset of variables with fixed values of others 9

Conditioning Random variables X and Y with domains X and Y Pr X = x, Y = y Pr X = x Y = y = Pr Y = y Conditional expectation E[f(X) Y = y] = ` f x Pr X = x Y = y? X h y = E[f(X) Y = y] is a function of y h Y is a random variable with distribution given by Pr(h Y = h(y)) = Pr(Y = y) 10

Product rule Going from conditional distribution to joint distribution Pr X = x, Y = y Pr X = x Y = y = Pr Y = y Pr X = x, Y = y = Pr(Y = y) Pr X = x Y = y What about thee variables? Pr X W = x W, X V = x V, X b = x b = Pr(X W = x W ) Pr X V = x V X W = x W Pr X b = x b X W = x W, X V = x V More generally, Pr X W = x W, X V = x V,, X Y = x Y Y = Pr(X W = x W ) m Pr(X n = x n X now = x now, X nov = x nov,, X W = x W ) npv 11

Optimal unrestricted classifier C class classification problem Y = {1,2,, C} Population distribution Let x, y D Consider the population 0-1 loss of classifier yx(x) L yx E x,{ 1 y yx x = Pr y yx x ~, = Pr x Pr(y yx(x) x) Risk of classifier yx(x) Conditional risk L(yx x) Pr(y yx x x) = 1 Pr(y = yx x x) Check that this is minimized for yx x = argmax Pr(y = c x) ƒ Optimal unrestricted classifier or Bayes optimal classifier yx x = argmax Pr(y = c x) ƒ 12

Generative vs discriminative models Recall optimal unrestricted predictor for following cases o Regression+squared lossà f (x) = E y x o Classification+ 0-1 loss à yx x = argmax ƒ Pr(y = c x) Non-probabilistic approach: don't deal with probabilities, just estimate f(x) directly to the data. Discriminative models: Estimate/infer the conditional density Pr(y x) o Typically uses a parametric model f (x) of Pr(y x) Generative models: Estimate the full joint probability density Pr y, x o Normalize to find the conditional density Pr(y x) o Specify models for Pr x, y or [Pr x y and Pr(y)] o Why? In two slides! 13

Bayes rule Optimal classifier yx x = argmax ƒ Pr y = c x Bayes rule: Pr(x, y) = Pr y x Pr(x) = Pr(x y) Pr(y) yx x = argmax = argmax ƒ = argmax ƒ ƒ Pr y = c x Pr x y = c Pr y = c Pr x Pr x y = c Pr y = c 14

Bayes rule Optimal classifier yx x = argmax Pr y = c x ƒ = argmax Pr x y = c Pr y = c ƒ Why is this helpful? o One conditional might be tricky to model with prior knowledge but the other simple o e.g., say we want to specify a model for digit recognition Binary images à digit 1 compare specifying Pr(image digit = 1) vs Pr(digit = 1 image) 15

Generative model for classification argmax ƒ = argmax ƒ Pr y = c x Pr x y = c Pr y = c C class classification with binary features x R Y and y 1,2,, C Want to specify Pr(x y) = Pr(x W, x V,, x Y y) If each of x W, x V,, x Y can take one of K values. How many parameters to specify Pr(x y)? o C K Y!! Too many 16

Naive Bayes assumption Specifying Pr(x y) = Pr(x W, x V,, x Y y) requires C K Y Naive Bayes assumption: features are independent given class y e.g., for two features Pr(x W, x V y) = Pr(x W y) Pr(x V y) more generally, Pr(x W, x V,, x Y y) = Pr(x W y) Pr(x V y) Pr(x Y y) = Y Pr(x n y) npw number of parameters if each of x W, x V,, x Y can take one of K values? o C K d 17

Naive Bayes classifier Naive Bayes assumption: features are independent given class: Pr(x W, x V,, x Y y) = npw Pr(x n y) C classes Y = {1,2,, C} d binary feature X = 0,1 Y Model parameters: specify from prior knowledge and/or learn from data o Priors Pr y = c à #parameters C 1 o Conditional probabilities Pr(x n = 1 y = c) à #parameters Cd if x W, x V,, x Š takes one of K discrete values rather than binary à #parameters K 1 Cd if x W, x V,, x Š are continuous, additionally model Pr(x n y = c) as some parametric distribution, like Gaussian Pr(x n y = c) N(μ n,ƒ, σ), and estimate the parameters (μ n,ƒ, σ) from data Classifier rule: yx Ž x = argmax ƒ = argmax ƒ Y Pr x W, x V,, x Y y = c Pr(y = c) Y Pr(y = c) m Pr(x n y = c) npw 18

Digit recognizer Slide credit: David Sontag 19

What has to be learned? Slide credit: David Sontag 20

MLE for parameters of NB Training dataset S = x, y : i = 1,2,, N Maximum likelihood estimation for naive Bayes with discrete features and labels Assume S has iid examples o Prior: what is the probability of observing label y Pr y = c = Ž pw 1 y = c N ƒ o Conditional distribution: Pr x n = z n Y = c = Ž pw 1[x n = z n, y = c] Ž pw 1[y = c] ` ` 1 x n c = z n, y = c ` ` 1 y = c 21

MLE for parameters of NB Slide credit: David Sontag 22

Smoothing for parameters of NB Training dataset S = x, y : i = 1,2,, N Maximum likelihood estimation for naive Bayes with discrete features and labels Assume S has iid examples o Prior: what is the probability of observing label y Pr y = c = 1 y = c N o Conditional distribution: 1 x n = z n, y = c + ε Pr x n = z n Y = c = 1 y = c + c ε 23

Smoothing for parameters of NB Training dataset S = x, y : i = 1,2,, N Maximum likelihood estimation for naive Bayes with discrete features and labels Assume S has iid examples o Prior: what is the probability of observing label y Pr y = c = 1 y = c N o Conditional distribution: 1 x n = z n, y = c + ε Pr x n = z n Y = c = 1 y = c + c ε 24

Smoothing for parameters of NB Training dataset S = x, y : i = 1,2,, N Maximum likelihood estimation for naive Bayes with discrete features and labels Assume S has iid examples o Prior: what is the probability of observing label y Pr y = c = 1 y = c N o Conditional distribution: 1 x n = z n, y = c + ε Pr x n = z n Y = c = 1 y = c + c ε ` ` 1 x n c = z n, y = c + ε 25

Missing features One of the key strengths of Bayesian approaches is that they can naturally handle missing data What happens if we don t have value of some feature x n ( ) o e.g., applicants credit history unknown o e.g., some medical tests not performed How to compute Pr x W, x V, x ow,?, x W, x Y y? o e.g., three coin tosses E = H,?, T o Pr E = Pr H, H, T + Pr({H, T, T}) More generally Pr x W, x V, x ow,?, x W, x Y y = ` Pr x W, x V, x ow, z, x W, x Y y c Slide credit: David Sontag 26

Missing features in naive Bayes Pr x W, x V, x ow,?, x W, x Y y = ` Pr x W, x V, x ow, z, x W, x Y y c = ` Pr z y m Pr x n y c n = m Pr x n y ` Pr z y n c = m Pr x n y n Simply ignore the missing values and compute likelihood based only observed features no need to fill-in or explicitly model missing values 27

Naive Bayes Generative model o Model Pr(x y) and Pr(y) Logistic Regression Discriminative model o Model Pr(y x) Prediction: models the full joint distribution and uses Bayes rule to get Pr(y x) Prediction: directly models what we want Pr(y x) Can generate data given label Naturally handles missing data Cannot generate data Cannot handle missing data easily 28