Generative modeling of data. Instructor: Taylor Berg-Kirkpatrick Slides: Sanjoy Dasgupta

Similar documents
The generative approach to classification. A classification problem. Generative models CSE 250B

Random variables, expectation, and variance

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Applied Multivariate and Longitudinal Data Analysis

Machine learning - HT Maximum Likelihood

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Day 5: Generative models, structured classification

Two hours. Statistical Tables to be provided THE UNIVERSITY OF MANCHESTER. 14 January :45 11:45

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

LDA, QDA, Naive Bayes

Chapter 5 continued. Chapter 5 sections

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Machine Learning

Stephen Scott.

The Bayes classifier

STAT 4385 Topic 01: Introduction & Review

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Introduction to Machine Learning

1 Probability Review. 1.1 Sample Spaces

ECE Homework Set 3

ECE 450 Homework #3. 1. Given the joint density function f XY (x,y) = 0.5 1<x<2, 2<y< <x<4, 2<y<3 0 else

Probability and random variables

Bayesian Methods: Naïve Bayes

CSC 411: Lecture 09: Naive Bayes

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Parsimonious Gaussian Mixture Models

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Gaussian Mixture Models, Expectation Maximization

Stat 5101 Notes: Algorithms (thru 2nd midterm)

Combining eigenvalues and variation of eigenvectors for order determination

Bayes Decision Theory

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Chapter 4 continued. Chapter 4 sections

Probability and Statistical Decision Theory

Naïve Bayes classification

Bayesian Learning (II)

Class 26: review for final exam 18.05, Spring 2014

COMP90051 Statistical Machine Learning

Classification Methods II: Linear and Quadratic Discrimminant Analysis

PAC-learning, VC Dimension and Margin-based Bounds

Classification 2: Linear discriminant analysis (continued); logistic regression

Lecture 2: Repetition of probability theory and statistics

Parametric Techniques

Bayes Decision Theory - I

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Quick Tour of Basic Probability Theory and Linear Algebra

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Machine Learning for Signal Processing Bayes Classification and Regression

Table of Contents. Multivariate methods. Introduction II. Introduction I

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Linear Models for Classification

Linear Classification: Probabilistic Generative Models

Nonparametric Methods Lecture 5

Gaussian Mixture Models

Introduction to Computational Finance and Financial Econometrics Probability Review - Part 2

6.041/6.431 Fall 2010 Quiz 2 Solutions

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Probability reminders

Cheng Soon Ong & Christian Walder. Canberra February June 2018

10-701/ Machine Learning - Midterm Exam, Fall 2010

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence

Review of Basic Probability Theory

5. Discriminant analysis

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Parametric Techniques Lecture 3

Introduction to Machine Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

PATTERN RECOGNITION AND MACHINE LEARNING

Expectation and Variance

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Introduction to Machine Learning

Bandits, Experts, and Games

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

A Probability Review

MATH Solutions to Probability Exercises

COM336: Neural Computing

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Gaussian with mean ( µ ) and standard deviation ( σ)

Algorithms for Uncertainty Quantification

Distributions of linear combinations

Midterm Exam 1 (Solutions)

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

2. Suppose (X, Y ) is a pair of random variables uniformly distributed over the triangle with vertices (0, 0), (2, 0), (2, 1).

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Math 180B Problem Set 3

Lecture 2: Review of Probability

Deep Learning for Computer Vision

ECE521 week 3: 23/26 January 2017

CMSC858P Supervised Learning Methods

Transcription:

Generative modeling of data Instructor: Taylor BergKirkpatrick Slides: Sanjoy Dasgupta

Parametric versus nonparametric classifiers Nearest neighbor classification: size of classifier size of data set Nonparametric: Model complexity (# of parameters ) not fixed.

Parametric versus nonparametric classifiers Nearest neighbor classification: size of classifier size of data set Nonparametric: Model complexity (# of parameters ) not fixed. Good: Can fit any boundary, given enough data. Bad: Typically need a lot of data.

Parametric versus nonparametric classifiers Nearest neighbor classification: size of classifier size of data set Nonparametric: Model complexity (# of parameters ) not fixed. Good: Can fit any boundary, given enough data. Bad: Typically need a lot of data. Parametric classifiers: fixed number of parameters.

Parametric classifiers Fixed number of parameters: e.g. lines.

Parametric classifiers Fixed number of parameters: e.g. lines. Good: Can get a reasonable model with limited data Bad: Less flexibility in modeling

Parametric classifiers Fixed number of parameters: e.g. lines. Good: Can get a reasonable model with limited data Bad: Less flexibility in modeling Basic principle: pick a good approximation.

The generative approach to classification Two ways to classify: Generative: model the individual classes. Discriminative: model the decision boundary between the classes.

Recall: Bayes rule Pr(A B) probability event A occurs given that we have observed B to occur Pr(A B) Pr(A and B) Pr(B) Pr(B A) Pr(B) Pr(A)

Recall: Bayes rule Pr(A B) probability event A occurs given that we have observed B to occur Pr(A B) Pr(A and B) Pr(B) Pr(B A) Pr(B) Pr(A) Example: Coin 1 has heads probability 1/6 and Coin 2 has heads probability 1/3. We close our eyes, pick a coin at random and flip it. It comes out heads. What is the probability it is Coin 1?

Generative models Example: data space X R, classes/labels Y {1, 2, 3} Pr(x) P 3 (x) P 1 (x) P 2 (x) π 1 10% π 2 50% π 3 40% x

Generative models Example: data space X R, classes/labels Y {1, 2, 3} Pr(x) P 3 (x) P 1 (x) P 2 (x) π 1 10% π 2 50% π 3 40% x Capture the joint distribution over (x, y) via: the probabilities of the classes, π j Pr(y j) the distribution of the individual classes, P 1 (x), P 2 (x), P 3 (x)

Generative models Example: data space X R, classes/labels Y {1, 2, 3} Pr(x) P 3 (x) P 1 (x) P 2 (x) π 1 10% π 2 50% π 3 40% x Capture the joint distribution over (x, y) via: the probabilities of the classes, π j Pr(y j) the distribution of the individual classes, P 1 (x), P 2 (x), P 3 (x) Overall distribution: Pr(x, y) π y P y (x).

Optimal predictions Pr(x) P 3 (x) P 1 (x) P 2 (x) π 1 10% π 2 50% π 3 40% x

Optimal predictions Pr(x) P 3 (x) P 1 (x) P 2 (x) π 1 10% π 2 50% π 3 40% x For any data point x X and any candidate label j, Pr(y j x) Pr(y j)pr(x y j) Pr(x) π j P j (x) k i1 π ip i (x)

Optimal predictions Pr(x) P 3 (x) P 1 (x) P 2 (x) π 1 10% π 2 50% π 3 40% x For any data point x X and any candidate label j, Pr(y j x) Pr(y j)pr(x y j) Pr(x) π j P j (x) k i1 π ip i (x) Optimal prediction: the class j with largest π j P j (x).

A classification problem You have a bottle of wine whose label is missing. Which winery is it from, 1, 2, or 3?

A classification problem You have a bottle of wine whose label is missing. Which winery is it from, 1, 2, or 3? Solve this problem using visual and chemical properties of the wine.

The data set Training set obtained from 130 bottles Winery 1: 43 bottles Winery 2: 51 bottles Winery 3: 36 bottles For each bottle, 13 features: Alcohol, Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, Proline

The data set Training set obtained from 130 bottles Winery 1: 43 bottles Winery 2: 51 bottles Winery 3: 36 bottles For each bottle, 13 features: Alcohol, Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, Proline Also, a separate test set of 48 labeled points.

Recall: the generative approach Pr(x) P 3 (x) P 1 (x) P 2 (x) π 1 10% π 2 50% π 3 40% x For any data point x X and any candidate label j, Pr(y j x) Pr(y j)pr(x y j) Pr(x) π j P j (x) k i1 π ip i (x) Optimal prediction: the class j with largest π j P j (x).

Fitting a generative model Training set of 130 bottles: Winery 1: 43 bottles, winery 2: 51 bottles, winery 3: 36 bottles For each bottle, 13 features: Alcohol, Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, Proline

Fitting a generative model Training set of 130 bottles: Winery 1: 43 bottles, winery 2: 51 bottles, winery 3: 36 bottles For each bottle, 13 features: Alcohol, Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, Proline Class weights: π 1 43/130 0.33, π 2 51/130 0.39, π 3 36/130 0.28

Fitting a generative model Training set of 130 bottles: Winery 1: 43 bottles, winery 2: 51 bottles, winery 3: 36 bottles For each bottle, 13 features: Alcohol, Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, Proline Class weights: π 1 43/130 0.33, π 2 51/130 0.39, π 3 36/130 0.28 Need distributions P 1, P 2, P 3, one per class. Base these on a single feature: Alcohol.

The univariate Gaussian The Gaussian N(µ, σ 2 ) has mean µ, variance σ 2, and density function ) 1 (x µ)2 p(x) (2πσ 2 exp ( ) 1/2 2σ 2.

The distribution for winery 1 Single feature: Alcohol

The distribution for winery 1 Single feature: Alcohol Mean µ 13.72, Standard deviation σ 0.44 (variance 0.20)

All three wineries π 1 0.33, P 1 N(13.7, 0.20) π 2 0.39, P 2 N(12.3, 0.28) π 3 0.28, P 3 N(13.2, 0.27)

All three wineries π 1 0.33, P 1 N(13.7, 0.20) π 2 0.39, P 2 N(12.3, 0.28) π 3 0.28, P 3 N(13.2, 0.27) To classify x: Pick the j with highest π j P j (x)

All three wineries π 1 0.33, P 1 N(13.7, 0.20) π 2 0.39, P 2 N(12.3, 0.28) π 3 0.28, P 3 N(13.2, 0.27) To classify x: Pick the j with highest π j P j (x) Test error: 14/48 29%

All three wineries π 1 0.33, P 1 N(13.7, 0.20) π 2 0.39, P 2 N(12.3, 0.28) π 3 0.28, P 3 N(13.2, 0.27) To classify x: Pick the j with highest π j P j (x) Test error: 14/48 29% What if we use two features?

The data set, again Training set obtained from 130 bottles Winery 1: 43 bottles Winery 2: 51 bottles Winery 3: 36 bottles For each bottle, 13 features: Alcohol, Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, Proline Also, a separate test set of 48 labeled points.

The data set, again Training set obtained from 130 bottles Winery 1: 43 bottles Winery 2: 51 bottles Winery 3: 36 bottles For each bottle, 13 features: Alcohol, Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, Proline Also, a separate test set of 48 labeled points. This time: Alcohol and Flavanoids.

Why it helps to add features Better separation between the classes!

Why it helps to add features Better separation between the classes!

Why it helps to add features Better separation between the classes!

Why it helps to add features Better separation between the classes! Error rate drops from 29% to 8%.

The bivariate Gaussian

The bivariate Gaussian Model class 1 by a bivariate Gaussian, parametrized by: ( ) ( ) 13.7 0.20 0.06 mean µ and covariance matrix Σ 3.0 0.06 0.12

Dependence between two random variables Suppose X 1 has mean µ 1 and X 2 has mean µ 2. Can measure dependence between them by their covariance: cov(x 1, X 2 ) E[(X 1 µ 1 )(X 2 µ 2 )] E[X 1 X 2 ] µ 1 µ 2 Maximized when X 1 X 2, in which case it is var(x 1 ). It is at most std(x 1 )std(x 2 ).

The bivariate (2d) Gaussian A distribution over (x 1, x 2 ) R 2, parametrized by: Mean (µ 1, µ 2 ) R 2, where µ 1 E(X 1 ) and µ 2 E(X 2 ) [ ] Σ11 Σ Covariance matrix Σ 12 where Σ 21 Σ 22 Σ 11 var(x 1 ) Σ 22 var(x 2 ) Σ 12 Σ 21 cov(x 1, X 2 )

The bivariate (2d) Gaussian A distribution over (x 1, x 2 ) R 2, parametrized by: Mean (µ 1, µ 2 ) R 2, where µ 1 E(X 1 ) and µ 2 E(X 2 ) [ ] Σ11 Σ Covariance matrix Σ 12 where Σ 21 Σ 22 Σ 11 var(x 1 ) Σ 22 var(x 2 ) Σ 12 Σ 21 cov(x 1, X 2 ) Density is highest at the mean, falls off in ellipsoidal contours.

Density of the bivariate Gaussian Mean (µ 1, µ 2 ) R 2, where µ 1 E(X 1 ) and µ 2 E(X 2 ) [ ] Σ11 Σ Covariance matrix Σ 12 Σ 21 Σ 22 ( 1 p(x 1, x 2 ) exp 1 [ ] T [ ] ) x1 µ 1 Σ 1 x1 µ 1 2π Σ 1/2 2 x 2 µ 2 x 2 µ 2

Bivariate Gaussian: examples In either case, the mean is (1, 1). Σ [ 4 0 0 1 ] Σ [ 4 1.5 1.5 1 ]

The decision boundary Go from 1 to 2 features: error rate goes from 29% to 8%.

The decision boundary Go from 1 to 2 features: error rate goes from 29% to 8%.

The decision boundary Go from 1 to 2 features: error rate goes from 29% to 8%. What kind of function is this? And, can we use more features?