EM Algorithm & High Dimensional Data

Similar documents
Dimensionality Reduction and Principle Components

Dimensionality Reduction and Principal Components

Kernel-based density. Nuno Vasconcelos ECE Department, UCSD

Stat 451 Lecture Notes Numerical Integration

Math 350: An exploration of HMMs through doodles.

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Deep Learning for Computer Vision

Classification: The rest of the story

Naïve Bayes classification

CMSC858P Supervised Learning Methods

The Bayes classifier

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Bayes Decision Theory - I

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Metric-based classifiers. Nuno Vasconcelos UCSD

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

CPSC 340: Machine Learning and Data Mining

Machine Learning for Signal Processing Bayes Classification and Regression

Lecture 4: Training a Classifier

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

CSE446: Clustering and EM Spring 2017

Expectation Maximization Algorithm

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

CS115 Computer Simulation Project list

Line Broadening. φ(ν) = Γ/4π 2 (ν ν 0 ) 2 + (Γ/4π) 2, (3) where now Γ = γ +2ν col includes contributions from both natural broadening and collisions.

LECTURE NOTE #3 PROF. ALAN YUILLE

Pattern Recognition and Machine Learning

CS 188: Artificial Intelligence Spring Today

BAYESIAN DECISION THEORY

Chapter XII: Data Pre and Post Processing

Introduction to Algebra: The First Week

Curve Fitting Re-visited, Bishop1.2.5

12 Statistical Justifications; the Bias-Variance Decomposition

VBM683 Machine Learning

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Gaussian Mixture Models

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Latent Variable Models and Expectation Maximization

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Sparse Linear Models (10/7/13)

K-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

CSCI-567: Machine Learning (Spring 2019)

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Least Squares. Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter UCSD

Neural networks and optimization

Bayesian Updating with Continuous Priors Class 13, Jeremy Orloff and Jonathan Bloom

K-Means and Gaussian Mixture Models

Discriminant analysis and supervised classification

CS281 Section 4: Factor Analysis and PCA

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

L11: Pattern recognition principles

Feature selection. Micha Elsner. January 29, 2014

Latent Variable Models and Expectation Maximization

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Generative Clustering, Topic Modeling, & Bayesian Inference

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Lecture : Probabilistic Machine Learning

CSC321 Lecture 4 The Perceptron Algorithm

Notes on Machine Learning for and

Bias-Variance Tradeoff

c 1 v 1 + c 2 v 2 = 0 c 1 λ 1 v 1 + c 2 λ 1 v 2 = 0

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

Midterm sample questions

Mixtures of Gaussians continued

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Lecture 4: Training a Classifier

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

appstats27.notebook April 06, 2017

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Variational Autoencoders

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Introduction to Bayesian Learning

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Quadratic Equations Part I

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Mathematical Formulation of Our Example

Curve learning. p.1/35

Clustering and Gaussian Mixtures

CSC321 Lecture 18: Learning Probabilistic Models

CPSC 540: Machine Learning

Bayesian Analysis for Natural Language Processing Lecture 2

Hidden Markov Models Part 2: Algorithms

Naive Bayes classification

Calculus with Analytic Geometry I Exam 8 Take Home Part.

Apprentissage non supervisée

Conceptual Explanations: Simultaneous Equations Distance, rate, and time

Independent Components Analysis

Going from graphic solutions to algebraic

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Transcription:

EM Algorithm & High Dimensional Data Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD

Gaussian EM Algorithm For the Gaussian mixture model, we have Expectation Step (E-Step): Maximization Step (M-Step): 2

EM versus K-means EM k-means Data Class Assignments Soft Decisions: Hard Decisions: ( ) * i ( xi) = argmax PZ X j xi j ( ) ( ) 1, PZX j xi > PZX k xi, k j h ij = 0, otherwise Parameter Updates Soft Updates: Hard Updates: new 1 ( i ) i x j n j µ = x 3

Important Application of EM Recall, in Bayesian decision theory we have World: States Y in {1,..., M} and observations of X Class-conditional densities P X Y (x y) Class (prior) probabilities P Y (i) Bayes decision rule (BDR) We have seen that this is only optimal insofar as all probabilities involved are correctly estimated One of the important applications of EM is to more accurately learn the class-conditional l densities 4

Example Image segmentation: Given this image, can we segment it into the cheetah and background classes? Useful for many applications Recognition: this image has a cheetah Compression: code the cheetah with fewer bits Graphics: plug in for photoshop would allow manipulating objects Since we have two classes (cheetah and grass), we should be able to do this with a Bayesian classifier 5

Example Start by collecting a lot of examples of cheetahs and a lot of examples of grass One can get tons of such images via Google image search 6

Example Represent images as bags of little image patches We can fit a simple Gaussian to the transformed patches discrete cosine transform Bag of DCT vectors Gaussian P X Y (x cheetah) h) 7

Example Do the same for grass and apply BDR to each patch to classify each patch into cheetah or grass * i = arg max{ log G ( x, µ, Σ ) log P ( i )} i i i Y P X W ( x grass) discrete cosine transform Bag of DCT vectors?? P X W ( x cheetah) 8

Example Better performance is achieved by modeling the cheetah class distribution as a mixture of Gaussians discrete cosine transform Bag of DCT vectors Mixture of Gaussians P X Y (x cheetah) h) 9

Example Do the same for grass and apply BDR to each patch to classify * i = argmax log G( x, µ ik,, Σ ik, ) π ik, log PY( i) i k P X W ( x grass) discrete cosine transform Bag of DCT vectors?? P X Y (x cheetah) 10

Classification The use of more sophisticated probability models, e.g. mixtures, usually improves performance However, it is not a magic solution Earlier on in the course we talked about features Typically, you have to start from a good feature set It turns out that even with a good feature set, you must be careful Consider the following example, from our image classification problem 11

Example Cheetah Gaussian classifier, DCT space 8 first DCT features all 64 features Prob. of error: 4% 8% Interesting observation: more features = higher error! 12

Comments on the Example The first reason why this happens is that things are not always what we think they are in high dimensions One could say that high dimensional spaces are STRANGE!!! In practice, we invariable have to do some form of dimensionality reduction We will see thatt eigenvalues play a major role in this One of the major dimensionality reduction techniques is principal component analysis (PCA) But let s start by discussing the problems of high dimensions 13

High Dimensional Spaces Are strange! First thing to know: Never fully trust your intuition in high dimensions! More often than not you will be wrong! There are many examples of this We will do a couple here, skipping most of the math These examples are both fun and instructive 14

The Hypersphere Consider the ball of radius r in a space of dimension d r The surface of this ball is a (d-1)-dimensional dimensional hypersphere. The ball has volume where Γ(n) is the gamma function When we talk of the volume of a hypersphere, we will actually mean the volume of the ball it contains. Similarly, for the volume of a hypercube, etc. 15

Hypercube versus Hypersphere Consider the hypercube [-a,a] d and an inscribe hypersphere: a -a a Q: what does your intuition tell you about the relative sizes of these two volumes? 1. volume of sphere volume of cube? 2. volume of sphere >> volume of cube? 3. volume of sphere << volume of cube? -a 16

Answer To find the answer, we can compute the relative volumes: This is a sequence that does not depend on the radius a, just on the dimension d! d 1 2 3 4 5 6 7 f d 1.785.524.308.164.08.037 The relative volume goes to zero, and goes to zero fast! 17

Hypercube vs Hypersphere This means that: As the dimension of the space increases, the volume of the sphere is much smaller (infinitesimally so) than that of the cube! Is this really going against intuition? It is actually not very surprising, if we think about it. we can see it even in low dimensions: i 1. d = 1 -a a volume is the same 2. d=2 a volume of sphere is already smaller -a a -a 18

Hypercube vs Hypersphere As the dimension increases, the volume of the shaded corners becomes larger. a -a a -a In high dimensions the picture you should imagine is: All the volume of the cube is in the spikes (corners)! 19

Believe it or Not we can actually check this mathematically: Consider d and p a d -a p a note that -a d becomes orthogonal to p as d increases, and infinitely larger!!! 20

But there is even more Consider the crust of unit sphere of thickness ε We can compute the volume of the crust: ε S 1 S 2 a No matter how small ε is, ratio goes to zero as d increases Ie I.e. all the volume is in the crust! 21

High Dimensional Gaussian For a Gaussian, it can be shown that if and one considers the region outside of the hypersphere where the probability density drops to 1% of peak value then the probability mass in this region is ) where χ 2 (n) is a chi-squared random variable with n degrees of freedom 22

High-Dimensional Gaussian If you evaluate this, you ll find out that n 1 2 3 4 5 6 10 15 20 1-P n.998.99.97.94.89.83.48.134.02 As the dimension increases, virtually all the probability mass is in the tails Yet, the point of maximum density is still the mean This is really strange: in high-dimensions the Gaussian is a very heavy-tailed distribution Take-home message: In high dimensions never trust your low-dimensional intuition! 23

The Curse of Dimensionality Typical observation in Bayes decision theory: Error increases when number of features is large This is unintuitive since theoretically: If I have a problem in n dimensions I can always generate a problem in n1 dimensions without increasing the probability of error, and even often decreasing the probability of error. E.g. two uniform classes in 1-D A B can be transformed into a 2-D problem with the same error Just add a non-informative variable (extra dimension) y. 24

Curse of Dimensionality x x y y Sometimes it is possible to reduce the error by adding a second variable which is informative On the left there is no decision boundary that will achieve zero error On the right, the decision boundary shown has zero error 25

Curse of Dimensionality In fact, it is theoretically impossible to do worse in 2-D than 1-D: x y If we move the classes along the lines shown in green the error can only go down, since there will be less overlap 26

Curse of Dimensionality So why do we observe this curse of dimensionality? The problem is the quality of the density estimates All we have seen so far, assumes perfect estimation of the BDR We discussed various reasons why this is not easy Most densities are not simply a Gaussian, exponential, etc Typically densities are, at best, a mixture of several eral components. There are many unknowns (# of components, what type), the likelihood has local minima, etc. Even with algorithms like EM, it is difficult to get this right 27

Curse of Dimensionality But the problem goes much deeper than this Even for simple models (e.g. Gaussian) we need a large number of examples n to have good estimates Q: What does large mean? This depends on the dimension of the space The best way to see this is to think of an histogram: Suppose you have 100 points and you need at least 10 bins per axis in order to get a reasonable quantization For uniform data you get, on average: dimension 1 2 3 points/bin 10 1 0.1 This is decent in1-d; bad in 2-D; terrible in 3-D (9 out of each10 bins empty) 28

Dimensionality Reduction We see that it can quickly become impossibleibl to fill up a high h dimensional space with a sufficient number of data points. What do we do about this? We avoid unnecessary dimensions! Unnecessary can be measured in two ways: 1. Features are non-discriminant (insufficiently discriminating) 2. Features are not independent Non-discriminant means that they don t separate classes well Discriminant Feature Non-Discriminant Feature 29

END 30