Axis Aligned Ellipsoid

Similar documents
Machine Learning for Data Science (CS 4786)

Mixtures of Gaussians and the EM Algorithm

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

Final Review for MATH 3510

Machine Learning for Data Science (CS 4786)

THE KALMAN FILTER RAUL ROJAS

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Algorithms for Clustering

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Disjoint set (Union-Find)

Example: Find the SD of the set {x j } = {2, 4, 5, 8, 5, 11, 7}.

Expectation and Variance of a random variable

NUMERICAL METHODS FOR SOLVING EQUATIONS

Statistical Pattern Recognition

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Bayesian Methods: Introduction to Multi-parameter Models

Lecture 19: Convergence

7.1 Convergence of sequences of random variables

Chapter 8: Estimating with Confidence

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Ma 530 Infinite Series I

Understanding Samples

L = n i, i=1. dp p n 1

AAEC/ECON 5126 FINAL EXAM: SOLUTIONS

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Exercise 4.3 Use the Continuity Theorem to prove the Cramér-Wold Theorem, Theorem. (1) φ a X(1).

Power and Type II Error

Lecture 7: Properties of Random Samples

Chapter 2 The Monte Carlo Method

Lecture 2: Poisson Sta*s*cs Probability Density Func*ons Expecta*on and Variance Es*mators

7.1 Convergence of sequences of random variables

Discrete Mathematics and Probability Theory Fall 2016 Walrand Probability: An Overview

Exponential Families and Bayesian Inference

Simulation. Two Rule For Inverting A Distribution Function

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Topic 9: Sampling Distributions of Estimators

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Problem Set 4 Due Oct, 12

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

Metric Space Properties

0, otherwise. EX = E(X 1 + X n ) = EX j = np and. Var(X j ) = np(1 p). Var(X) = Var(X X n ) =

Topic 9: Sampling Distributions of Estimators

STAT Homework 1 - Solutions

A quick activity - Central Limit Theorem and Proportions. Lecture 21: Testing Proportions. Results from the GSS. Statistics and the General Population

Median and IQR The median is the value which divides the ordered data values in half.

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Infinite Sequences and Series

6.3 Testing Series With Positive Terms

1 Approximating Integrals using Taylor Polynomials

Monte Carlo Integration

Stat 421-SP2012 Interval Estimation Section

Chapter 6 Principles of Data Reduction

Random Variables, Sampling and Estimation

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

Fourier Series and the Wave Equation

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

MIDTERM 3 CALCULUS 2. Monday, December 3, :15 PM to 6:45 PM. Name PRACTICE EXAM SOLUTIONS

Curve Sketching Handout #5 Topic Interpretation Rational Functions

Clustering: Mixture Models

Estimation for Complete Data

Lecture 3: August 31

2 Banach spaces and Hilbert spaces

1 Introduction to reducing variance in Monte Carlo simulations

Problem Cosider the curve give parametrically as x = si t ad y = + cos t for» t» ß: (a) Describe the path this traverses: Where does it start (whe t =

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

f(x) dx as we do. 2x dx x also diverges. Solution: We compute 2x dx lim

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. Comments:

Lecture 18: Sampling distributions

4x 2. (n+1) x 3 n+1. = lim. 4x 2 n+1 n3 n. n 4x 2 = lim = 3

Module 1 Fundamentals in statistics

BIOS 4110: Introduction to Biostatistics. Breheny. Lab #9

Binomial Distribution

Topic 9: Sampling Distributions of Estimators

April 18, 2017 CONFIDENCE INTERVALS AND HYPOTHESIS TESTING, UNDERGRADUATE MATH 526 STYLE

18.S096: Homework Problem Set 1 (revised)

Roberto s Notes on Series Chapter 2: Convergence tests Section 7. Alternating series

5.1 Review of Singular Value Decomposition (SVD)

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Bivariate Sample Statistics Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 7

Lecture 2: Monte Carlo Simulation

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Economics Spring 2015

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Introductory statistics

Distribution of Random Samples & Limit theorems

MA131 - Analysis 1. Workbook 2 Sequences I

Lecture 12: September 27

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

Notes 5 : More on the a.s. convergence of sums

Math 155 (Lecture 3)

PRACTICE PROBLEMS FOR THE FINAL

Direction: This test is worth 150 points. You are required to complete this test within 55 minutes.

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Transcription:

Machie Learig for Data Sciece CS 4786) Lecture 6,7 & 8: Ellipsoidal Clusterig, Gaussia Mixture Models ad Geeral Mixture Models The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm or method. The text i red are mathematical details for those who are iterested. Motivatio K-meas algorithm looks for roud clusters ad caot explicitly model clusters where oe cluster has fewer umber of poits tha aother. We would like to address this issue. We will do this by chagig the dissimilarity fuctios to first allow ellipsoidal clusters ad further by explicitly by maitaiig parameter π called mixture distributio that tells us the proportio of poits withi each cluster. 2 Ellipsoidal Clusterig Axis Aliged Ellipsoid The basic idea is goig to be that each of our clusters will be explicitly modeled by a ellipsoid. 2. Prelude: Axis aliged case To this ed, say data-poits withi a cluster are spread as i the figure below. r var = 2 var =.5 For the example below, ituitively we would like the ellipse to be a vertically stadig oe. How so we obtai such a ellipse?

Well what we require is that i terms of dissimilarity measure, all the blue poits o the outer ellipse have the same value of dissimilarity. That is to say that, we wat to squish the ellipse vertically ad elogate it horizotally so that it is circular ie. all blue dots are same distace from ceter). Now say we scale each coordiate by / Variace of the coordiate. I this case, we will fid that the poits have a variace of o each coordiate. To see this, say [ x t = x t []/ Varx [],..., x []), x t [2]/ ] Varx [2],..., x [2]) That is for each x t each x t is the ew variable whose coordiates are scaled iversely by stadard deviatio. We will otice that whe the set of poits are axis aliged ie. are stadig vertically or layig dow horizotally), the x,..., x will have a variace of o each coordiate ad covariace amogst coordiates. Hece uder x,..., x we see that all the blue poits o the ellipse uder the origial set are ow o a circle. To mathematically see this ote that ) 2 Var x [],..., x []) = x t [] x s [] = = x t [] Varx [],..., x []) x t [] = Varx [],..., x []) = ) 2 x s [] Varx [],..., x []) 2 x s []) Varx [],..., x []) Varx [],..., x []) Similarly you will fid that the variace of the secod coordiate is as well. Now sice we bega with poits distributed i a axis aliged way, covariace betwee differet coordiates will be. Thus the ew set of poits are best described by a circle. Thus, we fid that to defie the right ellipse for the origial set, that is for axis-aliged poits x,..., x, to measure the ellipsoidal distace of a poit say x to ceter, we ca istead measure the usual otio of distace euclidea distace) of the ew poit x = Varx [],...,x []) x to the modified ceter r = Varx [],...,x []) Varx [2],...,x [2]) Varx [2],...,x [2]) r. That is, the ellipsoidal distace dx, C ) = x r 2 = x r x r ) [ = x r Varx [],...,x []) Varx [2],...,x [2]) = x r Σ x r ) ] x r ) 2

where Σ is the covariace matrix. Thus we have established that for the axis aliged case, the dissimilarity measure is dx, C ) = x r Σ x r ) We will est show that eve for the geeral case, we have the same dissimilarity measure. Error 2.2 Geeral From Ellipsoids Last Slide Cosider the geeral ellipsoid as show below. r r How do we get a slated ellipsoidal dissimilarity x r ) > for the above? The high level picture is that, if we could somehow x rrotate ) apple the poits so that they are axis aliged, the we ca use the axis aliged versio for dissimilarity. How do we rotate the poits? Basically what we wat is to rotate the Correctio! poits such that the ew set of rotated poits are axis Bug last lecture! aliged. We ca achieve this by cosiderig the eige decompositio x r ) > x r ) apple Σ = UΛU where U is a rotatio matrix ad hece U = U) ad Λ is a diagoal matrix. Now for give x,..., x cosider a ew rotated buch of poits x,..., x where x t = Ux t. Note that, covariace matrix of x,..., x say Σ is give by ) Σ = x t x s x t x s ) = Ux t Ux s Ux t Ux s ) = U x t x s x t x s U ) ) = U x t x s x t x s U = UΣU = UUΛUU = Λ 3

Thus we see that x t = Ux t s the rotated poits are ow axis aliged with covariace Λ. Hece, the dissimilarity measure for the geeral case ca be set as: dx, C ) = x r ) Λ x r ) = Ux Ur Λ Ux Ur ) = x r U Λ U x r ) = x r Σ x r ) Thus we ca see that eve for the geeral case, dx, C ) = x r Σ x r ) defies the right ellipsoidal dissimilarity measure. Now as for the algorithm, it has the same flavor as the K-meas algorithm i that at every step it first radomly iitializes parameters r,..., r K ad Σ,..., Σ K radomly. Next, each poit is assiged to closest cluster uder the ew ellipsoidal dissimilarity measure dx, C ) = x r Σ x r ) Next, i that iteratio, for each cluster, we recompute mea r ad Σ. We repeat these two steps iteratively as show i the pseudo code i the lecture slides. 2.3 Modelig Mixture Distributio Say we had two clusters draw from ormal distributio of same covariace structure with meas separated by some distace. Now say we have a poit equidistat from both the meas of the two clusters. Now if umber of poits draw from both the gaussias were exactly the same, the we would of course have to coclude that this poit equidistat to the mea could belog to either cluster with same probability. However, ow say you were iformed that oe of the cluster has times the umber of poits as the other cluster. Now you would expect that this poit equidistat to the mea is times more likely to be i the first cluster tha the secod. However, our cluster assigmet step that oly looks at dissimilarity does ot capture this iformatio. Hece to fix, this we maitai a mixture distributio parameter that maitais the proportio of poits i each cluster at ay iteratio ad aims to pealize more likely clusters lesser. This pealty to the dissimilarity fuctio is give by logπ ) for the th cluster. The algorithm is give i the lecture slide. 3 Probabilistic Iterpretatio: Hard Gaussia Mixture Models Oe ca obtai a probabilistic iterpretatio of the algorithm as follows: the probability of a poit belogig to a particular cluster is proportioal to probability of pickig cluster give by π times the likeloihood of the poit belogig to cluster. Notice that π = exp logπ ))) similarly, we ca set likelihood px, C ) exp Dissimilarityx, C )). Notice that this esures that desity p is o-egative. To esure that p is a valid desity, it eed to itegrate to. Hece, px, C ) = C exp Dissimilarityx, C )). Now oce ca calculate C the ormalizig costat so that p itegrates to. For the probabilisitic iterpretatio of the ellipsoidal dissimilarity, assume Dissimilarityx, C ) = 2 x r Σ x r ) 4

The factor /2 ust makes calculatios easier. The likelihood is give by px; r, Σ ) exp 2 x r Σ x r )) But ote that p is basically proportioal to the multivariate gaussia distributio ad exp 2 x r Σ x r )) = 2π) d/2 detσ ) ad hece the desity fuctio for the probabilistic iterpretatio ca be obtaied by settig px; r, Σ ) = 2π) d/2 detσ ) exp 2 x r Σ x r )) which is the multivariate gaussia distributio. Uder the probabilistic iterpretatio the hard gaussia mixture model algorithm ca be foud i lecture slides. Specifically we ca use for hard cluster assigmet, assigig cluster to a poit based o oe that has maximum probability, that is a poit is asiged to that cluster to which it has the maximum probability of belogig to. This probability is proportioal to π probability of pickig cluster ) times the likelihood of poit belogig to cluster. 4 Soft) Gaussia Mixture Models Oe issue with hard clusterig is that whe we begi, we radomly guess parameters ad recompute multiple times hopig to coverge to the right oes. Say ow there is a poit that has probability.5 of belogig to cluster ad probability.49 of belogig to cluster 2 o iteratio. So the poit is close to beig equally likely to belog to each of the two clusters, ad this computatio is based o radomly iitialized parameters. Now based o this assigig poit to oly cluster ad ot 2 seems too harsh. The soft assigmet takes care of this issue by replacig the cluster assigmet step o each iteratio by a step that updates for each poit the probability that it belogs to each of the K clusters. That is, every poit belogs to every cluster with some probability give by the variable Q. Specifically, at ay iteratio m, Q m) t [] specifies, based o parameters at iteratio m, that is the probability that poit x t belogs to cluster. Now whe we compute mea ad covariaces at step 2 of iteratio, for every cluster we compute the weighted covariace ad meas ad π) as show i lecture slides. Whe we do probabilistic models, we will come back to mixture models ad see how this makes sese. 5