Distributional Similarity Models (cont.)

Similar documents
Distributional Similarity Models (cont.)

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Algorithms for Clustering

The Expectation-Maximization (EM) Algorithm

5.1 Review of Singular Value Decomposition (SVD)

Mixtures of Gaussians and the EM Algorithm

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Expectation-Maximization Algorithm.

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

Axis Aligned Ellipsoid

CSE 527, Additional notes on MLE & EM

Vector Quantization: a Limiting Case of EM

Machine Learning for Data Science (CS 4786)

3/8/2016. Contents in latter part PATTERN RECOGNITION AND MACHINE LEARNING. Dynamical Systems. Dynamical Systems. Linear Dynamical Systems

Clustering: Mixture Models

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Machine Learning for Data Science (CS 4786)

Probabilistic Unsupervised Learning

Lecture 8: October 20, Applications of SVD: least squares approximation

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

10-701/ Machine Learning Mid-term Exam Solution

Dimensionality Reduction vs. Clustering

Estimation for Complete Data

36-755, Fall 2017 Homework 5 Solution Due Wed Nov 15 by 5:00pm in Jisu s mailbox

Unsupervised Learning 2001

Algebra of Least Squares

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

STATS 306B: Unsupervised Learning Spring Lecture 8 April 23

18.S096: Homework Problem Set 1 (revised)

Lecture 2 Clustering Part II

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

Signal Processing in Mechatronics

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice

Linear Classifiers III

A Note on Effi cient Conditional Simulation of Gaussian Distributions. April 2010

Expectation maximization

Chimica Inorganica 3

Machine Learning Theory (CS 6783)

The Basic Space Model

Maximum Likelihood Estimation

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Affine Structure from Motion

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

6.867 Machine learning

Statistical Inference Based on Extremum Estimators

Homework Set #3 - Solutions

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

For a 3 3 diagonal matrix we find. Thus e 1 is a eigenvector corresponding to eigenvalue λ = a 11. Thus matrix A has eigenvalues 2 and 3.

PC5215 Numerical Recipes with Applications - Review Problems

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

CMSE 820: Math. Foundations of Data Sci.

Statistical Pattern Recognition

State Space Representation

Lecture 5: Latent Semantic Indexing. Independence. Dealing with Topics. Latent Semantic Indexing. Linear Algebra

Fastest mixing Markov chain on a path

Quantile regression with multilayer perceptrons.

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition

Chapter 6 Principles of Data Reduction

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

PCA SVD LDA MDS, LLE, CCA. Data mining. Dimensionality reduction. University of Szeged. Data mining

Matrix Representation of Data in Experiment

TAMS24: Notations and Formulas

Session 5. (1) Principal component analysis and Karhunen-Loève transformation

IIT JAM Mathematical Statistics (MS) 2006 SECTION A

Mathematics 3 Outcome 1. Vectors (9/10 pers) Lesson, Outline, Approach etc. This is page number 13. produced for TeeJay Publishers by Tom Strang

PAijpam.eu ON TENSOR PRODUCT DECOMPOSITION

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.


Markov Decision Processes

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Lecture 20. Brief Review of Gram-Schmidt and Gauss s Algorithm

EECS 442 Computer vision. Multiple view geometry Affine structure from Motion

Simulation. Two Rule For Inverting A Distribution Function

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

( ) (( ) ) ANSWERS TO EXERCISES IN APPENDIX B. Section B.1 VECTORS AND SETS. Exercise B.1-1: Convex sets. are convex, , hence. and. (a) Let.

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

LECTURE 8: ORTHOGONALITY (CHAPTER 5 IN THE BOOK)

Probabilistic Unsupervised Learning

Chapter 1 Simple Linear Regression (part 6: matrix version)

Topic 9: Sampling Distributions of Estimators

BHW #13 1/ Cooper. ENGR 323 Probabilistic Analysis Beautiful Homework # 13

Estimating Confidence Interval of Mean Using. Classical, Bayesian, and Bootstrap Approaches

Expectation and Variance of a random variable

Multinomial likelihood. Multinomial MLE. NIST data and genetic fingerprints. θ = (p 1,..., p m ) with j p j = 1 and p j 0. Point probabilities

Concavity-Preserving Integration and Its Application in Principal-Agent Problems

Lecture 11 and 12: Basic estimation theory

Lecture 23: Minimal sufficiency

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Optimization Methods MIT 2.098/6.255/ Final exam

The Perturbation Bound for the Perron Vector of a Transition Probability Tensor

Applications in Linear Algebra and Uses of Technology

A Unified Approach on Fast Training of Feedforward and Recurrent Networks Using EM Algorithm

Soo King Lim Figure 1: Figure 2: Figure 3: Figure 4: Figure 5: Figure 6: Figure 7:

Transcription:

Distributioal Similarity Models (cot.) Regia Barzilay EECS Departmet MIT October 19, 2004

Sematic Similarity Vector Space Model Similarity Measures cosie Euclidea distace... Clusterig k-meas hierarchical Last Time Distributioal Similarity Models (cot.) 1/26

Example Delicately hadlig the beautiful sati bidigs, Emma looked with dazzled eyes at the ames of the ukow authors. The orage blossoms were yellow with dust ad the silver bordered sati ribbos frayed at the borders. The cofessioal forms a pedat to a statuette of the Virgi, clothed i a sati robe. Never had Emma bee so beautiful as at this period. He picked up a cigar-case with a gree silk border. The Border Emma Ribbo Beautiful 1 0 2 0 Sati 3 1 1 1 Silk 0 1 0 0 Distributioal Similarity Models (cot.) 2/26

EM Clusterig Soft versio of K-meas clusterig Iput: m dimesioal objects X = {x 1,..., x } R m to be clustered ito k groups Observable data: X = { X i }, where x i =(x i1,...,x im ) Uobservable data: Z = { Z i }, where withi each z i = z i1,...z ik, the compoet z ij is 1 if object i is a member of cluster j ad 0 otherwise Clusterig is viewed as estimatig a mixture of probability distributios Distributioal Similarity Models (cot.) 3/26

Example of the EM algorithm for Soft Clusterig 4 3 2 c c 1 2 4 3 2 c 1 4 3 2 c 1 1 1 c 2 1 c 2 1 2 3 4 1 2 3 4 1 2 3 4 Iitial state After iteratio 1 After iteratio 2 Distributioal Similarity Models (cot.) 4/26

Multivariate Normal Distributios Key Assumptio: Data Geerated by k Gaussias The probability desity fuctio for a Gaussia: 1 1 x; µ j, Σ j )= (2π)m Σ j ) exp[ 2 ( µ)t Σ 1 j ( x j ( x µ)] Goal: fid the maximum likelihood model of the form k j=1 π j ( x; µ j, Σ j ) Distributioal Similarity Models (cot.) 5/26

The EM algorithm for Gaussia Mixtures Hidde Parameters: Θ j =(µ j, Σ j,π j ) Log likelihood of the data: k L(X Θ) = log P (x i ) = log π j j (x i ; µ j, Σ j ) i=1 i=1 j=1 k = i=1 log j=1 π j j (x i ; µ j, Σ j ) Distributioal Similarity Models (cot.) 6/26

Iterative Solutio Estimate: If we kew the value of Θ we could compute the expected values of the hidde structure of the model. Maximize: If we kew the expected values of the hidde structure of the model, the we could compute the maximum likelihood value of Θ. Distributioal Similarity Models (cot.) 7/26

Iitializatio The covariace matrices Σ j are iitialized as idetity matrix. Meas µ j are selected to be a radom perturbatio away from a data poit radomly selected from X. Distributioal Similarity Models (cot.) 8/26

Expectatio Step Give the curret parameters, compute cluster membership probabilities h ij = E(z ij x i ;Θ)= k l=1 P (x i j ;Θ) P (x i l ;Θ) Distributioal Similarity Models (cot.) 9/26

Maximizatio Step Give the cluster membership probabilities (expected values), compute the most likely parameters Θ h µ j = ij x i i=1 Σ j π h ij i=1 i=1 = i=1 h ij j = k h ij (x i µ j ) T j )(x i µ h ij i=1 h ij i=1 = h ij j=1 i=1 Distributioal Similarity Models (cot.) 10/26

Example of a Gaussia mixture Posterior probabilities P (w i c j ) Mai cluster Word 1 2 3 4 5 1 ballot 0.63 0.12 0.04 0.09 0.11 1 polls 0.58 0.11 0.06 0.10 0.14 1 Gov 0.58 0.12 0.03 0.10 0.17 1 seats 0.11 0.59 0.02 0.14 0.15 2 profit 0.58 0.12 0.03 0.10 0.17 2 fiace 0.15 0.55 0.01 0.13 0.16 2 paymets 0.12 0.66 0.01 0.09 0.11 3 NFL 0.13 0.05 0.58 0.09 0.16 3 Reds 0.05 0.01 0.86 0.02 0.06 Distributioal Similarity Models (cot.) 11/26

Other Methods of Dimesioality Reductio Latet Sematic Idexig Similar objects are projected oto the same dimesios The represetatio i the origial space is chaged as little as possible Distributioal Similarity Models (cot.) 12/26

Documet-by-word Matrix d 1 d 2 d 3 d 4 d 5 d 6 cosmoaut 1 0 1 0 0 0 astroaut 0 1 0 0 0 0 moo 1 1 0 0 0 0 car 1 0 0 1 1 0 truck 0 0 0 0 0 1 Distributioal Similarity Models (cot.) 13/26

Least-Squares Methods: Liear Regressio y x Distributioal Similarity Models (cot.) 14/26

Least-Squares Methods: Liear Regressio Iput: (x 1,y 1 ), (x 2,y 2 ),...,(x,y ) Goal: Fid f(x) =mx + b that miimizes the sum of the squares of the differece SS(m, b) = (y i f(x i ) i=1 2 Distributioal Similarity Models (cot.) 15/26

Liear Regressio Miimize SS(m, b) = 2 = i=1 (y i f(x i ) i =1 (y i mx i b) 2 SS(m,b) b = i=1 [2(y i m x i b)( 1)]=0 i=1 i=1 b = x, where ȳ = y i y m ad x = d (y i m x i y+m x) 2 i=1 m = dm SS(m,b) m = y y ( i )( x x i) i=1 x x i ) 2 i=1 ( x i Distributioal Similarity Models (cot.) 16/26

Sigular Value Decompositio(SVD) Ratioal: Icrease similarity i represetatio by dimesioality reductio SVD projects a -dimesioal space ito a k-dimesioal space where >k Example: Word-documet matrices i iformatio retrieval. is a umber of word types i the collectio. k ca be 100 Costrait: such that the their distace δ = A A is miimal Distributioal Similarity Models (cot.) 17/26

Sigular Value Decompositio Ay m by matrix A ca be factored ito A = T ΣD T =(orthogoal)(diagoal)(orthogoal) The colums of T (m by m) are eigevectors of AA T, ad the colums of D ( by ) are eigevectors of A T A. The r sigular values o the diagoal of Σ (m by ) are the square roots of the ozero eigevalues of both AA T ad A T A. SVD is uique (up to sig flip i D ad T) Distributioal Similarity Models (cot.) 18/26

Ituitio SVD rotates the the axes of -dimesioal space such that the first axis rus alog the largest variatio amog the documets, the secod dimesio rus alog the secod largest variatio ad... Matrices T ad D represet terms ad documets i the ew space. Distributioal Similarity Models (cot.) 19/26

Origial Matrix d 1 d 2 d 3 d 4 d 5 d 6 cosmoaut 1 0 1 0 0 0 astroaut 0 1 0 0 0 0 moo 1 1 0 0 0 0 car 1 0 0 1 1 0 truck 0 0 0 0 0 1 Distributioal Similarity Models (cot.) 20/26

T Matrix Dim 1 Dim 2 Dim 3 Dim 4 Dim 5 cosmoaut -0.44-0.30 0.57 0.58 0.25 astroaut -0.13-0.33-0.59 0.73 moo -0.48-0.51-0.37 0.61 car -0.70 0.35 0.15-0.58 0.16 truck -0.26 0.65-0.41 0.58-0.09 Distributioal Similarity Models (cot.) 21/26

D T Matrix d 1 d 2 d 3 d 4 d 5 d 6 Dim 1-0.75-0.28-0.20-0.45-0.33 0.12 Dim 2-0.29-0.53-0.19 0.63 0.22 0.41 Dim 3 0.28-0.75 0.45-0.2-0.12-0.33 Dim 4 0.58-0.58 0.58 Dim 5-0.53 0.29 0.63 0.19 0.41-0.22 Distributioal Similarity Models (cot.) 22/26

Matrix of Sigular Values 2.16 1.59 1.28 1.00 0.39 Distributioal Similarity Models (cot.) 23/26

Reductio Restrict the matrices T,S ad D to their first k < colums T t k S k k (D d k ) T is the best least square approximatio of A by a matrix of rak k Term similarity ca be computed as (T t k S k k )(T t k S k k ) T AA T = TSD T (TSD T ) T = TSD T DS T T T =(TS)(TS) T Distributioal Similarity Models (cot.) 24/26

Pros ad Cos + Clea formal framework with a clearly defied optimizatio criterio + Used i a variety of applicatios (from IR to dialogue processig) - Computatioally expesive - Assumes ormally-distributed data Distributioal Similarity Models (cot.) 25/26

Coclusios The EM algorithm for Gaussia Mixtures Latet Sematic Idexig Sigular Value decompositio Distributioal Similarity Models (cot.) 26/26