Distributional Similarity Models (cont.)

Distributioal Similarity Models (cot.) Regia Barzilay EECS Departmet MIT October 19, 2004

Sematic Similarity Vector Space Model Similarity Measures cosie Euclidea distace... Clusterig k-meas hierarchical Last Time Distributioal Similarity Models (cot.) 1/26

Example Delicately hadlig the beautiful sati bidigs, Emma looked with dazzled eyes at the ames of the ukow authors. The orage blossoms were yellow with dust ad the silver bordered sati ribbos frayed at the borders. The cofessioal forms a pedat to a statuette of the Virgi, clothed i a sati robe. Never had Emma bee so beautiful as at this period. He picked up a cigar-case with a gree silk border. The Border Emma Ribbo Beautiful 1 0 2 0 Sati 3 1 1 1 Silk 0 1 0 0 Distributioal Similarity Models (cot.) 2/26

EM Clusterig Soft versio of K-meas clusterig Iput: m dimesioal objects X = {x 1,..., x } R m to be clustered ito k groups Observable data: X = { X i }, where x i =(x i1,...,x im ) Uobservable data: Z = { Z i }, where withi each z i = z i1,...z ik, the compoet z ij is 1 if object i is a member of cluster j ad 0 otherwise Clusterig is viewed as estimatig a mixture of probability distributios Distributioal Similarity Models (cot.) 3/26

Example of the EM algorithm for Soft Clusterig 4 3 2 c c 1 2 4 3 2 c 1 4 3 2 c 1 1 1 c 2 1 c 2 1 2 3 4 1 2 3 4 1 2 3 4 Iitial state After iteratio 1 After iteratio 2 Distributioal Similarity Models (cot.) 4/26

Multivariate Normal Distributios Key Assumptio: Data Geerated by k Gaussias The probability desity fuctio for a Gaussia: 1 1 x; µ j, Σ j )= (2π)m Σ j ) exp[ 2 ( µ)t Σ 1 j ( x j ( x µ)] Goal: fid the maximum likelihood model of the form k j=1 π j ( x; µ j, Σ j ) Distributioal Similarity Models (cot.) 5/26

The EM algorithm for Gaussia Mixtures Hidde Parameters: Θ j =(µ j, Σ j,π j ) Log likelihood of the data: k L(X Θ) = log P (x i ) = log π j j (x i ; µ j, Σ j ) i=1 i=1 j=1 k = i=1 log j=1 π j j (x i ; µ j, Σ j ) Distributioal Similarity Models (cot.) 6/26

Iterative Solutio Estimate: If we kew the value of Θ we could compute the expected values of the hidde structure of the model. Maximize: If we kew the expected values of the hidde structure of the model, the we could compute the maximum likelihood value of Θ. Distributioal Similarity Models (cot.) 7/26

Iitializatio The covariace matrices Σ j are iitialized as idetity matrix. Meas µ j are selected to be a radom perturbatio away from a data poit radomly selected from X. Distributioal Similarity Models (cot.) 8/26

Expectatio Step Give the curret parameters, compute cluster membership probabilities h ij = E(z ij x i ;Θ)= k l=1 P (x i j ;Θ) P (x i l ;Θ) Distributioal Similarity Models (cot.) 9/26

Maximizatio Step Give the cluster membership probabilities (expected values), compute the most likely parameters Θ h µ j = ij x i i=1 Σ j π h ij i=1 i=1 = i=1 h ij j = k h ij (x i µ j ) T j )(x i µ h ij i=1 h ij i=1 = h ij j=1 i=1 Distributioal Similarity Models (cot.) 10/26

Example of a Gaussia mixture Posterior probabilities P (w i c j ) Mai cluster Word 1 2 3 4 5 1 ballot 0.63 0.12 0.04 0.09 0.11 1 polls 0.58 0.11 0.06 0.10 0.14 1 Gov 0.58 0.12 0.03 0.10 0.17 1 seats 0.11 0.59 0.02 0.14 0.15 2 profit 0.58 0.12 0.03 0.10 0.17 2 fiace 0.15 0.55 0.01 0.13 0.16 2 paymets 0.12 0.66 0.01 0.09 0.11 3 NFL 0.13 0.05 0.58 0.09 0.16 3 Reds 0.05 0.01 0.86 0.02 0.06 Distributioal Similarity Models (cot.) 11/26

Other Methods of Dimesioality Reductio Latet Sematic Idexig Similar objects are projected oto the same dimesios The represetatio i the origial space is chaged as little as possible Distributioal Similarity Models (cot.) 12/26

Documet-by-word Matrix d 1 d 2 d 3 d 4 d 5 d 6 cosmoaut 1 0 1 0 0 0 astroaut 0 1 0 0 0 0 moo 1 1 0 0 0 0 car 1 0 0 1 1 0 truck 0 0 0 0 0 1 Distributioal Similarity Models (cot.) 13/26

Least-Squares Methods: Liear Regressio y x Distributioal Similarity Models (cot.) 14/26

Least-Squares Methods: Liear Regressio Iput: (x 1,y 1 ), (x 2,y 2 ),...,(x,y ) Goal: Fid f(x) =mx + b that miimizes the sum of the squares of the differece SS(m, b) = (y i f(x i ) i=1 2 Distributioal Similarity Models (cot.) 15/26

Liear Regressio Miimize SS(m, b) = 2 = i=1 (y i f(x i ) i =1 (y i mx i b) 2 SS(m,b) b = i=1 [2(y i m x i b)( 1)]=0 i=1 i=1 b = x, where ȳ = y i y m ad x = d (y i m x i y+m x) 2 i=1 m = dm SS(m,b) m = y y ( i )( x x i) i=1 x x i ) 2 i=1 ( x i Distributioal Similarity Models (cot.) 16/26

Sigular Value Decompositio(SVD) Ratioal: Icrease similarity i represetatio by dimesioality reductio SVD projects a -dimesioal space ito a k-dimesioal space where >k Example: Word-documet matrices i iformatio retrieval. is a umber of word types i the collectio. k ca be 100 Costrait: such that the their distace δ = A A is miimal Distributioal Similarity Models (cot.) 17/26

Sigular Value Decompositio Ay m by matrix A ca be factored ito A = T ΣD T =(orthogoal)(diagoal)(orthogoal) The colums of T (m by m) are eigevectors of AA T, ad the colums of D ( by ) are eigevectors of A T A. The r sigular values o the diagoal of Σ (m by ) are the square roots of the ozero eigevalues of both AA T ad A T A. SVD is uique (up to sig flip i D ad T) Distributioal Similarity Models (cot.) 18/26

Ituitio SVD rotates the the axes of -dimesioal space such that the first axis rus alog the largest variatio amog the documets, the secod dimesio rus alog the secod largest variatio ad... Matrices T ad D represet terms ad documets i the ew space. Distributioal Similarity Models (cot.) 19/26

Origial Matrix d 1 d 2 d 3 d 4 d 5 d 6 cosmoaut 1 0 1 0 0 0 astroaut 0 1 0 0 0 0 moo 1 1 0 0 0 0 car 1 0 0 1 1 0 truck 0 0 0 0 0 1 Distributioal Similarity Models (cot.) 20/26

T Matrix Dim 1 Dim 2 Dim 3 Dim 4 Dim 5 cosmoaut -0.44-0.30 0.57 0.58 0.25 astroaut -0.13-0.33-0.59 0.73 moo -0.48-0.51-0.37 0.61 car -0.70 0.35 0.15-0.58 0.16 truck -0.26 0.65-0.41 0.58-0.09 Distributioal Similarity Models (cot.) 21/26

D T Matrix d 1 d 2 d 3 d 4 d 5 d 6 Dim 1-0.75-0.28-0.20-0.45-0.33 0.12 Dim 2-0.29-0.53-0.19 0.63 0.22 0.41 Dim 3 0.28-0.75 0.45-0.2-0.12-0.33 Dim 4 0.58-0.58 0.58 Dim 5-0.53 0.29 0.63 0.19 0.41-0.22 Distributioal Similarity Models (cot.) 22/26

Matrix of Sigular Values 2.16 1.59 1.28 1.00 0.39 Distributioal Similarity Models (cot.) 23/26

Reductio Restrict the matrices T,S ad D to their first k < colums T t k S k k (D d k ) T is the best least square approximatio of A by a matrix of rak k Term similarity ca be computed as (T t k S k k )(T t k S k k ) T AA T = TSD T (TSD T ) T = TSD T DS T T T =(TS)(TS) T Distributioal Similarity Models (cot.) 24/26

Pros ad Cos + Clea formal framework with a clearly defied optimizatio criterio + Used i a variety of applicatios (from IR to dialogue processig) - Computatioally expesive - Assumes ormally-distributed data Distributioal Similarity Models (cot.) 25/26

Coclusios The EM algorithm for Gaussia Mixtures Latet Sematic Idexig Sigular Value decompositio Distributioal Similarity Models (cot.) 26/26