Lecture 13b: Latent Semantic Analysis

Size: px

Start display at page:

Download "Lecture 13b: Latent Semantic Analysis"

Grant Bridges
5 years ago
Views:

1 Lecture 13b: Latent Semantc Analyss S540 4/19/18 Materal borrowe (wth permsson) from Vasleos Hatzvassloglou & Evmara Terz. Mstakes are mne. Announcement Proect # test problems release Input fles are complete states Output fles are thngs that must be true n the fnal state All outputs are possble Input/output #1 (reorerng) Start wth re on bottom, black on top En wth black on bottom, re on top Lots of steps to acheve n subgoal Input/Output # (ang subgoal) Goal: black on re Twst: move black frst, even though re s on bottom & must move Announcements (II) Input/output #3: Stack ranomly scattere blocks Lots of solutons (many?s n output fle) Large search space Announcements (III) Proect # Paper : Due Tuesay Four pages (no longer) Three parts Introucton : escrbe your motvaton an program Why you esgn t the way you How s t suppose to work Precte lmtatons an rsks Performance How well oes t work on the problems gven? Report relevant quanttatve metrcs for your esgn You may supplement wth atonal problems of your own esgn oncluson Do your prectons match the results? If not, why not? Announcements (IV) In-class presentaton : also Tuesay 5 mnutes maxmum (tme) Powerpont or pf sles Emal to me (raper@colostate.eu) the nght before Same 3-part structure as the paper Reang assgnment for next Thursay Thomas Hoffman. Probablstc Latent Semantc Inexng. AM SIGIR Forum, 51(), 017. Homonyms rane A br wth long legs an a long neck A tool use to lft large obects A moton that stretches (e.g. crane your neck) Date A frut of the palm tree A romantc evenng A ay of the month Fol A materal use to wrap thngs (sometmes mae of alumnum) To run a plan or scheme A ramatc fol 1

2 Resolvng Homonyms Part of speech taggng Dstngushes between meanngs wth fferent grammatcal roles E.g. crane/br (noun) vs crane/moton (verb) But all the homonyms on the prevous sle have noun meanngs Drect mofers (aectves/averbs/phrases) 1 ton crane probably not a br Anyone seen the move Rampage? Not always avalable Requre parsng an e-referencng To know whch wor s beng mofe Meanng as Assocaton ontext Dfferent meanngs (wor senses) occur n fferent contexts rane/br often occurs wth fsh, marsh, water, rare, rane/tool often occurs wth constructon, bulng, lft, collapse, Use surrounng text to select wor meanngs Bag of Wors Ignore syntax altogether Treat text as an unorere set of wors Two texts are smlar f ther wors are smlar Problem: two texts may use fferent terms for the same thng We ve gone from homonyms to synonyms How o we ust the smlarty of groups of wors? Assocaton as nformaton alculatng mutual nformaton Gven a ranom varable, entropy s å H ( ) = p( )log p( ) I(, Y ) = åå p(, Y ) p(, Y )log p( ) p( Y ) Mutual nformaton s the reucton n entropy from knowng another varable Base on Kullback Lebler stance D(p q) I(, Y ) = H ( ) - H ( Y ) = H ( Y ) - H ( Y ) Specfc Mutual nformaton hurch an Hanks, 1989; Smaa 1990 Only the 1-1 term P(, Y ) SI (, Y ) = log P( ) P( Y ) Assocaton as contonal probabltes The Dce coeffcent (Dce, 1945) P(, Y ) D(, Y ) = P( ) + P( Y ) Smlar to Jaccar coeffcent

3 feature 4/19/18 Datasets n the form of matrces We are gven n obects an features escrbng the obects. (Each obect has numerc values escrbng t.) Dataset An n-by- matrx A, A shows the mportance of feature for obect. Every row of A represents an obect. Goal 1.Unerstan the structure of the ata, e.g., the unerlyng process generatng the ata..reuce the number of features representng the ata Market basket matrces n customers proucts (e.g., mlk, brea, wne, etc.) A = quantty of -th prouct purchase by the -th customer Fn a subset of the proucts that characterze customer behavor Socal-network matrces groups (e.g., BU group, opera, etc.) Document matrces terms (e.g., theorem, proof, etc.) n users A = parttcpaton of the -th user n the -th group Fn a subset of the groups that accurately clusters socal-network users n ocuments A = frequency of the -th term n the -th ocument Fn a subset of the terms that accurately clusters the ocuments The Sngular Value Decomposton (SVD) Data matrces have n rows (one for each obect) an columns (one for each feature). Rows: vectors n a Euclean space, Two obects are close f the angle between ther corresponng vectors s small. (,x) Obect Obect x feature SVD: Example n (rght) sngular vector 1st (rght) sngular vector Input: - mensonal ponts Output: 1st (rght) sngular vector: recton of maxmal varance, n (rght) sngular vector: recton of maxmal varance, after removng the proecton of the ata along the frst sngular vector. 3

4 Sngular values SVD ecomposton n (rght) sngular vector s 1st (rght) 1 sngular vector s 1 : measures how much of the ata varance s explane by the frst sngular vector. s : measures how much of the ata varance s explane by the secon sngular vector. n x n x l l x l l x U (V): orthogonal matrx contanng the left (rght) sngular vectors of A. S: agonal matrx contanng the sngular values of A: (s 1 s s l ) Exact computaton of the SVD takes O(mn{mn, m n}) tme. The top k left/rght sngular vectors/values can be compute faster usng Lanczos/Arnol methos. 0 0 SVD an Rank-k approxmatons Rank-k approxmatons (A k ) A = U S V T features = sgnfcant sg. sgnfcant n x n x k k x k k x A k s the best U k (V k ): orthogonal matrx contanng the top k left (rght) sngular vectors of A. approxmaton of A S k : agonal matrx contanng the top k sngular values of A obects A k s an approxmaton of A SVD as an optmzaton problem Relatng SVD to PA Fn to mnmze: mn A - A F = å A, k k Frobenus norm: F SVD s a matrx ecomposton algorthm. Apple to the raw matrx, we get:! = #Σ% & When we apply SVD to the covarance matrx, we call t PA!! & = #Σ ' # & There s therefore a secon form of PA:! &! = %Σ ' % & Note: Egenvalues are sngular values square 4

5 The -ecomposton Fn that contans subset of the columns of A to mnmze: mn A - A F = å A, k k Gven t s easy to fn from stanar least squares. However, fnng s now har!!! F Why -ecomposton If A s an obect-feature matrx, then selectng representatve columns s equvalent to selectng representatve features Ths leas to easer nterpretablty; compare to egenfeatures, whch are lnear combnatons of all features. Algorthms for the ecomposton The SVD-base algorthm The greey algorthm The k-means-base algorthm Algorthms for the ecomposton The SVD-base algorthm Do SVD frst Map k columns of A to the left sngular vectors The greey algorthm Greely pck k columns of A that mnmze the error The k-means-base algorthm Fn k centers (by clusterng the columns) Map the k centers to columns of A Dscusson on the ecomposton The vectors n are not orthogonal they o not efne a space It mantans the sparcty of the ata Latent Semantc Analyss How o we use SVD (or, or IA, or ) n NLP? Let s look at the text retreval problem Gven a corpus of ocuments An a new ocument A Orer the ocuments n by smlarty to A Not really. Usually IR s efne as returnng the N most smlar ocuments. But we wll look at retreval ssues later. For the moment, efnng an orerng wll o. 5

6 LSA Step 1: Analyze corpus by SVD A = U S V T LSA Step 1 (cont) The left sngular vectors U map ocuments to concepts Dscar all but the frst K columns of U Assumng orere by the magntue of the sngular vectors The 1st K columns wll map ocuments onto the K maor concepts n your corpus obects features = sgnfcant sg. sgnfcant U k s a compacte verson of the corpus s a ocuments x terms matrx Note: terms s the sze f your lexcon! U k s ocuments x K Much smaller Arguably more meanngful Each row s a vector Each term U k [,] measures how much concept occurs n ocument LSA Step When gven A (the query ocument), compute Auk Ths wll gve a column vector a = AU k The th term measures how much concept u appears n A ompute the sne of the angle between a an every row of U k The closer to 1, the more smlar the ocument s to A!"#$ %, ' = % ) ' % ' LST Step (cont) Faster than you mght thnk Typcally one corpus Many query mages A, gven one at a tme ompute U k once Normalze the rows to have magntue 1. For every query ocument ompute AU k Normalze t to have magntue 1 Now compute (U k )a usng normalze versons. The result s a vector of snes of angles. Dsambguatng Homonyms We starte wth the homonym problem How to select the correct wor sense? A soluton: use LSA ollect every sentence the wor occurs n from a corpus Ths s a set of small ocuments Perform LSA, keepng K sngular vectors K shoul account for 85% of energy For ever new use, fn closest wor sense Avantage: we work for new terms, f you a them to corpus Dsavantage: wor sense s a column number. Lngusts mght stll want to know what t means If ecomposton s use, you can say as use n ocument y 6

ENGI9496 Lecture Notes Multiport Models in Mechanics

ENGI9496 Lecture Notes Multiport Models in Mechanics ENGI9496 Moellng an Smulaton of Dynamc Systems Mechancs an Mechansms ENGI9496 Lecture Notes Multport Moels n Mechancs (New text Secton 4..3; Secton 9.1 generalzes to 3D moton) Defntons Generalze coornates