Computa(on of Similarity Similarity Search as Computa(on. Stoyan Mihov and Klaus U. Schulz

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Computa(on of Similarity Similarity Search as Computa(on. Stoyan Mihov and Klaus U. Schulz"

Transcription

1 Computa(on of Similarity Similarity Search as Computa(on Stoyan Mihov and Klaus U. Schulz

2 Roadmap What is similarity computa(on? NLP examples of similarity computa(on Machine transla(on Speech recogni(on Text correc(on and historical text normaliza(on Research direc(ons General picture Efficient similarity filtering with universal Levenshtein automata Distance defini(on with expecta(on maximiza(on Viterbi and beam search in HMM und Approximate search in large databases Conclusion

3 What is similarity computa(on? For a given paqern find the object from a target set, which is: AND most similar in respect to a given distance like acous(c similarity, orthographic similarity etc. most plausible e.g. in respect to syntax, context etc.

4 Sta(s(cal Machine Transla(on Input Paern: sequence of words (sentence in the source language Target object: sequence of words (sentence in the target language Target set: all finite sequences of words in the target language Similarity measure: transla(on equivalence probability of the correspondence of individual pairs of words or phrases given by a transla(on model Object plausibility : object probability in target language given by a language model (e.g. Markov model

5 Sta(s(cal Machine Transla(on (cont. Pr(e f = Pr( f epr(e Pr( f ê = argmaxpr( f epr(e e n i=1 Pr(e = Pr(e i e i!k e i!k+1 e i!1! a Pr( f e = Pr( f, a e Pr( f, a e = m Pr(m e! Pr( f (l +1 m j e aj j=1 e = e 1 e 2 e l f = f 1 f 2 f m a = a 1 a 2 a m a i! {0,1,,l}

6 Speech Recogni(on Input Paern: audio signal converted to a sequence of feature vectors for each slice of the signal Target object: sequence of words in the target language Target set: all finite sequences of words in the target language Similarity measure: acous(c equivalence probability of the correspondence of subsequence of feature vectors with a word phone(za(on given by an acous(c model Object plausibility: object probability in target language given by a language model (e.g. Markov model

7 Speech Recogni(on (cont.

8 Text correc(on and historical text normaliza(on Input Paern: sequence of possibly grabbled / historical words Target object: sequence of corrected /normalized words Target set: all finite sequences of words in the correct / modern language Similarity measure: Edit distance in respect to primi(ve edit opera(ons / historical varia(on paqerns of individual pairs of words or phrases Object plausibility: object probability in the correct / modern language given by a language model (e.g. Markov model

9 Text correc(on and historical text normaliza(on (cont. Dictionary: D = {w 1, w 2,, w n } Uncorrected or historical text: u 1 u 2 u l! ( * l Corrected or normalized text: w i1 w i2 w il l = argmax log P(w i1 w i2 w il #d(u k, w ik w i 1 w i2 w il!dl k=1

10 Research roadmap Base no(on of similarity Basic methods for approximate search Refined no(on of similarity Feedback similarity results Online efficiency improvement Efficient approximate search methods Efficient approximate search methods With refined no(on of similarity

11 Research examples Efficient methods for approximate search universal Levenshtein automata possibly aligning all parts of the paqern in parallel (instead of le^ to right alignment Distance defini(on based on data analysis: expecta(on maximiza(on

12 Universal Levenshtein Automata Effec(ve characteriza(on of the set of words within a given distance to a paqern Allows efficient similarity filtering Universal does not depend on the paqern Transi(ons on characteris(c vectors calculated in respect to the paqern Existence of universal Automaton for the natural Edit distances

13 Approximate search in large databases Wall effect: Tolera(ng all mismatches in the beginning of the sequence yields to a full traversal of all the prefixes in the data set. Possible solu(on: Tolera(ng only half of the possible mismatches in the first half of the sequence, tolerate all in the second half Repeat the procedure on the backwards dic(onary Forward dic(onary Backwards dic(onary

14 Viterbi and beam search in HMM!# $ %&# $'( a 1,1 a 1,2 a 2,2 a 2,3 b 1 b 2 b 3 Viterbi algorithm t (i = max s 1,,s t#1 P(O 1,,O t,s 1,,s t = i 1 (i = a 0,i b i (O 1 a 3,3 t +1 (i = max t ( ja j,i b i (O t +1 j *! % + '!# $ %&# $'( #!! &!# $ %&# $'( &!# $ %&# $'( Naïve search complexity: O(N T Viterbi search complexity: O(N 2 T Beam search complexity: O(MNT ( & & ' # $

15 Acous(c similarity HMM training with the Baum Welch algorithm a 1,1 a 2,2 a 3,3 Forward algorithm Backwards algorithm a 1,2 a 2,3 b 1 b 2 b 3 t (i = P(O 1,,O t,s t = i = $ = P(O 1,,O t,s 1,,s t = i s 1,,s t#1 1 (i = a 0,i b i (O 1 $ t +1 (i = t ( ja j,i b i (O t +1 j t (i = P(O t +1 s t = i T (i =1 # t (i = a i, j b j (O t +1 t +1 ( j j Forward-backwards algorithm t (i, j = P(s t = i,s t +1 = j O 1 = P(O 1,s t = i,s t +1 = j P(O 1 t (i, j = # t (ia i, j b j (O t +1 $ t +1 ( j P(O 1 % t (i = P(s t = i O 1 = t (i, j j Baum-Welch algorithm L( O 1 = logp(o 1 argmaxl( O 1 â i, j = T!1 t=1 T!1 t=1! t (i, j! t (i ĉ i,k = T t=1 T t=1 t (i, k! t (i t (i, k = j # t ( ja j,i c i,k N(O t ;µ i,k, # i,k $ t (i P(O 1 ˆµ i,k = T t=1 T t=1! t (i, ko t! t (i, k ˆ# i,k = T t=1 T! t (i, ko t $ t=1! t (i, k O t! ˆµ i,k ( ˆµ i,k $

16 Conclusion Similarity computa(on paradigm: not a universal model of computa(on But: arising in many NLP applica(ons Adap(ve methods necessary since op(mal no(on of similarity is context and applica(on dependent Efficiency is central issue because of huge search space