Unsupervised Vocabulary Induction

Infant Language Acquisition Unsupervised Vocabulary Induction MIT (Saffran et al., 1997) 8 month-old babies exposed to stream of syllables Stream composed of synthetic words (pabikumalikiwabufa) After only 2 minutes of exposure, infants can distinguish words from non-words (e.g., pabiku vs. kumali) Today: Unsupervised Vocabulary Induction Vocabulary Induction from Unsegmented Text Vocabulary Induction from Speech Signal Sequence Alignment Algorithms Vocabulary Induction Task: Unsupervised learning of word boundary segmentation Simple: Ourenemiesareinnovativeandresourceful,andsoarewe. Theyneverstopthinkingaboutnewwaystoharmourcountry andourpeople,andneitherdowe. More ambitious:

Word Segmentation (Ando&Lee, 2000) Key idea: for each candidate boundary, compare the frequency of the n-grams adjacent to the proposed boundary with the frequency of the n-grams that straddle it. S? S t1 1 2 T I N G E V I D t2 t 3 For N = 4, consider the 6 questions of the form: Is #(s i ) #(t j )?, where #(x) is the number of occurrences of x Example: Is TING more frequent in the corpus than INGE? Algorithm for Word Segmentation (Cont.) Place boundary at all locations l such that either: l is a local maximum: v N (l) > v N (l 1) and v N (l) > v N (l + 1) v N (l) t, a threshold parameter V (k) N A B C D W X Y Z t s n 1 s n 2 t n j I (y, z) Algorithm for Word Segmentation non-straddling n-grams to the left of location k non-straddling n-grams to the right of location k straddling n-gram with j characters to the right of location k indicator function that is 1 when y z, and 0 otherwise. 1. Calculate the fraction of affirmative answers for each n N: v n (k) = 1 2 (n 1) 2 n 1 i=1 j=1 I (#(s n i ), #(t n j )) 2. Average the contributions of each n-gram order v N (k) = 1 v n (k) N n N Experimental Framework Corpus: 150 megabytes of 1993 Nikkei newswire Manual annotations: 50 sequences for development set (parameter tuning) and 50 sequences for test set Baseline algorithms: Chasen and Juman morphological analyzers (115,000 and 231,000 words)

Evaluation Precision (P): the percentage of proposed brackets that exactly match word-level brackets in the annotation Recall (R): the percentage of word-level annotation brackets that are proposed by the algorithm F = 2 P R (P +R) F = 82% (improvement of 1.38% over Jumann and of 5.39% over Chasen) Today: Unsupervised Vocabulary Induction Vocabulary Induction from Unsegmented Text Vocabulary Induction from Speech Signal Sequence Alignment Algorithms Performance on other datasets Aligning Two Sequences Given two possibly related strings S1 and S2, find the longest common subsequence Cheng & Mitzenmacher Orwell(English) 79.8 Song lyrics (Romaji) 67.6 Goethe (German) 75.2 Verne (French) 72.9 Arrighi (Italian) 73.1

How can We Compute Best Alignment Key Insight: Score is Additive We need a scoring system for ranking alignments Substitution Cost A G T C A 1 0.5-1 -1 G -0.5 1-1 -1 T -1-1 +1-0.5 C -1-1 -0.5 1 Gap (insertion&deletion) Cost Compute best alignment recursively For a given aligned pair (i, j), the best alignment is: Best alignment of S1[1... i] and S2[1... j] + Best alignment of S1[i... n] and S2[j... m] Can We Simply Enumerate All Possible Alignments? Naive enumeration is prohibitively expensive ( ) n + m (m + n)! = 2m+n m (m!) 2 (n m) n=m Enumeration 10 184,756 20 1.4E+11 100 9.00E+58 Alignment using dynamic programming can be done in O(n m) Alignment Matrix Alignment of two sequences can be modeled as a task of finding the path with the highest weight in a matrix Alignment: Corresponding Path: - - P - A W - + + + P + + A + W + +

Global Alignment: Needleman-Wunsch Algorithm To align two strings x, y, we construct a matrix F F(i,j): the score of the best alignment between the initial segment s 1...i of x up to x i and the initial segment y 1...j of y up to y j We compute F recursively: F (0, 0) = 0 F(i 1,j 1) s(xi,yj) F(i 1,j) d F(i,j 1) d F(i,j) Dynamic Programming Formulation We know how to compute the best score The number at the bottom right entry (i.e., F (n, m)) But we need to remember where it came from Pointer to the choice we made at each step Retrace path through the matrix Need to remember all the pointers Time: O(m n) s(x i, y j ) d Dynamic Programming Formulation similarity between x i and y j gap penalty F (i 1, j 1) + s(x i, y j ) F (i, j) = max F (i 1, j) d F (i, j 1) d Boundary conditions: The top row: F (i, 0) = id F (i, 0) represents alignments of prefix x to all gaps in y Local alignment: Smith-Waterman Algorithm Global alignment: find the best match between sequences from one end to the other Local alignment: find the best match between subsequences of two sequences Useful for comparing highly divergent sequences when only local similarity is expected The left column: F (0, j) = jd

Dynamic Programming Formulation 0 F (i 1, j 1) + s(x i, y j ) F (i, j) = max F (i 1, j) d F (i, j 1) d Boundary conditions: F (i, 0) = F (0, j) = 0 Finding the best local alignment Today: Unsupervised Vocabulary Induction Vocabulary Induction from Unsegmented Text Vocabulary Induction from Speech Signal Sequence Alignment Algorithms Find the highest value of F (i, j), and start the traceback from there The traceback ends when a cell with value 0 is found Local vs. Global Alignment Finding Words in Speech SimilarityMatrix GlobalAlignment P -2-1 -1-2 -1-4 -2 A -2-1 5 0 5-3 0 W -3-3 -3-3 -3 15-3 0-8 -16-24 -32-40 -48-56 P -8-2 -9-17 -25-33 -42-49 A -16-10 -3-4 -12-20 -28-36 W -24-18 -11-6 -7-15 -5-13 0 0 0 0 0 0 0 0 LocalAlignment P 0 0 0 0 0 0 0 0 A 0 0 0 5 0 5 0 0 W 0 0 0 0 2 0 20 12 Traditional approached to speech recognition are supervised: Recognizers are trained using a large corpus of speech with corresponding transcripts During the training process, a recognizer is provided with a vocabulary Is it possible to learn vocabulary directly from speech signal?

Vocabulary Induction: Outline Spectral Vectors Spectral vector is a vector where each component is a measure of energy in a particular frequency band We divide acoustic signal (a one dimensional wave form) into short overlapping intervals (25 msec with 15 msec overlap) We convert each overlapping window using Fourier transform Comparing Acoustic Signals Example of Spectral Vectors he too was diagnosed with paranoid schizophrenia 6000 Freq (Hz) 4000 2000 0 0.5 1 1.5 2 2.5 3 3.5 4 Time (sec) were willing to put nash s schizophrenia on record 6000 Freq (Hz) 4000 2000 0 0.5 1 1.5 2 2.5 3 3.5 Time (sec)

Comparing Spectral Vectors Computing Local Alignment Divide acoustic signal to word segments based on pauses Compute spectral vectors for each segment Build a distance matrix for each pair of word segments use Euclidean distance to compare between spectral vectors Example of Distance Matrix Clustering Similar Utterance

Examples of Computed Clusters