Text Mining. March 3, March 3, / 49

Size: px

Start display at page:

Download "Text Mining. March 3, March 3, / 49"

Arthur Jenkins
5 years ago
Views:

1 Text Mining March 3, 2017 March 3, / 49

2 Outline Language Identification Tokenisation Part-Of-Speech (POS) tagging Hidden Markov Models - Sequential Taggers Viterbi Algorithm March 3, / 49

3 Language Identification Introduction Process of determining the natural language in which the contents of the text is written. An important research area - focused as early as 1970s. Two approaches: non-computational: linguistic knowledge, such as diacritics and symbols, most frequent words used, character combination etc. computational: statistical techniques rather than linguistic knowledge March 3, / 49

4 Language Identification Architecture of Language Identifier Figure: General architecture of a language identifier [Padró and Padró, 2004] March 3, / 49

5 Language Identification N-Gram-based Text Categorization [Cavnar et al., 1994] Zipf s law: The frequency of any word is inversely proportional to its rank in the frequency table. Figure: Empirical evaluation of Zipf s law on Tom Sawyer There is always a set of words which dominates most of the other words of the language in terms of frequency of use. March 3, / 49

6 Language Identification N-Gram-based Text Categorization [Cavnar et al., 1994] Figure: Dataflow for N-Gram-based text categorization [Cavnar et al., 1994] March 3, / 49

7 Language Identification N-Gram-and-Wikipedia Joint Approach [Yang and Liang, 2010] Previous methods required large corpus as training data for determining each language. Joint approach solved this problem by using a single language corpus i.e., local English corpus for all languages. Two steps: implementation of segmentation algorithm which uses n-gram frequency statistics for segmentation into language-specific units determine language of each unit by utilizing di erent language versions of Wikipedia. Wikipedia supports 262 language versions of the documents. Joint approach achieves approximately 100% accurate results March 3, / 49

8 Tokenisation Tokenisation Tokenisation is the process of splitting a text into words What is a word? string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes, but no other punctuation marks [Kučera and Francis, 1967]. Di culties: $22.50; Micro$oft; C# net; :-) Occurrence of whitespace - a space or tab is the main clue in English Full stop, comma etc can be used as pre-defined delimiters For languages that do not use spaces to separate words, supervised sequential tagging methods are used to perform tokenisation during morphological analysis. March 3, / 49

9 Part-of Speech Tagging Part-of-Speech Tagging (POS) Task of tagging POS tags (Nouns, Verbs, Adjectives, Adverbs,...) for words POS tags provide lot of information about a word knowing whether a word is noun or verb gives information bout neighbouring words nouns are preceded by determiners; adjectives and verbs by nouns provide useful features for named entity recognition Given a word, we assume it can belong to only of the POS tags. POS Tagging problem Given a sentence S = w 1 w 2...w n consisting of n words, determine the corresponding tag sequence P = P 1 P 2...P n March 3, / 49

10 Part-of Speech Tagging Part-of-Speech Tagging (POS) Figure: Penn Treebank POS Tags March 3, / 49

11 Part-of Speech Tagging Hidden Markov Model (HMM) March 3, / 49

12 Part-of Speech Tagging Hidden Markov Model (HMM) March 3, / 49

13 Markov Chains Markov Chains Probabilistic graphical model for representing probabilistic assumptions in a graph. Q = q 1, q 2,...,q N : a set of states A = a 01, a 02,...a n1,...,a nn :a transition probability matrix A, each a ij representing the probability of moving from state i to state j, s.t. P n j=1 a ij =1 8i q 0, q end :aspecialstart and end state which are not associated with observations March 3, / 49

14 Markov Chains Markov Chains 1, 2,..., N :aninitial probability distribution over states. i is the probability that the Markov chain will start in state i. Markov Assumption: P(q i q 1, q 2,...,q i 1 )=P(q i q i 1 ) P(cold hot cold hot) = P(cold) P(hot cold) P(cold hot) P(hot cold) =0.3 x 0.2 x 0.2 x 0.2 = March 3, / 49

15 Hidden Markov Model HMMs allows us to model both observed events (words that we see) and hidden events (POS tags). March 3, / 49 Hidden Markov Model (HMM) Markov chains are useful for observed events However, in many cases the events are not observed Example: POS tagging - POS tags are not observed

16 Hidden Markov Model Amotivatingexample Probability of transition to another Urn after picking a ball: U 1 U 2 U 3 U U U March 3, / 49

17 Hidden Markov Model AMotivatingExample(contd.) Given: Transition Probabilities U 1 U 2 U 3 U U U Given: Output Probabilities R G B U U U Observation: RRGGBRGR State Sequence (Urn chosen corresponding to each ball):? March 3, / 49

18 Hidden Markov Model Diagrammatic Representation - 1 Transition Probabilities U 1 U 2 U 3 U U U Output Probabilities R G B U U U Observation: RRGGBRGR State Sequence (Urn chosen corresponding to each ball):? March 3, / 49

19 Hidden Markov Model Diagrammatic Representation - 2 Transition Probabilities U 1 U 2 U 3 U U U Output Probabilities R G B U U U Observation: RRGGBRGR State Sequence (Urn chosen corresponding to each ball):? March 3, / 49

20 Hidden Markov Model Example (contd.) Transition Probabilities (A) States Set: S = {U 1, U 2, U 3 } Observation Set: V = {R, G, B} Observation Sequence: O = {O 1...O n } State Sequence: Q = {q 1...q n } Initial Probability: i = P(q i = U i ) U 1 U 2 U 3 U U U Output Probabilities (B) R G B U U U March 3, / 49

21 Hidden Markov Model Observations and states O O 3 O 4 O 5 O 6 O 7 O 8 OBS: R R G G B R G R State: S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S i = U 1 /U 2 /U 3 ;Aparticularstate S: State sequence O: Observation sequence S* = best possible state (urn) sequence Goal: Maximize P(S O by choosing best S Goal: Maximize P(S O) wheres is the State Sequence and O is the Observation Sequence S = argmax s (P(S O)) March 3, / 49

22 Hidden Markov Model O O 3 O 4 O 5 O 6 O 7 O 8 OBS: R R G G B R G R State: S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 P(S O) =P(S 1 8 O 1 8 ) P(S O) =P(S 1 O)P(S 2 S 1, O)P(S 3 S 12, O)...P(S 8 S 1 7, O) Markov Assumption: a state depends only on the previous state P(S O) =P(S 1 O)P(S 2 S 1, O)P(S 3 S 2, O)...P(S 8 S 7, O) Baye s Theorem P(A B) = P(A)P(B A) P(B) P(A): Prior P(B A): Likelihood argmax s P(S O) =argmax x P(S)P(O S) March 3, / 49

23 Hidden Markov Model State Transitions Probability P(S) =P(S 1 8 ) P(S) =P(S 1 )P(S 2 S 1 )P(S 3 S 1 2 )P(S 4 S 1 3 )...P(S 8 S 1 7 ) By Markov Assumption (k=1) P(S) =P(S 1 )P(S 2 S 1 )P(S 3 S 2 )P(S 4 S 3 )...P(S 8 S 7 ) March 3, / 49

24 Hidden Markov Model Observations Sequence Probability P(O S) = P(O 1 S 1 8 )P(O 2 O 1, S 1 8 )P(O 3 O 1 2, S 1 8 )...P(O 8 O 1 7, S 1 8 ) Assumption that ball drawn depends only on the Urn Chosen P(O S) =P(O 1 S 1 )P(O 2 S 2 )P(O 3 S 3 )...P(O 8 S 8 ) P(S O) =P(S)P(O S) P(S O) =P(S 1 )P(S 2 S 1 )P(S 3 S 2 )P(S 4 S 3 )...P(S 8 S 7 )P(O 1 S 1 ) P(O 2 S 2 )P(O 3 S 3 )...P(O 8 S 8 ) March 3, / 49

25 Hidden Markov Model O 0 O O 3 O 4 O 5 O 6 O 7 O 8 OBS: R R G G B R G R State: S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 P(S).P(O S) = [P(O 0 S 0 ).P(S 1 S 0 )] [P(O 1 S 1 ).P(S 2 S 1 )] [P(O 2 S 2 ).P(S 3 S 2 )] [P(O 3 S 3 ).P(S 4 S 3 )] [P(O 4 S 4 ).P(S 5 S 4 )] [P(O 5 S 5 ).P(S 6 S 5 )] [P(O 6 S 6 ).P(S 7 S 6 )] [P(O 7 S 7 ).P(S 8 S 7 )] [P(O 8 S 8 ).P(S 9 S 8 )] States S 0 and S 9 is introduced as initial and final states After S 8 the next state is S 9 with probability 1, i.e., P(S 9 S 8 )==1 O 0 is -transition March 3, / 49

26 Hidden Markov Model O 0 O O 3 O 4 O 5 O 6 O 7 O 8 OBS: R R G G B R G R State: S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 P(O k S k ).P(S k+1 S k )=P(S k o! Sk+1 ) March 3, / 49

27 Hidden Markov Model Three problems of HMM Problem 1 (Decoding): Given an observation sequence O and an HMM =(A, B), discover the best hidden state sequence S. Problem 2 (Computing Likelihood): Given an HMM =(A, B) and an observation sequence O, determine the likelihood P(O ). Problem 3 (Learning) : Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B. March 3, / 49

28 Viterbi Algorithm Problem 1 (Decoding): Given an observation sequence O and an HMM =(A, B), discover the best hidden state sequence S. March 3, / 49

29 Viterbi Algorithm March 3, / 49

30 Viterbi Algorithm Viterbi Algorithm for the Urn problem (first two symbols) March 3, / 49

31 Viterbi Algorithm HMM - Computational Complexity March 3, / 49

32 Viterbi Algorithm HMM - Computational Complexity if the tree is grown in this manner RRGGBRGR - Observation Sequence length = 9 (including epsilon) at each level multiply the node by 3 level 1 ( ) -3 1,atlevel2(R)-3 2,...atlevel9(R)-3 9 (nodes at leaf) complexity without restriction = S o S =NumberoStates, O =lengthoftheobservationsequence March 3, / 49

33 Viterbi Algorithm Viterbi Algorithm for the Urn problem (first two symbols) At every stage, we only keep three nodes at the end of observation sequence - we have three nodes (total nodes - 3 x 8) complexity comes down from S o to S. o March 3, / 49

34 Viterbi Algorithm Probabilistic FSM March 3, / 49

35 Viterbi Algorithm Probabilistic FSM (contd.) March 3, / 49

36 Viterbi Algorithm Probabilistic FSM (contd.) March 3, / 49

37 Viterbi Algorithm Tabular Representation of the Tree a 1 a 2 a 1 a 2 S (1.0*0.1,0.0*0.2) =(0.1,0.0) (0.02, 0.09) (0.009, 0.012) (0.0024, ) S (1.0*0.3,0.0*0.3) =(0.3,0.0) (0.04, 0.06) (0.027, 0.018) (0.0048, ) Number of columns = length of observation sequence +1 ( ) Rows - ending state March 3, / 49

38 HMM-POS Tagging Example HMM - Definition Markov Assumption: P(q 1 q 1,...q i 1 = P(q i q i 1 ) Output Independence Assumption: P(o i q 1,...,q i,...,q n, o 1,...,o i,...,o n )=P(o i q i ) March 3, / 49

39 HMM-POS Tagging Example HMM - POS Tagging Figure: Markov chain corresponding to the hidden states of HMM. The transition probabilities A are used to compute the prior probability. March 3, / 49

40 HMM-POS Tagging Example HMM - POS Tagging Figure: Observation likelihoods B for the HMM. March 3, / 49

41 HMM-POS Tagging Example HMM - POS Tagging Goal: choose the most probable tag sequence given the observation sequence of n words ˆ w n 1 Using Bayes rule Simplifying further by dropping the denominator March 3, / 49

42 HMM-POS Tagging Example HMM - POS Tagging HMM makes two further assumptions: 1 probability of a word depends only on its tag and is independent of neighbouring words and tags 2 probability of a word depends only on its tag and is independent of neighbouring words and tags Using these simplifications: March 3, / 49

43 HMM-POS Tagging Example Viterbi Algorithm - Pseudocode March 3, / 49

HMM-POS Tagging Example POS Tagging - Example [Jurafsky and James, 2000] Janet will back the bill Janet/NNP will/md back/vb the/dt bill/nn Figure: Transition

44 HMM-POS Tagging Example POS Tagging - Example [Jurafsky and James, 2000] Janet will back the bill Janet/NNP will/md back/vb the/dt bill/nn Figure: Transition probabilities P(t i t i 1 ( computed from WSJ corpus (without smoothing). Rows are labeled with the conditioning event; thus P(VB MD) is March 3, / 49

45 HMM-POS Tagging Example POS Tagging - Example Janet will back the bill Janet/NNP will/md back/vb the/dt bill/nn Figure: Observation likelihoods computed from WSJ corpus (without smoothing) March 3, / 49

46 HMM-POS Tagging Example POS Tagging - Example Janet will back the bill Janet/NNP will/md back/vb the/dt bill/nn Figure: Schematic diagram showing possible POS tag sequences March 3, / 49

47 HMM-POS Tagging Example POS Tagging - Example Janet will back the bill Janet/NNP will/md back/vb the/dt bill/nn March 3, / 49

48 HMM-POS Tagging Example Yang, X. and Liang, W. (2010). An n-gram-and-wikipedia joint approach to natural language identification. In Universal Communication Symposium (IUCS), th International, pages IEEE. March 3, / 49 Cavnar, W. B., Trenkle, J. M., et al. (1994). N-gram-based text categorization. Ann Arbor MI, 48113(2): Jurafsky, D. and James, H. (2000). Speech and language processing an introduction to natural language processing, computational linguistics, and speech. Kučera, H. and Francis, W. N. (1967). Computational analysis of present-day American English. Dartmouth Publishing Group. Padró, M. and Padró, L. (2004). Comparing methods for language identification. Procesamiento del lenguaje natural, 33:

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21