Named Entity Recognition using Maximum Entropy Model SEEM5680

Size: px

Start display at page:

Download "Named Entity Recognition using Maximum Entropy Model SEEM5680"

Jonah Fletcher
5 years ago
Views:

1 Named Entity Recognition using Maximum Entroy Model SEEM5680

2 Named Entity Recognition System Named Entity Recognition (NER): Identifying certain hrases/word sequences in a free text. Generally it involves assigning labels to noun hrases. Lets say: erson, organization, locations, times, quantities, miscellaneous, etc. NER useful for information extraction, intelligent searching etc. 2

3 Named Entity Recognition System Examle: Bill Gates (erson name) oened the gate (thing) of cinema hall and sat on a front seat to watch a movie named the Gate (movie name). Simly retrieving any document containing the word Gates will not always hel. It might confuse with other use of word gate. The good use of NER is to describe a model that could distinguish between these two items. 3

4 Entity Tag Design BIO encoding: Person B-PER, I-PER Location B-LOC, I-LOC Organization B-ORG, I-ORG Others O Word United Nations official Peter Marcus arrived in Name Entity Tag B-ORG I-ORG O B-PER I-PER O O Seattle B-LOC 4

5 NER as sequence rediction The basic NER task can be defined as: Let t 1, t 2, t 3,. t n be a sequence of entity tags denoted by T. Let w 1, w 2, w 3,. w n be a sequence of words denoted by W. Given some W, find the best T. 5

6 Entroy Entroy measures the amount of information in a random variable: H(X) = - P(X=x) log(p(x=x)) Random variable is name entity tag. It is the robability of being different values of that tag. H(NET)= -{(P(er)*log(P(er))) + (P(loc)*log(P(loc))) + } 6

7 Entroy Numerical Values Probability P(x) log((x)) = L - (P(x)*L)

8 Maximum Entroy Model Based on Probability estimation technique. Widely used for natural language understanding tasks such as text segmentation, sentence boundary detection, POS tagging, reositional hrase attachment, ambiguity resolution, stochastic attributed-value grammar, and language modelling roblems. 8

9 Maximum Entroy Why maximum entroy? Maximize entroy = Minimize commitment Model all that is known and assume nothing about what is unknown. Model all that is known: satisfy a set of constraints that must hold Assume nothing about what is unknown: choose the most uniform distribution choose the one with maximum entroy 9

10 Basic Idea Goal: estimate Choose with maximum entroy (or uncertainty ) subect to the constraints (or evidence ). H ( ) ( x)log ( x) xa B 10

11 Maximum Entroy (MaxEnt) Model: Theory and Method 11

12 Ex1: Coin-fli examle (Klein & Manning 2003) Toss a coin: (H)=1, (T)=2. Constraint: = 1 Question: what s your estimation of =(1, 2)? Answer: choose the that maximizes H() H ( ) ( x)log ( x) x H 1 1=0.3 12

13 Coin-fli examle (cont) H = =1.0, 1=0.3 13

14 Ex2: An Machine Translation (MT) examle (Berger et. al., 1996) Possible translation for the word in is: Constraint: Intuitive answer: 14

15 An MT examle (cont) Constraints: Intuitive answer: 15

16 An MT examle (cont) Constraints: Intuitive answer:?? 16

17 Ex3: POS tagging (Klein and Manning, 2003) 17

18 Ex3 (cont) 18

19 Ex4: overlaing features (Klein and Manning, 2003) 19

20 Features Features are elementary ieces of evidence that link asects of what we observed with a category a that we want to redict. A feature has a real value: : Usually features are indicator functions of roerties of the inut and a articular class, is defined as ᴧ Φ which has a value of 0 or 1 We can also say that Φ(b) is a feature of the data b 20

21 Features Feature (a.k.a. feature function, Indicator function) is a binary-valued function on events: f : {0,1 }, A A: the set of ossible classes (e.g., NER tag) B: sace of contexts (e.g., neighboring words/ NER tags) B Ex: 1 f ( a, b) 0 if a B PER otherwise revword( b) "Mr." 21

22 NER Model and Features Sequence model across words Each word is classified by a local model Examle of features: The current word, revious word, next word, revious tag, revious, next, and current art-ofseech (POS) tag, character n-gram features, etc. Could be >800K features 22

23 Conditional Model We have some data and we want to lace robability distributions over it. A conditional model gives robability It takes the data as given and models only the conditional robability of the class. 23

24 Modeling the roblem Obective function: H() Goal: Among all the distributions that satisfy the constraints, choose the one, *, that maximizes H(). * arg max H ( ) P Question: How to reresent constraints? 24

25 Some notations Finite training samle of events: Observed robability of x in S: The model s robability of x: S ~ ( x ) (x) The th feature: f Observed exectation of f : (emirical count of f ) E f ~ ~ ( x) f ( x) x Model exectation of f : E f ( x) f ( x) x 25

26 Constraints Model s feature exectation = observed feature exectation E f E ~ How to calculate? ~ ( i1 E~ f x) f ( x) x N E ~ f f N ( x) f 1 f ( a, b) 0 if a B PER otherwise revword( b) "Mr." 26

27 Restating the roblem The task: find * s.t. * arg max H ( ) P where P { E f E~ f, {0,..., k}} Obective function: -H() Constraints: { E f E~ f d, {0,..., k}} 27

28 Questions Is P emty? Does * exist? Is * unique? What is the form of *? How to find *? 28

29 Using Lagrangian multiliers ) ( ) ( ) ( 0 k d f E H A Minimize A(): 0 1 ) ( 1 ) ( 1 0 ) ( 1 1 ) ( 0 ) ( 1 ) ( 0 ) ( log 0 ) ( 0 ) ( log 1 0 ) ( / ) )) ( ) ( (( 0 ) ( log ) ( ( 0 ) ( ' e Z where Z x f k e x x f k e x f k e x x f k x x f k x x d x x f x x k x x A 29

30 Two equivalent forms 1 ( x) ex f( x) Z where Z ( x) ex k 1 1 Z xi A equivalent form: f ( x) f( xi) ln 30

31 The form of * P { E f E ~ f, {1,..., k}} Q 1 { ( x) ex f( x)} Z Theorem: if * P Q then Furthermore, * is unique. * arg max H ( ) P 31

32 The Model Form The model for a data set (A,B) can be exressed as conditional likelihood of the data:,, log,, log The aim is to find arameters to maximize the above exonential model. 32

33 Practical Issues of Scale Huge number of features: Some roblems can easily have over a million of features Sarsity issues: Some arameters are too large Many features seen in training will never occur again at test time. Otimization issue: Some arameters can be infinite Smoothing can tackle some of the above roblems 33

34 An Examle Consider a coin fliing roblem. The data contains h heads and t tails. Features: Heads, Tails Let and be the arameter for the head and tail resectively According to the exonential model form, the model distribution is: 34

35 An Examle Since and are related, there is only one degree of freedom. Let The conditional likelihood of the data (h,t) is: 35

36 Smoothing Issues For the data set with 4 heads and 0 tails, there are two roblems: The otimal value of λ is infinity which imoses roblems for any otimization rocedure The learned distribution is not smooth To solve both issues, we can aly smoothing One way to do smoothing is to ust sto the otimization early (early stoing) The value of λ may still be too large 36

37 Smoothing Gaussian Priors Suose that we had a rior exectation that arameter values would not be very large. We could balance evidence suggesting large arameter values (or infinity) against our rior Parameter values would be smoothed (ket finite) As a result, the obective function is refined as: evidence rior Making use of Gaussian riors to achieve it. 37

38 Smoothing Gaussian Priors Intuition: Model arameters such as should not be too large. Formalization: Prior exectation that each arameter will be distributed according to a Gaussian distribution with mean and variance. For examle, Penalizes arameters for drifting too far from their mean rior value. 38

39 Smoothing Gaussian Priors In general: As a result, the obective function is refined as: 39

40 Parameter Estimation Algorithms Generalized Iterative Scaling (GIS): (Darroch and Ratcliff, 1972) Imroved Iterative Scaling (IIS): (Della Pietra et al., 1995) 40

41 Feature selection 41

42 Feature selection Throw in many features and let the machine select the weights Manually secify feature temlates Problem: too many features An alternative: greedy algorithm Start with an emty set S Add a feature at each iteration 42

43 Notation With the feature set S: After adding a feature: The gain in the log-likelihood of the training data: 43

44 Feature selection algorithm (Berger et al., 1996) Start with S being emty; thus s is uniform. Reeat until the gain is small enough For each candidate feature f S f Comuter the model using IIS Calculate the log-likelihood gain Choose the feature with maximal gain, and add it to S Problem: too exensive Instead of calculating all the weights, calculate only the weight of the new feature. 44

4. Score normalization technical details We now discuss the technical details of the score normalization method.

4. Score normalization technical details We now discuss the technical details of the score normalization method. SMT SCORING SYSTEM This document describes the scoring system for the Stanford Math Tournament We begin by giving an overview of the changes to scoring and a non-technical descrition of the scoring rules