Structure Learning in Sequential Data

Size: px

Start display at page:

Download "Structure Learning in Sequential Data"

Alvin Harrington
5 years ago
Views:

1 Structure Learning in Sequential Data Liam Stewart Richard Zemel

2 Motivation. Cau, R. Kuiper, and W.-P. de Roever. Formalising Dijkstra's development strategy within Stark's formalism. In C.. Jones, R. C. Shaw, and T. Denvir, editors, Proc. 5th. CS-FCS Refinement Workshop, Cau, R. Kuiper, and W.-P. de Roever. Formalising Dijkstra's development strategy within Stark's formalism. In C.. Jones, R. C. Shaw, and T. Denvir, editors, Proc. 5th. CS-FCS Refinement Workshop, 1992.

3 Structure Learning Structure: patterns and relationships... - between labels and observations, - between labels. asic components: - Observations X = { x 1,...,x T x i R d } - s Y = {y 1,..., y T y i Y} Goal: learn a mapping from observations to labels.

4 Why is it difficult? If sequences are considered as a whole: - Observations have indeterminate dimensionality, - Number of joint classifications of labels is exponential in T. If the items are considered separately: - Information about label structures is lost. How do we learn a mapping while maintaining tractability and exploiting structure?

5 Outline 1. Probabilistic models for structure learning 2. Experimental results 3. Conclusions and future work

6 Probabilistic Models Choices: - Generative vs. conditional. - Directed vs. undirected model. - Use latent states? - Connectivity? Generative models express the conditional distribution P(Y X) in terms of the joint distribution P(Y,X) and require a model of the observations P(X). latent We focus on conditional models.

7 Conditional Models These model the conditional distribution p(y X) directly and make extensive use of features: functions of the observations. Features can be overlapping and non-independent. Examples: - Regular Expressions: [-Z] - Category: is a name - Exact match: Morgan ssumption: raw observations have been pre-processed by a set of feature functions.

8 Models We implemented - logistic regression (LR) - the Maximum Entropy Markov Model (MEMM) and a version with observation independent transitions (the IMEMM), - the Conditional Random Field (CRF) and a version with observation independent edge potentials (the ICRF). - the Input Output Hidden Markov Model (IOHMM) - the Hidden Random Field (HRF) - the RM-CRF template model. We report results for LR, the CRF, the IOHMM, the HRF, and RM-CRF models.

9 Conditional Models Fully Observed Chain Structures Latent State Template Models CRF IOHMM HRF RM-CRF

10 y 0 y 1 y 2 y T CRF x 1 x 2 x T Conditional random field [Lafferty, McCallum, Pereira 2002] It is the undirected version of the maximum entropy Markov model (MEMM). p(y X) = = 1 Z(X) 1 Z(X) T φ(y t, y t 1, x t ) t=1 T exp{θ yt,y t 1 x t } t=1

11 y 0 y 1 y 2 y T CRF x 1 x 2 x T Most implementations use a second-order quasi-newton optimizer (e.g. L-FGS) to optimize the weights. m i,t,j Let be one when = j. l(θ; D) = = y (i) t N log p(y i X i ) i=1 N i=1 T i t=1 θ (i) y t,y (i) t 1 x (i) t N log Z(X i ) i=1 θjk l(θ; D) = N T i [m i,t,j m i,t 1,k p(y t = j, y t 1 = k X i )]x (i) t i=1 t=1

12 Conditional Models Fully Observed Chain Structures Latent State Template Models CRF IOHMM HRF RM-CRF

13 y 1 y 2 y T IOHMM h 0 h 1 h 2 h T x 1 x 2 x T Input output HMM [engio, Frasconi 1995] It is an HMM where the transition and emission distributions are conditional on the observations. set of latent random variables H that form a chain structure are used to maintain state information. p(y X) = T p(h t h t 1, x t )p(y t h t, x t ). H t=1 p(h t = j h t 1 = k, x t ) exp{λ jk x t } p(y t = j h t = k, x t ) exp{θ jk x t }

14 y 1 y 2 y T IOHMM h 0 h 1 h 2 h T x 1 x 2 x T The EM algorithm is used to train the IOHMM. In the E step, we need to calculate the posterior distributions: p(h (i) t = j Y i, X i ) The M step updates: 1. the transition weights, 2. the emission weights. p(h (i) t = j, h (i) t 1 = k Y i, X i ) oth of these updates are weighted logistic regression problems.

15 y 1 y 2 y T HRF h 0 h 1 h 2 h T x 1 x 2 x T Hidden random field [Kakade, Teh, Roweis 2002] It is the undirected equivalent of the IOHMM. p(y X) = H p(y, H X) = = 1 Z(X) 1 Z(X) H H T φ(h t, h t 1, x t )φ(y t, h t, x t ) t=1 T exp{λ ht,h t 1 x t + θ yt,h t x t } t=1 Like the IOHMM, training uses the EM algorithm. The M step involves optimizing a fully-observed CRF.

16 Conditional Models Fully Observed Chain Structures Latent State Template Models CRF IOHMM HRF RM-CRF

17 Features CRF models are fully specified with respect to label configurations. dding higher-order features results in an exponential increase in model complexity. Idea: learn higher-order structures from the data using label features. Feature: parameterized feature on a group of label variables. The weights can be adjusted to change what the label feature looks for.

18 RM-CRF Restricted oltzmann machine CRF [He,Zemel,Carreira-Perpinan 2004] label feature is a binary random variable connected to J labels. Define label features through groups: g = n g, o g, {W g,n } n g n=1, b g n g : number of label features in the group. o g : vector of offsets that specifies the connectivity. Instantiate a group at time t (replication r): g(t) = n g, o g (t), {W g,n } n g n=1, b g o g (t) = o g + t

19 RM-CRF: Example Group: f s: 0 1 h f,1,1 h f,2,1 h f,1,2 h f,2,2 h f,1,3 h f,2,3 Group: g y 1 y 2 y 3 y 4 s: h g,1,1 h g,1,2

20 RM-CRF n RM-CRF is just a collection of groups. When rolled-out across a sequence, the model has the form of a restricted oltzmann machine (RM). p g (Y, H) exp{ [b g,n + w g,n,k l k ]h g,n,r } r n k = exp{[b g,n + w g,n,k l k ]h g,n,r } r n k = p g,n,r (h g,n,r, Y ) r n

21 RM-CRF: Inference and Training Exact inference is hard because of the loopy structure. However, we can exploit the bipartite nature of the graph to implement a block Gibbs sampler. RM-CRF models are products of experts so we can use contrastive divergence to train them.

22 RM-CRF: Observations There are several ways to incorporate observations into RM- CRF models. 1. Use a pre-trained classifier (e.g.: logistic regression, CRF). 2. Train a classifier at the same time as the RM-CRF. 3. Incorporate observations directly into the parameterization of the label feature. p g (Y, H X) exp{ r [b g,n + n j θ g,n,j x j + k w g,n,k l k ]h g,n,r }

23 Outline 1. Probabilistic models for structure learning 2. Experimental results 3. Conclusions and future work

24 Evaluation Metrics With respect to label l, let - be the number of true positives, be the number of false negatives - C be the number of false positives, D be the number of true negatives F1 score: F1 = (2)/(2 + + C) ccuracy: number of items labeled correctly divided by the total number of items in a sequence. Instance accuracy: percentage of sequences labeled entirely correctly. We use Viterbi decoding for the CRF, IOHMM, and HRF and maximum marginals for logistic regression and the RM-CRF.

25 Toy Problem training instances &100 test instances, each of length 50. Models require memory / knowledge of structures to classify observation 5. RM-CRF models look at groups of six label variables. IOHMM/HRF have 3 states. Observation s C D 1 2 Q C E Q 3 C 4 Q 5 D,E S

26 Toy Problem 1 LR/CRF IOHMM HRF RM(1) RM(2) RM(3) D E vg. F vg. cc Inst. cc

27 Toy Problem 1 Feature 1 Feature 2 Feature 1 C D C D E E Q Q C Feature 1 Feature Feature 3 D E C D E Q C D E Q C D E Q Q Feature Feature Feature C D E Q C D E Q C D E Q

28 Toy Problem 2 Three labels:,,o structures:,,,,. Separated by O Each label has a distribution over observations, but the interior of the larger structures is a mixture distribution. 100 training instances, 1000 test instances. RM(1) has 9 pair-wise label features; RM(2) has 9 pair-wise features and 3 features that look at groups of 8 label variables. Five-fold cross-validation used to choose number of states in the IOHMM and the HRF (8).

29 Toy Problem 2 LR CRF IOHMM HRF RM(1) RM(2) vg. F vg. cc Inst. cc

30 Toy Problem 3 From [Kakade, Teh, Roweis 2002]. Investigates variable term memory. The CRF, IOHMM, and HRF can model well, but achieve suboptimal performance. RM-CRF models with label features conditional on the input can improve improve upon the performance of the CRF but are not able to model the data perfectly because they maintain no state.

31 Cora References Consists of 500 bibliography entries from research papers; 350 are used for training and 150 are used for testing. s: author, book title, date, editor, institution, journal, location, note, pages, publisher, tech, title, and volume. Entries tokenized by whitespace. Each token processed by 4191 features: 3*(19 regular expressions + 4 categorical vocabulary) The IOHMM and HRF took too long to train.

32 Cora References Name Groups Observations RM-LR(2) (10,0:4), (10,0:9), (10,0:19) Pre-trained LR RM-CRF(2) (10,0:4), (10,0:9), (10,0:19) Pre-trained CRF RM(3) (15,0:1), (20,0:2), (5,0:9), (5,0:19) LR Notation: (number of nodes, offsets)

33 Cora References LR RM-LR (2) CRF* CRF RM-CRF (2) RM(3) vg. F vg. cc Inst. cc CRF*: CRF trained by [Peng, McCallum 2005].

34 Outline 1. Probabilistic models for structure learning 2. Experimental results 3. Conclusions and future work

35 Conclusions We evaluated several methods for doing sequence labeling including latent state models and parameterized templated models. IOHMMs and HRFs can be more expressive than CRFs, but it can be difficult to pick a good number of latent states and training take a long time and is subject to local optima. Template models may improve performance, but it easy to pick suboptimal architectures. They do not maintain state so they may not be appropriate for all tasks.

36 Future Work RM-CRF: - More experiments on real data are required. - Feature induction to learn the architecture of the model. - Extension to a hierarchical model. - Integration of the RM-CRF with one or more IOHMM/HRF models in a product model. IOHMM/HRF: - Investigation of training issues and methods. ll: - Put them in a the ayesian framework.

37 end.

38 Toy Problem 1 Method Mean Time (s) Standard Deviation LR MEMM IMEMM CRF ICRF IOHMM HRF RM(1) RM(2) RM(6)

39 Feature 1 RM(1) Replication Iteration Feature 1 Feature 2 RM(2) Replication Iteration Replication Iteration RM(3) Replication Feature Iteration Replication Feature Iteration Replication Feature Iteration Replication Feature Iteration Replication Feature Iteration Replication Feature Iteration

40 Toy Problem Five!Fold Cross Validation Error versus Number of Latent States Error Latent States

41 Toy Problem 2 Method Mean Time (s) Standard Deviation LR MEMM IMEMM CRF ICRF IOHMM HRF RM(1) RM(2)

42 Toy Problem 2 Feature 1 Feature 2 Feature 3 O O O 0 1 Feature Feature Feature 6 O O O 0 1 Feature Feature Feature 9 O O O RM(1)

43 Toy Problem 2 Feature 1 Feature 2 Feature 3 O O O 0 1 Feature Feature Feature 6 O O O 0 1 Feature Feature Feature 9 O O O RM(2) 0 1

44 Toy Problem 2 Feature 1 Feature 2 O O Feature 3 O RM(2)

45 Toy Problem Training Error versus Number of Hidden States for the IOHMM and HRF on Toy Problem 2 IOHMM,! = 2 IOHMM,! = 10 HRF,! = 2 HRF,! = Error ! Number of Hidden States

46 Toy Problem Test Error versus Number of Hidden States for the IOHMM and HRF on Toy Problem 2 IOHMM,! = 2 IOHMM,! = 10 HRF,! = 2 HRF,! = Error Number of Hidden States

47 Toy Problem 3 From [Kakade,Teh,Roweis 2002]. The label of the first I is always 1. CRF can only model //R. No RM-CRF can model //R/I as no state is maintained. Observation 0 1 R Last / I Inverse of last I

48 Toy Problem 3 : -1 0 RM(1) Group 1 : RM(2) Group 1 Group 2 Group 10

49 Toy Problem 3 LR CRF IOHMM HRF RM(1) RM(2) vg. F vg. cc Inst. cc Per-observation error rates LR CRF IOHMM HRF RM(1) RM(2) R I

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions