1 The algorithm for variable-order linear-chain CRFs

Size: px

Start display at page:

Download "1 The algorithm for variable-order linear-chain CRFs"

Linette Hall
6 years ago
Views:

1 1 he algorith for variable-order linear-chain CRFs In this section, we introduce three algoriths that constitute the ain part of this paper. One is Su-Difference algorith, which is used to calculate the ectations of the feature functions to estiate paraeters. his algorith uses su scores and difference ones when calculating forward and backward scores. his way, we can apply dynaic prograing. he other two are decoding algorith used to infer the best labels for an sequence of observations. One process the sequences fro head to tail and the other fro tail to head. 1.1 Notations We introduce the notations used in this paper. denotes the length of a sequence. x and y represent respectively a sequence of observations and labels of length and + 2. y is a sequence which has positions 0 and + 1, which represent respectively the beginning and ending of a sequence. hey have the labels l BOS and l EOS. Suppose we have a set of labels Y {l 1,, l L }. We ine the set of labels Y t at a particular position t as follows. {l BOS } if t 0 Y t {l EOS } if t Y otherwise We denote z Y n: when a label sequence z satisfies z n + 1, t : 1 t n + 1z n Y n+t 1. For a label sequence z, we ine z i:j z i j. When j < i, let z i:j ϵ, where ϵ represents an epty string. When two label sequences z 1 and z 2 are concatenated into z 3, we denote it as z 3 z 1 + z 2. When a label sequence z 1 is a suffix of z 2, we denote it as z 1 s z 2, otherwise z 1 s z 2. For any label sequence z, ϵ s z. When z 1 is a proper suffix of z 2, we denote it as z 1 < s z 2. he following two equations can be trivially derived: z 1 < s z 2, z 1 ϵ z 1 1: z 1 1 < s z 2 1: z We ine sz 1, S as follows: z 1 < s z 3, z 2 < s z 3 z 1 < s z 2 or z 2 s z 1 3 sz 1, S z 2 if and only if z 2 S and z 2 < s z 1 and z Sz < s z 1 z s z 2 4 and call it longest proper suffix of z 1 with respect to S. z 1 ay or ay not be an eleent of S. 1

2 Siilarly, we ine Sz 1, S as follows: Sz 1, S z 2 if and only if z 2 S and z 2 s z 1 and z Sz s z 1 z s z 2 5 and call it the longest suffix of z 1 with respect to S. Sz, S z if and only if z is an eleent of S. For any set S, let sϵ, S. Let be an iaginary label sequence that never equals to any other label sequences zz and let n be such a label sequence that has the length n. Let the feature functions be f 1,, f. Each feature function is a binary function that takes three arguents obervations, label sequence, position and can be ressed as a product of two binary functions b i x, t we call it observation function and L i y, t we call it label function. f i x, y, t b i x, tl i y, t 6 Each L i is associated with a label sequence z i and is ined as follows: { 1 if yt zi +1:t z L i y, t i 0 otherwise 7 We also ine L i ys, t and L z as alternate representations of the label function, where y s is a suffix of a label sequence y and z is an arbitrary label sequence. We use zero-based indexes for y s. 1 if t + y s y z i and L iy s, t y t+ ys y z i +1:t+ y s y z i 8 0 otherwise We ine f i L i z 1 if z z i and z z zi +1: z z i 0 otherwise and f i as products of b i and L i, b i and L i respectively. 9 f ix, y s, t b i x, tl iy s, t 10 f i x, z, t b i x, tl i z When we ine f i, f i and f i as above, the following can be said: f i x, y, t f i x, y t zi +1:t, t 13 f ix, y s, t f i x, yt+ y s s y z i +1:t+ y s y, t 14 2

3 1.2 raining of High-order CRF Here is the ected su of f i based on current paraeters: E[f i ] x,ỹ P y x y +1 t z i 1 f i x, y, t 15 We will need ectations of f i later when we update the paraeters. Here we introduce a way to calculate the Patterns We ine P t, the pattern set at the position t, as follows: P t Y t {ϵ} {z k 1: z k t t b kx, t 1, z k t t} 16 t t We call each eleent of P t a pattern at the position t. A pattern is either one of the label sequences that are associated with the feature functions or a prefix of another pattern at the position t > t, with t t labels reoved. When a label sequence y satisfies y t zp +1:t z p for a pattern z p at a position t, we say that y contains z p at the position t. When we ine the pattern sets as in 16, the following can be said: We can also say the following: z P t, z ϵ, t > 0 z 1: z 1 P t 1 17 f i x, z, t f i x, Sz, P t, t 18 Assue this is not true and that there is a feature function f i x, Sz, P t, t 0, f i x, z, t 1 for a Sz, P t < s z s z. his eans that z P t and this contradicts with the inition of longest suffix given in Scores of Patterns Considering positions 0 and + 1 as the beginning and ending of a label sequence, conditional probability distributions of a linear-chan CRF is ined as P y x Z x y/z x, where Z x y +1 i1 t z i 1 f ix, y, tλ i and Z x y Z xy. We call Z x y the score of y. Let σz p, t denote the su of scores of label sequences y that contain z p at a position t. It can be ressed in the following way: σz p, t +1 f i x, y, t λ i 19 y:y Y 0:,y t z p +1:t z i1 t 1 Unlike first order linear-chain CRFs, we cannot directly decopose this using forward and backward scores. Since the fact that a label sequence y contains 3

4 a pattern z p at a position t does not necessarily ean it is the longest pattern that y contains at that position, backward scores cannot be deterined. So we introduce suppleentary variables θz p, t that denote the probability that a label sequence y contains a pattern z p P t at a position t and does not contain any longer patterns at the sae position, i.e. does not contain any patterns that have z p as their longest proper suffix. θz p, t y:y Y 0:,y t z p +1:t z p, z P t,z p sz y t z +1:t z +1 f i x, y, t λ i i1 t 1 y:y Y 0:,y t z p +1:t z p, z P t,sz,p tz p y t z +1:t z +1 f i x, y, t λ i i1 t 1 When we ine θz p, t like this, we can ress σz p, t as a su of θ. σz p, t z:z P t,z p s z 20 θz, t 21 By the inition of pattern given in 16, if a label sequence y contains z p at a position t and does not contain any longer patterns, we can know that there are no feature functions that are affected by the labels before t z + 1. So we can decopose θz p, t in three parts, the forward part 1 t t and backward part t + 1 t + 1. he forward part can also be represented in the for of substraction. Let αz p, t and βz p, t denote the forward and backward parts respectively. θz p, t z:z Y 0:t,z t z p +1:t z p, z P t,sz,p tz p z t z +1:t z i1 +1 f ix, z p + z, t λ i z:z Y t+1: +1 i1 t t+1 t t 1 f i x, z, t λ i αz p, tβz p, t Pattern weights For a pattern z p P t and a position t, we ine wz p, t as follows: wz p, t i:z i z p,b i x,t1 λ i 23 4

5 his is the su of the weights of the feature functions associated with the pattern z p at the position t. We also ine W z p, t for a pattern z p P t and a position t as follows: W z p, t i:z i sz p,b ix,t1 λ i i1 f i x, z p, tλ i 24 his is the su of the weigths of the feature functions associated with the pattern z p or any of its suffixes at the position t. Fro 24 and 23, we can ress W using w as follows: W z p, t wz i, t 25 i:z i P t,z i s z p When we ine W and w as above, we can say that for any z p ϵ, W z p, t W sz, P t, t + wz p, t he Forward Variables As shown in 22, αz p, t is ined for a pattern z p and a position t 1 as follows: αz p, t z:z Y 0:t,z t z p +1:t z p, z P t,sz,p t z p z t z +1:t z i1 t 1 t f i x, z, t λ i 26 Let αϵ, t 0 for any t. In order to calculate these using dynaic prograing, we also ine γz p, t for a pattern z p P t and a position t as follows: γz p, t t f i x, z, t λ i 27 z:z Y 0:t,z t z p +1:t z p i1 t 1 When we ine α and γ like this, we can say the following using 24: αz p, t γz p, t z:z Y 0:t,z t z p +1:t z p, z P t,sz,p tz p z t z +1:t z t 1 f i x, z 1: z 1,t λ i f i x, z p, tλ i i1 t 1 γz p 1: z p 1, t 1 z:z P t,sz,p t z p i1 γz1: z 1, t 1 W z p, t 28 αz, t 29 z P t,z p s z 5

6 1.2.5 he Backward Variables Here we rewrite the inition of βz p, t given in 22 using f i ined in 11: +1 βz p, t f i x, t zp +1 + z p + z, t λ i z:z Y t+1: +1 z:z Y t+1: +1 i1 t t+1 +1 i1 t t+1 f ix, z p + z, t λ i 30 In order to calculate these using dynaic prograing, we also ine δz p, t for a pattern z p P t and a position t as follows: +1 z Y t+1: +1 i1 t t+1 f i x, z, t λ i if z p ϵ δz p +1, t z Y t+1: +1 i1 t t+1 f i x, zp + z, t λ i +1 i1 t t+1 f i x, szp, P t + z, t λ i otherwise When we ine β and δ like this, we can say the following: δz p z P t+1,z 1: z 1 z βz, t + 1 W z, t + 1 if p zp ϵ, t z P t+1,z 1: z 1 z βz, t + 1 W z, t + 1 p βsz, P t+1, t + 1a W sz, P t+1, t + 1 otherwise βz p, t z P t,z sz p δz, t 33 Here we present the derivation of 32 for z p ϵ, which is not straightforward. First, we rewrite the right hand side of 31 as follows: +1 f ix, z p + z + z, t λ i f i x, z p + z, t + 1λ i z Y t+2: +1 z Y t+1 +1 i1 t t+2 i1 t t+2 f ix, sz p, P t + z + z, t λ i Using the equation 18, we can rewrite 34 as follows: i1 i1 f i x, sz p, P t + z + z, tλ i t t+2 i1 t t+2 i1 i1 z Y t+2: +1 z Y t+1:t+1 f ix, Sz p + z, P t+1 + z, t λ i f i x, Sz p + z, P t+1, t + 1λ i f ix, Ssz p, P t + z, P t+1 + z, t λ i i1 f i x, Ssz p, P t + z, P t+1, tλ i 35 6

7 Here we can say the following: sz p + z, P t+1 s sz p, P t + z 36 Assue 36 is not true. By the inition of longest proper suffix in 4, sz p + z, P t+1 < s z p + z and sz p, P t + z < s z p + z. Fro 3 and the assuption that 36 is not true, sz p, P t + z < s sz p + z, P t+1. Let z sz p + z, P t+1. Since z P t+1 and z ϵ, we can derive z 1: z 1 P t fro 17 and sz p, P t < s z 1: z 1 < s z p fro 2. his contradicts the inition of longest proper suffix in 4. Fro the inition of longest proper suffix given in 4, we can also say that there is no such z P t+1 that satisfies sz p + z, P t+1 < s z < s z p + z. Fro the above arguent, we can say the following: Ssz p, P t + z, P t+1 sz p + z, P t+1 37 When z p + z P t+1, we can say the following fro 4 and 5. Sz p + z, P t+1 sz p + z, P t+1 38 So in 35, for z p + z P t+1, we can cancel out the right and left sides of the subtraction and rewrite it as follows: z Y t+2: +1 z P t+1,z z 1 zp +1 t t+2 i1 +1 t t+2 i1 f i x, z + z, t λ i f i x, z, t + 1λ i f i x, sz, P t+1 + z, t λ i f i x, sz, P t+1 + z, P t+1, tλ i 39 Applying the inition of β given in 30 and that of W given in 24, we can derive the equation 32. Now that α s at a position t can be calculated using γ s at the position t 1 and γ s using α s at the sae position, β s at a position t using δ s at the position t + 1 and δ s using β s at the sae position and αbos, 0 1, δϵ, + 1 1, we can calculate all the α s, β s, γ s and δ s using dynaic prograing. We can then calculate θ s using α s and β s, and then, σ s using θ s. he noralization factor Z x can be ressed as γϵ, + 1, because: t γϵ, + 1 f i x, z, t λ i Z x y Z x z:z Y 0: +1,z +2: +1 ϵ i1 t 1 y 40 So we can now calculate the ectations of each feature function as follows: i1 i1 +1 P f i x σz i, t/zx 41 t1 7

8 1.2.6 he algorith to calculate the variables Here we present the algorith that calculates efficiently the variables introduced in the previous section. First we enuerate the patterns that belong to P t at positions t 1,, + 1. We also enuerate the feature functions whose values equal to 1 at the position t if a label sequence contains the pattern z p at that position and ake a set F zp,t which has the as eleents. his procedure can be described as in Algorith 1. We can easily link a pattern z ϵ P t to its corresponding prefix Algorith 1 Enuerate patterns for t 1 to + 1 do F {f k b k x, t 1} for all f i F do F z i,t F z i,t {f i } for j 0 to z i do P t j P t j {z i 1: z j } pattern, which we here ine as z 1: z 1 P t 1. We can also link a pattern z P to its longest proper suffix sz, P t by storing the pattern sets in tries using the reversed lavel sequences of patterns as keys. We can sort the patterns in a order such that z 1 appears before z 2 when z 1 < s z 2. We call this order ascending order and the reverse descending order. Since the sorting order of the patterns and their prefix patterns and longest proper suffixes are invariable over iterations, we only have to execute this procesure once and store the result. We execute Algorith 2 for each iteration to calculate the ectations of the feature functions. Assue that all wz, t and other nueric variables are initialized by 0. he coplexity of this procedure is O F zp,t + P t 42 t1 z p :z p P t t1 for an iteration on a sequence. γ s and δ s are teporary variables and old ones can be discarded. 1.3 Inference We will describe the procedure to infer the labels using the estiated paraeters. First, the label sequence which we want to infer can be described as follows: y arg ax y 1 Zx 8 λ i f i x, y, t t i

9 Algorith 2 Su-difference for t 1 to + 1 do for all z ϵ P t in ascending order do for all f i F z,t do wz, t wz, t + λ i W z, t W sz, P t, t + wz, t γϵ, 0 1, γl BOS, 0 1 for t 1 to + 1 do for all z ϵ P t in descending order do αsz, P t, t αsz, P t, t γz 1: z 1, t 1 αz, t αz, t + γz 1: z 1, t 1 W z, t αϵ, t 0 for all z ϵ P t in descending order do γz, t γz, t + αz, t γsz, P t, t γsz, P t, t + γz, t Zx γϵ, + 1 δϵ, for t + 1 downto 1 do βϵ, t δϵ, t for all z ϵ P t in ascending order do βz, t βsz, P t, t + δz, t for all z ϵ P t do δz 1: z 1, t 1 δz 1: z 1, t 1 + βz, t W z, t βsz, P t, t W sz, P t for t 1 to + 1 do for all z ϵ P t in descending order do θz p, t αz p, t W z p, tβz p, t σz, t σz, t + θz, t σsz, P t, t σsz, P t, t + σz, t for all f i F z,t do 9 E[f i ] E[f i ] + σz, t/zx

10 arg ax λ i f i x, y, t y he algorith we present here has the coplexity of O t i +1 t1 which is OL ties bigger than that of training for one iteration Inference Algorith 43 i:bx,t1 1 + L +1 t1 We have to calculate the W for each pattern beforehand. his procedure, Algorith 3, is the sae as the one that we used in the Su-Difference algorith. z p :z p P t 1, Algorith 3 Calculate weights for t 1 to + 1 do for all z ϵ P t in ascending order do for all f i F z,t do wz, t wz, t + λ i W z, t W sz, P t, t + wz, t We ine pz p, t and qz p, t for a pattern z p P t as follows. pz p, t qz p, t ax z:z P t 1,z p 1: z p 1 sz, z :z P t 1,z p sz z sz pz, t 1 W z p, t 44 arg ax z:z P t 1,z p 1: z p 1 sz, z :z P t 1,z p s zz s z pz, t 1 45 When we ine p and q like this, we can calculate the using the Algorith 4. After that, we can obtain the optiu label sequence y using the Algorith 5. 10

11 Algorith 4 Inference 1 pϵ, 0 1, pl BOS, 0 1 for t 1 to + 1 do for all z ϵ P t in descending order do if this is the first iteration or the last label in z has changed then z the first z P t 1 in descending order for all z P t 1 do rz, t 1 pz, t 1 uz, t 1 z end if while z z 1: z 1 do if rz, t 1 > rsz, t 1, t 1 then rsz, t 1, t 1 rz, t 1 usz, t 1, t 1 uz, t 1 end if z next z P t 1 in descending order end while pz, t rz, t W z, t qz, t uz, t Algorith 5 Inference 2 z arg ax z P +1 pz, + 1 for t + 1 to 1 do y t z z z qz, t y 0 l BOS 11

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data Suppleentary to Learning Discriinative Bayesian Networks fro High-diensional Continuous Neuroiaging Data Luping Zhou, Lei Wang, Lingqiao Liu, Philip Ogunbona, and Dinggang Shen Proposition. Given a sparse