Sampling Equation Derivation for Lex-MED-RTM

Size: px

Start display at page:

Download "Sampling Equation Derivation for Lex-MED-RTM"

Cassandra Robbins
5 years ago
Views:

1 Sampling Equation Derivation for Lex-MED-RTM Weiwei Yang Computer Science University of Maryland College Park, MD Jordan Boyd-Graber Computer Science University of Colorado Boulder, CO colorado.edu Philip Resnik Linguistics and UMIACS University of Maryland College Park, MD Sampling Topics The probability that document d and d are linked is defined as py η, τ, z d, z d, w d, w d = c max0, ζ, where z d = n z d,n and w d = N d n w d,n; η and τ are weight vectors for two documents element-wise products of topic proportions and word proportions respectively; c is the regularization parameter; ζ is defined as N d ζ = y η T z d z d + τ T w d w d, where denotes element-wise product of two vectors. Equation can be ressed [] as py η, τ, z d, z d, w d, w d = 0 πλ by introducing a latent variable λ. Therefore the joint probability of Lex-MED-RTM is pw, z, y K N k + β β D d= N d + α α πλd,d cζ + λ cζ + λ dλ, 3, 4 where D and K are numbers of documents and topics respectively; d and d denote the document pairs that are actually linked; is defined as dimx i= Γx i x = Γ dimx i= x i, 5 where Γ denotes the Gamma function. Then the Gibbs sampling equation can be derived as pz d,n = k z d,n, w, y pz, w, y pz d,n, w d,n, y N k + β N d,n k + β N d,n k,v N d,n k, + β + V β N d + α N d,n d + α N d,n + α d d cζ +λ cζ d,n +λ cζ + λ 6 7, 8

2 where N k,v denotes the count of word v assigned to topic k; N is the number of tokens in document d that are assigned to topic k. Marginal counts are denoted by ; d,n denotes that the count excludes token n in document d; d denotes the indexes of documents which are actually linked to document d. The next step is to and the hinge loss term as cζ + λ c ζd,d + λ cζ 9 c y η T z d z d + τ T w d w d + η T z d z d + τ T w d w d 0 λcy η T z d z d + τ T w d w d cyc + λ η T z d z d c ηt z d z d + τ T w d w d λ cy c + λ K N d,n η N d,k k N + η N k d,k k = d, N d, = λ 3 K N d,n η N d,k k N + η N k d,k c k d, N + V N τ d,v N d,v d v, = 4 cy c + λ η N k d,k λ 5 η N k d,k + η N N k d,k c d N, d, N K N d,n η N d,k d k, N + V N τ d,v N d,v k d v, = cyc + λ η k N d,k 7 λ N d, N d,k + η kn d,k K η k N d,n N d,k + V τ v N d,v N d,v c k = Nd, N d,. 8 6 y In the sampling process, we only consider linked documents, which means that y =, so can be removed in the sampling equation. Optimizing Topic and Lexical Regression Parameters Assuming that each element of topic regression parameters η and lexical regression parameters τ is given a Gaussian prior N 0, ν, the likelihood of η and τ are computed as K pη, τ z, w, λ V ν τv ν λ + cζ. 9 λ

3 Therefore, the log likelihood Lη, τ is Lη, τ It can be further anded as Lη, τ = K K K V ν τv ν V ν τv ν λ + cζ λ c ζ + cλ d,d ζ d,d. 0 V ν τv ν c ηt z d z d + τ T w d w d + cλ d,d ηt z d z d + τ T w d w d 3 K V ν τv ν + 4 cc + λ d,d ηt z d z d + τ T w d w d c ηt z d z d + τ T w d w d. 5 Let then Lη, τ is W = η T z d z d + τ T w d w d, 6 Lη, τ K V ν τ v ν + cc + λ d,d W d,d c Wd,d. 7 λ Next step is to compute the derivatives. We first compute W s derivatives as W = N N d,k η k W = N d,v N d,v τ v W η k = W W N N d = W,k η k W Therefore, the derivatives are W N d,v N d = W = W,v. 3 τ v τ v Lη, τ η k η k ν + cn N d,kc + λ cw λ 3 Lη, τ τ v τ v ν + cn d,v N d,vc + λ cw. 33 λ 3

4 3 Sampling Latent Variables The likelihood of latent variable λ pλ z, η, τ is πλ πλ c ζd,d = GIG λ + cζ λ λ ;,, c ζ d,d, 36 where GIG is generalized inverse Gaussian distribution which is defined as GIGx; p, a, b = Cp, a, bx p b x + ax. 37 We can sample λ from an inverse Gaussian distribution pλ d,d z, η, τ = IG λ d,d ; c ζ,, 38 where for a > 0 and b > 0. b IGx; a, b = πx 3 bx a a, 39 x 4 Sampling Process The general sampling process of Lex-MED-RTM is given in Algorithm, which is similar to MED-LDA []. Algorithm Sampling Process : set λ = and draw z d,n from a uniform distribution : for m = to M do 3: optimize η and τ using L-BFGS Eqaution 7, 3 and 33 4: for d = to D do 5: for each word n in document d do 6: draw a topic z d,n from the multinomial distribution Equation 8, 7 and 8 7: end for 8: for each document d which document d links do 9: draw λ and then λ from the inverse Gaussian distribution Equation 38 0: end for : end for : end for The sampling process starts from initialization of λ and topic assignments. In each iteration, η and τ are optimized by feeding their likelihood and derivatives to L-BFGS MALLET provides a nice implementation. When sampling for documents, we first sample each word s topic assignment. Then for each λ, we sample its reciprocal from the inverse Gaussian distribution. MALLET: 4

5 References [] Nicholas G. Polson and Steven L. Scott. Data augmentation for support vector machines. Bayesian Analysis, 6: 3, 0. [] Jun Zhu, Ning Chen, Hugh Perkins, and Bo Zhang. Gibbs max-margin topic models with data augmentation. The Journal of Machine Learning Research, 5:073 0, 04. 5

A Discriminative Topic Model using Document Network Structure

A Discriminative Topic Model using Document Network Structure Weiwei Yang Computer Science University of Maryland College Park, MD wwyang@cs.umd.edu Jordan Boyd-Graber Computer Science University of Colorado