Accelerated Training of Max-Margin Markov Networks with Kernels

Size: px

Start display at page:

Download "Accelerated Training of Max-Margin Markov Networks with Kernels"

Rafe Bryan
5 years ago
Views:

1 Accelerated Training of Max-Margin Markov Networks with Kernels Xinhua Zhang University of Alberta Alberta Innovates Centre for Machine Learning (AICML) Joint work with Ankan Saha (Univ. of Chicago) and SVN Vishwanathan (Purdue Univ) 1

2 Outline Objective of max-margin Markov network (M 3 N) Smoothing for M 3 N Excessive gap technique in general, and problem for M 3 N Bregman divergence for prox-function Retain the accelerated rates Efficient computation by graphical model factorization Kernelization Conclusion 2

3 Outline Objective of max-margin Markov network (M 3 N) Smoothing for M 3 N Excessive gap technique in general, and problem for M 3 N Bregman divergence for prox-function Retain the accelerated rates Efficient computation by graphical model factorization Kernelization Conclusion 3

4 Structured output prediction Structured feature/label joint map: Linear discriminant: Structured label loss: with Hinge loss Loss augmented discriminant: 4

5 Structured output prediction Structured feature/label joint map: Linear discriminant: Structured label loss: with Hinge loss Loss augmented discriminant Max over : 5

6 Structured output prediction Structured feature/label joint map: Linear discriminant: Structured label loss: with Hinge loss: Loss augmented discriminant Max over : 6

7 Max-margin Markov networks and conditional random fields Structured feature/label joint map: Linear discriminant: Structured label loss: M 3 N: CRF: 7

8 Max-margin Markov networks and conditional random fields Structured feature/label joint map: Linear discriminant: Structured label loss: M 3 N: CRF: 8

9 Max-Margin Markov Networks Major challenges Large space of, so need to (carefully) keep factorization Loss is not smooth 9

10 Factorization for structured output space Factorization Feature factorization: Loss factorization Probability factorization 10

11 Non-smooth solvers: State of the art for M 3 N Algorithm Rate of convergence BMRM / SVM-Struct Extragradient Exponentiated Gradient SMO Our algorithm 11

12 Outline Objective of max-margin Markov network (M 3 N) Smoothing for M 3 N Excessive gap technique in general, and problem for M 3 N Bregman divergence for prox-function Retain the accelerated rates Efficient computation by graphical model factorization Kernelization Conclusion 12

13 Intuition of smoothing Find a smooth and tight approximation of the non-smooth objectives 13

14 Intuition of smoothing Find a smooth and tight approximation of the non-smooth objectives Q: general procedure for smoothing? 14

15 Intuition of smoothing Find a smooth and tight approximation of the non-smooth objectives Q: general procedure for smoothing? A: Fenchel conjugation 15

16 Key observation Loss for M 3 N has rich structure (though non-smooth) Can be rewritten A : a matrix stacking features 16

17 Key observation Loss for M 3 N has rich structure (though non-smooth) Can be rewritten A : a matrix stacking features Further written as where is a convex function with a compact domain 17

18 Key observation Loss for M 3 N has rich structure (though non-smooth) Empirical risk can be written as A : a matrix stacking features : is a convex function with a compact domain 18

19 Significance of this reformulation: smoothing It helps us to design a tight and smooth approximation Use a prox-function d d is strongly convex with modulus 1 (wrt some norm on ), let Desirable properties has Lipschitz continuous gradient (lcg) 19

20 Example approximation: tight and smooth Example: hinge loss 20

21 Example approximation: tight and smooth Example: hinge loss Entropic prox-function: logistic loss 21

22 Example approximation: tight and smooth Example: hinge loss Quadratic prox-function 22

23 Smoothing M 3 N into CRF M 3 N loss Use entropic prox-function then 23

24 Outline Objective of max-margin Markov network (M 3 N) Smoothing for M 3 N Excessive gap technique in general, and problem for M 3 N Bregman divergence for prox-function Retain the accelerated rates Efficient computation by graphical model factorization Kernelization Conclusion 24

25 Excessive Gap Technique 25

26 Excessive Gap Technique 26

27 Excessive Gap Technique 27

28 Key technical challenges Key challenges of excessive gap minimization Let approach 0 as rapidly as possible Still allow and to be updated efficiently 28

Rates of convergence Rates when using Euclidean prox-function But, Euclidean prox-function does not work for M 3 N Key issue: cannot maintain factorization in

29 Rates of convergence Rates when using Euclidean prox-function But, Euclidean prox-function does not work for M 3 N Key issue: cannot maintain factorization in the updates Need to evaluate the smooth objective Maximizer must factorize over the graphical model. Intuition: arithmetic mean of two iid densities is not iid. 29

30 Outline Objective of max-margin Markov network (M 3 N) Smoothing for M 3 N Excessive gap technique in general, and problem for M 3 N Bregman divergence for prox-function Retain the accelerated rates Efficient computation by graphical model factorization Kernelization Conclusion 30

31 Using Bregman divergence prox-function We show Bregman divergence maintains factorization Intuition: geometric mean of two iid densities is still iid We show same rates hold for Bregman divergence prox-function 31

32 Comparison Resulting rates Ours (Collins et al, 2008) 32

33 Outline Objective of max-margin Markov network (M 3 N) Smoothing for M 3 N Excessive gap technique in general, and problem for M 3 N Bregman divergence for prox-function Retain the accelerated rates Efficient computation by graphical model factorization Kernelization Conclusion 33

34 Kernelization enters the objective only via inner products So kernelize on Further factorize the kernel onto Key idea: implicitly represent in terms of Roughly speaking: This factorizes over the graphical model Then can be computed by using kernels 34

35 Conclusion Excessive gap technique enjoys accelerated rates But only shown for Euclidean prox-function Euclidean prox-function is problematic for M 3 N Does not allow computations to factorize We extend prox-function to Bregman divergence Efficient computation by graphical model factorization Improved rates compared with state-of-the-art M 3 N solvers Admits kernelization 35

Machine Learning for NLP

Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline