Algorithms for Predicting Structured Data

Size: px

Start display at page:

Download "Algorithms for Predicting Structured Data"

Belinda Skinner
5 years ago
Views:

1 1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010

2 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure and dependencies Contrast with single-valued prediction problems like binary classification and regression

3 Part-of-Speech Tagging 3 / 70 Input: Sentence Output: Part-of-speech tags The brown fox chased the lazy dog DT JJ N VBD DT JJ N

4 Linguistic Parsing 4 / 70 Input: Sentence Output: Parse tree S NP VP DT JJ N VBD NP The brown fox chased DT JJ N the lazy dog

5 Machine Learning Problems 5 / 70 Multi-class prediction Multi-label classification Hierarchical classification Label ranking...

6 Real World Applications 6 / 70 Natural language processing Computational biology Bioinformatics Computer vision Robotics...

7 Outline 7 / 70 Part I: From Binary to Structured Prediction Loss Functions Algorithms Part II: Exact and Approximate Inference Approximate Inference: + / - ve Results Complexity of Learning New Training Methods Part III: Weakly Supervised Learning

8 Tutorial Focus 8 / 70 Discriminative methods in structured prediction Algorithmic techniques

9 Topics Not Covered 9 / 70 Advanced optimization methods Connections to reinforcement learning Learning theory / Generalization bounds Inference in graphical models X-supervised learning

10 Outline 10 / 70 Part I: Binary to Structured Loss Functions Algorithms

11 Road Map 11 / 70 Perceptron Structured perceptron Regression Kernel dependency estimation Support Vector Machines Structured SVMs Logistic Regression Conditional random fields

12 Preliminaries 12 / 70

13 13 / 70 Input and output spaces Supervised Learning X, Y Classification Y = { 1, +1}; Regression Y = R Training data X = (x 1,..., x m ) X m Y = (y 1,..., y m ) Y m (drawn i.i.d. from an unknown distribution) Function approximation f : X Y Good performance on unseen examples where performance is measured w.r.t. a loss function

14 Generative and Discriminative Learning 14 / 70 Generative learning (e.g., Naïve Bayes, HMM) Model joint probability p(x, y) Predict using Bayes rule p(y x) = p(x, y) p(x, z) z Y p(x y)p(y) = p(x, z) z Y Discriminative learning (e.g., Logistic regression, CRF) Model conditional probability p(y x) OR Mapping from inputs to outputs f : X Y

15 Regularized Risk Minimization 15 / 70 argmin f H m l(f(x i ), y i )+λω(f) i=1 l : Y Y R +, non-negative loss function Ω : H R +, regularization function λ > 0, regularization parameter

16 Regularized Risk Minimization 16 / 70 Example: Linear classifiers f(x) = w, φ(x) argmin w R n m l( w,φ(x i ),y i )+λ w 2 2 i=1 φ : X R n, feature map

17 Structured Prediction 17 / 70 Input and output spaces X, Y Training data X = (x 1,..., x m ) X m Y = (y 1,..., y m ) Y m Joint scoring function f : X Y R Prediction ŷ = argmax y Y(x) f(x, y) = argmax w,φ(x, y) y Y(x)

18 Joint Feature Maps 18 / 70 Extract features from inputs AND outputs Feature map, φ : X Y H Joint input-output kernel, k[(x, y),(x, y )] = φ(x, y),φ(x, y )

19 Joint Feature Maps 19 / 70 Example: Multi-label classification, Hierarchical classification x R n, y {0, 1} d φ(x, y) = x y Tensor product kernel: k[(x, y),(x, y )] = k X (x, x )k Y (y, y )

20 20 / 70 Loss Functions (Binary Structured)

21 Regularized Risk Minimization 21 / 70 argmin f H m l(f(x i ), y i )+λω(f) i=1 l : Y Y R +, non-negative loss function Ω : H R +, regularization function λ > 0, regularization parameter

22 Zero-One Loss 22 / 70 Binary: l 0-1 (y, z) = { 0 if y = z 1 otherwise Structured: l max (f,(x, y)) = (argmax f(x, z), y) z Y Non-convex, non-differentiable, hard to optimize Use surrogate losses instead; convex upper bounds

23 Squared Loss 23 / 70 l square (y, z) = (y z) 2, Y = R Used in regression problems Extension to structured prediction is non-trivial due to inter-dependencies among the multiple outputs (more on this later)

24 Hinge Loss 24 / 70 Binary: l hinge (y, z) = max(0, 1 yz) Structured: l hinge(f,(x, y)) = max[ (z, y) + f(x, z) f(x, y)] z Y

25 Logistic Loss 25 / 70 Binary: l log (y, z) = ln(1+exp( yz)) Structured: [ ] l log (f,(x, y)) = ln exp(f(x, z)) f(x, y) z Y

26 Exponential Loss 26 / 70 Binary: l exp (y, z) = exp( yz) Structured: l exp (f,(x, y)) = z Y exp[f(x, z) f(x, y)]

27 Loss Functions 27 / Square Hinge Log Exp loss(t) t=y.f(x)

28 28 / 70 Learning Algorithms (parameter estimation)

29 Perceptron Structured perceptron 29 / 70

30 Perceptron 30 / 70 Online learning algorithm Prediction ŷ = f(x) = sgn w, x Algorithm: Initialize w = 0 For t = 1...T 1. receive input x t 2. predict ŷ t = sgn w, x t 3. receive true label y t { 1,+1} 4. if y t ŷ t, then w w + y t x t

31 Mistake Bound 31 / 70 Separable data [Block, 62; Novikoff, 62] Theorem Let (x 1, y 1 ),...,(x m, y m ) be a sequence of training examples with x i R for all i [[m]]. Suppose there exists a unit norm vector u such that y i ( u, x i ) γ for all the examples. Then the number of mistakes made by the perceptron algorithm on this sequence is at most (R/γ) 2.

32 Mistake Bound 32 / 70 Inseparable data [Freund and Schapire, 99] Theorem Let (x 1, y 1 ),...,(x m, y m ) be a sequence of training examples with x i R for all i [[m]]. Let u be any unit norm weight vector and let γ > 0. Define the deviation of each example as d i = max[0,γ y i ( u, x i )] and define D = m i=1 d 2 i. Then the number of mistakes made by the perceptron algorithm on this sequence of examples is at most (R + D) 2 γ 2.

33 Constraint Classification 33 / 70 (Har-Peled et al., ALT 2002) Generalizes perceptron for various problems including multiclass prediction, multilabel classification and label ranking Example: Multiclass prediction, Y = {1,...,d} Weights (w 1,..., w d ) R nd Prediction ŷ = f(x) = argmax i [[d]] w i, x

34 Constraint Classification 34 / 70 Algorithm: Initialize (w 1,...,w d ) = 0 For t = 1...T 1. receive input x t and label y t [[d]] 2. construct ỹ t = {(y t, i) i [[d]]\y t } 3. for all (p, q) ỹ t, if w p w q, x t < 0, then w p w p + x, w q w q x

35 Structured Perceptron 35 / 70 (Collins, EMNLP 2002) Linear scoring function f(x, y) = w,φ(x, y) Algorithm: Initialize w = 0 For t = 1...T 1. receive input x t 2. predict ŷ t = argmax y Y f(x t, y) 3. receive true label y t Y 4. w w +φ(x t, y t ) φ(x t, ŷ t )

36 Mistake Bound 36 / 70 Separable data [Collins, 02] Theorem Let (x 1, y 1 ),(x 2, y 2 ),...,(x m, y m ) be a sequence of training examples which is separable with margin γ. Let R denote a constant that satisfies φ(x i, y i ) φ(x i, z) R, for all i [[m]], and for all z GEN(x i ). Then the number of mistakes made by the perceptron algorithm on this sequence of examples is at most R 2 /γ 2.

37 Mistake Bound 37 / 70 Inseparable data [Collins, 02] For a training example (x i, y i ) and a v,γ pair, define m i = v,φ(x i, y i ) max z GEN(x i ) v,φ(x i, z) and m ɛ i = max{0,γ m i } and define D v,γ = i=1 ɛ2 i. Then the number of mistakes made by the structured perceptron is at most (R + D v,γ ) 2 min v,γ γ 2.

38 Regression KDE 38 / 70

39 Regularized Least-Squares Regression 39 / 70 OPT: min + m (w x w R nλ w 2 i y i ) 2 i=1 Solution: w = (X X +λi) 1 X y where X R m n : data/input matrix y R m 1 : label/output vector I: identity matrix

40 Kernelized Version 40 / 70 OPT1: min f H λ f 2 + m (f(x i ) y i ) 2 i=1 Representer theorem: f = m c i k(x i, ) i=1 OPT2: min Kc + y Kc 2 c R mλc Solution: c = (K +λi) 1 y

41 Kernel Dependency Estimation 41 / 70 (Weston et al., NIPS 2002) Output feature" map ψ : Y H Y, k Y (y, y ) = ψ(y),ψ(y ) Note: Possible to consider a large class of non-linear loss functions in the output space KDE uses kernel PCA to decorrelate" the outputs, and trains independent RLSRs

42 Kernel Dependency Estimation 42 / 70 (Weston et al., NIPS 2002) Algorithm: 1. decompose {ψ(y i ) i [[m]]} into k orthogonal directions using KPCA 2. learn f j : φ R j for each direction j [[k]] using RLSR 3. predict by solving the pre-image problem: ( ) 2 ŷ(x) = argmin [v1 ψ(y),..., v k ψ(y)] [f 1(x),..., f k (x)] y Y

43 Kernel Dependency Estimation 43 / 70 (Weston et al., NIPS 2002) Note: Projection of ψ(y) onto the p th principal component v p = m i=1 αp i ψ(y i) is given by vp ψ(y) = m i=1 αp i k Y(y i, y), where α p is the p th eigenvector of the kernel matrix K Y (after centering it)

44 SVM Structured SVM 44 / 70

45 Support Vector Machine 45 / 70 Linear large-margin classifier f(x) = w, φ(x) Minimizes the hinge loss OPT: minλ w w m m max{0, 1 y i w,φ(x i ) } i=1 OPT with slack variables: min w,ξ λ w m m ξ i i=1 s.t. : y i w,φ(x i ) 1 ξ i, i [[m]] ξ i 0, i [[m]]

46 Support Vector Machine 46 / 70 SVM dual 1 min c R m 2 c YKYc 1 c s.t.: 0 c i 1 mλ, i [[m]] Representer theorem: f( ) = m c i k(x i, ) i=1

47 Structured SVM / Max-margin Markov Network 47 / 70 Minimizes structured hinge loss Two formulations: 1. Slack rescaling 2. Margin rescaling

48 Structured SVM 48 / 70 Slack rescaling min w,ξ λ w m m ξ i i=1 s.t. : w,φ(x i, y i ) w,φ(x i, z) 1 ξ i (y i,z), z, i ξ i 0, i [[m]]

49 Structured SVM 49 / 70 Margin rescaling min w,ξ λ w m m ξ i i=1 s.t. : w,φ(x i, y i ) w,φ(x i, z) (y i, z) ξ i, z, i ξ i 0, i [[m]]

50 Solving the OPT 50 / 70 Major issue: Exponential number of constraints Suffices to design a sub-routine (loss-augmented inference) to compute ŷ = argmax[1 w,φ(x, y) φ(x, z) ] (z, y) z Y ŷ = argmax[ (z, y) w,φ(x, y) φ(x, z) ] z Y Iteratively add constraints

51 Cutting-Plane Method (Tsochantaridis et al., JMLR 2005) 1: Input: (x 1, y 1 ),...,(x m, y m ),λ,ɛ 2: S i, for all i [[m]] 3: repeat 4: for i = 1,...,m do 5: H(y) (y i, y)+w φ(x i, y) w φ(x i, y i ) 6: compute ŷ = argmax y Y H(y) 7: 8: compute ξ i = max{0, max y Si H(y)} if H(ŷ) > ξ i +ɛ then 9: S i S i {ŷ} 10: w optimize primal over i S i 11: end if 12: end for 13: until no S i has changed during iteration 51 / 70

52 Theoretical Guarantees with Exact Inference 52 / 70 (Tsochantaridis et al., 2005; Finley and Joachims, ICML 2008) Polynomial time termination Algorithm terminates in a polynomial number of iterations Correctness Algorithm solves OPT accurately to a desired precision ɛ Empirical risk bound 1 m m ξ i upper bounds empirical risk i=1

53 Kernelized SSVM 53 / 70 min α R m Y s.t. : iz α jz k[(x i, z),(x j, z i,j [[m]],z,z Yα )] (y i, z)α iz i,z α iz λ, i [[m]] z Y α iz 0, i [[m]], z Y Representer theorem: f(, ) = α iz k[(x i, z),(, )] i [[m]],z Y

54 Structured Perceptron Revisited 54 / 70 Linear scoring function f(x, y) = w,φ(x, y) Algorithm: Initialize w = 0 For t = 1...T 1. receive input x t 2. predict ŷ t = argmax y Y f(x t, y) 3. receive true label y t Y 4. w w +φ(x t, y t ) φ(x t, ŷ t ) Note: Update rule is not influenced by the loss function (y, z)

55 Structured Perceptron Revisited 55 / 70 Replace inference with loss-augmented inference Algorithm: Initialize w = 0 For t = 1...T 1. receive example pair x t, y t 2. predict ŷ t = argmax y Y [f(x t, y) + (y t, y)] 3. w w +η t (φ(x t, y t ) φ(x t, ŷ t ))

56 Min-Max Formulation 56 / 70 (Taskar et al., ICML 2005) Recall brute force enumeration min w,ξ λ w m m ξ i i=1 s.t. : w,φ(x i, y i ) w,φ(x i, z) (y i, z) ξ i, z, i ξ i 0, i [[m]]

57 Min-Max Formulation (Taskar et al., ICML 2005) min w,ξ λ w m m ξ i i=1 s.t. : w,φ(x i, y i ) +ξ i max z Y [ w,φ(x i, z)+ (y i, z)], i [[m]] ξ i 0, i [[m]] Key steps: Plug-in LP inference Use LP duality to replace max with min Rewrite the OPT as a concise QP with polynomial number of variables and constraints (depends on the output structure) 57 / 70

58 Logistic Regression CRF 58 / 70

59 Logistic Regression 59 / 70 Probabilistic (binary) classifier Likelihood function: p(y x, w) = exp(y φ(x), w ) exp(y φ(x), w )+exp( y φ(x), w ) Estimate parameters w by minimizing negative log-likelihood

60 Exponential Family 60 / 70 Family of probability distributions p(x; w) = exp( φ(x), w ln Z(w)) φ(x) - Sufficient statistics of x Z(w) - Partition function Z(w) = exp( φ(x), w )dx x

61 Log-Partition Function 61 / 70 g(w) = ln Z(w) = ln x exp( φ(x), w )dx g(w) is a cumulant generator, i.e., w g(w) = φ(x) exp( φ(x), w )dx exp( φ(x), w )dx = E x p(x;w) [φ(x)] 2 w g(w) = Cov x p(x;w)[φ(x)] Note: g(w) is convex!

62 Examples 62 / 70 Binomial Multinomial Gaussian Laplace Poisson Dirichlet...

63 (Conditional) Exponential Family 63 / 70 Family of conditional distributions p(y x, w) = exp( φ(x, y), w ln Z(w x)) φ(x, y) - Joint sufficient statistics of x and y Z(w x) - Partition function Z(w x) = exp( φ(x, y), w ) y Y

64 MAP Estimation 64 / 70 p(w Y, X) p(w, Y X) = p(y X, w)p(w) Point estimate with a Gaussian prior on w, p(w) = exp( λ w 2 ) ŵ = argmax w = argmax w [ln p(w, Y X)] [ ] 1 m p(y i x i, w) λ w 2 m i=1 Note: Convex optimization problem

65 Linear-Chain Conditional Random Fields Y i 1 Y i Y i+1 X i 1 X i X i+1 Cliques: (y t, x t ),(y t 1, y t ), for all t p(y ( x, w) = ) exp [ φ(y t, x t ), w xy,t + φ(y t 1, y t ), w yy,t ] ln Z(w x) t Efficient inference using dynamic programming MLE / MAP estimation of parameters 65 / 70

66 HMMs and CRFs 66 / 70 Y i 1 Y i Y i+1 Y i 1 Y i Y i+1 X i 1 X i X i+1 X i 1 X i X i+1 Hidden Markov Model: Maximize p(x, y w) Conditional Random Field: Maximize p(y x, w)

67 Learning Reductions 67 / 70 (ICML 2009 tutorial) Reductions: Transform complex learning problems into simpler, core problems Desideratum: Good performance on the core problem should imply good performance on the complex problem Examples Multi-class prediction Binary classification Ranking Classification Cost-sensitive classification Binary classification Structured prediction Binary classification (SEARN)

68 Structured Binary (SEARN) 68 / 70 (Daumé et al., 2009) Decomposable outputs y = (y 1,..., y T ) SEARN learns a policy π that maps tuples (x, y 1,..., y t 1 ) to y T Reduces structured prediction to cost-sensitive (multi-class) classification Good performance on cost-sensitive classification implies good performance on the structured prediction problem Note: No argmax problem!

69 Reduction to Cost-Sensitive Classification 69 / 70 Distribution over cost-sensitive examples (input, c 1,...,c K ) For every structured example (x, y) 1. sample t uniformly from [[T]] 2. run policy π for t 1 steps to yield (ŷ 1,...,ŷ t 1 ) 3. Input: (x, ŷ 1,..., ŷ t 1 ) 4. Costs: c k = E l(y,(ŷ 1,..., ŷ t 1, k, ŷ t 1,..., ŷ T )), k [[K]] ŷ t+1,...,ŷ T π

70 Intermediate Summary 70 / 70 Loss Functions (binary structured) Discriminative Structured Prediction Algorithms Perceptron Structured perceptron RLSR KDE SVM Struct-SVM, M3N LogReg CRF Structured prediction Cost-sensitive classification

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for