Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016
Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]
Structured Prediction Problem Structured Output: Y = Y 1... Y l. Loss Function: L : Y Y R + decomposable. Training data: sample drawn i.i.d. from X Yaccording to some distribution D, S =((x 1, y 1 ),..., (x m, y m )) X Y Problem: find hypothesis h : X Ywith small generalization error E (x,y) D [L (h (x), y)]
Ensemble Methods: Further assumptions The number of substructures l 1isfixed. Loss function L can be decomposed as a sum of loss functions l k : Y k R+,i.e. for all y = ( y 1,...,y l) and y = ( y 1,...,y l ), L ( y, y ) = l k=1 l k ( y k, y k) Asetofstructuredpredictionexpertsh 1,..., h p : h j (x) = ( h 1 j (x),..., h l j (x)), j = 1,..., p
Path Experts Intuition: one particular expert maybe better at predicting k-th substructure, but not elsewhere. Figure: Path Expert Construct a larger set of experts, path experts : H = {( h 1 j 1,..., h l j l ), jk {1,..., p}, k = 1,..., l }, H = p l When selecting the best h from H, itmaynotcoincidewith any of original experts.
Review: Randomized Weighted Majority (WM) Task: find a distribution over experts. Initialize weights for all experts: After receiving sample (x t, y t ), w 1,j = 1/p, j = 1,..., p Update weight for expert j, j = 1,..,p: w t+1,j = Return w T +1 =(w T +1,1,..., w T +1,p ) w t,j e ηl(ŷ t,j,y t) p j=1 w,η >0constant t,je ηl(ŷ t,j,y t)
Randomized Weighted Majority (WM) Apply the Weighted Majority algorithm to path experts set H: h H, w t+1 (h) = w t (h) e ηl(h(xt),yt) h H w t (h) e ηl(h(xt),yt) p l updates per round! Using the structure of Y and additive loss assumption, we can reduce to pl updates per round.
Weighted Majority Weight Pushing (WMWP) For every h = ( h 1 j 1,..., h l j l ), w (h) = l k=1 w k j k Enforce the weights at each substructure sum to one: p j=1 w k j = 1, k {1,..., l} The overall weights h H w (h) will automatically sum to one! Example: p = 2,l = 2, w 1 1 w 2 1 + w 1 1 w 2 2 + w 1 2 w 2 1 + w 1 2 w 2 2 = ( w 1 1 + w 1 2 )( w 2 1 + w 2 2 ) = 1
Weighted Majority Weight Pushing (WMWP) Initialize weights: w k 1,j = 1/p, (k, j) {1,..., l} {1,..., p}, At round t, (k, j) {1,..., l} {1,..., p}, updateweight w k t+1,j = w k t,j e ηl k(h k j (xt),y k t ) p j=1 w k t,j e ηl k(h k j (xt),y k t ), Return W 1,..., W T +1,whereW t = ( w k t,j).
On-line to batch conversion How to define a hypothesis based on {W 1,..., W T +1 }? Two steps: 1 Choose a good collection of distributions. P = {W 1,...,W T +1 } is one choice, but not necessarily the best. 2 Use P to define hypotheses for prediction.
On-line to batch conversion Step 1: Choose a good collection of distributions. For any collection P, definescore Γ(P) = 1 W P t P h H W log 1 δ t (h) L (h (x t ), y t )+M P Chose P δ = arg min P P Γ(P)
On-line to batch conversion Step 2: define hypotheses. Randomized: randomly select a path expert h Haccording to p (h) = 1 P δ W t P δ W t (h), denote the set of generated hypotheses as H Rand. Deterministic: defined a scoring function ( ) l 1 p h MVote (x, y) = wt,j k P δ 1 hj k (x)=y k k=1 W t P δ j=1 H MVote (x) =arg max h MVote (x, y) y Y
Learning Guarantees Regret R T R T = T t=1 E h Wt [L (h (x t ), y t )] inf h H T L (h (x t ), y t ) t=1 For any δ>0, with probability at least 1 δ over sample (x i, y i ) T i=1 drawn i.i.d from D, thefollowinghold: E [L (H Rand (x), y)] 1 T T log T δ E h Wt [L (h (x t ), y t )] + M T, t=1 E [L (H Rand (x), y)] inf h H E [L (h (x), y)] + R T T + 2M log 2T δ T.
Learning Guarantees The following inequality always hold: with expectations taken over (x, y) Dand h p for H Rand, E [L Ham (H MVote (x), y)] 2 E [L Ham (H Rand (x), y)] Proof: by definition of H MVote,thefollowingalwayshold: 1 2 1 H k MVote (x) y k 1 P δ W t P δ p wt,j k 1 hj k(x) y k j=1 summing over k and take expectations over D yields the desired results.
AdaBoost: Review { N } Hypothesis space: H = j=1 α jh j,α j 0,whereh j H 0 are base classifiers, h j : X R. N j=1 α j h j (x i ) Objective Function: F (α) = m i=1 e y i Apply Coordinate Descent to F (α) ( N ) Return hypothesis: h (x) =sgn j=1 α jh j (x)
ESPBoost: hypothesis space How to make a convex combination of path experts? Prediction Score Convex Combination of Score Combined Prediction For each path experts h t H,definethescore h t : X Y R + : (x, y) X Y, h t (x, y) = l k=1 1 h k t (x)=y k Convex combination of score: { } T h = α t ht : h t derived from path experts,α t 0 t=1 Combined Prediction: h (x) =arg max h (x, y) =arg max y Y y Y T t=1 k=1 l α t 1 h k t (x)=y k
Loss Function and Upper Bound Normalized Hamming Loss: 1 m = 1 ml 1 ml m [L Ham (H ESPBoost (x i ), y i )] i=1 m i=1 m i=1 l k=1 1 hk (x i,y i ) max y yi hk (x i,y)<0 { l exp k=1 ( hk ) Margin: ρ, x i, y i T t=1 ) } α t ρ ( hk t, x i, y i := F (α) = h ( ) k x i, yi k arg maxy k y h ( k x i k i, y k)
ESPBoost Hypothesis space: the set of combined predictions { } H ESPBoost (x) = h (x) :arg max h (x, y) y Y Objective Function F : R T + R F (α) = 1 ml m i=1 { l exp k=1 T t=1 ) } α t ρ ( hk t, x i, y i Apply coordinate descent.
ESPBoost: algorithm
Generalized Kernel Approach Problem Setting: Given (x i, y i ) n i=1 X Y,wewanttolearna mapping f from X to Y. l : Y Y R akernelfunction F Y :RKHSassociatedwithl Φ l :themappingfromy to F Y Step 1: Learn the mapping g from X to F Y Step 2: Find the pre-image of g (x)
Generalized Kernel Approach Step 1: learn the mapping g : X F Y Preliminaries Operator-valued kernels Function-valued RKHS
Operator-valued Kernels L (F Y ) be the set the bounded operators T : F Y F Y Non-negative L (F Y ) valued kernel K : X X L(F Y ) such that: x i, x j X, K (x i, x j )=K (x j, x } i ). For any m > 0, {(x i,ϕ i ) i=1,...,m X F Y, m i,j=1 K (x i, x j ) ϕ j,ϕ i FY 0. Note: * denotes adjoint operator, i.e. ϕ 1,ϕ 2 F Y, K (ϕ 1 ),ϕ 2 FY = ϕ 1, K (ϕ 2 ) FY
Function-valued RKHS AHilbertspaceF XY of functions g : X F Y is a F Y valued RKHS if there is a non-negative L (F Y ) valued kernel K with the following properties: x X, ϕ F Y,thefunctionK (x, ) ϕ F XY g F XY, x X, ϕ F Y, g, K (x, ) ϕ FXY = g (x),ϕ FY Theorem: Bijection between F XY and K, aslongask is non-negative.
Kernel Ridge Regression Step 1: Learning g : X F Y Kernel Ridge Regression, with closed form solution: arg min g F XY n g (x i ) Φ l (y i ) 2 F Y + λ g 2 F XY i=1 g (x) =K x (K + λi ) 1 Φ l K x :arowvectorofoperators,[k (, x i ) L(F Y )] n i=1 K: amatrixofoperators,[k (x i, x j ) L(F Y )] n i,j=1 Φ l :acolumnvectoroffunctions[φ l (y i ) F Y ] n i=1
Find Pre-image Step 2: Find pre-image of g (x) f (x) =arg min g (x) Φ l (y) 2 F Y y Y = arg min y Y K x (K + λi ) 1 Φ l Φ l (y) = arg min l (y, y) 2 y Y 2 F Y K x (K + λi ) 1 Φ l, Φ l (y) Φ l (y) unknown, use a generalized kernel trick: T Φ l (y i ), Φ l (y) =[T l (y i, )] (y) Express f (x) using only kernel functions: f (x) =arg min l (y, y) 2 y Y where L. is a column vector of [l (y i, )] n i=1 [ K x (K + λi ) 1 L. ] (y) F Y
Covariance-based Operator-valued Kernels Covariance-based operator-valued kernels: K (x i, x j )=k (x i, x j ) C YY k : X X R ascalar-valuedkernel; C YY : F Y F Y acovarianceoperatordefinedbyarandom variable Y Y: ϕ 1, C YY ϕ 2 FY = E [ϕ 1 (Y ) ϕ 2 (Y )] Empirical covariance operator Ĉ YY (ϕ) = 1 n n ϕ (y i ) l (, y i ) To account for the effects of input, we could also use conditional covariance operator i=1 C YY X = C YY C YX C 1 XX C XY
Conclusion Ensemble methods: ensemble learning with expended path experts. On-line algorithm (WMWP): efficient for learning and inference by exploiting the output structure. On-line-to-batch-conversion: randomized and deterministic algorithms with learning guarantees. Boosting: efficient for output structure. Kernel method: Use a joint feature space. Covariance-based operator-valued kernel to encode interactions between outputs. Conditional Covariance-based operator to correlate input with interaction between outputs. Express the final hypothesis with only kernel functions.
Reference Corinna Cortes, Vitaly Kuznetsov, and Mehryar Mohri. Ensemble methods for structured prediction. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages1134 1142,2014. Hachem Kadri, Mohammad Ghavamzadeh, and Philippe Preux. AGeneralizedKernelApproachtoStructuredOutputLearning. In International Conference on Machine Learning (ICML), Atlanta, United States, June 2013.