Structured Prediction

Similar documents
Boosting Ensembles of Structured Prediction Rules

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Advanced Machine Learning

On-Line Learning with Path Experts and Non-Additive Losses

ADANET: adaptive learning of neural networks

Structured Prediction Theory and Algorithms

Online Learning and Sequential Decision Making

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

Time Series Prediction & Online Learning

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning for NLP

Discriminative Models

Sample Selection Bias Correction

Foundations of Machine Learning

Lecture 16: Perceptron and Exponential Weights Algorithm

Online Learning for Time Series Prediction

Learning with Rejection

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Advanced Machine Learning

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

CS60021: Scalable Data Mining. Large Scale Machine Learning

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Support Vector Machines for Classification: A Statistical Portrait

Discriminative Models

Regularization in Reproducing Kernel Banach Spaces

Least Mean Squares Regression

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

A Magiv CV Theory for Large-Margin Classifiers

Machine Learning. Ensemble Methods. Manfred Huber

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Statistical learning theory, Support vector machines, and Bioinformatics

Perceptron Mistake Bounds

GAUSSIAN PROCESS REGRESSION

Domain Adaptation for Regression

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

CS446: Machine Learning Spring Problem Set 4

Spectral Regularization

Classifier Complexity and Support Vector Classifiers

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

Approximate Kernel Methods

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Regularization via Spectral Filtering

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Decentralized Detection and Classification using Kernel Methods

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

A Generalized Kernel Approach to Structured Output Learning

Learning with Imperfect Data

Empirical Risk Minimization

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Overfitting, Bias / Variance Analysis

Advances in kernel exponential families

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Support Vector Machines

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

Introduction to Logistic Regression and Support Vector Machine

9 Classification. 9.1 Linear Classifiers

Large-Margin Thresholded Ensembles for Ordinal Regression

CSC 411: Lecture 04: Logistic Regression

Machine Learning Theory (CS 6783)

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Lecture 8. Instructor: Haipeng Luo

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Online Kernel PCA with Entropic Matrix Updates

Online Gradient Descent Learning Algorithms

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Basis Expansion and Nonlinear SVM. Kai Yu

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Approximate Kernel PCA with Random Features

Multiclass Boosting with Repartitioning

U Logo Use Guidelines

Least Squares Regression

Does Unlabeled Data Help?

Learning Binary Classifiers for Multi-Class Problem

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Importance Sampling for Minibatches

Learning Weighted Automata

Statistical Machine Learning from Data

Lecture 5: GPs and Streaming regression

22 : Hilbert Space Embeddings of Distributions

Nonparametric Decentralized Detection using Kernel Methods

Joint distribution optimal transportation for domain adaptation

Full-information Online Learning

Learning Ensembles of Structured Prediction Rules

Rademacher Bounds for Non-i.i.d. Processes

Transcription:

Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016

Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]

Structured Prediction Problem Structured Output: Y = Y 1... Y l. Loss Function: L : Y Y R + decomposable. Training data: sample drawn i.i.d. from X Yaccording to some distribution D, S =((x 1, y 1 ),..., (x m, y m )) X Y Problem: find hypothesis h : X Ywith small generalization error E (x,y) D [L (h (x), y)]

Ensemble Methods: Further assumptions The number of substructures l 1isfixed. Loss function L can be decomposed as a sum of loss functions l k : Y k R+,i.e. for all y = ( y 1,...,y l) and y = ( y 1,...,y l ), L ( y, y ) = l k=1 l k ( y k, y k) Asetofstructuredpredictionexpertsh 1,..., h p : h j (x) = ( h 1 j (x),..., h l j (x)), j = 1,..., p

Path Experts Intuition: one particular expert maybe better at predicting k-th substructure, but not elsewhere. Figure: Path Expert Construct a larger set of experts, path experts : H = {( h 1 j 1,..., h l j l ), jk {1,..., p}, k = 1,..., l }, H = p l When selecting the best h from H, itmaynotcoincidewith any of original experts.

Review: Randomized Weighted Majority (WM) Task: find a distribution over experts. Initialize weights for all experts: After receiving sample (x t, y t ), w 1,j = 1/p, j = 1,..., p Update weight for expert j, j = 1,..,p: w t+1,j = Return w T +1 =(w T +1,1,..., w T +1,p ) w t,j e ηl(ŷ t,j,y t) p j=1 w,η >0constant t,je ηl(ŷ t,j,y t)

Randomized Weighted Majority (WM) Apply the Weighted Majority algorithm to path experts set H: h H, w t+1 (h) = w t (h) e ηl(h(xt),yt) h H w t (h) e ηl(h(xt),yt) p l updates per round! Using the structure of Y and additive loss assumption, we can reduce to pl updates per round.

Weighted Majority Weight Pushing (WMWP) For every h = ( h 1 j 1,..., h l j l ), w (h) = l k=1 w k j k Enforce the weights at each substructure sum to one: p j=1 w k j = 1, k {1,..., l} The overall weights h H w (h) will automatically sum to one! Example: p = 2,l = 2, w 1 1 w 2 1 + w 1 1 w 2 2 + w 1 2 w 2 1 + w 1 2 w 2 2 = ( w 1 1 + w 1 2 )( w 2 1 + w 2 2 ) = 1

Weighted Majority Weight Pushing (WMWP) Initialize weights: w k 1,j = 1/p, (k, j) {1,..., l} {1,..., p}, At round t, (k, j) {1,..., l} {1,..., p}, updateweight w k t+1,j = w k t,j e ηl k(h k j (xt),y k t ) p j=1 w k t,j e ηl k(h k j (xt),y k t ), Return W 1,..., W T +1,whereW t = ( w k t,j).

On-line to batch conversion How to define a hypothesis based on {W 1,..., W T +1 }? Two steps: 1 Choose a good collection of distributions. P = {W 1,...,W T +1 } is one choice, but not necessarily the best. 2 Use P to define hypotheses for prediction.

On-line to batch conversion Step 1: Choose a good collection of distributions. For any collection P, definescore Γ(P) = 1 W P t P h H W log 1 δ t (h) L (h (x t ), y t )+M P Chose P δ = arg min P P Γ(P)

On-line to batch conversion Step 2: define hypotheses. Randomized: randomly select a path expert h Haccording to p (h) = 1 P δ W t P δ W t (h), denote the set of generated hypotheses as H Rand. Deterministic: defined a scoring function ( ) l 1 p h MVote (x, y) = wt,j k P δ 1 hj k (x)=y k k=1 W t P δ j=1 H MVote (x) =arg max h MVote (x, y) y Y

Learning Guarantees Regret R T R T = T t=1 E h Wt [L (h (x t ), y t )] inf h H T L (h (x t ), y t ) t=1 For any δ>0, with probability at least 1 δ over sample (x i, y i ) T i=1 drawn i.i.d from D, thefollowinghold: E [L (H Rand (x), y)] 1 T T log T δ E h Wt [L (h (x t ), y t )] + M T, t=1 E [L (H Rand (x), y)] inf h H E [L (h (x), y)] + R T T + 2M log 2T δ T.

Learning Guarantees The following inequality always hold: with expectations taken over (x, y) Dand h p for H Rand, E [L Ham (H MVote (x), y)] 2 E [L Ham (H Rand (x), y)] Proof: by definition of H MVote,thefollowingalwayshold: 1 2 1 H k MVote (x) y k 1 P δ W t P δ p wt,j k 1 hj k(x) y k j=1 summing over k and take expectations over D yields the desired results.

AdaBoost: Review { N } Hypothesis space: H = j=1 α jh j,α j 0,whereh j H 0 are base classifiers, h j : X R. N j=1 α j h j (x i ) Objective Function: F (α) = m i=1 e y i Apply Coordinate Descent to F (α) ( N ) Return hypothesis: h (x) =sgn j=1 α jh j (x)

ESPBoost: hypothesis space How to make a convex combination of path experts? Prediction Score Convex Combination of Score Combined Prediction For each path experts h t H,definethescore h t : X Y R + : (x, y) X Y, h t (x, y) = l k=1 1 h k t (x)=y k Convex combination of score: { } T h = α t ht : h t derived from path experts,α t 0 t=1 Combined Prediction: h (x) =arg max h (x, y) =arg max y Y y Y T t=1 k=1 l α t 1 h k t (x)=y k

Loss Function and Upper Bound Normalized Hamming Loss: 1 m = 1 ml 1 ml m [L Ham (H ESPBoost (x i ), y i )] i=1 m i=1 m i=1 l k=1 1 hk (x i,y i ) max y yi hk (x i,y)<0 { l exp k=1 ( hk ) Margin: ρ, x i, y i T t=1 ) } α t ρ ( hk t, x i, y i := F (α) = h ( ) k x i, yi k arg maxy k y h ( k x i k i, y k)

ESPBoost Hypothesis space: the set of combined predictions { } H ESPBoost (x) = h (x) :arg max h (x, y) y Y Objective Function F : R T + R F (α) = 1 ml m i=1 { l exp k=1 T t=1 ) } α t ρ ( hk t, x i, y i Apply coordinate descent.

ESPBoost: algorithm

Generalized Kernel Approach Problem Setting: Given (x i, y i ) n i=1 X Y,wewanttolearna mapping f from X to Y. l : Y Y R akernelfunction F Y :RKHSassociatedwithl Φ l :themappingfromy to F Y Step 1: Learn the mapping g from X to F Y Step 2: Find the pre-image of g (x)

Generalized Kernel Approach Step 1: learn the mapping g : X F Y Preliminaries Operator-valued kernels Function-valued RKHS

Operator-valued Kernels L (F Y ) be the set the bounded operators T : F Y F Y Non-negative L (F Y ) valued kernel K : X X L(F Y ) such that: x i, x j X, K (x i, x j )=K (x j, x } i ). For any m > 0, {(x i,ϕ i ) i=1,...,m X F Y, m i,j=1 K (x i, x j ) ϕ j,ϕ i FY 0. Note: * denotes adjoint operator, i.e. ϕ 1,ϕ 2 F Y, K (ϕ 1 ),ϕ 2 FY = ϕ 1, K (ϕ 2 ) FY

Function-valued RKHS AHilbertspaceF XY of functions g : X F Y is a F Y valued RKHS if there is a non-negative L (F Y ) valued kernel K with the following properties: x X, ϕ F Y,thefunctionK (x, ) ϕ F XY g F XY, x X, ϕ F Y, g, K (x, ) ϕ FXY = g (x),ϕ FY Theorem: Bijection between F XY and K, aslongask is non-negative.

Kernel Ridge Regression Step 1: Learning g : X F Y Kernel Ridge Regression, with closed form solution: arg min g F XY n g (x i ) Φ l (y i ) 2 F Y + λ g 2 F XY i=1 g (x) =K x (K + λi ) 1 Φ l K x :arowvectorofoperators,[k (, x i ) L(F Y )] n i=1 K: amatrixofoperators,[k (x i, x j ) L(F Y )] n i,j=1 Φ l :acolumnvectoroffunctions[φ l (y i ) F Y ] n i=1

Find Pre-image Step 2: Find pre-image of g (x) f (x) =arg min g (x) Φ l (y) 2 F Y y Y = arg min y Y K x (K + λi ) 1 Φ l Φ l (y) = arg min l (y, y) 2 y Y 2 F Y K x (K + λi ) 1 Φ l, Φ l (y) Φ l (y) unknown, use a generalized kernel trick: T Φ l (y i ), Φ l (y) =[T l (y i, )] (y) Express f (x) using only kernel functions: f (x) =arg min l (y, y) 2 y Y where L. is a column vector of [l (y i, )] n i=1 [ K x (K + λi ) 1 L. ] (y) F Y

Covariance-based Operator-valued Kernels Covariance-based operator-valued kernels: K (x i, x j )=k (x i, x j ) C YY k : X X R ascalar-valuedkernel; C YY : F Y F Y acovarianceoperatordefinedbyarandom variable Y Y: ϕ 1, C YY ϕ 2 FY = E [ϕ 1 (Y ) ϕ 2 (Y )] Empirical covariance operator Ĉ YY (ϕ) = 1 n n ϕ (y i ) l (, y i ) To account for the effects of input, we could also use conditional covariance operator i=1 C YY X = C YY C YX C 1 XX C XY

Conclusion Ensemble methods: ensemble learning with expended path experts. On-line algorithm (WMWP): efficient for learning and inference by exploiting the output structure. On-line-to-batch-conversion: randomized and deterministic algorithms with learning guarantees. Boosting: efficient for output structure. Kernel method: Use a joint feature space. Covariance-based operator-valued kernel to encode interactions between outputs. Conditional Covariance-based operator to correlate input with interaction between outputs. Express the final hypothesis with only kernel functions.

Reference Corinna Cortes, Vitaly Kuznetsov, and Mehryar Mohri. Ensemble methods for structured prediction. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages1134 1142,2014. Hachem Kadri, Mohammad Ghavamzadeh, and Philippe Preux. AGeneralizedKernelApproachtoStructuredOutputLearning. In International Conference on Machine Learning (ICML), Atlanta, United States, June 2013.