Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

Similar documents
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

Lecture 2 Machine Learning Review

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Log-Linear Models, MEMMs, and CRFs

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Analysis of Spectral Kernel Design based Semi-supervised Learning

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Least Squares Regression

Statistical Data Mining and Machine Learning Hilary Term 2016

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Day 3: Classification, logistic regression

Least Squares Regression

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Manifold Regularization

What is semi-supervised learning?

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Computational and Statistical Learning Theory

Conditional Random Field

CS229 Supplemental Lecture notes

Loss Functions, Decision Theory, and Linear Models

Empirical Risk Minimization

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning 2017

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Overfitting, Bias / Variance Analysis

10-701/ Machine Learning - Midterm Exam, Fall 2010

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

9.520 Problem Set 2. Due April 25, 2011

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Day 4: Classification, support vector machines

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Data Mining Techniques

Discriminative Models

Ordinary Least Squares Linear Regression

Machine Learning Practice Page 2 of 2 10/28/13

Logistic regression and linear classifiers COMS 4771

Part of the slides are adapted from Ziko Kolter

Introduction to Machine Learning Midterm Exam

RegML 2018 Class 2 Tikhonov regularization and kernels

CPSC 340: Machine Learning and Data Mining

A Least Squares Formulation for Canonical Correlation Analysis

Warm up: risk prediction with logistic regression

Foundations of Machine Learning

Linear Classifiers IV

Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss

Lecture 5 Neural models for NLP

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Lecture 7 Introduction to Statistical Decision Theory

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

COMP 558 lecture 18 Nov. 15, 2010

Multi-Label Informed Latent Semantic Indexing

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018

Two-view Feature Generation Model for Semi-supervised Learning

Introduction to Machine Learning Midterm Exam Solutions

Notes on Noise Contrastive Estimation (NCE)

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CMU-Q Lecture 24:

Classification objectives COMS 4771

Lecture 3: Statistical Decision Theory (Part II)

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Algorithmic Stability and Generalization Christoph Lampert

Neural Networks: Backpropagation

Introduction to Machine Learning Midterm, Tues April 8

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Generalization theory

1 Machine Learning Concepts (16 points)

Conditional Random Fields

The Perceptron algorithm

Learning with Rejection

Midterm exam CS 189/289, Fall 2015

Lecture Support Vector Machine (SVM) Classifiers

Machine Learning And Applications: Supervised Learning-SVM

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Discriminative Models

Topics in Structured Prediction: Problems and Approaches

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

CS60021: Scalable Data Mining. Large Scale Machine Learning

Lecture 21: Minimax Theory

Algorithms for Predicting Structured Data

SGD and Deep Learning

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Computational and Statistical Learning Theory

Bits of Machine Learning Part 1: Supervised Learning

Bayesian Interpretations of Regularization

Machine Learning Linear Models

Linear Regression (continued)

Transcription:

Notes on the framework of Ando and Zhang (2005 Karl Stratos 1 Beyond learning good functions: learning good spaces 1.1 A single binary classification problem Let X denote the problem domain. Suppose we want to learn a binary hypothesis function h : X { 1, +1} defined as { 1 if f(x 0 h(x := 1 otherwise where f : X R assigns a real-valued score to a given x X. We assume some loss function L : R R R that can be used to evaluate the prediction f(x R against a given label y { 1, +1}. Here are some example loss functions: L(y, y := log(1 + exp( yy L(y, y := max(0, 1 yy (logistic loss (hinge loss In a risk minimization approach, we seek a function f within a space of allowed functions H that minimizes the expected loss over the distribution D of instances (x, y X { 1, +1}: f := arg min E (X,Y D L(f(X, Y f H Of course, we don t know the distribution D. In practice, we seek a function f that minimizes the empirical loss over the training data (x (1, y (1... (x (n, y (n : f := arg min f H n L(f(x (i, y (i (1 this approach is called empirical risk minimization (ERM. One can easily find f if L is defined so that the objective in Eq. (1 is convex with respect to the function parameters. 1.2 Multiple binary classification problems Suppose we now have m binary classification problems. Let X l denote the problem domain of the l-th problem. We want to lean m binary hypothesis functions h 1... h m defined as { 1 if fl (x 0 h l (x := 1 otherwise where f l : X l R assigns a real-valued score to a given input x X l from the l-th domain. A simple way to learn f 1... f m is to assume that each problem is independent so that there is no interaction between hypothesis spaces H 1... H m. Then following the usual ERM approach, we simply solve for each function separately. Assuming we have n l training examples (x (1;l, y (1;l... (x (n l;l, y (n l;l X l { 1, +1} for the l-th problem drawn iid from distribution D l, find: for l = 1... m. f l := arg min f l H l L(f l (x (i;l, y (i;l (2 1

A more interesting setting is that the problems share a certain common structure. To realize this setting, we introduce an abstract parameter Θ Γ for some parameter space Γ and assume that all the m hypothesis spaces are influenced by Θ. To mark this influence, we now denote the spaces by H 1,Θ... H m,θ and functions by f 1,Θ... f m,θ. If we know Θ, then this new setting is reduced to the previous setting since the problems are conditionally independent given Θ. Again, we simply solve for each function separately in a manner similar to Eq. (2: f l,θ := arg min f l,θ H l,θ L(f l,θ (x (i;l, y (i;l But instead of assuming a known Θ Γ, we will optimize it jointly with the model parameters {w l, v l } m. In this approach, we seek to minimize the sum of average losses across the m problems: Θ, {f l,θ} m := arg min Θ Γ,f l,θ H l,θ 1 n l L(f l,θ (x (i;l, y (i;l (3 Note that for each problem l = 1... m, we minimize the average loss (normalized by the number of training examples n l since the problems may have vastly different sizes of training data. 2 Learning strategies To make the exposition in this section concrete, we will adopt the following setup of Ando and Zhang (2005. We assume a single feature function φ that maps an input x X l from any of the l problems to a vector φ(x R d. We consider the structural parameter space Γ R k d for some k. For the l-th problem and for a given structural parameter Θ R k d, we consider a hypothesis space H l,θ of linear functions f l,θ : X l R parameterized by w l R d and v l R k : f l,θ (x = (w l + v l Θφ(x x X l (4 Here is how to interpret the setup. If Θ is a zero matrix, then the hypothesis function for each problem independently makes its own predictions as a standard linear classifier: f l,θ (x = wl φ(x. Otherwise, Θ serves as a global force that influences the predictions of all m hypothesis functions f 1,Θ... f m,θ. Example 2.1. Let m denote the number of distinct words in the vocabulary. Given an occurrence of word x in a corpus, we use φ(x R d to represent the context of x. For example, for the sentence the dog barked, φ(dog R 2m might be a two-hot encoding of words the and barked. Note that the word x itself is not encoded in φ(x. We have m binary predictors f 1,Θ... f m,θ where f l,θ (x = (wl whether x is the l-th word type given only the context φ(x. + v l Θφ(x corresponds to predicting 2.1 Joint optimization on training data Assume R(Θ and r l (w l, v l are some regularizers for matrices Θ R k d and vectors w l R d, v l R k. Then we can implement the abstract training objective in Eq. (3 by defining H l,θ as a space of regularized linear functions parametrized by Θ R k d, w l R d, and v l R k as in Eq. (4: Θ, {w l, v l } m := arg min R(Θ + Θ,{w l,v l } m (r l (w l, v l + 1nl L((wl + vl Θφ(x (i;l, y (i;l We assume that r l and L are chosen so that Eq. (5 is convex with respect to w l, v l for a fixed Θ. But this is not convex with respect to w l, v l, Θ jointly, so one cannot hope to find globally optimal Θ, {w l, v l }m. A natural optimization method is alternating minimization: Initialize Θ to some value. Repeat until convergence: (5 2

1. Optimize the objective in Eq. (5 with respect to w l, v l, with fixed Θ. 2. Optimize the objective in Eq. (5 with respect to Θ, with fixed w l, v l. But Ando and Zhang (2005 modify the objective slightly so that the alternating minimization can use singular value decomposition (SVD as a subroutine. 2.2 Alternating minimization using SVD Here is a variant of the objective in Eq. (5: Θ, {wl, vl } m := arg min (λ l w l 2 + 1nl L((wl + vl Θφ(x (i;l, y (i;l Θ,{w l,v l } m (6 In other words, we set r l (w l, v l = λ l w l 2 (where λ l 0 is a regularization hyperparameter and introduced an orthogonality constraint ΘΘ = I k,k. This constraint can be seen as implicitly regularizing the parameter Θ. Now a key step: a change of variables. Define u l := w l + Θ v l R d. We can then rewrite Eq. (6 as: ( Θ, {u l, vl } m := arg min λ l ul Θ v l 2 1 + L(u l φ(x (i;l, y (i;l (7 Θ,{u l,v l } m n l Once we have Θ, {u l, v l }m, we can recover w l = u l (Θ vl. This non-obvious step is to achieve the following effect. When u l is fixed for l = 1... m, Eq. (7 reduces to: Θ, {vl } m := arg min λ l ul Θ v l 2 (8 Θ,{v l } m One more key step: inspect what the optimal parameters {vl }m are given a fixed Θ in Eq. (8. That is, analyze the form of {vl } m := arg min λ l ul Θ v l 2 (9 {v l } m for some Θ R k d such that ΘΘ = I k,k. Differentiating the right hand side with respect to v l and setting it to zero, we see that vl = Θu l. Note that this is the case for any given Θ, in particular the optimal Θ in Eq. (8. This is good, because we can now pretend there is no v l in Eq. (8 by plugging in v l = Θu l. Eq. (8 is now formulated entirely in terms of Θ: Θ := arg max Θ λ l u l Θ Θu l (10 This form is almost there. To see what the solution Θ is more clearly, we take a final step: organize vectors u 1... u m and regularizing parameters λ 1... λ m into a matrix M = [ λ 1 u 1... λ m u m ] R d m. By construction, the diagonal entries of ΘMM Θ R k k have the following form: for l = 1... k, [ΘMM Θ ] l,l = λ l u l Θ Θu l Let θ l R d denote the l-th row of Θ R k d. Then Eq. (10 can be rewritten as: {θ1... θk} := arg max θ 1...θ k λ l θl MM θ l (11 subject to θ i θ j = 1 if i = j and 0 otherwise 3

At last, it is clear in Eq. (11 that the solution θ 1... θ k is the k eigenvectors of MM R d d that correspond to the largest k eigenvalues. Equivalently, they are the k left singular vectors of M R d m that correspond to the largest k singular values. Based on this derivation, we have an alternating minimization algorithm that uses SVD as a subroutine to approximate the solution in Eq. (6: SVD-based Alternating Optimization of Ando and Zhang (2005 Input: for l = 1... m: training examples (x (1;l, y (1;l... (x (n l;l, y (n l;l X l { 1, +1}, regularization parameters λ l 0; a suitable loss function L : R R R Output: approximation of the parameters in Eq. (6: ŵ l R d and ˆv l R k for l = 1... m, structurual parameter ˆΘ R k d across the m problems Initialize û l 0 R d for l = 1... m and ˆΘ R k d randomly. Until convergence: 1. For each l = 1... m, set ˆv l = ˆΘû l and optimize the convex objective (e.g., using subgradient descent ŵ l = arg min λ l w l 2 + 1 w l R n d l L((w l 2. For each l = 1... m, set û l = ŵ l + ˆΘ ˆv l and set + ˆv l ˆΘφ(x (i;l, y (i;l ˆM = [ λ 1 û 1... λ m û m ] R d m Compute the k left singular vectors ˆθ 1... ˆθ k R d of ˆM corresponding to the k largest singular values and set ˆΘ = [ˆθ 1... ˆθ k ]. Return ˆΘ, {ŵ l, ˆv l } m where ˆv l ˆΘû l. 3 Application to semi-supervised learning Ando and Zhang (2005 propose applying this framework to semi-supervised learning, i.e., leveraging unlabeled data on top of the labeled data one already has. The basic idea is the following. Given labeled data for some supervised task, one creates a bunch of auxiliary labeled data for some related task. By finding a common structure Θ shared between the original task and the artificially constructed task, one can hope to utilize the information in unlabeled data. How to artificially construct a task such that (1 it is related to the original task and (2 its labeled data can be easily obtained from unlabeled data is best illustrated with an example. Consider the task of associating a word in a sentence with a part-of-speech tag. For simplicity, we will not consider any structured predction such as viterbi decoding. Instead, we independently predict the part-of-speech tag of word x j (the j-th word in sentence x given some feature representation φ(x, j. We will probably want to include features such as the identity of words in a local context (x j 1, x j, x j+1, the spelling of the considered word (prefixes and suffixes of x j, and so on. By observing that part-of-speech tags are intimiately correlated with word types themselves, we create an artificial problem of predicting the word in the current position given the previous and next words. This task is also illustrated in Example 2.1. Note that we can use unlabeled text for this problem. In summary, here is a description of how we might apply this framework to this semi-supervised task: 1. Apply the SVD-based Alternating Optimization algorithm to learn binary classifiers (corresponding to word types that predict the word in the current position given the previous and next words. In the described setup, we can simply use the existing feature function φ(x, j by setting all 4

irrelevant features (such as spelling features to zero. Ando and Zhang (2005 report that iterating only once is sufficient in practice. 2. Fix the output structure ˆΘ from step 1. Throw away other parameters ŵ l, ˆv l returned by the algorithm. Now, re-learn ŵ l, ˆv l on the original task of predicting the part-of-speech tag of word x j given φ(x, j, using the fixed ˆΘ from step 1 in the algorithm. The output of this procedure is a set of binary classifiers for predicting the part-of-speech tag of a word in a sentence. But during training, these classifiers took advantage of the shared structure ˆΘ in the task of predicting the current word given neighboring words. Since these two tasks are very related, the original problem will benefit from using ˆΘ. Ando and Zhang (2005 report very good performance on sequence labeling tasks such as named-entity recognition. But because this framework is formulated entirely in terms of binary classification, they use their own sequence labeling method which also formulates the problem in terms of binary classification, which is out of the scope of this note. References Ando, R. K. and Zhang, T. (2005. A framework for learning predictive structures from multiple tasks and unlabeled data. The Journal of Machine Learning Research, 6, 1817 1853. 5