Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Similar documents
Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Sequential Supervised Learning

Machine Learning for Structured Prediction

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Warm up: risk prediction with logistic regression

Lecture 13: Structured Prediction

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Machine Learning for NLP

Linear Classifiers IV

1 Machine Learning Concepts (16 points)

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Logistic Regression & Neural Networks

Probabilistic Graphical Models

Machine Learning (CS 567) Lecture 2

Linear & nonlinear classifiers

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

6.036 midterm review. Wednesday, March 18, 15

COMP90051 Statistical Machine Learning

Graphical models for part of speech tagging

Machine Learning for NLP

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

CMU-Q Lecture 24:

Part of the slides are adapted from Ziko Kolter

Foundations of Machine Learning

Linear discriminant functions

Lecture 5 Neural models for NLP

Multiclass and Introduction to Structured Prediction

Machine Learning Practice Page 2 of 2 10/28/13

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Online Learning and Sequential Decision Making

Logistic Regression Logistic

Information Extraction from Text

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

EE 511 Online Learning Perceptron

Multiclass and Introduction to Structured Prediction

Logistic Regression. Machine Learning Fall 2018

ESS2222. Lecture 4 Linear model

Discriminative Models

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Lab 12: Structured Prediction

Neural Networks: Optimization & Regularization

CSCI567 Machine Learning (Fall 2018)

Loss Functions, Decision Theory, and Linear Models

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Linear Models for Regression. Sargur Srihari

Training the linear classifier

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

ECE521 week 3: 23/26 January 2017

Click Prediction and Preference Ranking of RSS Feeds

The Noisy Channel Model and Markov Models

Discriminative Models

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CS 6375 Machine Learning

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

Linear Regression (continued)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Probabilistic Models for Sequence Labeling

CPSC 340: Machine Learning and Data Mining

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Hidden Markov Models

Active learning in sequence labeling

Bias-Variance Tradeoff

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Machine Learning Lecture 5

Conditional Random Fields for Sequential Supervised Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Multiclass Classification-1

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

What is semi-supervised learning?

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Machine Learning Lecture 7

Nonparametric Bayesian Methods (Gaussian Processes)

Intelligent Systems (AI-2)

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Day 3: Classification, logistic regression

Linear & nonlinear classifiers

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Security Analytics. Topic 6: Perceptron and Support Vector Machine

Introduction to Machine Learning

lecture 6: modeling sequences (final part)

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

10-701/ Machine Learning - Midterm Exam, Fall 2010

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

18.9 SUPPORT VECTOR MACHINES

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Transcription:

http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY

Topics of the 2nd half of this course: Advanced supervised learning and unsupervised learning Multi-class classification and structured output prediction Other variants of supervised learning problems: Semi-supervised learning, active learning, & transfer learning On-line learning: Follow the leader, on-line gradient descent, perceptron Regret analysis Sparse modeling: L 1 regularization, Lasso, & reduced rank regression Model evaluation 2 KYOTO UNIVERSITY

Homework 3 KYOTO UNIVERSITY

Homework: Supervised regression Work on a supervised regression problem: 1. Implement at least one linear regression by yourself 2. Use publicly available nonlinear regression implementations Participate into a competition at http//universityofbigdata.net Register with your Google account (if you have not yet) The competition will last until Dec. 31th Submit your predictions at least twice (once with your implementation; once with another) 4 KYOTO UNIVERSITY

Submitting your prediction: http://goo.gl/3bmpf4 See the instructions at http//universityofbigdata.net/ competition/5757948332605440?lang=en 5 KYOTO UNIVERSITY

Report submission: Submit a report summarizing your work Submission: Due: Jan. 7th noon, 2016 Send your report to kashipong+report@gmail.com with subject SML2015 competition report and confirm you receive an ack before 8th Report format: Must include: Brief description of your implementation (not sourcecode) Your approach, analysis pipeline, results, and discussions At least 3 pages, but do not exceed 6 pages in LNCS format 6 KYOTO UNIVERSITY

Topics of the 2nd half of this course: Advanced supervised learning and unsupervised learning Multi-class classification and structured output prediction Other variants of supervised learning problems: Semi-supervised learning, active learning, & transfer learning On-line learning: Follow the leader, on-line gradient descent, perceptron Regret analysis Sparse modeling: L 1 regularization, Lasso, & reduced rank regression Model evaluation 7 KYOTO UNIVERSITY

Multi-class Classification 8 KYOTO UNIVERSITY

Multi-class classification: Generalization of supervised two-class classification Training dataset: x 1, y 1,, x i, y i,, x N, y N input x i output y i X = R D : D-dimensional real vector Y: one-dimensional scalar Estimate a deterministic mapping f: X Y (often with a confidence value) or a conditional probability P(y x) Classification Y = {+1, 1}: Two-class classification Y = {1, 2,, K}: K-class classification hand-written digit recognition, text classification, 9 KYOTO UNIVERSITY

Two-class classification model: One model with one model parameter vector Two-class classification model Linear classifier: f x = sign(w x) {+1, 1} Logistic regression: P y x = 1 1 + exp w x The model is specified by the parameter vector w = (w 1, w 2,, w D ) Our goal is find the parameter w by using the training dataset x 1, y 1, x 2, y 2,, x N, y N Generalization: accurate prediction for future data sampled from some underlying distribution D x,y 10 KYOTO UNIVERSITY

Simple approaches to multi-class classification: Reduction to two-class classification Reduction to a set of two-class classification problems Approach 1: One-versus-rest Construct K two-class classifiers; each classifier sign(w (k) x) discriminates class k from the others Prediction: the most probable class with the highest w (k) x Approach 2: One-versus-one Construct K K 1 /2 two-class classifiers, each of which discriminates between a pair of two classes Prediction by voting confidence 11 KYOTO UNIVERSITY

Error Correcting Output Code (ECOC) : An approach inspired by error correcting coding Approach 3: Error correcting output code (ECOC) Construct a set of two-class classifiers, each of which discriminates between two groups of classes, e.g. AB vs. CD Prediction by finding the nearest code in terms of Hamming distance class two-class classification problems 1 2 3 4 5 6 A 1 1 1 1 1 1 B 1-1 1-1 -1-1 C -1-1 -1 1-1 1 D -1 1 1-1 -1 1 prediction codes 1 1 1 1 1-1 code for class A 12 KYOTO UNIVERSITY

Design of ECOC : Code design is the key for good classification Codes (row) should be apart from each other in terms of Hamming distance codes class two-class classification problems 1 2 3 4 5 6 A 1 1 1 1 1 1 B 1-1 1-1 -1-1 C -1-1 -1 1-1 1 D -1 1 1-1 -1 1 Hamming distances between codes class A B C D A 0 4 4 3 B 0 4 3 C 0 3 D 0 13 KYOTO UNIVERSITY

Multi-class classification model: One model parameter vector for each class More direct modeling of multi-class classification One parameter vector w (k) for each class k Multi-class linear classifier: f x = argmax k Y w (k) x Multi-class logistic regression: P(k x) = exp w (k) x k Y exp w (k ) x converts real values into positive values, and then normalizes them to obtain a probability value [0,1] 14 KYOTO UNIVERSITY

Training multi-class classifier: Constraints for correct classification Training multiclass linear classifier: f x = argmax k Y can use the one-versus-rest method, but not perfect Constraints for correct classification of training data w y(i) x i > w k x i for k y i i.e. w y(i) x i > argmax k Y,k y (i) w k x i w (k) x Learning algorithms find solutions satisfying (almost all) these constraints Multi-class perceptron, multi-class SVM, 15 KYOTO UNIVERSITY

Multi-class perceptron: Incremental learning algorithm of linear classifier Multi-class linear perceptron trains a classifier to meet the constraints w y(i) x i > max k Y,y y (i) w k x i Algorithm: 1. Given x i, y i, make a prediction with : f x i = argmax w (k) x i k Y 2. Update parameters only when the prediction is wrong: 1. w y(i) w y(i) + x i : reinforces correct prediction 2. w f x i w f x i x i : discourages wrong prediction 16 KYOTO UNIVERSITY

Training multi-class logistic regression: (Regularized) maximum likelihood estimation Find the parameters that minimizes the negative log-likelihood J w y y = log p(y (i) x (i) ) + γ w y 2 2 i=1,,n y Y w y 2 2 : a regularizer to avoid overfitting For multi-class logistic regression P(k x) = exp w (k) x k Y exp w (k ) x J = i w k x i + i log exp w k x i k Y + reg. Minimization using gradient-based optimization methods 17 KYOTO UNIVERSITY

Difference of perceptron and ML estimation: Perceptron needs only max operation; ML needs sum Perceptron Training & prediction need only argmax k Y operation SVM also does (Regularized) maximum likelihood estimation Training: needs k Y operation Prediction: needs argmax k Y operation 18 KYOTO UNIVERSITY

Equivalent form of multi-class logistic regression: Representation with one (huge) parameter vector Consider a joint feature space of x and y: φ x, y = (δ y = 1 x, δ y = 2 x,, δ y = K x ) Corresponding parameter vector: w = (w 1, w 2,, w K ) KD-dimensional feature space Multiclass LR model: P(y x) = Equivalent to the previous model exp w φ x,y k Y exp φ x,k P(k x) = exp w (k) x k Y exp w(k ) x Useful when we consider structured output prediction 19 KYOTO UNIVERSITY

Structured Output Prediction 20 KYOTO UNIVERSITY

Ultimate predictive modeling: Learn a mapping between general sets In supervised learning, what we want is a mapping f: X Y X = R D, Y = R (regression) or a discrete set (classification) Ultimate predictor should take arbitrary Xand Y sets f x We have to restrict the classes of X and Y in practice Especially, cases with general output spaces are difficult to consider in the current framework Classification with an infinite number of classes y 21 KYOTO UNIVERSITY

Structured output prediction: Outputs are sequences, trees, and graphs (Inputs and) outputs have complex structures such as sequences, trees, and graphs in many applications Natural language processing: texts, parse trees, Bioinformatics: sequences and structures of DNA/RNA/proteins Structured output prediction tasks: Syntactic parsing: sequences to trees x = (John, loves, Mary): sequence y = (S(NP(NNP))(VP(VPZ)(NP(NNP)))) : tree http://en.wikipedia.org/wiki/treebank#mediaviewer/file:example-tree.png 22 KYOTO UNIVERSITY y x

Sequence labeling: Structured prediction with sequential input & output Sequence labeling gives a label to each element of a sequence x = (x 1, x 2,, x T ): input sequence of length T y = (y 1, y 2,, y T ): output sequence with the same length Simplest structured prediction problem Example. Part-of-speech tagging gives a part-of-speech tag to each word in a sentence x: sentence (a sequence of words) y: Part-of-speech tags (e.g. noun, verb, ) x 1 x 2 x T y 1 y 2 y T 23 KYOTO UNIVERSITY

Sequence labeling as multi-class classification: Impossible to work with exponentially many parameters Formulation as T independent classification problems Predict y t using surrounding words (, x t 1, x t, x t+1, ) Sometimes quite works well and efficient No guarantee of consistence among predicted labels Might want to include dependencies among labels such as a verb is likely to follow nouns This problem can also be considered as one multi-class classification problem with K T classes f x = argmax w (k) x is almost impossible to work with k Y exponentially many parameters 24 KYOTO UNIVERSITY

Key for solving structured output prediction: Formulation as a validation problem of in/output pairs Remember another form of multi-class classifier using the joint feature space P(y x) = exp w φ x,y k Y exp φ x,k or f x = argmax y Y They evaluate the affinity of an input-output pair w φ x, y Still the problem is not solved. but we can consider reducing the dimensionality of φ x, y Because the dimensionality of φ x, y is still huge 25 KYOTO UNIVERSITY

Features for sequence labeling: First-order Markov assumption gives two feature types Two types of features for sequence labeling 1. Combination of one input label x t and one output label y t Standard feature for multi-class classification e.g. x t = loves y t = verb 2. Combination of two consecutive labels y t 1 and y t Markov assumption of output labels e.g. y t 1 = noun y t = verb x 1 x 2 x t-1 x t x T y 1 y 2 y t-1 y t y T 26 KYOTO UNIVERSITY

Feature vector definition: The numbers of appearance of each pattern Each dimension of φ x, y is defined as the number of appearance of each pattern in the joint sequence x, y, e.g. φ x, y 1 = #appearance of [ x t = loves y t = verb ] φ x, y 2 = #appearance of [ y t 1 = noun y t = verb ] Features for all possible combination of POS tags and words John loves Mary noun verb noun John loves Mary noun noun noun 27 KYOTO UNIVERSITY

Impact of first-order Markov assumption: Reduced dimensionality of feature space Dimensionality of a feature vector was decreased from O(K T ) to O(K 2 ) (K is the number of labels for each position) Space problem was solved; we can calculate w φ x, y Prediction problem (i.e. argmax y Y solved w φ x, y ) has not been For sequential labeling, this can be done by using dynamic programming 28 KYOTO UNIVERSITY

Structured perceptron : Simple structured output learning algorithm Structured perceptron learns w satisfying w φ x (i), y (i) > max y Y,y y (i) w φ x (i), y Algorithm: 1. Given x (i), y (i), make a prediction with : f x i = argmax w φ x (i), y y Y 2. Update parameters only when the prediction is wrong w NEW w OLD + φ x (i), y (i) φ x (i), f x i Prediction can be done in polynomial time by using dynamic programming for sequence labeling 29 KYOTO UNIVERSITY

Conditional random field: Conditional probabilistic model for structured prediction Conditional random filed: conditional probabilistic model P(y x) = exp w φ x, y k Y exp φ x, k ML estimation needs the sum over all possible outputs J = i w φ x (i), y (i) i log w φ x (i), y + reg. y Y The sum can be taken with dynamic programming 30 KYOTO UNIVERSITY

Perceptron vs. CRF: Perceptron needs only max operation; ML needs sum Just like in multi-class classification, Structured perceptron can work only with argmax operation Maximum likelihood estimation also needs sum operation There are some structured output problems where argmax operation is easy but sum operation is difficult e.g. bipartite matching 31 KYOTO UNIVERSITY