Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Similar documents
Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Sequential Supervised Learning

Machine Learning for Structured Prediction

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Lecture 13: Structured Prediction

Warm up: risk prediction with logistic regression

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Logistic Regression & Neural Networks

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Machine Learning for NLP

Linear Classifiers IV

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Probabilistic Graphical Models

1 Machine Learning Concepts (16 points)

Machine Learning (CS 567) Lecture 2

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

COMP90051 Statistical Machine Learning

Graphical models for part of speech tagging

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Machine Learning for NLP

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Linear & nonlinear classifiers

Linear discriminant functions

Part of the slides are adapted from Ziko Kolter

Information Extraction from Text

Logistic Regression Logistic

EE 511 Online Learning Perceptron

6.036 midterm review. Wednesday, March 18, 15

The Noisy Channel Model and Markov Models

Lecture 5 Neural models for NLP

Foundations of Machine Learning

Multiclass and Introduction to Structured Prediction

Logistic Regression. Machine Learning Fall 2018

Click Prediction and Preference Ranking of RSS Feeds

CS 6375 Machine Learning

Lab 12: Structured Prediction

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Machine Learning Practice Page 2 of 2 10/28/13

CSCI567 Machine Learning (Fall 2018)

Bias-Variance Tradeoff

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Online Learning and Sequential Decision Making

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Multiclass and Introduction to Structured Prediction

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

Intelligent Systems (AI-2)

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Networks: Optimization & Regularization

ESS2222. Lecture 4 Linear model

Probabilistic Models for Sequence Labeling

Hidden Markov Models

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Active learning in sequence labeling

CMU-Q Lecture 24:

Discriminative Models

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Conditional Random Fields for Sequential Supervised Learning

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Loss Functions, Decision Theory, and Linear Models

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Feature Noising. Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager

CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Multiclass Classification-1

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

ECE521 week 3: 23/26 January 2017

Training the linear classifier

Logistic Regression. William Cohen

Large-Margin Thresholded Ensembles for Ordinal Regression

Machine Learning, Fall 2012 Homework 2

Introduction to Logistic Regression

Introduction to Machine Learning Midterm Exam Solutions

Nonparametric Bayesian Methods (Gaussian Processes)

Discriminative Models

What is semi-supervised learning?

Ad Placement Strategies

CPSC 340: Machine Learning and Data Mining

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

lecture 6: modeling sequences (final part)

Introduction to Machine Learning

A brief introduction to Conditional Random Fields

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

10-701/ Machine Learning - Midterm Exam, Fall 2010

Conditional Random Field

Transcription:

http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY

Topics of the 2nd half of this course: Advanced supervised learning and unsupervised learning Multi-class classification and structured output prediction Other variants of supervised learning problems: Semi-supervised learning, active learning, & transfer learning On-line learning: Follow the leader, on-line gradient descent, perceptron Regret analysis Sparse modeling: L 1 regularization, Lasso, & reduced rank regression Model evaluation 2 KYOTO UNIVERSITY

Multi-class Classification 3 KYOTO UNIVERSITY

Multi-class classification: Generalization of supervised two-class classification Training dataset: x 1, y 1,, x i, y i,, x N, y N input x i output y i X = R D : D-dimensional real vector Y: one-dimensional scalar Estimate a deterministic mapping f: X Y (often with a confidence value) or a conditional probability P(y x) Classification Y = {+1, 1}: Two-class classification Y = {1, 2,, K}: K-class classification hand-written digit recognition, text classification, 4 KYOTO UNIVERSITY

Two-class classification model: One model with one model parameter vector Two-class classification model Linear classifier: f x = sign(w x) {+1, 1} Logistic regression: P y x = 1 1 + exp w x The model is specified by the parameter vector w = (w 1, w 2,, w D ) Our goal is find the parameter w by using the training dataset x 1, y 1, x 2, y 2,, x N, y N Generalization: accurate prediction for future data sampled from some underlying distribution D x,y 5 KYOTO UNIVERSITY

Simple approaches to multi-class classification: Reduction to two-class classification Reduction to a set of two-class classification problems Approach 1: One-versus-rest Construct K two-class classifiers; each classifier sign(w (k) x) discriminates class k from the others Prediction: the most probable class with the highest w (k) x Approach 2: One-versus-one Construct K K 1 /2 two-class classifiers, each of which discriminates between a pair of two classes Prediction by voting confidence 6 KYOTO UNIVERSITY

Error Correcting Output Code (ECOC) : An approach inspired by error correcting coding Approach 3: Error correcting output code (ECOC) Construct a set of two-class classifiers, each of which discriminates between two groups of classes, e.g. AB vs. CD Prediction by finding the nearest code in terms of Hamming distance class two-class classification problems 1 2 3 4 5 6 A 1 1 1 1 1 1 B 1-1 1-1 -1-1 C -1-1 -1 1-1 1 D -1 1 1-1 -1 1 prediction codes 1 1 1 1 1-1 code for class A 7 KYOTO UNIVERSITY

Design of ECOC : Code design is the key for good classification Codes (row) should be apart from each other in terms of Hamming distance codes class two-class classification problems 1 2 3 4 5 6 A 1 1 1 1 1 1 B 1-1 1-1 -1-1 C -1-1 -1 1-1 1 D -1 1 1-1 -1 1 Hamming distances between codes class A B C D A 0 4 4 3 B 0 4 3 C 0 3 D 0 8 KYOTO UNIVERSITY

Multi-class classification model: One model parameter vector for each class More direct modeling of multi-class classification One parameter vector w (k) for each class k Multi-class linear classifier: f x = argmax k Y w (k) x Multi-class logistic regression: P(k x) = exp w (k) x k Y exp w (k ) x converts real values into positive values, and then normalizes them to obtain a probability value [0,1] 9 KYOTO UNIVERSITY

Training multi-class classifier: Constraints for correct classification Training multiclass linear classifier: f x = argmax k Y can use the one-versus-rest method, but not perfect Constraints for correct classification of training data w y(i) x i > w k x i for k y i i.e. w y(i) x i > argmax k Y,k y (i) w k x i w (k) x Learning algorithms find solutions satisfying (almost all) these constraints Multi-class perceptron, multi-class SVM, 10 KYOTO UNIVERSITY

Multi-class perceptron: Incremental learning algorithm of linear classifier Multi-class linear perceptron trains a classifier to meet the constraints w y(i) x i > max k Y,y y (i) w k x i Algorithm: 1. Given x i, y i, make a prediction with : f x i = argmax w (k) x i k Y 2. Update parameters only when the prediction is wrong: 1. w y(i) w y(i) + x i : reinforces correct prediction 2. w f x i w f x i x i : discourages wrong prediction 11 KYOTO UNIVERSITY

Training multi-class logistic regression: (Regularized) maximum likelihood estimation Find the parameters that minimizes the negative log-likelihood J w y y = log p(y (i) x (i) ) + γ w y 2 2 i=1,,n y Y w y 2 2 : a regularizer to avoid overfitting For multi-class logistic regression P(k x) = exp w (k) x k Y exp w (k ) x J = i w k x i + i log exp w k x i k Y + reg. Minimization using gradient-based optimization methods 12 KYOTO UNIVERSITY

Difference of perceptron and ML estimation: Perceptron needs only max operation; ML needs sum Perceptron Training & prediction need only argmax k Y operation SVM also does (Regularized) maximum likelihood estimation Training: needs k Y operation Prediction: needs argmax k Y operation 13 KYOTO UNIVERSITY

Equivalent form of multi-class logistic regression: Representation with one (huge) parameter vector Consider a joint feature space of x and y: φ x, y = (δ y = 1 x, δ y = 2 x,, δ y = K x ) Corresponding parameter vector: w = (w 1, w 2,, w K ) KD-dimensional feature space Multiclass LR model: P(y x) = Equivalent to the previous model exp w φ x,y k Y exp φ x,k P(k x) = exp w (k) x k Y exp w(k ) x Useful when we consider structured output prediction 14 KYOTO UNIVERSITY

Structured Output Prediction 15 KYOTO UNIVERSITY

Generalized supervised learning problem: Learn a mapping between general sets In supervised learning, what we want is a mapping f: X Y X = R D, Y = R (regression) or a discrete set (classification) More general problem setting takes arbitrary Xand Y sets f x But, we have to restrict the classes of X and Y in practice Especially, cases with general output spaces are difficult to consider in the current framework Classification with an infinite number of classes y 16 KYOTO UNIVERSITY

Structured output prediction: Outputs are sequences, trees, and graphs (Inputs and) outputs have complex structures such as sequences, trees, and graphs in many applications Natural language processing: texts, parse trees, Bioinformatics: sequences and structures of DNA/RNA/proteins Structured output prediction tasks: Syntactic parsing: sequences to trees x = (John, loves, Mary): sequence y = (S(NP(NNP))(VP(VPZ)(NP(NNP)))) : tree http://en.wikipedia.org/wiki/treebank#mediaviewer/file:example-tree.png 17 KYOTO UNIVERSITY y x

Sequence labeling: Structured prediction with sequential input & output Sequence labeling gives a label to each element of a sequence x = (x 1, x 2,, x T ): input sequence of length T y = (y 1, y 2,, y T ): output sequence with the same length Simplest structured prediction problem Example. Part-of-speech tagging gives a part-of-speech tag to each word in a sentence x: sentence (a sequence of words) y: Part-of-speech tags (e.g. noun, verb, ) x 1 x 2 x T y 1 y 2 y T 18 KYOTO UNIVERSITY

Sequence labeling as multi-class classification: Impossible to work with exponentially many parameters Formulation as T independent classification problems Predict y t using surrounding words (, x t 1, x t, x t+1, ) Sometimes quite works well and efficient No guarantee of consistence among predicted labels Might want to include dependencies among labels such as a verb is likely to follow nouns This problem can also be considered as one multi-class classification problem with K T classes f x = argmax w (k) x is almost impossible to work with k Y exponentially many parameters 19 KYOTO UNIVERSITY

Key for solving structured output prediction: Formulation as a validation problem of in/output pairs Remember another form of multi-class classifier using the joint feature space P(y x) = exp w φ x,y k Y exp φ x,k or f x = argmax y Y They evaluate the affinity of an input-output pair w φ x, y Still the problem is not solved. but we can consider reducing the dimensionality of φ x, y Because the dimensionality of φ x, y is still huge 20 KYOTO UNIVERSITY

Features for sequence labeling: First-order Markov assumption gives two feature types Two types of features for sequence labeling 1. Combination of one input label x t and one output label y t Standard feature for multi-class classification e.g. x t = loves y t = verb 2. Combination of two consecutive labels y t 1 and y t Markov assumption of output labels e.g. y t 1 = noun y t = verb x 1 x 2 x t-1 x t x T y 1 y 2 y t-1 y t y T 21 KYOTO UNIVERSITY

Feature vector definition: The numbers of appearance of each pattern Each dimension of φ x, y is defined as the number of appearance of each pattern in the joint sequence x, y, e.g. φ x, y 1 = #appearance of [ x t = loves y t = verb ] φ x, y 2 = #appearance of [ y t 1 = noun y t = verb ] Features for all possible combination of POS tags and words John loves Mary noun verb noun John loves Mary noun noun noun 22 KYOTO UNIVERSITY

Impact of first-order Markov assumption: Reduced dimensionality of feature space Dimensionality of a feature vector was decreased from O(K T ) to O(K 2 ) (K is the number of labels for each position) Space problem was solved; we can calculate w φ x, y Prediction problem (i.e. argmax y Y solved w φ x, y ) has not been For sequential labeling, this can be done by using dynamic programming 23 KYOTO UNIVERSITY

Structured perceptron : Simple structured output learning algorithm Structured perceptron learns w satisfying w φ x (i), y (i) > max y Y,y y (i) w φ x (i), y Algorithm: 1. Given x (i), y (i), make a prediction with : f x i = argmax w φ x (i), y y Y 2. Update parameters only when the prediction is wrong w NEW w OLD + φ x (i), y (i) φ x (i), f x i Prediction can be done in polynomial time by using dynamic programming for sequence labeling 24 KYOTO UNIVERSITY

Conditional random field: Conditional probabilistic model for structured prediction Conditional random filed: conditional probabilistic model P(y x) = exp w φ x, y k Y exp φ x, k ML estimation needs the sum over all possible outputs J = i w φ x (i), y (i) i log w φ x (i), y + reg. y Y The sum can be taken with dynamic programming 25 KYOTO UNIVERSITY

Perceptron vs. CRF: Perceptron needs only max operation; ML needs sum Just like in multi-class classification, Structured perceptron can work only with argmax operation Maximum likelihood estimation also needs sum operation There are some structured output problems where argmax operation is easy but sum operation is difficult e.g. bipartite matching 26 KYOTO UNIVERSITY

Homework 27 KYOTO UNIVERSITY

Homework: Supervised regression Work on a supervised regression problem: 1. Implement at least one method by yourself 2. Use publicly available implementations (e.g. scikit.learn) Participate into a competition at http//universityofbigdata.net Temperature prediction problem Starts at Jun. 3th Ends at Jul. 10th Submit a report summarizing your work Due: Jul. 20th noon 28 KYOTO UNIVERSITY

How to participate: Register to University Of Big Data We use our educational competition platform: http://universityofbigdata.net/?lang=en Register with your Google account (if you have not) with registration code SML2016 If you already have an account, send an email to universityofbigdata@gmail.com to give you a permission This is a closed competition and hence we have to give you a permission 29 KYOTO UNIVERSITY

Submitting your prediction: http://goo.gl/a2liuz See the instructions at http://universityofbigdata.net/ competition/5500001?lang=en 30 KYOTO UNIVERSITY

Report submission: Submit a report summarizing your work Submission: Due: Jul. 20th noon Send your report to kashipong+report@gmail.com with subject SML2016 competition report and confirm you receive an ack before 21th Report format: Must include: Brief description of your implementation (not source codes) Your approach, analysis pipeline, results, and discussions At least 3 pages, but do not exceed 6 pages in LNCS format 31 KYOTO UNIVERSITY