Multiclass Multilabel Classification with More Classes than Examples

Similar documents
SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Classification, Linear Models, Naïve Bayes

Generalization, Overfitting, and Model Selection

Introduction and Models

Evaluation. Andrea Passerini Machine Learning. Evaluation

Deconstructing Data Science

Evaluation requires to define performance measures to be optimized

Hypothesis Evaluation

Generalization and Overfitting

Generalization, Overfitting, and Model Selection

Model Accuracy Measures

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

A first model of learning

Does Unlabeled Data Help?

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Least Squares Classification

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Large-Margin Thresholded Ensembles for Ordinal Regression

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Information, Learning and Falsification

The Naïve Bayes Classifier. Machine Learning Fall 2017

Aggregating Ordinal Labels from Crowds by Minimax Conditional Entropy. Denny Zhou Qiang Liu John Platt Chris Meek

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

PAC-Bayesian Learning and Domain Adaptation

Improving Quality of Crowdsourced Labels via Probabilistic Matrix Factorization

Support Vector Machines.

Decision Support. Dr. Johan Hagelbäck.

CMU-Q Lecture 24:

Modern Methods of Statistical Learning sf2935 Lecture 5: Logistic Regression T.K

Linear Classifiers IV

Machine Learning for Structured Prediction

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

IFT Lecture 7 Elements of statistical learning theory

Pointwise Exact Bootstrap Distributions of Cost Curves

Foundations of Machine Learning

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Performance Evaluation and Comparison

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Final Exam, Machine Learning, Spring 2009

Incentive Compatible Regression Learning

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

EECS 349:Machine Learning Bryan Pardo

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Introduction to Supervised Learning. Performance Evaluation

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Surrogate regret bounds for generalized classification performance metrics

Learning from the Wisdom of Crowds by Minimax Entropy. Denny Zhou, John Platt, Sumit Basu and Yi Mao Microsoft Research, Redmond, WA

day month year documentname/initials 1

Jeffrey D. Ullman Stanford University

Online Passive-Aggressive Algorithms. Tirgul 11

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

PAC-learning, VC Dimension and Margin-based Bounds

Active learning. Sanjoy Dasgupta. University of California, San Diego

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3

Classification of Ordinal Data Using Neural Networks

Models, Data, Learning Problems

Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss

Performance Evaluation

Classifier performance evaluation

Bias Correction in Classification Tree Construction ICML 2001

Introduction to Machine Learning CMU-10701

Domain-Adversarial Neural Networks

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

More Smoothing, Tuning, and Evaluation

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

Given a feature in I 1, how to find the best match in I 2?

Stephen Scott.

Large-Margin Thresholded Ensembles for Ordinal Regression

Recap from previous lecture

Performance Metrics for Machine Learning. Sargur N. Srihari

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Support vector machines Lecture 4

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Part of the slides are adapted from Ziko Kolter

Bayesian Decision Theory

Decision Tree Learning Lecture 2

Decision trees COMS 4771

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

ECS289: Scalable Machine Learning

Lecture Support Vector Machine (SVM) Classifiers

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Failures of Gradient-Based Deep Learning

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Learning From Crowds. Presented by: Bei Peng 03/24/15

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

ECS289: Scalable Machine Learning

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Adaptive Crowdsourcing via EM with Prior

Machine Learning for natural language processing

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Variance Reduction and Ensemble Methods

Transcription:

Multiclass Multilabel Classification with More Classes than Examples Ohad Shamir Weizmann Institute of Science Joint work with Ofer Dekel, MSR NIPS 2015 Extreme Classification Workshop

Extreme Multiclass Multilabel Problems Label set is a folksonomy (a.k.a. collaborative tagging or social tagging)

Categories 1452 births / 1519 deaths / 15 th century in science / ambassadors of the republic of Florence / Ballistic experts / Fabulists / giftedness / mathematics and culture / Italian inventors / Members of the Guild of Saint Luke / Tuscan painters / people persecuted under antihomosexuality laws...

Problem Definition Multiclass multilabel classification m training examples, k categories m, k together Possibly even k > m Goal: Categorize unseen instances

Extreme Multiclass Supervised learning starts with binary classification (k=2) and extends to multiclass learning Theory: VC dimension Natarajan dimension Algorithms: binary multiclass Usually, assume k = O 1 Some exceptions Hierarchy with prior knowledge on relationships not always available Additional assumptions (e.g. talk by Marius earlier)

Application Classify the web based on Wikipedia categories Training set: All Wikipedia pages (m = 4.2 10 6 ) Labels: All Wikipedia categories (k = 1.1 10 6 )

Challenges Statistical problem: Can t get a large (or even moderate) sample from each class. Computational problem: Many classification algorithms will choke on millions of labels

Propagating Labels on the Click-Graph queries web pages A bipartite graph derived from search engine logs: clicks encoded as weighted edges Wikipedia pages are labeled web pages Labels propagate along edges to other pages

Example http://en.wikipedia.org/wiki/leonardo da Vinci passes multiple labels to http://www.greatitalians.com Among them Renaissance artists good 1452 births bad Observation: 1452 births induces many false-positives (FP): best to remove it altogether from classifier output (FP TN, TP FN)

Simple Label Pruning Approach 1. Split dataset to training and validation set 2. Use training set to build an initial classifier h pre (e.g. by propagating labels over click-graph) 3. Apply h pre to validation set, count FP and TP 4. j 1,, k, remove label j if FP j > 1 γ TP j γ Defines a new pruned classifier h post

Simple Label Pruning Approach Explicitly minimizes empirical risk with respect to the γ-weighted loss: k j=1 l h x, y = γ I h j x = 1, y j = 0 + 1 γ I h j x = 0, y j = 1 FP (false positive) FN (false negative)

Main Question Would this actually reduce the risk? E x,y l h post x, y < E x,y l h pre x, y - positive

Baseline Approach Prove that uniformly for all labels j FP j TP j FP j TP j Pr(label j and not predicted) Pr(label j and predicted) Problem: m, k together. Many classes only have a handful of examples

Uniform Convergence Approach Algorithm implicitly chooses a hypothesis from a certain hypothesis class Pruning rules on top of fixed predictor h pre Prove uniform convergence by bounding VC dimension / Rademacher complexity Conclude that if empirical risk decreases, the risk decreases as well

Uniform Convergence Fails Unfortunately, no uniform convergence...... and even no algorithm/data-dependent convergence! k E R h post R h post Pr j pruned TP j FP j j=1 k = j=1 Pr FP j > TP j TP j FP j Weak correlation in m k regime

A Less Obvious Approach Prove directly that risk decreases Important (but mild) assumption: Each example labeled by s labels Step 1: Risk of h post is concentrated. For all ε, Pr R h post ER h post

A Less Obvious Approach Part 2: Enough to prove R h pre ER h post > 0 Assuming for γ = 1 2 for simplicity, can be shown that R h pre ER h post > pos O FP j + TP j j 1/2 m where w 1/2 = j w j 2

A Less Obvious Approach For probability vector, Part 2: Enough to prove R h pre always ER at most h post k > 0 FP j TP j Assuming for γ = 1 for simplicity, Smaller can the be shown more nonuniform is the distribution that 2 R h pre ER h post j:fp j TP j > pos O FP j + TP j j 1/2 m where w 1/2 = j w j 2

Wikipedia Power-Law: r = 1.6

Wikipedia Power-Law: r = 1.6 R h pre ER h post > pos O k 0.4 m

Experiment Click graph on the entire web (based on search engine logs)

Experiment Categories from Wikipedia pages propagated twice through graph

Experiment Train/test split of Wikipedia pages How good are propagated categories from training set in predicting categories at test set pages?

Experiment

Another less obvious approach = = k j=1 k j=1 R h pre ER h post Pr j pruned FP j TP j Pr FP j > TP j FP j TP j Weak but positive correlation, even if only few examples per label For large k, sum will tend to be positive

Different Application: Crowdsourcing (Dekel and S., 2009)

Different Application: Crowdsourcing (Dekel and S., 2009)

Different Application: Crowdsourcing (Dekel and S., 2009)

Different Application: Crowdsourcing (Dekel and S., 2009)

Different Application: Crowdsourcing How can we improve crowdsourced data? Standard approach: Repeated labeling, but expensive A bootstrap approach: Learn predictor from data of all workers Throw away examples labeled by workers disagreeing a lot with the predictor Re-train on remaining examples Works! (Under certain assumptions) Challenge: Workers often labels only a handful of examples

Different Application: Crowdsourcing # examples/worker might be small, but many workers...

Different Application: Crowdsourcing # examples/worker might be small, but many workers...

Different Application: Crowdsourcing # examples/worker might be small, but many workers...

Conclusions # classes violates assumptions of most multiclass analyses Often based on generalizations of binary classification Possible approach Avoid standard analysis Extreme X can be a blessing rather than a curse Other applications? More complex learning algorithms (e.g. substitution)?

Thanks!