Online Passive-Aggressive Algorithms. Tirgul 11

Similar documents
Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Online Passive- Aggressive Algorithms

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Online Bayesian Passive-Aggressive Learning

Lecture 3: Multiclass Classification

Support vector machines Lecture 4

CS 188: Artificial Intelligence. Outline

COMS F18 Homework 3 (due October 29, 2018)

Online Passive-Aggressive Algorithms

Classification and Pattern Recognition

CS-E4830 Kernel Methods in Machine Learning

Brief Introduction of Machine Learning Techniques for Content Analysis

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Machine Learning for NLP

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CS229 Supplemental Lecture notes

Support Vector and Kernel Methods

The Perceptron algorithm

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Learning by constraints and SVMs (2)

Introduction to Machine Learning

Jeff Howbert Introduction to Machine Learning Winter

Online Passive-Aggressive Algorithms

Support Vector Machine & Its Applications

Midterms. PAC Learning SVM Kernels+Boost Decision Trees. MultiClass CS446 Spring 17

Lecture 11. Linear Soft Margin Support Vector Machines

COMP 875 Announcements

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Multiclass Classification

1 [15 points] Frequent Itemsets Generation With Map-Reduce

Soft-margin SVM can address linearly separable problems with outliers

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

MIRA, SVM, k-nn. Lirong Xia

Lecture 9: Classification, LDA

CSE 546 Final Exam, Autumn 2013

Lecture 9: Classification, LDA

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Multiclass Classification-1

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

10-701/ Machine Learning - Midterm Exam, Fall 2010

Large Scale Machine Learning with Stochastic Gradient Descent

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

2018 EE448, Big Data Mining, Lecture 4. (Part I) Weinan Zhang Shanghai Jiao Tong University

Machine Learning for NLP

Classification 1: Linear regression of indicators, linear discriminant analysis

Loss Functions, Decision Theory, and Linear Models

Online Bayesian Passive-Agressive Learning

Exploiting Primal and Dual Sparsity for Extreme Classification 1

Lecture #11: Classification & Logistic Regression

Statistical Methods for NLP

Efficient Bandit Algorithms for Online Multiclass Prediction

SVMs, Duality and the Kernel Trick

COMS 4771 Introduction to Machine Learning. Nakul Verma

Factor Modeling for Advertisement Targeting

2017 Fall ECE 692/599: Binary Representation Learning for Large Scale Visual Data

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Empirical Risk Minimization Algorithms

Linear discriminant functions

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Machine Learning. Ludovic Samper. September 1st, Antidot. Ludovic Samper (Antidot) Machine Learning September 1st, / 77

LIBOL: A Library for Online Learning Algorithms

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

CSCI-567: Machine Learning (Spring 2019)

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

ML4NLP Multiclass Classification

SUPPORT VECTOR MACHINE

Conditional Random Field

A Unified Algorithmic Approach for Efficient Online Label Ranking

Machine Learning, Fall 2012 Homework 2

Non-linear Support Vector Machines

Generative Clustering, Topic Modeling, & Bayesian Inference

Training the linear classifier

Natural Language Processing (CSEP 517): Text Classification

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Gaussian Models

Lecture 9: Classification, LDA

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

6.036: Midterm, Spring Solutions

Lecture 5 Neural models for NLP

Online Passive-Aggressive Algorithms

Model Accuracy Measures

Maximum Margin Matrix Factorization for Collaborative Ranking

4.1 Online Convex Optimization

Linear Classifiers IV

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CS 188: Artificial Intelligence Fall 2011

Classification: Analyzing Sentiment

Errors, and What to Do. CS 188: Artificial Intelligence Fall What to Do About Errors. Later On. Some (Simplified) Biology

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Click Prediction and Preference Ranking of RSS Feeds

6.036 midterm review. Wednesday, March 18, 15

Multi-Class Confidence Weighted Algorithms

Statistics and learning: Big Data

Logistic Regression. Advanced Methods for Data Analysis (36-402/36-608) Spring 2014

Text Classification. Jason Rennie

OPTIMIZING SEARCH ENGINES USING CLICKTHROUGH DATA. Paper By: Thorsten Joachims (Cornell University) Presented By: Roy Levin (Technion)

Transcription:

Online Passive-Aggressive Algorithms Tirgul 11

Multi-Label Classification 2

Multilabel Problem: Example Mapping Apps to smart folders: Assign an installed app to one or more folders Candy Crush Saga 3

Goal Given A, an installed app: Farm Story Assign it to y Y the set of relevant folders: Where Y is the set of all folders: Games Social Social Music Games Photography Shopping 4

Multilabel Classification A variant of the classification problem: Multiple target labels must by assigned to each instance. Setting: There are k different possible labels: Y = 1,, k. Every instance x i is associated with a set of relevant labels y i. Special case: There is a single relevant label for each instance: multiclass (single-label) classification. 5

Multilabel Classification Other usage: Text categorization: x i represents a document. y i is the set of topics which are relevant to the document Chosen from a predefined collection of topics. E.g.: A text might be about any of religion, politics, finance or education at the same time or none of these. 6

Task Assign user s installed applications to one ore more smart folders in an automatic way 7

Task Training set of examples: Each example (A i, y i ) contains an application and a set of one or more smart folders: Farm Story Games Social 8

Model Find a function y = f w A with parameters w that assigns a set of folders y to an installed app A. 9

Model and Inference y = f w A = argsort y w φ(a, y) weight vectors feature maps 10

Feature maps TF-IDF of title Google Play s category TF-IDF of description Representation of related apps

Inference y i = argsort y w φ(a i, y) Algorithm s output upon receiving an instance x i : A score for each of the k labels in Y. x i = Farm Story Y = Social Music Games Photography Shopping 0.6, 0.3, 0.7, 0.1, 0.01 12

y = argsort y w φ(a, y) Inference The algorithm s prediction is a vector in R k where each element in the vector corresponds to the score assigned to the respective label. This form of prediction is often referred to as label ranking. Y = Social Music Games Photography Shopping 0.6, 0.3, 0.7, 0.1, 0.01 R k 13

Inference For a pair of labels r, s Y: If score r > score(s): Label r is ranked higher than label s. Goal of Algorithm: Rank every relevant label above every irrelevant label. Social Music Games Photography 0.6 0.3 Shopping 14

The Margin Example: A i, y i =,. Games After making predictions y i, the algorithm receives the correct set y i. 1. Find the least probable correct folder: Social Social Music Games Games 2. Find the most probable wrong folder: Shopping Music max Photography 15

Iterate over examples Example: A i, y i =,. Games After making predictions y i, the algorithm receives the correct set y i. Social Social Music Games max Update: Shopping Photography 16

The Margin We define the margin attained by the algorithm on round i for example x i, y i, Social Music γ w i, x i, y i = min r y i w i φ x i, r max s y i w i φ x i, s. Games Shopping Photography 17

The Margin The margin is positive if all relevant labels are ranked higher than all irrelevant labels. We are not satisfied with only a positive margin; we require the margin of every prediction to be at least 1. γ w i, x i, y i = min r y i w i φ x i, r max s y i w i φ x i, s. 18

The Loss l min r y i w i φ x i,r max s y i w i φ x i,s <0 γ w i, x i, y i = min r y i w i φ x i, r max s y i w i φ x i, s. Define a hinge loss: l w i, x i, y i = 0 γ w i, x i, y i 1 1 γ w i, x i, y i otherwise 19

The Loss l w i, x i, y i = 0 γ w i, x i, y i 1 1 γ w i, x i, y i otherwise Could also be written as follows: where a + = max(0, a) l w i, x i, y i = 1 γ w i, x i, y i + 20

Learning First approach: Goal : 21

Multilabel PA Optimization Problem An alternative approach: r i = argmin r yi w i φ(x i, r) s i = argmax s yi w i φ(x i, s) w i+1 = argmin w 1 2 w w i 2 s. t. l w, x i, y i = 1 γ w, x i, y i + = 0 i.e., the margin is greater than 1. γ w i, x i, y i = w i φ x i, r i w i φ x i, s i 22

Passive Aggressive (PA) The algorithm is passive whenever the hinge-loss is zero: l i = 0 w i+1 = w i When the loss is positive, the algorithm aggressively forces w i+1 to satisfy the constraint l w i, x i, y i = 0, regardless of the step size required. 23

Passive Aggressive Solving with Lagrange Multipliers, we get the following update rule: w i+1 = w i + τ i φ x i, r i φ x i, s i τ i = l i φ x i, r i φ x i, s i 2 24

Passive Aggressive The updated vector w i+1 will classify example x i with l w, x i, y i = 0. 25

Ranking 26

Ranking Problem: Example A Prediction Bar: Predict the apps the user is most likely to use at any given time and location. 27

assume the user is at a context. 12:47 at the office today is Wednesday. the most likely app the 4th likely app

Goal Find a function f that gets as input the current context x and predicts the apps the user is likely to click at a given context. 29

Features time of day day of week location

Discriminative Model Parameters w A R d of app A Context x R d Prediction:. 31

New Evaluation The performance of the prediction system is measured by Receiver operating Characteristics (ROC) curve. true positive rate= app was in the prediction bar and was clicked total contexts where was clicked app false positive rate= was in the prediction bar and wasn t clicked total contexts where wasn t clicked

Maximizing AUC By definition of the AUC (Bamber, 1975; Hanley and McNeil,1982): AUC 33

Pairwise Dataset For an application define two sets of context: Waze Waze 34

Maximizing AUC By definition of the AUC (Bamber, 1975; Hanley and McNeil,1982): 35

Solve using PA Set the next w i to be the minimizer of the following optimization problem min w R d,ξ 0 1 2 w w i 1 2 s. t. f w x i +, A i f w x i, A i 1 37

Implementation Online algorithm to solve the optimization problem efficiency on huge data. Theoretical guarantees of convergence. 38