Online Passive-Aggressive Algorithms. Tirgul 11

Size: px

Start display at page:

Download "Online Passive-Aggressive Algorithms. Tirgul 11"

Kelley Harrison
5 years ago
Views:

1 Online Passive-Aggressive Algorithms Tirgul 11

2 Multi-Label Classification 2

3 Multilabel Problem: Example Mapping Apps to smart folders: Assign an installed app to one or more folders Candy Crush Saga 3

4 Goal Given A, an installed app: Farm Story Assign it to y Y the set of relevant folders: Where Y is the set of all folders: Games Social Social Music Games Photography Shopping 4

5 Multilabel Classification A variant of the classification problem: Multiple target labels must by assigned to each instance. Setting: There are k different possible labels: Y = 1,, k. Every instance x i is associated with a set of relevant labels y i. Special case: There is a single relevant label for each instance: multiclass (single-label) classification. 5

6 Multilabel Classification Other usage: Text categorization: x i represents a document. y i is the set of topics which are relevant to the document Chosen from a predefined collection of topics. E.g.: A text might be about any of religion, politics, finance or education at the same time or none of these. 6

7 Task Assign user s installed applications to one ore more smart folders in an automatic way 7

8 Task Training set of examples: Each example (A i, y i ) contains an application and a set of one or more smart folders: Farm Story Games Social 8

9 Model Find a function y = f w A with parameters w that assigns a set of folders y to an installed app A. 9

10 Model and Inference y = f w A = argsort y w φ(a, y) weight vectors feature maps 10

11 Feature maps TF-IDF of title Google Play s category TF-IDF of description Representation of related apps

12 Inference y i = argsort y w φ(a i, y) Algorithm s output upon receiving an instance x i : A score for each of the k labels in Y. x i = Farm Story Y = Social Music Games Photography Shopping 0.6, 0.3, 0.7, 0.1,

13 y = argsort y w φ(a, y) Inference The algorithm s prediction is a vector in R k where each element in the vector corresponds to the score assigned to the respective label. This form of prediction is often referred to as label ranking. Y = Social Music Games Photography Shopping 0.6, 0.3, 0.7, 0.1, 0.01 R k 13

14 Inference For a pair of labels r, s Y: If score r > score(s): Label r is ranked higher than label s. Goal of Algorithm: Rank every relevant label above every irrelevant label. Social Music Games Photography Shopping 14

15 The Margin Example: A i, y i =,. Games After making predictions y i, the algorithm receives the correct set y i. 1. Find the least probable correct folder: Social Social Music Games Games 2. Find the most probable wrong folder: Shopping Music max Photography 15

16 Iterate over examples Example: A i, y i =,. Games After making predictions y i, the algorithm receives the correct set y i. Social Social Music Games max Update: Shopping Photography 16

17 The Margin We define the margin attained by the algorithm on round i for example x i, y i, Social Music γ w i, x i, y i = min r y i w i φ x i, r max s y i w i φ x i, s. Games Shopping Photography 17

18 The Margin The margin is positive if all relevant labels are ranked higher than all irrelevant labels. We are not satisfied with only a positive margin; we require the margin of every prediction to be at least 1. γ w i, x i, y i = min r y i w i φ x i, r max s y i w i φ x i, s. 18

19 The Loss l min r y i w i φ x i,r max s y i w i φ x i,s <0 γ w i, x i, y i = min r y i w i φ x i, r max s y i w i φ x i, s. Define a hinge loss: l w i, x i, y i = 0 γ w i, x i, y i 1 1 γ w i, x i, y i otherwise 19

20 The Loss l w i, x i, y i = 0 γ w i, x i, y i 1 1 γ w i, x i, y i otherwise Could also be written as follows: where a + = max(0, a) l w i, x i, y i = 1 γ w i, x i, y i + 20

21 Learning First approach: Goal : 21

22 Multilabel PA Optimization Problem An alternative approach: r i = argmin r yi w i φ(x i, r) s i = argmax s yi w i φ(x i, s) w i+1 = argmin w 1 2 w w i 2 s. t. l w, x i, y i = 1 γ w, x i, y i + = 0 i.e., the margin is greater than 1. γ w i, x i, y i = w i φ x i, r i w i φ x i, s i 22

23 Passive Aggressive (PA) The algorithm is passive whenever the hinge-loss is zero: l i = 0 w i+1 = w i When the loss is positive, the algorithm aggressively forces w i+1 to satisfy the constraint l w i, x i, y i = 0, regardless of the step size required. 23

24 Passive Aggressive Solving with Lagrange Multipliers, we get the following update rule: w i+1 = w i + τ i φ x i, r i φ x i, s i τ i = l i φ x i, r i φ x i, s i 2 24

25 Passive Aggressive The updated vector w i+1 will classify example x i with l w, x i, y i = 0. 25

26 Ranking 26

27 Ranking Problem: Example A Prediction Bar: Predict the apps the user is most likely to use at any given time and location. 27

28 assume the user is at a context. 12:47 at the office today is Wednesday. the most likely app the 4th likely app

29 Goal Find a function f that gets as input the current context x and predicts the apps the user is likely to click at a given context. 29

30 Features time of day day of week location

31 Discriminative Model Parameters w A R d of app A Context x R d Prediction:. 31

Receiver operating Characteristics (ROC)

32 New Evaluation The performance of the prediction system is measured by Receiver operating Characteristics (ROC) curve. true positive rate= app was in the prediction bar and was clicked total contexts where was clicked app false positive rate= was in the prediction bar and wasn t clicked total contexts where wasn t clicked

33 Maximizing AUC By definition of the AUC (Bamber, 1975; Hanley and McNeil,1982): AUC 33

34 Pairwise Dataset For an application define two sets of context: Waze Waze 34

35 Maximizing AUC By definition of the AUC (Bamber, 1975; Hanley and McNeil,1982): 35

37 Solve using PA Set the next w i to be the minimizer of the following optimization problem min w R d,ξ w w i 1 2 s. t. f w x i +, A i f w x i, A i 1 37

38 Implementation Online algorithm to solve the optimization problem efficiency on huge data. Theoretical guarantees of convergence. 38

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by