Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Size: px

Start display at page:

Download "Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson"

Naomi Francis
5 years ago
Views:

1 Applied Machine Learning Lecture 5: Linear classifiers, continued Richard Johansson

2 overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers support vector classification optimizing the LR and SVM objectives

3 reformulating the perceptron a bit more compactly the perceptron algorithm can be expressed a bit more compactly if we code the positive class and negative class as +1 and -1, respectively for instance >50K +1 <=50K 1

4 misclassifications and updates, more compactly if y i = 1 (positive), we have a misclassification if w x i is negative y i score 0 and the update: w = w + y i x i if y i = 1 (negative), we have a misclassification if w x i is positive y i score 0 and the update: w = w + y i x i

5 the perceptron again, a bit more compactly if the y i are coded as +1 or -1: w = (0,..., 0) for (x i, y i ) in the training set score = w x i if y i score 0 w = w + y i x i return w

6 how can we get the confidence of a linear classifier?

7 how can we get the confidence of a linear classifier? score = w x large positive score: strong support that x belongs to the positive class large negative score: strong support that x belongs to the negative class near zero: we are unsure

8 prediction score in scikit-learn clf =... (train a classifier)... scores = clf.decision_function(x)

9 overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers support vector classification optimizing the LR and SVM objectives

10 how to interpret the output scores? linear classifiers select the outputs based on a scoring function: score = w x the confidence isn t directly interpretable can we create a model where the output can be interpreted as a probability?

11 the logistic regression model logistic regression is a method to train a linear classifier that gives a probabilistic output how to get the probability? use a logistic or sigmoid function: P(positive output x) = where e score = np.exp(-score) e score this is formally a probability: always between 0 and 1, sum of probablities of possible outcomes = 1

12 the logistic / sigmoid function 1.0 P(y = positive x) classifier score

13 conversely P(negative output x) = e score = e score

14 making it a bit more compact if we code the positive class as +1 and the negative class as -1, then we can write the probability a bit more neatly: P(y x) = e y score

15 in scikit-learn LR is called sklearn.linear_model.logisticregression predict_proba gives the probability output

16 code example: using a logistic regression classifier

17 overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers support vector classification optimizing the LR and SVM objectives

18 recall: the maximum likelihood principle in a probabilistic model, we can train the model by selecing parameters that assign a high probability to the data in our case, the parameters are the weight vector w adjust w so that each output label gets a high probability

19 the likelihood function formally, the probability of the data is defined by the likelihood function this is the product of the probabilities of all m individual training instances: in our case, this means L(w) = L(w) = P(y 1 x 1 ) P(y m x m ) e y 1 (w x 1 ) e ym (w xm)

20 rewriting a bit... we rewrite the previous formula as L(w) = e y 1 (w x 1 ) e ym (w xm) log L(w) = Loss(w, x 1, y 1 ) Loss(w, x m, y m ) where Loss(w, x, y) = log(1 + exp( y (w x))) is called the log loss function

21 plot of the log loss log loss y * classifier score

22 recall: the fundamental tradeoff in machine learning goodness of fit: the learned classifier should be able to correctly classify the examples in the training data regularization: the classifier should be simple but so far in our LR description, we ve just taken care of the first part! log L(w) = Loss(w, x 1, y 1 ) Loss(w, x m, y m )

23 regularization in logistic regression models just like we saw for linear regression models (Ridge and Lasso), we can add a regularizer that keeps the weights small most commonly, the L 2 regularizer:... or an L 1 regularizer: w 2 = w 1 w w n w n = w w w 1 = w w n which will do some feature selection

24 combining the pieces we combine the loss and the regularizer: Loss(w, xi, y i ) + λ w 2 in this formula, λ is a tweaking parameter that controls the tradeoff between loss and regularization note: in some formulations (including scikit-learn), there is a parameter C instead of the λ that is put before the loss C Loss(w, x i, y i ) + w 2

25 check minimize w C Loss(w, x i, y i ) + w 2 how do we convert this into an algorithm?

26 overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers support vector classification optimizing the LR and SVM objectives

27 two-class (binary) linear classifiers a linear classifier is a classifier that is defined in terms of a scoring function like this score = w x this is a binary (2-class) classifier: return the first class if the score > 0... otherwise the second class how can we deal with non-binary (multi-class) problems when using linear classifiers?

28 decomposing multi-class classification problems idea 1: break down the complex problem into simpler problems, train a classifier for each separately

29 decomposing multi-class classification problems idea 1: break down the complex problem into simpler problems, train a classifier for each separately one-versus-rest ( long jump ): for each class c, make a binary classifier to distinguish c from all other classes so if there are n classes, there are n classifiers at test time, we select the class giving the highest score one-versus-one ( football league ): for each pair of classes c 1 and c 2, make a classifier to distinguish c 1 from c 2 if there are n classes, there are n (n 1) 2 classifiers at test time, we select the class that has most wins

30 example assume we re training a classifier of fruits and we have the classes apple, orange, mango in one-vs-rest, we train the following three classifiers: apple vs orange+mango orange vs apple+mango mango vs apple+orange in one-vs-one, we train the following three: apple vs orange apple vs mango orange vs mango

31 example (continued) we train classifiers to distinguish between apple, orange, and mango, using one-vs-rest so we get wapple, w orange, w mango for some instance x, the respective scores are [-1, 2.2, 1.5] so our guess is orange

32 in scikit-learn scikit-learn includes implementations of both of the methods we have discussed: OneVsRestClassifier OneVsOneClassifier however, the built-in algorithms (e.g. Perceptron, LogisticRegression) will do this automatically for you they use one-versus-rest

33 multiclass learning algorithms is it good to separate the multiclass task into smaller tasks that are trained independently? maybe training should be similar to testing? idea 2: make a model where one-vs-rest is used while training let s see how this can be done for logistic regression

34 binary LR: reminders the logistic or sigmoid function: P(positive output x) = def sigmoid(score): return 1 / (1 + np.exp(-score)) e score when training, we minimize the log-loss Loss(w, x, y) = log(1 + exp( y (w x)))

35 multiclass LR using the softmax the softmax function is used in multiclass LR instead of the logistic: P(y i x) = escore i k escore k def softmax(scores): expscores = np.exp(scores) return expscores / sum(expscores) [exercise: make softmax numerically stable]

36 softmax example def softmax(scores): expscores = np.exp(scores) return expscores / sum(expscores) scores = [-1, 2.2, 1.5, -0.3] print(softmax(scores)) array([ , , , ])

37 cross-entropy loss when training, the softmax probabilities lead to the cross-entropy loss instead of the log loss Loss CE (w, x i, y i ) = log P(y i x i ) = log e score i k escore k just like the log-loss: high probability for the correct label yi low loss low probability for y i high loss

38 multiclass LR in scikit-learn LogisticRegression(multi_class= multinomial ) (otherwise, separate classifiers are trained independently)

39 overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers support vector classification optimizing the LR and SVM objectives

40 geometric view geometrically, a linear classifier can be interpreted as separating the vector space into two regions using a line (plane, hyperplane)

41 margin of separation the margin γ denotes how well w separates the classes: γ is the shortest distance from the separator to the nearest training instance

42 large margins are good a result from statistical learning theory: true error training error + BigUglyFormula( 1 γ 2 ) larger margin better generalization

43 support vector machines support vector machines (SVM) or support vector classifiers (SVC) are linear classifiers constructed by selecting the w that maximizes the margin

44 soft-margin SVMs in some cases the dataset is inseparable, or nearly inseparable soft-margin SVM: allow some examples to be disregarded when maximizing the margin r x i r x i ξ i A) Hard Margin SVM B) Soft Margin SVM

45 stating the SVM as an objective function the hard-margin and soft-margin SVM can be stated mathematically in a number of ways we ll skip the details, but with a bit of work we can show that the soft-margin SVM can be stated as minimizing where is called the hinge loss C Loss(w, x i, y i ) + w 2 Loss(w, x, y) = max(0, 1 y (w x))

46 plot of the hinge loss hinge loss y * classifier score

47 in scikit-learn linear SVM is called sklearn.svm.linearsvc

48 overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers support vector classification optimizing the LR and SVM objectives

49 SVM and LR have convex objective functions

50 optimizing SVM and LR since the objective functions of SVM and LR are convex, we can find w by stochastic gradient descent pseudocode: set w to some initial value, e.g. all zero iterate a fixed number of times: select a single training instance x select a suitable step length η compute the gradient of the hinge loss or log loss subtract step length gradient from w note the similarity to the perceptron!

51 missing pieces setting the learning rate η gradients for SVM and LR loss functions (hinge loss and log loss)

52 setting the learning rate η in principle, you can try to select a small enough value of η in practice, it s better to decrease η gradually we ll use the Pegasos algorithm, where we set η as follows: η = C t where t is the current step (1, 2,... ) C is the loss/regularization tradeoff

53 some comments about assignment 2 implement SVM and LR and test them in a classifier Pegasos (which is just SGD) works in an iterative fashion similar to the perceptron... so if you start from my perceptron code this will be a breeze optional tasks to speed up the implementation using sparse vectors

54 next couple of weeks Thursday: lab session for assignment 1 Friday: evaluation methods Tuesday: no class! (CHARM) Wednesday: deadline assignment 1 Thursday: lab session for assignment 2 Friday: guest lecture (Ericsson)

Machine Learning for NLP Extra lecture: multiclass linear classiers

Machine Learning for NLP Extra lecture: multiclass linear classiers UNIVERSITY OF Richard Johansson September 8, 2016 two-class (binary) linear classiers a linear classier is a classier that is dened in