PATTERN RECOGNITION AND MACHINE LEARNING

Size: px

Start display at page:

Download "PATTERN RECOGNITION AND MACHINE LEARNING"

Brianne Blankenship
6 years ago
Views:

1 PATTERN RECOGNITION AND MACHINE LEARNING Slide Set : Machine Learning: Linear Models January 18 Heikki Huttunen heikki.huttunen@tut.fi Department of Signal Processing Tampere University of Technology

2 Classification ˆ Many machine learning problems can be posed as classification tasks. ˆ Most classification tasks can be posed as a problem of partitioning a vector space into disjoint regions. ˆ These problems consist of following components: Samples: x[], x[1],..., x[n 1] R P Class labels: y[], y[1],..., y[n 1] {1,,..., C} Classifier: F(x) : R P {1,,..., C} ˆ Now, the task is to find the function F that maps the samples most accurately to their corresponding labels. ˆ For example: Find the function F that minimizes the number of erroneous predictions, i.e., the cases F(x[k]) = y[k].

3 Regression ˆ The second large class of machine learning problems consists of regression tasks. ˆ For regression, the output is real-valued instead of categorical. ˆ These problems consist of following components: Inputs: x[], x[1],..., x[n 1] R P Targets: y[], y[1],..., y[n 1] R Predictor: F(x) : R P R ˆ This time the task is to find the function F that maps the samples most accurately to their corresponding targets. ˆ For example: Find the function F that minimizes the squared sum of distances between predictions and targets: N 1 E = (y[k] F(x[k])). k=

4 Classification Example ˆ For example, consider the -dimensional dataset on the right. ˆ The data consists of blue crosses and red circles. ˆ Based on these data, what would be a good partition of the D space into "red" and "blue" regions? ˆ What kind of boundaries are allowed between the regions? ˆ Straight lines? ˆ Continuous curved boundaries? ˆ Boundary without any restriction? ˆ Note: In -D this can be solved manually, but not in -D Which class?

5 Regression Example 1. ˆ For example, consider the 1-dimensional data on the right. ˆ The data consists of 1 data points, where y coordinate is a function of x. ˆ Based on these data, what would be a good prediction of the target value at x = 1.3 (the dashed line)? ˆ An obvious solution is to fit a curve into the data points. What kind of forms may the curve have? ˆ Straight lines? ˆ Continuous curves? y Which y coordinate x

6 Different Classifiers ˆ We will study the following widely used classifiers. ˆ Nearest Neighbor classifier ˆ Linear classifiers (with linear boundary) ˆ The support vector machine (with nonlinear boundary) ˆ Ensemble classifiers: Random Forest ˆ Black boxes: Neural networks and deep neural networks ˆ For the first four, we will refer to the Scikit-Learn module: ˆ The neural network part will be studied with the Keras package:

7 Scikit-Learn ˆ Scikit-Learn started as a Google Summer of Code project. ˆ The project became a success: There was a clear need for free elegant platform bringing together widely used methods. ˆ Instead of each contributor providing their own package, scikit-learn has a unified API for each method. ˆ For details: Pedregosa, et al. Scikit-learn: Machine learning in Python, The Journal of Machine Learning Research,11.

8 Scikit-Learn Methods Sklearn API is straightforward: ˆ Initialization: Each model has its own constructor, e.g., model = LogisticRegression(penalty = "l1", C =.1). ˆ Training: Every model implements a.fit() method that trains the model; e.g., model.fit(x_train, y_train) ˆ Prediction: Every model implements a.predict() method that predicts the target output for new inputs; e.g., y_hat = model.predict(x_test) ˆ Probabilities: Many models also implement a.predict_proba() method that predicts the class probabilities for new inputs; e.g., p = model.predict_proba(x_test)

9 Sample Scikit-learn Session # Training code: from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(x, y) # Testing code: >>> model.predict([1,]) array([ 1.]) >>> model.predict([-1,-]) array([.] >>> model.predict_proba([[1,], [-1,-]]) Out[35]: array([[.881, ], [.59781,.39139]]) # [1,] is class 1 with 99.5 % confidence. # [-1,-] is class with 5.1 % confidence. ˆ In the example code, X consists of two-dimensional samples from two classes

10 Nearest Neighbor Classifier Nearest Neighbor Classifier Classified as RED ˆ Probably the most natural approach for deciding the class is simply to see what kind of samples are nearby. ˆ This is the idea behind the Nearest neighbor classifier: Just copy the class label of the most similar training sample to the unknown test sample. 8

11 K-Nearest Neighbor Classifier 9-Nearest Neighbor Classifier Classified as BLUE ˆ The problem with Nearest Neighbor is its fragility to changes in the training data. ˆ The classification boundary may change a lot by moving only one training sample. ˆ The robustness can be increased by taking a majority vote of more nearby samples. ˆ The K-Nearest neighbor classifier selects the most frequent class label among the K nearest training samples

12 Nearest Neighbor in Scikit-learn # Training code: from sklearn.neighbors import KNeighborsClassifier model = KNeighborsClassifier(n_neighbors = 5, metric = "euclidean") model.fit(x, y) # Testing code: >>> model.predict([-1,-]) array([.]) >>> model.predict([1,]) array([ 1.]) >>> model.predict_proba([,-3]) array([[.,.]]) # Ask what are the 5 nearest neighbors. # Returns the distances and indices (rows) in X >>> distances, indices = model.kneighbors([,-3]) [.1,.5,.3,.1,.] [379, 11, 15, 37, 198] # What are the classes for these five: >>> y[indices] array([[., 1.,.,., 1.]]) ˆ Parameters of constructor include: ˆ n_neighbors: Number of neighbors K ˆ metric: How distance is calculated ˆ algorithm: Which algorithm finds the nearest samples; e.g., methods like K-D Tree or brute force ˆ Probability prediction counts how many of the nearest K samples belong to each group. ˆ Thus, the probabilities are very quantized and often either or 1.

13 Benefits of Nearest Neighbor ˆ Training time is minimal: Either just store the data as is; or reorganize it into a tree structure for efficient search. ˆ Accuracy is often relatively good. ˆ Allows complicated definitions of distance, e.g., ˆ Find the 5 nearest samples holding attributes A and B but not C. ˆ New data can be added/deleted without retraining.

14 Problems with the Nearest Neighbor ˆ Nearest neighbor is prone to overlearning: Especially the 1-NN forms local regions around every isolated data point. ˆ It is highly unlikely that these represent the general trend of the whole population. ˆ Moreover, the training step extremely fast while the classification step becomes extremely slow (consider training data with billion high-dimensional samples). ˆ Therefore, more compact representations are preferred: Training time is usually not critical while execution time is.

15 Linear Classifiers ˆ A linear classifier learns a linear decision boundary between classes. ˆ In mathematical terms, the classification rule can be written as A Linear Classifier Classified as BLUE Class 1, F(x) = Class, if w T x < b if w T x b 8 where the weights w are learned from the data. ˆ The expression w T x = k w kx k essentially transforms the multidimensional data x to a real number, which is then compared to a threshold b

16 Flavors of Linear Classifiers There exists many algorithms for learning the weights w, including: ˆ Linear Discriminant Analysis (LDA) Linear Discriminant Analysis Classified as BLUE ˆ Support Vector Machine (SVM) Support Vector Machine Classified as RED ˆ Logistic Regression (LR) Logistic Regression Classified as RED

17 Flavors of Linear Classifiers ˆ The LDA: ˆ The oldest of the three: Fisher, ˆ "Find the projection that maximizes class separation", i.e., pull the classes as far from each other as possible. ˆ Closed form solution, fast to train. ˆ The SVM: ˆ Vapnik and Chervonenkis, 193. ˆ "Maximize the margin between classes". ˆ Slowest of the three 1, performs well with sparse high-dimensional data. ˆ Logistic Regression (a.k.a. Generalized Linear Model): ˆ History traces back to 17 s, but proper formulation and efficient training algorithm by Nelder and Wedderburn in 197. ˆ Statistical algorithm: "Maximize the likelihood of the data". ˆ Outputs also class probability. Has been extended to automatic feature selection. 1 Slowest in training; testing time is the same for all

18 Linear Discriminant Analysis ˆ The LDA maps high dimensional data into a single scalar simultaneously pulling individual classes apart. ˆ Thus, the task is essentially finding a good linear projection R N R Good Projection: Good Separation Poor Projection: Poor separation

19 Linear Discriminant Analysis ˆ A good measure of class separation could be, e.g., the distance between their projected means. ˆ However, this quantity depends on the scale, and increases simply by multiplying by a large number. ˆ Thus, we want to pull class means apart while keeping the variance of each class small. ˆ As a result, we look into the following separability score: Distance of Class Means J(w) = Variance of Classes = (wt μ 1 w T μ ) w T Σ 1 w + w T Σ w, where μ 1 and μ are the class means and Σ 1 and Σ covariance matrices of the two classes.

20 Maximization of J(w) ˆ The Fisher separation score can be maximized analytically. ˆ First simplify J(w) a bit: J(w) = (wt μ 1 w T μ ) w T Σ 1 w + w T Σ w = wt S B w w T S W w with S B = (μ 1 μ )(μ 1 μ ) T and S W = Σ 1 + Σ. ˆ This form is known as Generalized Rayleigh Quotient, which appears in many contexts. ˆ Using Lagrange multipliers, one can show that the maximum must satisfy with λ R a constant. S B w = λs W w

21 Maximization of J(w) ˆ The above condition is a generalized eigenvalue problem. ˆ In other words, there are many solutions: one for each eigenvalue. ˆ Thus, it is straightforward to conclude that the quantity is maximized when or equivalently J(w) = wt S B w w T S W w S B w = λs W w w T S B w = λw T S W w ˆ Finally, the ratio w T S B w/w T S W w is maximized by choosing λ as the largest eigenvalue (and w as the corresponding eigenvector).

22 Solution 1 using Eigenvalues # Training code: import numpy as np from scipy.linalg import eig # X and X1 contain data of the classes m = np.mean(x, axis = ) m1 = np.mean(x1, axis = ) C = np.cov(x - m, rowvar = False) C1 = np.cov(x1 - m1, rowvar = False) SB = np.multiply.outer(m1 - m, m1 - m) SW = C + C1 D, V = eig(sb, SW) w = V[:, -1] T = np.mean([np.dot(m1, w), np.dot(m, w)]) # Testing code: >>> w array([ ,.97198]) >>> np.dot(w, [,-]) - T. # Positive -> in class "red circles" >>> np.dot(w, [,-3]) - T # Negative -> in class "blue crosses" ˆ We need to solve the generalized eigenvalue problem S B w = λs W w. ˆ scipy has a solver for that. ˆ scipy.linalg.eig returns ˆ Array of eigenvalues D in increasing order. ˆ Matrix of eigenvectors V (in columns). ˆ Thus, we want the rightmost ( 1 th ) column V[:, -1] (corresponding to largest eigenvalue). ˆ The decision is done by comparing the projected vector to threshold T; the center of projected class means.

23 LDA Projection with Threshold Threshold 8 1

24 Solution without Eigenvalues ˆ It turns out that solving the generalized eigenvalue problem S B w = λs W w is not necessary. ˆ The maximum of J(w) has to satisfy S B w = λs W w, or equivalently, S 1 W S Bw = λw. ˆ Insert the definition of S B to get: or, S 1 W (μ 1 μ ) (μ 1 μ ) T w = λw }{{} scalar C S 1 W (μ 1 μ ) C λ = w

25 Solution without Eigenvalues ˆ Thus, the maximum of J(w) has to satisfy this condition as well (regardless of λ). ˆ We are looking for a direction vector w, so the multiplier C/λ is not relevant either ( c w has the same direction as w). λ ˆ Thus, w is given by w = S 1 W (μ 1 μ ) = (Σ + Σ 1 ) 1 (μ 1 μ ).

26 Solution without Eigenvalues # Training code: import numpy as np from scipy.linalg import eig # X and X1 contain data of the classes m = np.mean(x.t, axis = ) m1 = np.mean(x1.t, axis = ) C = np.cov(x - m, rowvar = False) C1 = np.cov(x1 - m1, rowvar = False) w = np.dot(np.linalg.inv(c + C1), (m1 - m)) T = np.mean([np.dot(m1, w), np.dot(m, w)]) ˆ The Python code is very similar to the eigenvalue based solution. ˆ Note that the resulting w is not exactly the same: The lengths are different, but directions are the same. Threshold # Testing code: >>> w array([.53757, 1.177]) >>> np.dot(w, [,-]) - T.5 # Positive -> in class "red circles" >>> np.dot(w, [,-3]) - T # Negative -> in class "blue crosses" 8 1

27 Solution 3: Scikit-Learn # Training code: from sklearn.discriminant_analysis import LinearDiscriminantAnalysis clf = LinearDiscriminantAnalysis() # X contains all samples, and y their class # labels: y = [,1,1,,...] clf.fit(x, y) # Testing code: >>> clf.coef_ # This is our w vector array([[ ,.8353]]) >>> clf.intercept_ # This is the threshold array([ ]) >>> clf.predict([, -]) array([ 1.]) >>> clf.predict([, -3]) array([.]) ˆ Scikit-Learn implements LDA in a straightforward manner. ˆ First construct your classifier using LinearDiscriminantAnalysis() ˆ Then train the model using fit() ˆ Then predict classes using predict() ˆ Or class probabilities using predict_proba() >>> clf.predict_proba([, -3]) array([[.8598,.31315]])

28 Multiclass LDA ˆ The LDA can be generalized to handle more than classes. ˆ Multiclass LDA can be understood in two contexts: ˆ As dimensionality reduction: We seek for a set of projections, that lower the dimensionality of the data while retaining its original separability. ˆ As classification: We seek for a set of discriminant functions that give a probability score for each class. ˆ We are more interested in the latter interpretation. 8

29 Multiclass LDA ˆ In this case, the classifier is defined by a set of linear functions g 1 (x) = w T 1 x + w 1. g K (x) = w T K x + w K ˆ Then the vector x is assigned class k = 1,,..., K if for all j = k. g k (x) > g j (x)

30 Multiclass LDA # Training code: import numpy as np from sklearn.discriminant_analysis import LinearDiscriminantAnalysis clf = LinearDiscriminantAnalysis() # X contains all samples, and y their class # labels: y = [1,3,1,,...] clf.fit(x, y) ˆ LinearDiscriminantAnalysis() model applies to multiclass case as well. ˆ The fit() method finds the three discriminants also shown in the plot below. # Testing code: # Show the three discriminants (rows): >>> clf.coef_ array([[-1.95, ], [ , ], [ ,.33]]) # Compute discriminants for [,-] >>> scores = np.dot(clf.coef_, [,-]) + clf.intercept_ array([.59889, -1.5, ]) # First class (red) has largest score # Predict class more easily using predict() >>> clf.predict([[, -], [, -]]) array([ 1.,.]) 8 1 8

31 The Support Vector Machine ˆ The Support Vector Machine (SVM) is characterized by its maximum margin property. ˆ In other words, it attempts to maximize the margin between classes. ˆ In the attached example, the binary data is perfectly separable, and the SVM sets the boundary in the middle of the two classes. ˆ Thus, the boundary location is defined by three samples only; called support vectors. 8 Support Vectors Margin 8

32 The Support Vector Machine ˆ Maximizing the margin M for the SVM can be characterized as an optimization problem maximize w,b, w =1 subject to M, y i (x T w + b) M, for i = 1,,..., N i ˆ Here, y i encodes the class as y i { 1, 1}, so the last condition can also be written as x T i w + b M, if y i = 1 x T i w + b M, if y i = 1 Support Vectors Margin 8

33 The Support Vector Machine ˆ Alternatively, the SVM criterion can be simplified into an equivalent form minimize w, w,b subject to y i (x T w + b) 1, for i = 1,,..., N i ˆ This is a convex optimization problem with constraints. Efficient tailored algorithms exist. ˆ The implementation of Scikit-learn is based on the widely used LibSVM library.

34 The Support Vector Machine ˆ The earlier definition of SVM assumes that the classes are separate (and there exists a margin between classes). ˆ In reality we have to allow some samples to reside on the wrong side of the margin. ˆ The solution is to penalize for samples on the wrong side. ˆ The resulting function to be minimized is: minimize w,b N [max(, 1 y i (w T x i + b))] + C w. i=1 ˆ For each sample on the wrong side, we add a penalty (called hinge loss) defined as max(, x) ˆ Note, that this increases the number of SV s Hinge Loss Right Side Wrong Side

35 SVM in Scikit-Learn ˆ The parameter C determines the balance between margin width and penalty for being on the wrong side. ˆ Left to right: Linear SVM s with decreasing penalty. SVM with C = 1 SVM with C = 1 1 SVM with C = 1 5 SVM with C =

36 Kernel Trick SVM with nd order Polynomial Kernel ˆ The SVM can be extended to nonlinear boundaries using the kernel trick. ˆ The kernel trick essentially maps the data into a higher dimension and designs the linear SVM there. ˆ For example, the top plot can be generated as follows: ˆ Map the D samples into 3D: (x, y) (x, y, xy) ˆ Train the SVM with the new 3D samples. ˆ The decision boundary is linear in 3D but nonlinear in D. ˆ However, this explicit transformation is slow; the kernel trick does the same thing implicitly. 8 SVM with the RBF Kernel 8

37 Kernel Trick ˆ It can be shown that substitution of the dot product by some nonlinear function is equivalent to the explicit mapping. ˆ More specifically, an algorithm can be kernelized by inserting a kernel function κ(x, y) in place of all dot products x y. ˆ It can be shown that under certain conditions on the kernel (e.g., positive semidefiniteness), this is equivalent to an explicit mapping. ˆ A lot of research has been done on the relation of kernel and the corresponding mapping. ˆ However, we take a more practical approach and only consider an example. SVM with nd order Polynomial Kernel 8 SVM with the RBF Kernel 8

38 Kernel Trick ˆ As an example, consider what happens when the inner product of D vectors x = (x 1, x ) and y = (y 1, y ) is substituted by its second power: ˆ We can expand the kernel as follows κ(x, y) = (x y) κ(x, y) = (x y) = (x 1 y 1 + x y ) = (x 1 y 1 ) + (x y ) + x 1 y 1 x y

39 Kernel Trick ˆ The result can be cleverly rearranged to make it look like a dot product in 3D: κ(x, y) = (x 1 y 1 ) + (x y ) + x 1 y 1 x y = (x 1 y 1 ) + (x y ) + ( x 1 x )( y 1 y ) = x x 1, x, 1 x y y 1, y, 1 y

40 Kernel Trick ˆ Thus, the following two things are equivalent: ˆ Explicit Mapping: Transform D data into 3D explicitly and fit the SVM with transformed data: u u v v uv ˆ Implicit Mapping: Substitute each dot product in the SVM algorithm by the kernel κ(x, y) = (x y) and fit with original D data.

41 Popular Kernels ˆ As mentioned, there is lot of literature on kernels. Popular ones include ˆ Linear Kernel: κ(x, y) = x y. This is the basic SVM with no mapping. ˆ Polynomial Kernel: κ(x, y) = (x y) d. Raises dot product to d th power. ˆ Inhomogeneous Polynomial Kernel: κ(x, y) = (x y + 1) d. Similar to polynomial kernel, but produces a bit more dimensions. ˆ Sigmoid Kernel: κ(x, y) = tanh(ax y + b), with a > and b <. x y σ kernel. Also known as Radial Basis Function (RBF) kernel. ˆ Gaussian Kernel: κ(x, y) = exp. Probably the most widely used ˆ The RBF kernel is special in the sense that it corresponds to a mapping into an infinite dimensional vector space. ˆ In addition to the mathematical excitement, the RBF kernel is often the best performing one, as well.

42 SVM in Scikit-Learn # Training code: from sklearn.svm import SVC # Alternatively: from sklearn.svm import LinearSVC # Kernels include: "linear", "rbf", "poly", "sigmoid" # C is the penalty term (default = 1) clf = SVC(kernel = linear, C = 1) # X contains all samples, and y their class # labels: y = [,1,1,,...] clf.fit(x, y) ˆ Scikit-Learn wraps the LibSVM and LibLinear libraries (latter is linear kernel optimized). Support Vectors # Testing code: >>> clf.coef_ # This is our w vector (linear case) array([[ ,.85]]) >>> clf.intercept_ # This is the threshold array([ ]) >>> clf.predict([[-,-], [-,]]) array([., 1.]) >>> clf.support_vectors_ array([[-.339, ], [ , ], [ , ]]) Margin 8

43 Multiclass SVM ˆ SVM is inherently two-class. ˆ Generalization to many classes is done by comparing pairs of classes. ˆ For each class, we train a SVM that classifies this class vs. all others. ˆ Scikit-Learn provides a few meta-classifiers for this purpose. ˆ multiclass.onevsrestclassifier compares each class vs. all others (requires K classifiers) ˆ multiclass.onevsoneclassifier compares all class pairs (requires K(K 1)/ classifiers)

Multiclass SVM ˆ A further extension is multilabel classification, where several classes can be present simultaneously. ˆ Targets are presented as binary indicators: >>> from sklearn.

44 Multiclass SVM ˆ A further extension is multilabel classification, where several classes can be present simultaneously. ˆ Targets are presented as binary indicators: >>> from sklearn.preprocessing import MultiLabelBinarizer >>> y = [[, 3, ], [], [, 1, 3], [, 1,, 3, ], [, 1, ]] >>> MultiLabelBinarizer().fit_transform(y) array([[,, 1, 1, 1], # Classes,3, are "on" [,, 1,, ], # Class 3 is "on" [1, 1,, 1, ], #... [1, 1, 1, 1, 1], [1, 1, 1,, ]]) ˆ Occurs often in, e.g., image recognition problems: "What is shown in this picture." ˆ For example, the following attributes are "on" in the attached picture: {"fly", "flower", "summer"}

45 Multiclass SVM # Training code: from sklearn.svm import LinearSVC from sklearn.multiclass import OneVsRestClassifier from sklearn.multiclass import OneVsOneClassifier clf_ova = OneVsRestClassifier(LinearSVC()) clf_ova.fit(x, y) clf_ovo = OneVsOneClassifier(LinearSVC()) clf_ovo.fit(x, y) # Testing code: >>> clf_ova.predict(np.array([[,-3.], [,-]])) array([., 3.]) >>> clf_ovo.predict(np.array([[,-3.], [,-]])) array([ 1., 3.]) >>> len(clf_ova.estimators_) 3 ˆ SVC() in fact implements OvO heuristic inherently. ˆ LinearSVC() does not, so let s use that as our example Linear SVM with OvA Wrapper Linear SVM with OvO Wrapper 1 3 5

46 Logistic Regression ˆ The third member of linear classifier family is Logistic Regression (LR). ˆ Unlike the other two, LR is probabilistic, i.e., it models the class probabilities instead of plain class memberships. ˆ For two-class case (c {, 1}), the model is: p(c = 1 x) = ˆ Also: p(c = x) = 1 p(c = 1 x) exp[ (w T x + b)]. ˆ In essence, the model maps the projection w T x + b. through the sigmoid function (thus limiting the. range to [, 1]) Logistic Sigmoid Function

47 Logistic Regression ˆ The model is illustrated in the flowgraph: ˆ The samples are projected to 1D using the learned weights w and b. ˆ The resulting "class scores" are mapped through the logistic function, which transforms the scores to probabilities p(c = 1 x) [, 1]. ˆ The probability estimates can also be mapped back to D, as shown in the figure on the right. 8 Logistic Regression Probabilities Projection Class Score Logistic Sigmoid Function Class Probability

48 Training Logistic Regression Models ˆ Logistic regression is trained by maximum likelihood. ˆ In other words, we maximize the likelihood of observing these data points with respect to model parameters. ˆ The likelihood of samples X = [x, x 1,..., x N 1 ] with class labels y, y 1,..., y N 1 {1,,..., K} to have occurred from the model with parameters θ is N 1 p(x θ) = p(x n ; θ, y n ). n= ˆ As usual, we consider the log-likelihood instead: N 1 ln p(x θ) = ln p(x n ; θ, y n ). n=

49 Training Logistic Regression Models ˆ Due to simplicity, we will continue with the two-class case only. ˆ Let s label the classes as y k { 1, 1}, because this will simplify the notation later. ˆ Also: let s hide the constant term b by catenating it at the end of w: w [w, b] and x [x, 1]

50 Training Logistic Regression Models ˆ The likelihood for classes 1 and 1 are now 1 p(x n y n = 1) = 1 + exp( w T x n ) 1 p(x n y n = 1) = exp( w T x n ) = (1 + exp( wt x n )) exp( w T x n ) = exp( wt x n ) 1 + exp( w T x n ) 1 = exp(w T x n ) + 1 = exp(w T x n )

51 Training Logistic Regression Models ˆ Now p(x n y n = 1) and p(x n y n = 1) are almost the same form and can be conveniently combined into one formula: 1 p(x n y n ) = 1 + exp( y n w T x n ) ˆ The likelihood for all samples X = [x, x 1,..., x N 1 ] is now N 1 1 p(x w, y) = 1 + exp( y n w T x n ) n=

52 Training Logistic Regression Models ˆ The log-likelihood becomes N 1 ln p(x w, y) = ln 1 ln(1 + exp( y n w T x n )) n= N 1 = n= ln(1 + exp( y n w T x n )).

53 Training Logistic Regression Models ˆ In order to maximize the likelihood, we can equivalently minimize the logistic loss function: N 1 l(w) = ln(1 + exp( y n w T x n )). n= ˆ There exists several algorithms for minimization of logistic loss, e.g., ˆ Iterative Reweighted Least Squares (IRLS) is an algorithm specifically tailored for this kind of problems. Used by statsmodels.api.logit and statsmodels.api.mnlogit. ˆ Optimization theory based approaches. Since ln p(x θ) is convex, in principle any optimization approach can be used. scikit-learn uses something called Trust Region Newton Algorithm.

54 Training Logistic Regression Models ˆ The optimization theory approach has become dominant, for two reasons: 1 The SVM training can also be posed in the same framework: minimize the hinge loss: hinge-loss = max(, 1 y n w T x n ). A general purpose optimizer can minimize modified losses, as well. Most importantly, one can add a penalty term into the loss function; e.g., N 1 penalized log-loss = ln(1 + exp( y n w T x n )) + Cw T w n= where the latter term favors small coefficients in w. We will return to this technique called regularization later.

55 Training Logistic Regression Models ˆ In the exercises, we will implement log loss minimization for two dimensional data. 1 Initialize w at random. Adjust w towards the negative gradient: w1 3 1 Starting point Optimization path Endpoint w 3 Return to step. w = w ε l (w) ˆ The result is shown on the right. Accuracy / % 8 Classification Accuracy 8 1 Iteration

56 Example: Effect of Regularization Parameter C from sklearn.cross_validation import train_test_split # Split data to training and testing X_train, X_test, y_train, y_test = \ train_test_split(x, y, test_size =.) # Test C values -, -3,..., C_range = 1. ** np.arange(-, 3) clf = LogisticRegression() for C in C_range: clf.c = C clf.fit(x_train, y_train) y_hat = clf.predict(x_test) accuracy = 1. * np.mean(y_hat == y_test) print ("Accuracy for C = %.e is %.1f %% ( w = %.f)" % \ (C, accuracy, np.linalg.norm(clf.coef_))) # Code output for the 3-class example data: Accuracy for C = 1.e- is 5.8 % ( w =.) Accuracy for C = 1.e-3 is 59. % ( w =.333) Accuracy for C = 1.e- is 73.3 % ( w = 1.37) Accuracy for C = 1.e-1 is 88.3 % ( w =.715) Accuracy for C = 1.e+ is 9.8 % ( w = 3.153) Accuracy for C = 1.e+1 is 91.7 % ( w = 3.773) Accuracy for C = 1.e+ is 91.7 % ( w = 3.87) ˆ In this example, we train the LR classifier with a range of values for parameter C. ˆ Each C value can be set inside a for loop as clf.c = <new value> ˆ We test the accuracy on a separate test data extracted from the complete data set.

57 Multiclass Logistic Regression There are two ways to extend LR to multiclass. ➊ A straightforward way would just normalize the exponential terms to sum up to 1: p(c = k x) = exp(wt k x + b k) K j=1 exp(wt j x + b j) However, one of the terms is unnecessary, because the probabilities add up to 1. ➋ Thus, we get an alternative model with fewer parameters: p(c = k x) = p(c = K x) = exp(w T k x + b k) 1 + K 1 j=1 exp(wt j x + b, k = 1,,..., K 1 j) K 1 j=1 exp(wt j x + b j).

Multiclass Version ➊ ˆ The first approach designs one probability model p(c = k x) for each class. ˆ However, the model is ill-posed: there are infinitely many solutions (by scaling).

58 Multiclass Version ➊ ˆ The first approach designs one probability model p(c = k x) for each class. ˆ However, the model is ill-posed: there are infinitely many solutions (by scaling). ˆ Thus, the model needs to be regularized to create a unique solution. ˆ This is the model given by scikit-learn. 8 1 Data 8 Predicted Classes # Training code: from sklearn.linear_model import LogisticRegression clf = LogisticRegression() clf.fit(x, y) # Test code: >>> clf.predict([[1,-1], [, -]]) array([ 1.,.]) clf.predict_proba([[1,-1], [, -]]) array([[.97,.38,.35], [.1,.51,.98]] # Model parameters: >>> clf.coef_ array([[-1.38, ], [ , ], [.13883,.58581]]) >>> clf.intercept_ array([ , , ]) Class Probabilities 8

59 Multiclass Version ➋ ˆ The second version is historically older. ˆ Has a unique solution even without regularization. ˆ Is not implemented in scikit-learn, but can be found in statsmodels library. # Training code: from statsmodels.api import MNLogit 8 1 Data 8 Predicted Classes # statsmodel has a bit different API. # In reality, X should zero-mean, but # let s simplify the code a bit here. clf = MNLogit(y, X) clf = clf.fit() # Test code: >>> clf.predict([[1,-1], [, -]]) array([[.97,.19,.9], [.1,.71,.58]], # clf.predict() gives probabilities. # To get classes, we find the max: >>> clf.predict([[1,-1], [, -]]).argmax(axis = 1) array([, ]) # Model parameters: >> clf.params array([[-.177,.8185], [-.5979, ]]) # Note: only two sets of weights. # Note: no intercept; assumes zero mean 8 Class Probabilities 1 8

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING Slide Set 3: Detection Theory January 2018 Heikki Huttunen heikki.huttunen@tut.fi Department of Signal Processing Tampere University of Technology Detection theory