PATTERN RECOGNITION AND MACHINE LEARNING

Size: px
Start display at page:

Download "PATTERN RECOGNITION AND MACHINE LEARNING"

Transcription

1 PATTERN RECOGNITION AND MACHINE LEARNING Slide Set : Machine Learning: Linear Models January 18 Heikki Huttunen heikki.huttunen@tut.fi Department of Signal Processing Tampere University of Technology

2 Classification ˆ Many machine learning problems can be posed as classification tasks. ˆ Most classification tasks can be posed as a problem of partitioning a vector space into disjoint regions. ˆ These problems consist of following components: Samples: x[], x[1],..., x[n 1] R P Class labels: y[], y[1],..., y[n 1] {1,,..., C} Classifier: F(x) : R P {1,,..., C} ˆ Now, the task is to find the function F that maps the samples most accurately to their corresponding labels. ˆ For example: Find the function F that minimizes the number of erroneous predictions, i.e., the cases F(x[k]) = y[k].

3 Regression ˆ The second large class of machine learning problems consists of regression tasks. ˆ For regression, the output is real-valued instead of categorical. ˆ These problems consist of following components: Inputs: x[], x[1],..., x[n 1] R P Targets: y[], y[1],..., y[n 1] R Predictor: F(x) : R P R ˆ This time the task is to find the function F that maps the samples most accurately to their corresponding targets. ˆ For example: Find the function F that minimizes the squared sum of distances between predictions and targets: N 1 E = (y[k] F(x[k])). k=

4 Classification Example ˆ For example, consider the -dimensional dataset on the right. ˆ The data consists of blue crosses and red circles. ˆ Based on these data, what would be a good partition of the D space into "red" and "blue" regions? ˆ What kind of boundaries are allowed between the regions? ˆ Straight lines? ˆ Continuous curved boundaries? ˆ Boundary without any restriction? ˆ Note: In -D this can be solved manually, but not in -D Which class?

5 Regression Example 1. ˆ For example, consider the 1-dimensional data on the right. ˆ The data consists of 1 data points, where y coordinate is a function of x. ˆ Based on these data, what would be a good prediction of the target value at x = 1.3 (the dashed line)? ˆ An obvious solution is to fit a curve into the data points. What kind of forms may the curve have? ˆ Straight lines? ˆ Continuous curves? y Which y coordinate x

6 Different Classifiers ˆ We will study the following widely used classifiers. ˆ Nearest Neighbor classifier ˆ Linear classifiers (with linear boundary) ˆ The support vector machine (with nonlinear boundary) ˆ Ensemble classifiers: Random Forest ˆ Black boxes: Neural networks and deep neural networks ˆ For the first four, we will refer to the Scikit-Learn module: ˆ The neural network part will be studied with the Keras package:

7 Scikit-Learn ˆ Scikit-Learn started as a Google Summer of Code project. ˆ The project became a success: There was a clear need for free elegant platform bringing together widely used methods. ˆ Instead of each contributor providing their own package, scikit-learn has a unified API for each method. ˆ For details: Pedregosa, et al. Scikit-learn: Machine learning in Python, The Journal of Machine Learning Research,11.

8 Scikit-Learn Methods Sklearn API is straightforward: ˆ Initialization: Each model has its own constructor, e.g., model = LogisticRegression(penalty = "l1", C =.1). ˆ Training: Every model implements a.fit() method that trains the model; e.g., model.fit(x_train, y_train) ˆ Prediction: Every model implements a.predict() method that predicts the target output for new inputs; e.g., y_hat = model.predict(x_test) ˆ Probabilities: Many models also implement a.predict_proba() method that predicts the class probabilities for new inputs; e.g., p = model.predict_proba(x_test)

9 Sample Scikit-learn Session # Training code: from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(x, y) # Testing code: >>> model.predict([1,]) array([ 1.]) >>> model.predict([-1,-]) array([.] >>> model.predict_proba([[1,], [-1,-]]) Out[35]: array([[.881, ], [.59781,.39139]]) # [1,] is class 1 with 99.5 % confidence. # [-1,-] is class with 5.1 % confidence. ˆ In the example code, X consists of two-dimensional samples from two classes

10 Nearest Neighbor Classifier Nearest Neighbor Classifier Classified as RED ˆ Probably the most natural approach for deciding the class is simply to see what kind of samples are nearby. ˆ This is the idea behind the Nearest neighbor classifier: Just copy the class label of the most similar training sample to the unknown test sample. 8

11 K-Nearest Neighbor Classifier 9-Nearest Neighbor Classifier Classified as BLUE ˆ The problem with Nearest Neighbor is its fragility to changes in the training data. ˆ The classification boundary may change a lot by moving only one training sample. ˆ The robustness can be increased by taking a majority vote of more nearby samples. ˆ The K-Nearest neighbor classifier selects the most frequent class label among the K nearest training samples

12 Nearest Neighbor in Scikit-learn # Training code: from sklearn.neighbors import KNeighborsClassifier model = KNeighborsClassifier(n_neighbors = 5, metric = "euclidean") model.fit(x, y) # Testing code: >>> model.predict([-1,-]) array([.]) >>> model.predict([1,]) array([ 1.]) >>> model.predict_proba([,-3]) array([[.,.]]) # Ask what are the 5 nearest neighbors. # Returns the distances and indices (rows) in X >>> distances, indices = model.kneighbors([,-3]) [.1,.5,.3,.1,.] [379, 11, 15, 37, 198] # What are the classes for these five: >>> y[indices] array([[., 1.,.,., 1.]]) ˆ Parameters of constructor include: ˆ n_neighbors: Number of neighbors K ˆ metric: How distance is calculated ˆ algorithm: Which algorithm finds the nearest samples; e.g., methods like K-D Tree or brute force ˆ Probability prediction counts how many of the nearest K samples belong to each group. ˆ Thus, the probabilities are very quantized and often either or 1.

13 Benefits of Nearest Neighbor ˆ Training time is minimal: Either just store the data as is; or reorganize it into a tree structure for efficient search. ˆ Accuracy is often relatively good. ˆ Allows complicated definitions of distance, e.g., ˆ Find the 5 nearest samples holding attributes A and B but not C. ˆ New data can be added/deleted without retraining.

14 Problems with the Nearest Neighbor ˆ Nearest neighbor is prone to overlearning: Especially the 1-NN forms local regions around every isolated data point. ˆ It is highly unlikely that these represent the general trend of the whole population. ˆ Moreover, the training step extremely fast while the classification step becomes extremely slow (consider training data with billion high-dimensional samples). ˆ Therefore, more compact representations are preferred: Training time is usually not critical while execution time is.

15 Linear Classifiers ˆ A linear classifier learns a linear decision boundary between classes. ˆ In mathematical terms, the classification rule can be written as A Linear Classifier Classified as BLUE Class 1, F(x) = Class, if w T x < b if w T x b 8 where the weights w are learned from the data. ˆ The expression w T x = k w kx k essentially transforms the multidimensional data x to a real number, which is then compared to a threshold b

16 Flavors of Linear Classifiers There exists many algorithms for learning the weights w, including: ˆ Linear Discriminant Analysis (LDA) Linear Discriminant Analysis Classified as BLUE ˆ Support Vector Machine (SVM) Support Vector Machine Classified as RED ˆ Logistic Regression (LR) Logistic Regression Classified as RED

17 Flavors of Linear Classifiers ˆ The LDA: ˆ The oldest of the three: Fisher, ˆ "Find the projection that maximizes class separation", i.e., pull the classes as far from each other as possible. ˆ Closed form solution, fast to train. ˆ The SVM: ˆ Vapnik and Chervonenkis, 193. ˆ "Maximize the margin between classes". ˆ Slowest of the three 1, performs well with sparse high-dimensional data. ˆ Logistic Regression (a.k.a. Generalized Linear Model): ˆ History traces back to 17 s, but proper formulation and efficient training algorithm by Nelder and Wedderburn in 197. ˆ Statistical algorithm: "Maximize the likelihood of the data". ˆ Outputs also class probability. Has been extended to automatic feature selection. 1 Slowest in training; testing time is the same for all

18 Linear Discriminant Analysis ˆ The LDA maps high dimensional data into a single scalar simultaneously pulling individual classes apart. ˆ Thus, the task is essentially finding a good linear projection R N R Good Projection: Good Separation Poor Projection: Poor separation

19 Linear Discriminant Analysis ˆ A good measure of class separation could be, e.g., the distance between their projected means. ˆ However, this quantity depends on the scale, and increases simply by multiplying by a large number. ˆ Thus, we want to pull class means apart while keeping the variance of each class small. ˆ As a result, we look into the following separability score: Distance of Class Means J(w) = Variance of Classes = (wt μ 1 w T μ ) w T Σ 1 w + w T Σ w, where μ 1 and μ are the class means and Σ 1 and Σ covariance matrices of the two classes.

20 Maximization of J(w) ˆ The Fisher separation score can be maximized analytically. ˆ First simplify J(w) a bit: J(w) = (wt μ 1 w T μ ) w T Σ 1 w + w T Σ w = wt S B w w T S W w with S B = (μ 1 μ )(μ 1 μ ) T and S W = Σ 1 + Σ. ˆ This form is known as Generalized Rayleigh Quotient, which appears in many contexts. ˆ Using Lagrange multipliers, one can show that the maximum must satisfy with λ R a constant. S B w = λs W w

21 Maximization of J(w) ˆ The above condition is a generalized eigenvalue problem. ˆ In other words, there are many solutions: one for each eigenvalue. ˆ Thus, it is straightforward to conclude that the quantity is maximized when or equivalently J(w) = wt S B w w T S W w S B w = λs W w w T S B w = λw T S W w ˆ Finally, the ratio w T S B w/w T S W w is maximized by choosing λ as the largest eigenvalue (and w as the corresponding eigenvector).

22 Solution 1 using Eigenvalues # Training code: import numpy as np from scipy.linalg import eig # X and X1 contain data of the classes m = np.mean(x, axis = ) m1 = np.mean(x1, axis = ) C = np.cov(x - m, rowvar = False) C1 = np.cov(x1 - m1, rowvar = False) SB = np.multiply.outer(m1 - m, m1 - m) SW = C + C1 D, V = eig(sb, SW) w = V[:, -1] T = np.mean([np.dot(m1, w), np.dot(m, w)]) # Testing code: >>> w array([ ,.97198]) >>> np.dot(w, [,-]) - T. # Positive -> in class "red circles" >>> np.dot(w, [,-3]) - T # Negative -> in class "blue crosses" ˆ We need to solve the generalized eigenvalue problem S B w = λs W w. ˆ scipy has a solver for that. ˆ scipy.linalg.eig returns ˆ Array of eigenvalues D in increasing order. ˆ Matrix of eigenvectors V (in columns). ˆ Thus, we want the rightmost ( 1 th ) column V[:, -1] (corresponding to largest eigenvalue). ˆ The decision is done by comparing the projected vector to threshold T; the center of projected class means.

23 LDA Projection with Threshold Threshold 8 1

24 Solution without Eigenvalues ˆ It turns out that solving the generalized eigenvalue problem S B w = λs W w is not necessary. ˆ The maximum of J(w) has to satisfy S B w = λs W w, or equivalently, S 1 W S Bw = λw. ˆ Insert the definition of S B to get: or, S 1 W (μ 1 μ ) (μ 1 μ ) T w = λw }{{} scalar C S 1 W (μ 1 μ ) C λ = w

25 Solution without Eigenvalues ˆ Thus, the maximum of J(w) has to satisfy this condition as well (regardless of λ). ˆ We are looking for a direction vector w, so the multiplier C/λ is not relevant either ( c w has the same direction as w). λ ˆ Thus, w is given by w = S 1 W (μ 1 μ ) = (Σ + Σ 1 ) 1 (μ 1 μ ).

26 Solution without Eigenvalues # Training code: import numpy as np from scipy.linalg import eig # X and X1 contain data of the classes m = np.mean(x.t, axis = ) m1 = np.mean(x1.t, axis = ) C = np.cov(x - m, rowvar = False) C1 = np.cov(x1 - m1, rowvar = False) w = np.dot(np.linalg.inv(c + C1), (m1 - m)) T = np.mean([np.dot(m1, w), np.dot(m, w)]) ˆ The Python code is very similar to the eigenvalue based solution. ˆ Note that the resulting w is not exactly the same: The lengths are different, but directions are the same. Threshold # Testing code: >>> w array([.53757, 1.177]) >>> np.dot(w, [,-]) - T.5 # Positive -> in class "red circles" >>> np.dot(w, [,-3]) - T # Negative -> in class "blue crosses" 8 1

27 Solution 3: Scikit-Learn # Training code: from sklearn.discriminant_analysis import LinearDiscriminantAnalysis clf = LinearDiscriminantAnalysis() # X contains all samples, and y their class # labels: y = [,1,1,,...] clf.fit(x, y) # Testing code: >>> clf.coef_ # This is our w vector array([[ ,.8353]]) >>> clf.intercept_ # This is the threshold array([ ]) >>> clf.predict([, -]) array([ 1.]) >>> clf.predict([, -3]) array([.]) ˆ Scikit-Learn implements LDA in a straightforward manner. ˆ First construct your classifier using LinearDiscriminantAnalysis() ˆ Then train the model using fit() ˆ Then predict classes using predict() ˆ Or class probabilities using predict_proba() >>> clf.predict_proba([, -3]) array([[.8598,.31315]])

28 Multiclass LDA ˆ The LDA can be generalized to handle more than classes. ˆ Multiclass LDA can be understood in two contexts: ˆ As dimensionality reduction: We seek for a set of projections, that lower the dimensionality of the data while retaining its original separability. ˆ As classification: We seek for a set of discriminant functions that give a probability score for each class. ˆ We are more interested in the latter interpretation. 8

29 Multiclass LDA ˆ In this case, the classifier is defined by a set of linear functions g 1 (x) = w T 1 x + w 1. g K (x) = w T K x + w K ˆ Then the vector x is assigned class k = 1,,..., K if for all j = k. g k (x) > g j (x)

30 Multiclass LDA # Training code: import numpy as np from sklearn.discriminant_analysis import LinearDiscriminantAnalysis clf = LinearDiscriminantAnalysis() # X contains all samples, and y their class # labels: y = [1,3,1,,...] clf.fit(x, y) ˆ LinearDiscriminantAnalysis() model applies to multiclass case as well. ˆ The fit() method finds the three discriminants also shown in the plot below. # Testing code: # Show the three discriminants (rows): >>> clf.coef_ array([[-1.95, ], [ , ], [ ,.33]]) # Compute discriminants for [,-] >>> scores = np.dot(clf.coef_, [,-]) + clf.intercept_ array([.59889, -1.5, ]) # First class (red) has largest score # Predict class more easily using predict() >>> clf.predict([[, -], [, -]]) array([ 1.,.]) 8 1 8

31 The Support Vector Machine ˆ The Support Vector Machine (SVM) is characterized by its maximum margin property. ˆ In other words, it attempts to maximize the margin between classes. ˆ In the attached example, the binary data is perfectly separable, and the SVM sets the boundary in the middle of the two classes. ˆ Thus, the boundary location is defined by three samples only; called support vectors. 8 Support Vectors Margin 8

32 The Support Vector Machine ˆ Maximizing the margin M for the SVM can be characterized as an optimization problem maximize w,b, w =1 subject to M, y i (x T w + b) M, for i = 1,,..., N i ˆ Here, y i encodes the class as y i { 1, 1}, so the last condition can also be written as x T i w + b M, if y i = 1 x T i w + b M, if y i = 1 Support Vectors Margin 8

33 The Support Vector Machine ˆ Alternatively, the SVM criterion can be simplified into an equivalent form minimize w, w,b subject to y i (x T w + b) 1, for i = 1,,..., N i ˆ This is a convex optimization problem with constraints. Efficient tailored algorithms exist. ˆ The implementation of Scikit-learn is based on the widely used LibSVM library.

34 The Support Vector Machine ˆ The earlier definition of SVM assumes that the classes are separate (and there exists a margin between classes). ˆ In reality we have to allow some samples to reside on the wrong side of the margin. ˆ The solution is to penalize for samples on the wrong side. ˆ The resulting function to be minimized is: minimize w,b N [max(, 1 y i (w T x i + b))] + C w. i=1 ˆ For each sample on the wrong side, we add a penalty (called hinge loss) defined as max(, x) ˆ Note, that this increases the number of SV s Hinge Loss Right Side Wrong Side

35 SVM in Scikit-Learn ˆ The parameter C determines the balance between margin width and penalty for being on the wrong side. ˆ Left to right: Linear SVM s with decreasing penalty. SVM with C = 1 SVM with C = 1 1 SVM with C = 1 5 SVM with C =

36 Kernel Trick SVM with nd order Polynomial Kernel ˆ The SVM can be extended to nonlinear boundaries using the kernel trick. ˆ The kernel trick essentially maps the data into a higher dimension and designs the linear SVM there. ˆ For example, the top plot can be generated as follows: ˆ Map the D samples into 3D: (x, y) (x, y, xy) ˆ Train the SVM with the new 3D samples. ˆ The decision boundary is linear in 3D but nonlinear in D. ˆ However, this explicit transformation is slow; the kernel trick does the same thing implicitly. 8 SVM with the RBF Kernel 8

37 Kernel Trick ˆ It can be shown that substitution of the dot product by some nonlinear function is equivalent to the explicit mapping. ˆ More specifically, an algorithm can be kernelized by inserting a kernel function κ(x, y) in place of all dot products x y. ˆ It can be shown that under certain conditions on the kernel (e.g., positive semidefiniteness), this is equivalent to an explicit mapping. ˆ A lot of research has been done on the relation of kernel and the corresponding mapping. ˆ However, we take a more practical approach and only consider an example. SVM with nd order Polynomial Kernel 8 SVM with the RBF Kernel 8

38 Kernel Trick ˆ As an example, consider what happens when the inner product of D vectors x = (x 1, x ) and y = (y 1, y ) is substituted by its second power: ˆ We can expand the kernel as follows κ(x, y) = (x y) κ(x, y) = (x y) = (x 1 y 1 + x y ) = (x 1 y 1 ) + (x y ) + x 1 y 1 x y

39 Kernel Trick ˆ The result can be cleverly rearranged to make it look like a dot product in 3D: κ(x, y) = (x 1 y 1 ) + (x y ) + x 1 y 1 x y = (x 1 y 1 ) + (x y ) + ( x 1 x )( y 1 y ) = x x 1, x, 1 x y y 1, y, 1 y

40 Kernel Trick ˆ Thus, the following two things are equivalent: ˆ Explicit Mapping: Transform D data into 3D explicitly and fit the SVM with transformed data: u u v v uv ˆ Implicit Mapping: Substitute each dot product in the SVM algorithm by the kernel κ(x, y) = (x y) and fit with original D data.

41 Popular Kernels ˆ As mentioned, there is lot of literature on kernels. Popular ones include ˆ Linear Kernel: κ(x, y) = x y. This is the basic SVM with no mapping. ˆ Polynomial Kernel: κ(x, y) = (x y) d. Raises dot product to d th power. ˆ Inhomogeneous Polynomial Kernel: κ(x, y) = (x y + 1) d. Similar to polynomial kernel, but produces a bit more dimensions. ˆ Sigmoid Kernel: κ(x, y) = tanh(ax y + b), with a > and b <. x y σ kernel. Also known as Radial Basis Function (RBF) kernel. ˆ Gaussian Kernel: κ(x, y) = exp. Probably the most widely used ˆ The RBF kernel is special in the sense that it corresponds to a mapping into an infinite dimensional vector space. ˆ In addition to the mathematical excitement, the RBF kernel is often the best performing one, as well.

42 SVM in Scikit-Learn # Training code: from sklearn.svm import SVC # Alternatively: from sklearn.svm import LinearSVC # Kernels include: "linear", "rbf", "poly", "sigmoid" # C is the penalty term (default = 1) clf = SVC(kernel = linear, C = 1) # X contains all samples, and y their class # labels: y = [,1,1,,...] clf.fit(x, y) ˆ Scikit-Learn wraps the LibSVM and LibLinear libraries (latter is linear kernel optimized). Support Vectors # Testing code: >>> clf.coef_ # This is our w vector (linear case) array([[ ,.85]]) >>> clf.intercept_ # This is the threshold array([ ]) >>> clf.predict([[-,-], [-,]]) array([., 1.]) >>> clf.support_vectors_ array([[-.339, ], [ , ], [ , ]]) Margin 8

43 Multiclass SVM ˆ SVM is inherently two-class. ˆ Generalization to many classes is done by comparing pairs of classes. ˆ For each class, we train a SVM that classifies this class vs. all others. ˆ Scikit-Learn provides a few meta-classifiers for this purpose. ˆ multiclass.onevsrestclassifier compares each class vs. all others (requires K classifiers) ˆ multiclass.onevsoneclassifier compares all class pairs (requires K(K 1)/ classifiers)

44 Multiclass SVM ˆ A further extension is multilabel classification, where several classes can be present simultaneously. ˆ Targets are presented as binary indicators: >>> from sklearn.preprocessing import MultiLabelBinarizer >>> y = [[, 3, ], [], [, 1, 3], [, 1,, 3, ], [, 1, ]] >>> MultiLabelBinarizer().fit_transform(y) array([[,, 1, 1, 1], # Classes,3, are "on" [,, 1,, ], # Class 3 is "on" [1, 1,, 1, ], #... [1, 1, 1, 1, 1], [1, 1, 1,, ]]) ˆ Occurs often in, e.g., image recognition problems: "What is shown in this picture." ˆ For example, the following attributes are "on" in the attached picture: {"fly", "flower", "summer"}

45 Multiclass SVM # Training code: from sklearn.svm import LinearSVC from sklearn.multiclass import OneVsRestClassifier from sklearn.multiclass import OneVsOneClassifier clf_ova = OneVsRestClassifier(LinearSVC()) clf_ova.fit(x, y) clf_ovo = OneVsOneClassifier(LinearSVC()) clf_ovo.fit(x, y) # Testing code: >>> clf_ova.predict(np.array([[,-3.], [,-]])) array([., 3.]) >>> clf_ovo.predict(np.array([[,-3.], [,-]])) array([ 1., 3.]) >>> len(clf_ova.estimators_) 3 ˆ SVC() in fact implements OvO heuristic inherently. ˆ LinearSVC() does not, so let s use that as our example Linear SVM with OvA Wrapper Linear SVM with OvO Wrapper 1 3 5

46 Logistic Regression ˆ The third member of linear classifier family is Logistic Regression (LR). ˆ Unlike the other two, LR is probabilistic, i.e., it models the class probabilities instead of plain class memberships. ˆ For two-class case (c {, 1}), the model is: p(c = 1 x) = ˆ Also: p(c = x) = 1 p(c = 1 x) exp[ (w T x + b)]. ˆ In essence, the model maps the projection w T x + b. through the sigmoid function (thus limiting the. range to [, 1]) Logistic Sigmoid Function

47 Logistic Regression ˆ The model is illustrated in the flowgraph: ˆ The samples are projected to 1D using the learned weights w and b. ˆ The resulting "class scores" are mapped through the logistic function, which transforms the scores to probabilities p(c = 1 x) [, 1]. ˆ The probability estimates can also be mapped back to D, as shown in the figure on the right. 8 Logistic Regression Probabilities Projection Class Score Logistic Sigmoid Function Class Probability

48 Training Logistic Regression Models ˆ Logistic regression is trained by maximum likelihood. ˆ In other words, we maximize the likelihood of observing these data points with respect to model parameters. ˆ The likelihood of samples X = [x, x 1,..., x N 1 ] with class labels y, y 1,..., y N 1 {1,,..., K} to have occurred from the model with parameters θ is N 1 p(x θ) = p(x n ; θ, y n ). n= ˆ As usual, we consider the log-likelihood instead: N 1 ln p(x θ) = ln p(x n ; θ, y n ). n=

49 Training Logistic Regression Models ˆ Due to simplicity, we will continue with the two-class case only. ˆ Let s label the classes as y k { 1, 1}, because this will simplify the notation later. ˆ Also: let s hide the constant term b by catenating it at the end of w: w [w, b] and x [x, 1]

50 Training Logistic Regression Models ˆ The likelihood for classes 1 and 1 are now 1 p(x n y n = 1) = 1 + exp( w T x n ) 1 p(x n y n = 1) = exp( w T x n ) = (1 + exp( wt x n )) exp( w T x n ) = exp( wt x n ) 1 + exp( w T x n ) 1 = exp(w T x n ) + 1 = exp(w T x n )

51 Training Logistic Regression Models ˆ Now p(x n y n = 1) and p(x n y n = 1) are almost the same form and can be conveniently combined into one formula: 1 p(x n y n ) = 1 + exp( y n w T x n ) ˆ The likelihood for all samples X = [x, x 1,..., x N 1 ] is now N 1 1 p(x w, y) = 1 + exp( y n w T x n ) n=

52 Training Logistic Regression Models ˆ The log-likelihood becomes N 1 ln p(x w, y) = ln 1 ln(1 + exp( y n w T x n )) n= N 1 = n= ln(1 + exp( y n w T x n )).

53 Training Logistic Regression Models ˆ In order to maximize the likelihood, we can equivalently minimize the logistic loss function: N 1 l(w) = ln(1 + exp( y n w T x n )). n= ˆ There exists several algorithms for minimization of logistic loss, e.g., ˆ Iterative Reweighted Least Squares (IRLS) is an algorithm specifically tailored for this kind of problems. Used by statsmodels.api.logit and statsmodels.api.mnlogit. ˆ Optimization theory based approaches. Since ln p(x θ) is convex, in principle any optimization approach can be used. scikit-learn uses something called Trust Region Newton Algorithm.

54 Training Logistic Regression Models ˆ The optimization theory approach has become dominant, for two reasons: 1 The SVM training can also be posed in the same framework: minimize the hinge loss: hinge-loss = max(, 1 y n w T x n ). A general purpose optimizer can minimize modified losses, as well. Most importantly, one can add a penalty term into the loss function; e.g., N 1 penalized log-loss = ln(1 + exp( y n w T x n )) + Cw T w n= where the latter term favors small coefficients in w. We will return to this technique called regularization later.

55 Training Logistic Regression Models ˆ In the exercises, we will implement log loss minimization for two dimensional data. 1 Initialize w at random. Adjust w towards the negative gradient: w1 3 1 Starting point Optimization path Endpoint w 3 Return to step. w = w ε l (w) ˆ The result is shown on the right. Accuracy / % 8 Classification Accuracy 8 1 Iteration

56 Example: Effect of Regularization Parameter C from sklearn.cross_validation import train_test_split # Split data to training and testing X_train, X_test, y_train, y_test = \ train_test_split(x, y, test_size =.) # Test C values -, -3,..., C_range = 1. ** np.arange(-, 3) clf = LogisticRegression() for C in C_range: clf.c = C clf.fit(x_train, y_train) y_hat = clf.predict(x_test) accuracy = 1. * np.mean(y_hat == y_test) print ("Accuracy for C = %.e is %.1f %% ( w = %.f)" % \ (C, accuracy, np.linalg.norm(clf.coef_))) # Code output for the 3-class example data: Accuracy for C = 1.e- is 5.8 % ( w =.) Accuracy for C = 1.e-3 is 59. % ( w =.333) Accuracy for C = 1.e- is 73.3 % ( w = 1.37) Accuracy for C = 1.e-1 is 88.3 % ( w =.715) Accuracy for C = 1.e+ is 9.8 % ( w = 3.153) Accuracy for C = 1.e+1 is 91.7 % ( w = 3.773) Accuracy for C = 1.e+ is 91.7 % ( w = 3.87) ˆ In this example, we train the LR classifier with a range of values for parameter C. ˆ Each C value can be set inside a for loop as clf.c = <new value> ˆ We test the accuracy on a separate test data extracted from the complete data set.

57 Multiclass Logistic Regression There are two ways to extend LR to multiclass. ➊ A straightforward way would just normalize the exponential terms to sum up to 1: p(c = k x) = exp(wt k x + b k) K j=1 exp(wt j x + b j) However, one of the terms is unnecessary, because the probabilities add up to 1. ➋ Thus, we get an alternative model with fewer parameters: p(c = k x) = p(c = K x) = exp(w T k x + b k) 1 + K 1 j=1 exp(wt j x + b, k = 1,,..., K 1 j) K 1 j=1 exp(wt j x + b j).

58 Multiclass Version ➊ ˆ The first approach designs one probability model p(c = k x) for each class. ˆ However, the model is ill-posed: there are infinitely many solutions (by scaling). ˆ Thus, the model needs to be regularized to create a unique solution. ˆ This is the model given by scikit-learn. 8 1 Data 8 Predicted Classes # Training code: from sklearn.linear_model import LogisticRegression clf = LogisticRegression() clf.fit(x, y) # Test code: >>> clf.predict([[1,-1], [, -]]) array([ 1.,.]) clf.predict_proba([[1,-1], [, -]]) array([[.97,.38,.35], [.1,.51,.98]] # Model parameters: >>> clf.coef_ array([[-1.38, ], [ , ], [.13883,.58581]]) >>> clf.intercept_ array([ , , ]) Class Probabilities 8

59 Multiclass Version ➋ ˆ The second version is historically older. ˆ Has a unique solution even without regularization. ˆ Is not implemented in scikit-learn, but can be found in statsmodels library. # Training code: from statsmodels.api import MNLogit 8 1 Data 8 Predicted Classes # statsmodel has a bit different API. # In reality, X should zero-mean, but # let s simplify the code a bit here. clf = MNLogit(y, X) clf = clf.fit() # Test code: >>> clf.predict([[1,-1], [, -]]) array([[.97,.19,.9], [.1,.71,.58]], # clf.predict() gives probabilities. # To get classes, we find the max: >>> clf.predict([[1,-1], [, -]]).argmax(axis = 1) array([, ]) # Model parameters: >> clf.params array([[-.177,.8185], [-.5979, ]]) # Note: only two sets of weights. # Note: no intercept; assumes zero mean 8 Class Probabilities 1 8

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Slide Set 3: Detection Theory January 2018 Heikki Huttunen heikki.huttunen@tut.fi Department of Signal Processing Tampere University of Technology Detection theory

More information

Statistical Methods for SVM

Statistical Methods for SVM Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

ESS2222. Lecture 4 Linear model

ESS2222. Lecture 4 Linear model ESS2222 Lecture 4 Linear model Hosein Shahnas University of Toronto, Department of Earth Sciences, 1 Outline Logistic Regression Predicting Continuous Target Variables Support Vector Machine (Some Details)

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Support Vector Machines CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson Applied Machine Learning Lecture 5: Linear classifiers, continued Richard Johansson overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

Support Vector Machine (continued)

Support Vector Machine (continued) Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need

More information

Statistical Methods for Data Mining

Statistical Methods for Data Mining Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Pattern Recognition 2018 Support Vector Machines

Pattern Recognition 2018 Support Vector Machines Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines CS4495/6495 Introduction to Computer Vision 8C-L3 Support Vector Machines Discriminative classifiers Discriminative classifiers find a division (surface) in feature space that separates the classes Several

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Review: Support vector machines. Machine learning techniques and image analysis

Review: Support vector machines. Machine learning techniques and image analysis Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Machine Learning A Geometric Approach

Machine Learning A Geometric Approach Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU) Linear Separator Ham Spam From Perceptron

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

Support Vector Machines

Support Vector Machines Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder September 28, 2017 Prof. Michael Paul Today Two important concepts: Margins Kernels Large Margin Classification

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are

More information

Basis Expansion and Nonlinear SVM. Kai Yu

Basis Expansion and Nonlinear SVM. Kai Yu Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Tufts COMP 135: Introduction to Machine Learning

Tufts COMP 135: Introduction to Machine Learning Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)

More information

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

More information

Classification: The rest of the story

Classification: The rest of the story U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher

More information

CS798: Selected topics in Machine Learning

CS798: Selected topics in Machine Learning CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3

More information

Machine Learning Basics

Machine Learning Basics Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science Neural Networks Prof. Dr. Rudolf Kruse Computational Intelligence Group Faculty for Computer Science kruse@iws.cs.uni-magdeburg.de Rudolf Kruse Neural Networks 1 Supervised Learning / Support Vector Machines

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Introduction to SVM and RVM

Introduction to SVM and RVM Introduction to SVM and RVM Machine Learning Seminar HUS HVL UIB Yushu Li, UIB Overview Support vector machine SVM First introduced by Vapnik, et al. 1992 Several literature and wide applications Relevance

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Support Vector Machines CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Linear Classification

Linear Classification Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015 Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative

More information

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML) Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression ECE521: Inference Algorithms and Machine Learning University of Toronto Assignment 1: k-nn and Linear Regression TA: Use Piazza for Q&A Due date: Feb 7 midnight, 2017 Electronic submission to: ece521ta@gmailcom

More information

Support Vector Machines

Support Vector Machines EE 17/7AT: Optimization Models in Engineering Section 11/1 - April 014 Support Vector Machines Lecturer: Arturo Fernandez Scribe: Arturo Fernandez 1 Support Vector Machines Revisited 1.1 Strictly) Separable

More information

Chapter 6 Classification and Prediction (2)

Chapter 6 Classification and Prediction (2) Chapter 6 Classification and Prediction (2) Outline Classification and Prediction Decision Tree Naïve Bayes Classifier Support Vector Machines (SVM) K-nearest Neighbors Accuracy and Error Measures Feature

More information

Kernel Methods. Charles Elkan October 17, 2007

Kernel Methods. Charles Elkan October 17, 2007 Kernel Methods Charles Elkan elkan@cs.ucsd.edu October 17, 2007 Remember the xor example of a classification problem that is not linearly separable. If we map every example into a new representation, then

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

Nonlinear Classification

Nonlinear Classification Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

ML (cont.): SUPPORT VECTOR MACHINES

ML (cont.): SUPPORT VECTOR MACHINES ML (cont.): SUPPORT VECTOR MACHINES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 40 Support Vector Machines (SVMs) The No-Math Version

More information

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015 Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary

More information

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam. CS 189 Spring 2013 Introduction to Machine Learning Midterm You have 1 hour 20 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. Please use non-programmable calculators

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b. Lab 1 Conjugate-Gradient Lab Objective: Learn about the Conjugate-Gradient Algorithm and its Uses Descent Algorithms and the Conjugate-Gradient Method There are many possibilities for solving a linear

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Motivating the Covariance Matrix

Motivating the Covariance Matrix Motivating the Covariance Matrix Raúl Rojas Computer Science Department Freie Universität Berlin January 2009 Abstract This note reviews some interesting properties of the covariance matrix and its role

More information

Constrained Optimization and Support Vector Machines

Constrained Optimization and Support Vector Machines Constrained Optimization and Support Vector Machines Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/

More information

Assignment 4. Machine Learning, Summer term 2014, Ulrike von Luxburg To be discussed in exercise groups on May 12-14

Assignment 4. Machine Learning, Summer term 2014, Ulrike von Luxburg To be discussed in exercise groups on May 12-14 Assignment 4 Machine Learning, Summer term 2014, Ulrike von Luxburg To be discussed in exercise groups on May 12-14 Exercise 1 (Rewriting the Fisher criterion for LDA, 2 points) criterion J(w) = w, m +

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

CENG 793. On Machine Learning and Optimization. Sinan Kalkan CENG 793 On Machine Learning and Optimization Sinan Kalkan 2 Now Introduction to ML Problem definition Classes of approaches K-NN Support Vector Machines Softmax classification / logistic regression Parzen

More information

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How

More information

Support vector machines

Support vector machines Support vector machines Jianxin Wu LAMDA Group National Key Lab for Novel Software Technology Nanjing University, China wujx2001@gmail.com May 10, 2018 Contents 1 The key SVM idea 2 1.1 Simplify it, simplify

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning Support Vector Machines. Prof. Matteo Matteucci Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way

More information