DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Size: px
Start display at page:

Download "DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane"

Transcription

1 DATA MINING AND MACHINE LEARNING Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane Academic Year 2016/2017

2 Table of contents Linear models for regression Regularized least squares SVD analysis of ridge regression Online estimation of LRR Multiple outputs Hints on the geometry of least square Linear expansion on random bases Linear models for classification Least squares classification Logistic regression Discriminant analysis

3 OLS for a fixed basis expansion Going back to OLS, we can easily obtain a nonlinear model by first transforming the input x via a nonlinear function h(x) : R d R h. In this case, we construct a linear model in R h, which is implicitly nonlinear in the original space R d : f (x) = w T h(x). As a motivating example, consider a one-dimensional input x, and a polynomial expansion given by: for some user-specified P. h(x) = [ x 0, x 1,..., x P],

4 Polynomial expansion (1) 3 Dataset 2 Real 1 Predicted x Figure 1 : Data sampled from 2 sin(x) 0.4x with Gaussian noise; for P = 1, we obtain the standard OLS estimate. y

5 Polynomial expansion (2) 3 Dataset 2 Real 1 Predicted x Figure 2 : As before, but we choose P = 3. The models is now nonlinear in the original space, providing a relatively good approximation to our model. y

6 Polynomial expansion (3) Dataset 5 Real Predicted x Figure 3 : We now increase P to 7. The model complexity has increased, but its approximation capabilities have clearly worsened. This is an example of overfitting. y

7 The learning curve By setting P = N, we can construct a predictor which passes exactly through all training points, achieving an empirical error of 0. This is reminiscent of the k-nn behavior with k = 1. We can visualize this behavior by plotting simultaneously the evolution of the training error and of the error on an independent test set, as in the next slide. Note that this polynomial expansion becomes exponentially more complex as we increase d. As an example, for d = 3 and P = 2 we already need h = 10 terms: h(x) = [ 1, x 1, x 2, x 3, x 2 1, x 2 2, x 2 3, x 1 x 2, x 1 x 3, x 2 x 3 ].

8 Example of overfitting Training error Test error Error P Figure 4 : The training set is composed of 20 elements, the test set of 250 independent samples. We see that overfitting increases by increasing P.

9 Example of overfitting (2) Training error Test error Error P Figure 5 : Similar to before, but we increase the training set size to 40. The results are better for low P, but overfitting is still a problem for larger values of the polynomial expansion.

10 The cause of overfitting in this case Norm of w P Figure 6 : If we plot the norm of w for varying P, we see that overfitting is closely linked to the magnitude of the weights. Larger weights make the functions less smooth and more prone to overfitting.

11 Regularized least squares We can try to improve the results by including in the optimization process a term favoring low-norm solutions, in line with our intuition that such solutions generalize better: J(w) = 1 2N N ( ) 2 y i w T h(x i ) + C 2 w 2 2 i=1 where C balances the two terms. This model is called regularized least squares (RLS), or linear ridge regression (LRR). The solution can still be expressed in closed form as: w T = ( H T H + CI ) 1 H T y, where H is constructed similarly to X, and I is the identity matrix of appropriate size.

12 Polynomial expansion with regularization (1) Dataset 5 Real Predicted x Figure 7 : We solve the model for P = 7 with a small amount of regularization C = The results have highly improved. y

13 Polynomial expansion with regularization (2) Dataset 5 Real Predicted x Figure 8 : If we increase too much C (0.1 in this case), the model is again too simple to model anything interesting. This is an example of underfitting. y

14 The learning curve with regularization 10 1 Training error Test error Error P Figure 9 : By including a small regularization term C = 0.01, the overfitting is kept under control even for large values of P.

15 The norm of the weights with regularization 10 2 Norm of w P Figure 10 : If we plot the norm of the weights, we see that after a certain value of P, the norm is curtailed by the presence of the l 2 norm in the optimization problem.

16 Regularized risk minimization The previous idea is extremely powerful; in fact, a large set of supervised learning methods can be expressed as the solution of a regularized risk functional given by two terms: I reg [f ] = 1 N N L ( y i, f (x i ) ) i=1 } {{ } Training error + C R[f ], }{{} Regularization where R[f ] embodies any a priori information we might have on the optimal function, generally trading off its complexity with respect to training accuracy. The parameter C is a hyper-parameter similar to the k of k- NN, allowing the user to select a complexity adequate to the problem under consideration.

17 Table of contents Linear models for regression Regularized least squares SVD analysis of ridge regression Online estimation of LRR Multiple outputs Hints on the geometry of least square Linear expansion on random bases Linear models for classification Least squares classification Logistic regression Discriminant analysis

18 The SVD decomposition The singular value decomposition (SVD) of the matrix X is defined as: where: X = UDV T, U is an N d semi-orthogonal matrix, whose columns span the column space of X. V is a d d orthogonal matrix, whose columns span the row space of X. D is a d d diagonal matrix, whose elements on the diagonal satisfy D 11 D D dd 0 and are called the singular values of X.

19 OLS with SVD decomposition The SVD decomposition can be computed in a stable fashion, and its factors have many useful properties: For U, due to the orthogonality of the columns, we have U T U = I or, alternatively, U 1 = U T. The same is also true for V. For D, being diagonal, we have that D T = D. Also, D 1 is a diagonal matrix with elements 1/D ii. In addition to this, we remember that for any two matrices A, B of appropriate sizes: (AB) T = B T A T. (AB) 1 = B 1 A 1.

20 OLS with SVD decomposition (2) We can rewrite the OLS estimate as: w = ( X T X ) 1 X T y = ( VDU T UDV T ) 1 VDU T y = VD 1 U T y. Note that if X is singular, at least one singular value is 0, leading to an undefined operation in the computation of D 1. More generally, the operation is problematic even when X is very close to singular (e.g., two features highly dependent among them).

21 LRR with SVD decomposition With regularization, the matrix inversion step is modified as: ( VD 2 V T + CI ) 1 = ( VD 2 V T + CVV T ) 1 = ( V ( D 2 + CI ) V T ) 1. From which we can write the LRR solution as: w = V D 1 U T y. with: D 1 ii = D ii D 2 ii + C,

22 Regularization and singular values The formula above, apart from providing an efficient way to compute the LRR estimate, shows that regularization acts on the singular values of X. In fact, even if one singular value is 0, the modified value in D will be C. More in general, LRR with C > 0 can be shown to be strongly convex. The following quantity is called the effective degrees of freedom of the estimator: df(c) = d i=1 D 2 ii D 2 ii + C. Even if we are adapting d parameters, their adaptation is constrained by the value of C, so that the actual number of free parameters of the model is lower (hence, model complexity).

23 Effective degrees of freedom 8 6 df(c) C Figure 11 : A plot of df(c) in the previous problem, with P = 8.

24 QR decomposition In practice, alternative decomposition strategies are used to solve the LRR problem. As an example, the matrix X (if not singular) can be factored as: X = QR, where Q R N d is semi-orthogonal, while R R d d is upper triangular. This is called the QR decomposition of X.

25 Solving OLS with QR decomposition Substituting in the OLS solution we obtain: w = ( R T Q T QR ) 1 R T Q T y = ( R T R ) 1 R T Q T y = R 1 Q T y. The overall system can be solved efficiently via the use of back substitution. The most complex step is now the QR decomposition which, however, is much more stable than inverting the original matrix.

26 Table of contents Linear models for regression Regularized least squares SVD analysis of ridge regression Online estimation of LRR Multiple outputs Hints on the geometry of least square Linear expansion on random bases Linear models for classification Least squares classification Logistic regression Discriminant analysis

27 Online estimation In online learning, the training dataset is not available beforehand, but it is received one sample at a time in a streaming fashion. The objective is to create a sequence of estimates f 0, f 1,..., f N converging to the solution of the regularized risk. Note the similarity between online learning and stochastic optimization. As a general way of tackling online learning problems, we can apply an SGD formulation using the current sample (x n, y n ) to compute an estimate of the gradient. For LRR, the result is the so-called least mean square algorithm: w n+1 = w n α n ( yn w T n x n ) xn + α n Cw n.

28 Sherman-Morrison formula We can obtain a more efficient formulation with some algebra manipulations. In particular, denote by X n the matrix X restricted to the first n rows (and similarly for y n ). Defining for simplicity P n = ( X T X + CI ), we have P n+1 = P n + x n+1 x T n+1. The Sherman-Morrison formula allows us to compute the inverse of P 1 n+1 from the knowledge of P 1 n : P 1 n+1 = P 1 n P 1 n x n+1 x T n+1p 1 n. 1 + x T n+1 P 1 n x n+1

29 Recursive least-square We can write the complete LRR update as: ( ) w n+1 = w n + P 1 n+1 X T n y n + x n+1 y n+1. Putting everything together we obtain: w n+1 = w n + P 1 n+1x n+1 ( yn+1 w T n x n+1 ), which, combined with the previous update for P 1 n, gives us a fully iterative solution, by initializing P 1 0 as 1 I. C This is know as the recursive least squares. Note that, differently from the SGD solution, w n is the exact solution we would obtain by solving the LRR problem with the data received up to now.

30 Table of contents Linear models for regression Regularized least squares SVD analysis of ridge regression Online estimation of LRR Multiple outputs Hints on the geometry of least square Linear expansion on random bases Linear models for classification Least squares classification Logistic regression Discriminant analysis

31 Handling multiple outputs If we wish to estimate a vector y R c model can be written as: in output, the linear f(x) = W T x where we need to adapt a matrix W R d c. The objective function can now be written as: J(W) = 1 2N c k=1 N (y ik f k (x i )) 2 + C 2 i=1 d i=1 c j=1 W 2 ij.

32 Handling multiple outputs (2) After some manipulations, the objective can be rewritten as: J(W) = 1 { (Y 2N Tr ) T ( ) } XW Y XW + C 2 vect(w) 2 2, where Y is constructed by stacking row-wise the y i s; Tr { } is the trace operator; and vect( ) transforms the matrix into a vector. Computing the gradient and equating it to 0 shows that the solution is unchanged: W = ( X T X + CI ) 1 X T Y. In particular, the estimates for each output are independent among them.

33 Table of contents Linear models for regression Regularized least squares SVD analysis of ridge regression Online estimation of LRR Multiple outputs Hints on the geometry of least square Linear expansion on random bases Linear models for classification Least squares classification Logistic regression Discriminant analysis

34 An alternative formulation of LRR We can write the predictions of LRR as: ŷ = Xw = ( X ( X T X + CI ) 1 X T ) y = My. The LRR model can be seen as performing a linear combination of the training points in order to predict a value, weighting them via the corresponding rows of the hat matrix M. The errors of the model have a similar interpretation: y = ŷ + (y ŷ) = My + (I M) y.

35 Interpreting the geometry The desired vector y defines a point in some high-dimensional space R N. Consider the set of all possible linear models: ŷ = Xw = d w i X :i, where X :i is the ith column of X. Technically, the columns of X span a linear subspace (of dimension d if N d and the matrix has full rank) of the larger space R N. We can think of M as the orthogonal matrix projecting y onto the closest possible point on this linear subspace. i=1

36 A concrete example Consider a simple example with d = 1 and N = 2, with: [ ] [ ] X =, y = In this case, X spans a line in R 2 formed by all multiples of X itself. By working out the OLS solution, we find that the closest point on this line with respect to y is: [ ] ŷ y

37 A concrete example (2) y_ y_1 Figure 12 : The red point is y, the blue line is the space spanned by X. Multiplying by M projects y onto the closest point.

38 Table of contents Linear models for regression Regularized least squares SVD analysis of ridge regression Online estimation of LRR Multiple outputs Hints on the geometry of least square Linear expansion on random bases Linear models for classification Least squares classification Logistic regression Discriminant analysis

39 Random bases The polynomial expansion becomes infeasible for more than a few dimensions, due to the exponential number of terms that need to be computed. Can we find some alternative expansion h( ) with good approximation properties? One interesting idea is to combine h nonlinear transformations, whose parameters are extracted randomly from some distribution. As an example, consider the following nonlinear transformation: h i (x) = exp { a T i x b i }, with a i, b i taken from some uniform distribution in [ λ, λ].

40 Random vector functional-link networks The previous model is known as a random vector functionallink network. Crucially, it can be shown that the model can approximate any (compact) nonlinear function, provided h is large enough. Thanks to the closed-form solution, this model can be trained extremely fast and performs reasonably well in many situations. It provides a good baseline solution for many scenarios, including low-power devices. Clearly, the number of hidden bases might grow too large in some problems. Additionally, an improper regularization results in an extremely large variance of the model. [1] Pao, Y.H., Park, G.H. and Sobajic, D.J., Learning and generalization characteristics of the random vector functional-link net. Neurocomputing, 6(2), pp

41 Example of RVFL application y Figure 13 : C = Dataset 5 Real Predicted x An example of a RVFL model, with h = 500 and

42 Table of contents Linear models for regression Regularized least squares SVD analysis of ridge regression Online estimation of LRR Multiple outputs Hints on the geometry of least square Linear expansion on random bases Linear models for classification Least squares classification Logistic regression Discriminant analysis

43 Classification problems Let us now consider classification problems, i.e.: Binary classification: the desired output is a single value y i {0, 1}, defining whether the corresponding input belongs or not to that class. Multi-class classification: the output can take one of M different values. One way to tackle this problem is to consider a dummy encoding of the output variable, fit a model on the target vector, and select the class according to the largest value in output. Alternatively, we may fit M different discriminant functions separately, one for each class.

44 Least squares classification If we apply an LRR model, we obtain a least squares classification algorithm. The class is obtained as: { Class of x = arg max fk (x) }. k=1,...,m There are (at least) two possible drawbacks to this approach: 1. We cannot guarantee that the elements f k (x) represent valid probability distributions with respect to the classes, as they can be negative or larger than More importantly, for M > 2, there is the possibility that one class is masked by the others, as shown later in a simple 3-class example.

45 An example of least squares classification Figure 14 : With 3 perfectly separable classes, the decision boundaries found by least squares classification are correct.

46 An example of masking Figure 15 : In this case, even if the classes are linearly separable, one of the three is entirely masked by the others.

47 Table of contents Linear models for regression Regularized least squares SVD analysis of ridge regression Online estimation of LRR Multiple outputs Hints on the geometry of least square Linear expansion on random bases Linear models for classification Least squares classification Logistic regression Discriminant analysis

48 Squashing functions Focusing on the binary classification case, one way to obtain valid probabilities from the linear model is to squash all possible values w T x in the interval [0, 1] using some scalar function σ( ). One example is the sigmoid (or logistic) function, defined as: σ(s) = exp { s}. This is an example of a squashing function, i.e. a function such that: lim s σ(s) = 1, and similarly lim s σ(s) = 0. The function is monotonically increasing.

49 Visualizing the sigmoid Sigmoid σ(s) Figure 16 : s A plot of the sigmoid function.

50 Inverse link function Rearranging the model, we see that it is still linear with respect to a nonlinear transformation of the output: { } f (x) log = w T x. 1 f (x) If we interpret f (x) as the probability p(y = 1 x), we are constructing a linear model of the logarithm of the odds (i.e., the ratio of success probability vs. probability of failure). The nonlinear transformation above is called logit function in statistics, and it is an example of link function (linking probabilities with outputs). Since the sigmoid is its inverse, it is sometimes called an inverse link function.

51 Selecting a loss function It is possible to train the logistic regression model using a squared error, but if we interpret its values as probabilities, we can think of designing more appropriate loss functions. Let us consider the very natural 0/1 loss: { 0 if y = rnd(ŷ), L(y, ŷ) = 1 otherwise. where rnd( ) rounds the value. This loss cannot be optimized easily because it is not differentiable. We need to resort to an approximation to the 0/1 loss, generally called a surrogate loss function.

52 Logistic loss function Given two probability distributions q 1 ( ), q 2 ( ), taking values in a discrete space D, their similarity can be computed by the cross-entropy function as: H(q 1, q 2 ) = d D q 1 (d) log ( q 2 (d) ). Applying this to our context, we derive the cross-entropy loss: L(y, ŷ) = y log ( ŷ ) (1 y) log ( 1 ŷ ). The cross-entropy loss can be in principle be used for any classification model. In the case of logistic regression, it is sometimes called logistic loss.

53 Bernoulli distribution We can justify the cross-entropy loss from a probabilistic point of view. Remember that f (x) is the probability of observing 1 given x, while 1 f (x) is the probability of observing 0. Mathematically, the output follows a Bernoulli distribution: p(y f (x)) = f (x) y (1 f (x)) 1 y. Since the elements of the training set are independent, the probability of having observed the entire dataset is the product of the probabilities: P(w) = N p(y f (x)). i=1 The previous quantity is called the likelihood function.

54 Maximum likelihood A low value of P(w) means that the probability of having observed S is small. Finding a set of parameters w maximizing this criterion is called the maximum likelihood (ML) approach: { N } w = max p(y f (x)). w R d i=1 This is the first example we encounter of an approach explicitly modeling the conditional probabilities. Taking the logarithm and exchanging maximization with minimization: J(w) = N log ( p(y i f (x i ) ). i=1

55 Logistic regression algorithm Expanding each term in the logarithm, we obtain the logistic regression cost function: J(w) = N y i log ( f (x i ) ) + (1 y i ) log ( 1 f (x i ) ), i=1 which is equivalent to minimizing the cross-entropy loss on our training dataset. We can easily construct a regularized version of logistic regression by adding an l 2 penalty on w to the previous cost function. Its probabilistic derivation is slightly more involved, being left for a future lecture.

56 Solving the logistic regression problem After some manipulations, we express the objective function as: J(w) = N y i w T x i log ( ) 1 + e wt x i. i=1 In the simplest case, we can solve this using a gradient descent algorithm, where (using the chain rule): J(w) = = N 1 y i x i e wt x i x 1 + e wt x i i i=1 N (y i f (x i ))x i, i=1 and we made use of the fact that 1 1+e s = es 1+e s.

57 Newton s method More commonly, logistic regression is solved with the Newton s method, an optimization algorithm that makes use of Hessian information. The method works by constructing, at each iteration, a second-order approximation to J(w) around w n using a Taylor s expansion: J(w) J(w n ) + ( J(w n ) ) T (w wn ) (w w n) T 2 J(w n ) (w w n ). The estimate is updated according to the analytical minimum of this approximation: w n+1 = w n ( 2 J(w n ) ) 1 J(wn ).

58 Newton s method in logistic regression We can write gradient and Hessian as: J(w) = X T (y p), (1) 2 J(w) = X T WX, (2) where p is the vector of predicted outputs, and W is a diagonal matrix with W ii = f (x i ) (1 f (x i )). Putting everything together: w n+1 = w n + ( X T WX ) 1 X T (y p).

59 An example of logistic regression (a) Start (b) 3 iterations (c) 15 iterations Figure 17 : An example of application of logistic regression. 6

60 Logistic regression for multi-class problems For multi-class problems with M classes, we use a dummy encoding for the output y, where y k = 1 if the pattern is of class k, 0 otherwise. Consider a linear model s = Wx. The equivalent of the sigmoid function here is the softmax function: f k (s) = exp { s k} M l=1 exp { s l}. (3) This ensures that the outputs f 1,..., f M form a valid probability distribution over the classes. The cross-entropy loss trivially extends: N M J(w) = y ik log ( f k (x i ) ). i=1 k=1

61 Table of contents Linear models for regression Regularized least squares SVD analysis of ridge regression Online estimation of LRR Multiple outputs Hints on the geometry of least square Linear expansion on random bases Linear models for classification Least squares classification Logistic regression Discriminant analysis

62 Discriminant analysis Discriminant analysis (DA) is an alternative classification technique, which is very popular in statistics, also for historical reasons. In DA, we model several different probabilities: The prior probabilities π 1,..., π M are the probabilities of observing a specific class, irrespective of x (i.e., the relative frequencies of the classes). For each class k, we model the probability p(x k) describing the distribution of points in that class. This is our first example of generative model. Using Bayes theorem, we can write the probability that x belongs to class i as: p(k x) = p(x k)π k M t=1 p(x t)π. t

63 Modeling the probabilities Each prior probability can be modeled easily as π k = N k N, where N k is the number of points in the dataset belonging to class k. For simplicity, we can model each class-conditional probability as a Gaussian distribution: { 1 p(x k) = exp 1 } (2π) d/2 Σ k 1/2 2 (x µ k) T Σ 1 k (x µ k), where µ k is the mean, and Σ k the covariance of class k.

64 Estimating the probabilities Both the mean and the covariance can be estimated empirically from the data: µ k = 1 N k Σ k = x S k x 1 N k 1 x S k (x µ k ) (x µ k ) T. where S k is the set of input points belonging to class k. While this approach is extremely efficient, it requires us to estimate M+d+d 2 parameters for each class, for a total of M+Md(d+1) parameters, as compared to the lower number of Md parameters in a logistic regression model.

65 Quadratic discriminant analysis By Bayes theorem, we can associate the following discriminant function to each class: δ k (x) = 1 2 log Σ k 1 2 (x µ k) T Σ 1 k (x µ k) + log π k. By choosing the class corresponding to the larger δ k ( ), we obtain the quadratic discriminant analysis (QDA) algorithm. If we further suppose that all classes share the same covariance matrix Σ k = Σ, k = 1,..., M, we obtain linear discriminant analysis (LDA), requiring d 2 + dm + M parameters: δ k (x) = x T Σ 1 µ k 1 2 µt k Σ 1 µ k + log π k.

66 Naïve Bayes The naïve Bayes is a simplified form of discriminant analysis, where we assume that all features are independent among them given the class label, i.e.: p(x k) = d p(x i k). i=1 As an example, for continuous features we can model each p(x i k) as a univariate Gaussian, for a total of M + 2d parameters. Naïve Bayes is particularly powerful whenever d is large and we have many categorical features.

67 Visualizing QDA Figure 18 : Plot of the two Gaussian distributions for QDA in a simple binary classification problem.

68 Visualizing LDA Figure 19 : Plot of the two Gaussian distributions for LDA in the same dataset.

69 Visualizing Naive Bayes Figure 20 : Plot of the two distributions for Naive Bayes in the same dataset.

70 Further readings Most of the material in this lecture is rearranged from Chapters 3 and 4 of The Elements of Statistical Learning. An introductory article to linear expansions on random bases: [1] Scardapane, S. and Wang, D., Randomness in neural networks: An overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(2). Two interesting articles on the Naïve Bayes algorithm: [2] Domingos, P. and Pazzani, M., On the optimality of the simple Bayesian classifier under zero-one loss. Machine learning, 29(2-3), pp [3] Lewis, D.D., Naive (Bayes) at forty: The independence assumption in information retrieval. In ECML (pp. 4-15). Springer Berlin Heidelberg.

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Mathematical Formulation of Our Example

Mathematical Formulation of Our Example Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Logistic Regression & Neural Networks

Logistic Regression & Neural Networks Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability

More information

Classification. Sandro Cumani. Politecnico di Torino

Classification. Sandro Cumani. Politecnico di Torino Politecnico di Torino Outline Generative model: Gaussian classifier (Linear) discriminative model: logistic regression (Non linear) discriminative model: neural networks Gaussian Classifier We want to

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Logistic Regression. Seungjin Choi

Logistic Regression. Seungjin Choi Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017 CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Machine Learning. 7. Logistic and Linear Regression

Machine Learning. 7. Logistic and Linear Regression Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam. CS 189 Spring 2013 Introduction to Machine Learning Midterm You have 1 hour 20 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. Please use non-programmable calculators

More information

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression ECE521: Inference Algorithms and Machine Learning University of Toronto Assignment 1: k-nn and Linear Regression TA: Use Piazza for Q&A Due date: Feb 7 midnight, 2017 Electronic submission to: ece521ta@gmailcom

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII

More information

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber Machine Learning Regression-Based Classification & Gaussian Discriminant Analysis Manfred Huber 2015 1 Logistic Regression Linear regression provides a nice representation and an efficient solution to

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning Support Vector Machines. Prof. Matteo Matteucci Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Thomas G. Dietterich tgd@eecs.oregonstate.edu 1 Outline What is Machine Learning? Introduction to Supervised Learning: Linear Methods Overfitting, Regularization, and the

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis .. December 20, 2013 Todays lecture. (PCA) (PLS-R) (LDA) . (PCA) is a method often used to reduce the dimension of a large dataset to one of a more manageble size. The new dataset can then be used to make

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

CENG 793. On Machine Learning and Optimization. Sinan Kalkan CENG 793 On Machine Learning and Optimization Sinan Kalkan 2 Now Introduction to ML Problem definition Classes of approaches K-NN Support Vector Machines Softmax classification / logistic regression Parzen

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression Behavioral Data Mining Lecture 7 Linear and Logistic Regression Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips Linear Regression

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Statistical Methods for SVM

Statistical Methods for SVM Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

CS540 Machine learning Lecture 5

CS540 Machine learning Lecture 5 CS540 Machine learning Lecture 5 1 Last time Basis functions for linear regression Normal equations QR SVD - briefly 2 This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 3

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University Chap 2. Linear Classifiers (FTH, 4.1-4.4) Yongdai Kim Seoul National University Linear methods for classification 1. Linear classifiers For simplicity, we only consider two-class classification problems

More information

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is

More information

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA The Bayes classifier Linear discriminant analysis (LDA) Theorem The classifier satisfies In linear discriminant analysis (LDA), we make the (strong) assumption that where the min is over all possible classifiers.

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013) ISLR Data Mining I Lecture

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Machine Learning (CS 567) Lecture 5

Machine Learning (CS 567) Lecture 5 Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input

More information

Machine Learning and Data Mining. Linear classification. Kalev Kask

Machine Learning and Data Mining. Linear classification. Kalev Kask Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q

More information