# Machine Learning (CS 567) Lecture 5

Save this PDF as:
Size: px
Start display at page:

## Transcription

1 Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol Han Office: SAL 229 Office hours: M 2-3pm, W Class web page: 1

2 Administrative: Regrading This regrading policy holds for quizzes, homeworks, midterm and final exam. Grading complaint policy: if students have problems at assignment grading, feel free to talk to the instructor/ta for it within one week of getting the graded work back. After one week no regrading will be done. Even if students only request re-grading one of the answers, all this assignment will be checked and graded again. (take the risk of possible losing total points.) 2 Lecture 4 - Sofus A. Macskassy

3 Lecture 5 Outline Logistic Regression (from last week) Linear Discriminant Analysis Off-the-shelf Classifiers 3

4 Linear Threshold Units h(x) = ( +1 if w1 x 1 + : : : + w n x n w 0 1 otherwise We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear threshold units: Perceptron: classifier Logistic Regression: conditional distribution Linear Discriminant Analysis: joint distribution 4

5 Linear Discriminant Analysis Learn P(x,y). This is sometimes called the generative approach, because we can think of P(x,y) as a model of how the data is generated. For example, if we factor the joint distribution into the form P(x,y) = P(y) P(x y) we can think of P(y) as generating a value for y according to P(y). Then we can think of P(x y) as generating a value for x given the previouslygenerated value for y. This can be described as a Bayesian network y x 5

6 Linear Discriminant Analysis (2) P(y) is a discrete multinomial distribution example: P(y = 0) = 0.31, P(y = 1) = 0.69 will generate 31% negative examples and 69% positive examples For LDA, we assume that P(x y) is a multivariate normal distribution with mean k and covariance matrix P (xjy = k) = 1 (2¼) n=2 j j 1=2 exp µ 1 2 [x ¹ k] T 1 [x ¹ k ] y x 6

7 Multivariate Normal Distributions: A tutorial Recall that the univariate normal (Gaussian) distribution has the formula " # p(x) = 1 (2¼) 1=2 ¾ exp where is the mean and 2 is the variance Graphically, it looks like this: 1 (x ¹) 2 2 ¾ 2 7

8 The Multivariate Gaussian A 2-dimensional Gaussian is defined by a mean vector = ( 1, 2 ) and a covariance matrix " ¾ 2 = 1;1 ¾1;2 2 # ¾ 2 1;2 ¾2 2;2 where 2 i,j = E[(x i i )(x j - j )] is the variance (if i = j) or co-variance (if i j). 8

9 The Multivariate Gaussian (2) If is the identity matrix = " # and = (0, 0), we get the standard normal distribution: 9

10 The Multivariate Gaussian (3) If is a diagonal matrix, then x 1, and x 2 are independent random variables, and lines of equal probability are ellipses parallel to the coordinate axes. For example, when " # 2 0 = 0 1 and ¹ = (2; 3) we obtain 10

11 The Multivariate Gaussian (4) Finally, if is an arbitrary matrix, then x 1 and x 2 are dependent, and lines of equal probability are ellipses tilted relative to the coordinate axes. For example, when = " 2 0:5 0:5 1 # and ¹ = (2; 3) we obtain 11

12 Estimating a Multivariate Gaussian Given a set of N data points {x 1,, x N }, we can compute the maximum likelihood estimate for the multivariate Gaussian distribution as follows: ^¹ = 1 N ^ = 1 N X x i i X i (x i ^¹) (x i ^¹) T Note that the dot product in the second equation is an outer product. The outer product of two vectors is a matrix: x y T = x 1 x 2 x [y 1 y 2 y 3 ] = x 1 y 1 x 1 y 2 x 1 y 3 x 2 y 1 x 2 y 2 x 2 y 3 x 3 y 1 x 3 y 2 x 3 y 3 For comparison, the usual dot product is written as x T y 3 7 5

13 The LDA Model Linear discriminant analysis assumes that the joint distribution has the form P(x y) 1 P (x; y) = P (y) (2¼) n=2 j j 1=2 exp µ 1 2 [x ¹ y] T 1 [x ¹ y ] where each y is the mean of a multivariate Gaussian for examples belonging to class y and is a single covariance matrix shared by all classes. 13

14 Fitting the LDA Model It is easy to learn the LDA model in two passes through the data: ^¼ k Let be our estimate of P(y = k) Let N k be the number of training examples belonging to class k. ^¼ k = N k N ^¹ k = 1 N k X ^ = 1 N x i fi:y i =kg X i (x i ^¹ yi ) (x i ^¹ yi ) T Note that each x i is subtracted from its corresponding prior to taking the outer product. This gives us the pooled estimate of 14 ^¹ yi

15 LDA learns an LTU Consider the 2-class case with a 0/1 loss function. Recall that P (y = 0jx) = P (x; y = 0) P (x; y = 0) + P (x; y = 1) P (y = 1jx) = P (x; y = 1) P (x; y = 0) + P (x; y = 1) Also recall from our derivation of the Logistic Regression classifier that we should classify into class ŷ = 1 if log Hence, for LDA, we should classify into ŷ = 1 if log P (y = 1jx) P (y = 0jx) > 0 P (x; y = 1) P (x; y = 0) > 0 because the denominators cancel 15

16 LDA learns an LTU (2) 1 P (x; y) = P (y) (2¼) n=2 j j 1=2 exp µ 1 2 [x ¹ y] T 1 [x ¹ y ] P (x; y = 1) P (y = 1) 1 P (x; y = 0) = (2¼) n=2 j j 1=2 exp ³ 1 2 [x ¹ 1] T 1 [x ¹ 1 ] P (y = 0) 1 (2¼) n=2 j j 1=2 exp ³ 1 2 [x ¹ 0] T 1 [x ¹ 0 ] P (x; y = 1) P (x; y = 0) ³ P (y = 1) exp 1 2 [x ¹ 1 ] T 1 [x ¹ 1 ] = P (y = 0) exp ³ 1 2 [x ¹ 0] T 1 [x ¹ 0 ] P (x; y = 1) log P (x; y = 0) P (y = 1) = log P (y = 0) 1 ³ [x ¹1 ] T 1 [x ¹ 1 ] [x ¹ 0 ] T 1 [x ¹ 0 ] 2 16

17 LDA learns an LTU (3) Let s focus on the term in brackets: ³ [x ¹1 ] T 1 [x ¹ 1 ] [x ¹ 0 ] T 1 [x ¹ 0 ] Expand the quadratic forms as follows: [x ¹ 1 ] T 1 [x ¹ 1 ] = x T 1 x x T 1 ¹ 1 ¹ T 1 1 x + ¹ T 1 1 ¹ 1 [x ¹ 0 ] T 1 [x ¹ 0 ] = x T 1 x x T 1 ¹ 0 ¹ T 0 1 x + ¹ T 0 1 ¹ 0 Subtract the lower from the upper line and collect similar terms. Note that the quadratic terms cancel! This leaves only terms linear in x. x T 1 (¹ 0 ¹ 1 )+(¹ 0 ¹ 1 ) 1 x+¹ T 1 1 ¹ 1 ¹ T 0 1 ¹ 0 17

18 LDA learns an LTU (4) x T 1 (¹ 0 ¹ 1 )+(¹ 0 ¹ 1 ) 1 x+¹ T 1 1 ¹ 1 ¹ T 0 1 ¹ 0 a T 1 b = b T 1 a Note that since -1 is symmetric for any two vectors a and b. Hence, the first two terms can be combined to give 2x T 1 (¹ 0 ¹ 1 ) + ¹ T 1 1 ¹ 1 ¹ T 0 1 ¹ 0 : Now plug this back in P (x; y = 1) log P (x; y = 0) P (y = 1) = log P (y = 0) 1 h 2x T 1 (¹ 0 ¹ 1 ) + ¹ T ¹ 1 ¹ T i 0 1 ¹ 0 log P (x; y = 1) P (x; y = 0) = log P (y = 1) P (y = 0) + xt 1 (¹ 1 ¹ 0 ) 1 2 ¹T 1 1 ¹ ¹T 0 1 ¹ 0 18

19 LDA learns an LTU (5) P (x; y = 1) log P (x; y = 0) = log P (y = 1) P (y = 0) + xt 1 (¹ 1 ¹ 0 ) 1 2 ¹T 1 1 ¹ ¹T 0 1 ¹ 0 Let w = 1 (¹ 1 ¹ 0 ) c = P (y = 1) log P (y = 0) 1 2 ¹T 1 1 ¹ ¹T 0 1 ¹ 0 Then we will classify into class ^y = 1 if w x + c > 0: This is an LTU. 19

20 Two Geometric Views of LDA View 1: Mahalanobis Distance The quantity D M (x; u) 2 = (x u) T 1 (x u) is known as the (squared) Mahalanobis distance between x and u. We can think of the matrix -1 as a linear distortion of the coordinate system that converts the standard Euclidean distance into the Mahalanobis distance Note that log P (xjy = k) / log ¼ k 1 2 [(x ¹ k) T 1 (x ¹ k )] log P (xjy = k) / log ¼ k 1 2 D M(x; ¹ k ) 2 Therefore, we can view LDA as computing D M (x; ¹ 0 ) 2 D M (x; ¹ 1 ) 2 and and then classifying x according to which mean 0 or 1 is closest in Mahalanobis distance (corrected by log k ) 20

21 Mahalanobis Distance Which point is closer to the vertical line? Weak point of mahalanobis: if only few points observed, then the variance is not meaningful. 21

22 View 2: Most Informative Low- Dimensional Projection LDA can also be viewed as finding a hyperplane of dimension K 1 such that x and the { k } are projected down into this hyperplane and then x is classified to the nearest k using Euclidean distance inside this hyperplane 22

23 Nonlinear boundaries with LDA Left: Linear decision boundary found by LDA Right: Quadratic decision boundary found by LDA. This was obtained by finding linear decision boundaries in the five dimensional space X1, X2, X1*X2, X1 2, X2 2. Linear inequalities in this space are quadratic inequalities in the original space. 23

24 Generalizations of LDA General Gaussian Classifier Instead of assuming that all classes share the same, we can allow each class k to have its own k. In this case, the resulting classifier will be a quadratic threshold unit (instead of an LTU) Naïve Gaussian Classifier Allow each class to have its own k, but require that each k be diagonal. This means that within each class, any pair of features x j1 and x j2 will be assumed to be statistically independent. The resulting classifier is still a quadratic threshold unit (but with a restricted form) 24

25 Quadratic Discriminant Analysis (QDA) Quadratic Discriminant Analysis (QDA) is a generalization of LDA where different classes are allowed to have different covariance matrices. i.e., k l for classes k l. 1 P (x; y = k) = P (y = k) (2¼) n=2 j k j We can solve for log P (y = 1jx) P (y = 0jx) > 0 ^ k = 1 N k X i;y i =k 1=2 exp (x i ^¹ k ) (x i ^¹ k ) T in a similar way as before. Note, however, that the 0 and 1 determinants do not cancel. µ 1 2 [x ¹ k] T 1 k [x ¹ k] 25

26 LDA vs. QDA Left: Quadratic fit by LDA in transformed space. Right: Quadratic fit by QDA. The differences in the fit are very small. 26

27 Generalizations of LDA General Gaussian Classifier Instead of assuming that all classes share the same, we can allow each class k to have its own k. In this case, the resulting classifier will be a quadratic threshold unit (instead of an LTU) Naïve Gaussian Classifier Allow each class to have its own k, but require that each k be diagonal. This means that within each class, any pair of features x j1 and x j2 will be assumed to be statistically independent. The resulting classifier is still a quadratic threshold unit (but with a restricted form) 27

28 Different Discriminant Functions 28

29 Summary of Linear Discriminant Analysis Learns the joint probability distribution P(x, y). Direct Computation. The maximum likelihood estimate of P(x,y) can be computed from the data without search. However, inverting the matrix requires O(n 3 ) time. Eager. The classifier is constructed from the training examples. The examples can then be discarded. Batch. Only a batch algorithm is available. An online algorithm could be constructed if there is an online algorithm for incrementally updated -1. [This is easy for the case where is diagonal.] 29

30 Comparing Perceptron, Logistic Regression, and LDA How should we choose among these three algorithms? There is a big debate within the machine learning community! 30

31 Issues in the Debate Statistical Efficiency. If the generative model P(x,y) is correct, then LDA usually gives the highest accuracy, particularly when the amount of training data is small. If the model is correct, LDA requires 30% less data than Logistic Regression in theory Computational Efficiency. Generative models typically are the easiest to learn. In our example, LDA can be computed directly from the data without using gradient descent. 31

32 Issues in the Debate Robustness to changing loss functions. Both generative and conditional probability models allow the loss function to be changed at run time without re-learning. Perceptron requires re-training the classifier when the loss function changes. Robustness to model assumptions. The generative model usually performs poorly when the assumptions are violated. For example, if P(x y) is very non- Gaussian, then LDA won t work well. Logistic Regression is more robust to model assumptions, and Perceptron is even more robust. Robustness to missing values and noise. In many applications, some of the features x ij may be missing or corrupted in some of the training examples. Generative models typically provide better ways of handling this than non-generative models. 32

### Introduction to Machine Learning

Introduction to Machine Learning Thomas G. Dietterich tgd@eecs.oregonstate.edu 1 Outline What is Machine Learning? Introduction to Supervised Learning: Linear Methods Overfitting, Regularization, and the

### Machine Learning (CS 567) Lecture 3

Machine Learning (CS 567) Lecture 3 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

### CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

CS534: Machine Learning Thomas G. Dietterich 221C Dearborn Hall tgd@cs.orst.edu http://www.cs.orst.edu/~tgd/classes/534 1 Course Overview Introduction: Basic problems and questions in machine learning.

### Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Machine Learning Regression-Based Classification & Gaussian Discriminant Analysis Manfred Huber 2015 1 Logistic Regression Linear regression provides a nice representation and an efficient solution to

### Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

### Introduction to Machine Learning

Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},

### The Bayes classifier

The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal

### Introduction to Machine Learning

Introduction to Machine Learning Bayesian Classification Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

### Introduction to Machine Learning Spring 2018 Note 18

CS 189 Introduction to Machine Learning Spring 2018 Note 18 1 Gaussian Discriminant Analysis Recall the idea of generative models: we classify an arbitrary datapoint x with the class label that maximizes

### Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

### Introduction to Machine Learning

1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

### The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is

### The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

The Bayes classifier Linear discriminant analysis (LDA) Theorem The classifier satisfies In linear discriminant analysis (LDA), we make the (strong) assumption that where the min is over all possible classifiers.

### Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

### Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Lecture 5 Gaussian Models - Part 1 Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza November 29, 2016 Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 1 / 42 Outline 1 Basics

### MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

### Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

### 5. Discriminant analysis

5. Discriminant analysis We continue from Bayes s rule presented in Section 3 on p. 85 (5.1) where c i is a class, x isap-dimensional vector (data case) and we use class conditional probability (density

### Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

### Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

### University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout :. The Multivariate Gaussian & Decision Boundaries..15.1.5 1 8 6 6 8 1 Mark Gales mjfg@eng.cam.ac.uk Lent

### Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

### Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture

### Introduction to Machine Learning

1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,

### Week 5: Logistic Regression & Neural Networks

Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and

### Midterm: CS 6375 Spring 2015 Solutions

Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an

### Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

### Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

### COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

COMP 55 Applied Machine Learning Lecture 5: Generative models for linear classification Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material

### Naïve Bayes classification

Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

### Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

### Support Vector Machines

Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization

### Linear Regression (continued)

Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

### Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 9 Sep. 26, 2018 1 Reminders Homework 3:

### Machine Learning for Signal Processing Bayes Classification

Machine Learning for Signal Processing Bayes Classification Class 16. 24 Oct 2017 Instructor: Bhiksha Raj - Abelino Jimenez 11755/18797 1 Recap: KNN A very effective and simple way of performing classification

### Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

### Linear Classifiers 1

Linear Classifiers 1 Outline Framework Exact Minimize Mistakes (Perceptron Training) Matrix inversion Logistic Regression Model Max Likelihood Estimation (MLE) of P( y x ) Gradient descent (MSE; MLE) Linear

### Linear & nonlinear classifiers

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

### Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Week 5 Based in part on slides from textbook, slides of Susan Holmes Part I Linear Discriminant Analysis October 29, 2012 1 / 1 2 / 1 Nearest centroid rule Suppose we break down our data matrix as by the

### Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation

### Lecture 3: Pattern Classification

EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

### Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

### Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

### NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

### Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

### Gaussian discriminant analysis Naive Bayes

DM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline 1. is 2. Multi-variate

### Linear & nonlinear classifiers

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

### Inf2b Learning and Data

Inf2b Learning and Data Lecture 13: Review (Credit: Hiroshi Shimodaira Iain Murray and Steve Renals) Centre for Speech Technology Research (CSTR) School of Informatics University of Edinburgh http://www.inf.ed.ac.uk/teaching/courses/inf2b/

### Support Vector Machines

Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

### Machine Learning - MT Classification: Generative Models

Machine Learning - MT 2016 7. Classification: Generative Models Varun Kanade University of Oxford October 31, 2016 Announcements Practical 1 Submission Try to get signed off during session itself Otherwise,

### ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

### Lecture 9: Classification, LDA

Lecture 9: Classification, LDA Reading: Chapter 4 STATS 202: Data mining and analysis October 13, 2017 1 / 21 Review: Main strategy in Chapter 4 Find an estimate ˆP (Y X). Then, given an input x 0, we

### Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

### Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

### Machine Learning, Fall 2009: Midterm

10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

### Machine Learning Lecture 7

Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

### Linear Regression and Discrimination

Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian

### Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Generative classifiers: The Gaussian classifier Ata Kaban School of Computer Science University of Birmingham Outline We have already seen how Bayes rule can be turned into a classifier In all our examples

### Lecture 9: Classification, LDA

Lecture 9: Classification, LDA Reading: Chapter 4 STATS 202: Data mining and analysis October 13, 2017 1 / 21 Review: Main strategy in Chapter 4 Find an estimate ˆP (Y X). Then, given an input x 0, we

### L11: Pattern recognition principles

L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

### Classification Methods II: Linear and Quadratic Discrimminant Analysis

Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear

### Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

### Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

### Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction

### Mining Classification Knowledge

Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification

### Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machines CAP 5610: Machine Learning Instructor: Guo-Jun QI 1 Linear Classifier Naive Bayes Assume each attribute is drawn from Gaussian distribution with the same variance Generative model:

### Machine Learning Lecture 5

Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

### Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

### 1 Machine Learning Concepts (16 points)

CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

### Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

### EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

### Generative Learning algorithms

CS9 Lecture notes Andrew Ng Part IV Generative Learning algorithms So far, we ve mainly been talking about learning algorithms that model p(y x; θ), the conditional distribution of y given x. For instance,

### Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Chap 2. Linear Classifiers (FTH, 4.1-4.4) Yongdai Kim Seoul National University Linear methods for classification 1. Linear classifiers For simplicity, we only consider two-class classification problems

### University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision

### Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically

### Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

### CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

CS 3 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis AD March 11 AD ( March 11 1 / 17 Multivariate Gaussian Consider data { x i } N i=1 where xi R D and we assume they are

### Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes November 9, 2012 1 / 1 Nearest centroid rule Suppose we break down our data matrix as by the labels yielding (X

### Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

### Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Machine Learning 1 Linear Classifiers Marius Kloft Humboldt University of Berlin Summer Term 2014 Machine Learning 1 Linear Classifiers 1 Recap Past lectures: Machine Learning 1 Linear Classifiers 2 Recap

### Lecture 3: Pattern Classification. Pattern classification

EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and

### CS 195-5: Machine Learning Problem Set 1

CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

### ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

ISyE 6416: Computational Statistics Spring 2017 Lecture 5: Discriminant analysis and classification Prof. Yao Xie H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology

### Bayesian Decision Theory

Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 1 / 46 Bayesian

### Learning Neural Networks

Learning Neural Networks Neural Networks can represent complex decision boundaries Variable size. Any boolean function can be represented. Hidden units can be interpreted as new features Deterministic

### Perceptron (Theory) + Linear Regression

10601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron (Theory) Linear Regression Matt Gormley Lecture 6 Feb. 5, 2018 1 Q&A

What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

### 10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

### Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

### Introduction to Machine Learning Midterm Exam Solutions

10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

### CSC 411: Lecture 09: Naive Bayes

CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemel s lectures Sanja Fidler University of Toronto Feb 8, 2015 Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 1

### Classification. Chapter Introduction. 6.2 The Bayes classifier

Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

### LDA, QDA, Naive Bayes

LDA, QDA, Naive Bayes Generative Classification Models Marek Petrik 2/16/2017 Last Class Logistic Regression Maximum Likelihood Principle Logistic Regression Predict probability of a class: p(x) Example:

### Gaussian Models

Gaussian Models ddebarr@uw.edu 2016-04-28 Agenda Introduction Gaussian Discriminant Analysis Inference Linear Gaussian Systems The Wishart Distribution Inferring Parameters Introduction Gaussian Density

### Linear Decision Boundaries

Linear Decision Boundaries A basic approach to classification is to find a decision boundary in the space of the predictor variables. The decision boundary is often a curve formed by a regression model:

### UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and