Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Similar documents
Generative Learning algorithms

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Gaussian discriminant analysis Naive Bayes

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Generative v. Discriminative classifiers Intuition

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning Gaussian Naïve Bayes Big Picture

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Regression with Numerical Optimization. Logistic

Bias-Variance Tradeoff

ECE 5984: Introduction to Machine Learning

Machine Learning

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Naïve Bayes classification

CSC 411: Lecture 09: Naive Bayes

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Linear Models for Classification

Machine Learning, Fall 2012 Homework 2

Introduction to Machine Learning

Logistic Regression. Seungjin Choi

Machine Learning. Lecture 2: Linear regression. Feng Li.

Logistic Regression. Machine Learning Fall 2018

CMU-Q Lecture 24:

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Generative v. Discriminative classifiers Intuition

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Linear Classification

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Bayesian Methods: Naïve Bayes

The Bayes classifier

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Naive Bayes and Gaussian Bayes Classifier

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Least Squares Regression

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Ch 4. Linear Models for Classification

Logistic Regression. William Cohen

ECE521 week 3: 23/26 January 2017

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Machine Learning 2017

Statistical Data Mining and Machine Learning Hilary Term 2016

Linear Models for Regression CS534

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

An Introduction to Statistical and Probabilistic Linear Models

Bayesian Learning (II)

Linear & nonlinear classifiers

Ad Placement Strategies

Machine Learning Linear Classification. Prof. Matteo Matteucci

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Does Modeling Lead to More Accurate Classification?

Machine Learning. 7. Logistic and Linear Regression

Introduction to Machine Learning

A Study of Relative Efficiency and Robustness of Classification Methods

Least Squares Regression

Machine learning - HT Maximum Likelihood

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Bayesian Machine Learning

Linear Models for Regression CS534

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Introduction to Machine Learning

Linear Models for Regression CS534

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Logistic Regression and Generalized Linear Models

Support Vector Machines

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Classification. Sandro Cumani. Politecnico di Torino

Classification Based on Probability

Linear Regression (9/11/13)

Statistical Machine Learning Hilary Term 2018

Classification. Chapter Introduction. 6.2 The Bayes classifier

Introduction to Machine Learning

MLE/MAP + Naïve Bayes

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Lecture 15: Logistic Regression

CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

Probabilistic Machine Learning. Industrial AI Lab.

Overfitting, Bias / Variance Analysis

Linear Models for Classification

Generative v. Discriminative classifiers Intuition

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Transcription:

Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012)

Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation Generalized Linear Regression 2 Logistic Regression 3

A Regression Example Linear Regression Probabilistic Interpretation Generalized Linear Regression Consider the house price problem as follows: living area (m 2 ) location price (RMB10,000) 65 Zhongguancun 350 100 Shangdi 320 105 Shangdi 380 110 Changping 168. Input samples: X = {x (1),..., x (N) } Features include x (i) (i) 1 : living area; x 2 : location Output target: Y = {y (1),..., y (N) } Linear regression model: f w (x) = w 0 + w 1 x 1 +... + w D x D = w = (w 0, w 1,..., w D ) T, x = (1, x 1,..., x D ) T.. D w j x j = w T x j=0

Least Mean Square Algorithm Linear Regression Probabilistic Interpretation Generalized Linear Regression Cost function: J(w) = 1 2 N (f w (x (i) ) y (i) ) 2 Gradient descent: w j := w j α w j J(w) For a single training example: w j := w j + α(y (i) f w (x (i) ))x (i) j

Least Mean Square Algorithm (2) Linear Regression Probabilistic Interpretation Generalized Linear Regression For a training set of more than one example: Batch gradient descent: Repeat until convergence { w j := w j + α N (y (i) f w(x (i) ))x (i) j } Stochastic gradient descent: Loop { For i = 1 to N { w j := w j + α(y (i) f w(x (i) ))x (i) j } }

Probabilistic Assumptions Linear Regression Probabilistic Interpretation Generalized Linear Regression Relate the inputs and target via: y (i) = w T x (i) + ɛ (i) with ɛ (i) IID Gaussian distributed, i.e., p(ɛ (i) ) = 1 2πσ exp( (ɛ(i) ) 2 2σ 2 ) Thus, p(y (i) x (i) ; w) = 1 2πσ exp( (y (i) w T x (i) ) 2 2σ 2 ) Likelihood function: N L(w) = L(w; X, y) = p(y X; w) = p(y (i) x (i) ; w)

Maximum Likelihood Linear Regression Probabilistic Interpretation Generalized Linear Regression Log likelihood L(w) = log L(w) = N 1 log exp( (y (i) w T x (i) ) 2 2πσ 2σ 2 ) = N log 1 1 2πσ σ 2 1 2 N (y (i) w T x (i) ) 2 Maximizing L(w) gives the same answer as minimizing 1 2 N (y (i) w T x (i) ) 2 Under the previous probabilistic assumptions on the data, least-square regression corresponds to finding the maximum likelihood estimate of w.

Linear Regression Probabilistic Interpretation Generalized Linear Regression Locally Weighted Linear Regression (LWR) The LWR algorithm: 1 Fit w to minimize i θ(i) (y (i) w T x (i) ) 2 2 Output w T x A standard choice for the weights is: θ (i) = exp( x(i) x 2 2τ 2 ) More general form is possible The weights depend on the particular point x at which we are trying to evaluate. LWR is an example of non-parametric algorithm.

Linear Regression Probabilistic Interpretation Generalized Linear Regression Linear Regression with Nonlinear Basis Consider linear combinations of nonlinear basis functions of the input variables: M 1 f w (x) = w 0 + w j φ j (x) = w T φ(x) j=1 φ = (φ 0, φ 1,..., φ M 1 ), φ 0 (x) = 1 Examples of basis functions: φ(x) = (1, x, x 2,..., x M 1 ) φ j (x) = exp( (x µ j )2 2s 2 ) Can still use least squares method to estimate Linear method can model nonlinear functions through using nonlinear basis functions.

Two-Class Example Logistic Regression Let P(y = 1 x) = t, P(y = 0 x) = 1 t Classification rule: { y = 1 if t > 0.5 Choose = y = 0 otherwise Equivalent test for classification rule: choose y = 1 if t 1 t > 1 or log t 1 t > 0 where t 1 t is called the odds of t and log t 1 t is called the log odds or logit function of t.

Logistic Regression Logistic Regression The posterior probability (or the hypothesis) is written as: P(y = 1 x) = f w (x) = g(w T x) = where g( ) is called logistic or sigmoid function. 1 1 + exp( w T x) The derivative of the sigmoid function: g (z) = g(z)(1 g(z)) Note that this is a model for classification rather than regression.

Parameter Estimation Logistic Regression Since thus P(y = 1 x; w) = f w (x) P(y = 0 x; w) = 1 f w (x), P(y x; w) = (f w (x)) y (1 f w (x)) 1 y. The likelihood of the parameters (given N training examples): N N L(w) = P(y (i) x (i) ; w) = (f w (x (i) )) y (i) (1 f w (x (i) )) 1 y (i) The log likelihood function: N L(w) = log L(w) = y (i) log f w (x (i) ) + (1 y (i) ) log(1 f w (x (i) ))

Parameter Estimation (2) Logistic Regression As for one training example (x, y), maximizing the log likelihood gives w j L(w) = (y f w (x))x j Use the fact that f w(x) = g(w T x) and g (z) = g(z)(1 g(z)) This therefore gives the stochastic gradient ascent rule: w j := w j + α(y (i) f w (x (i) ))x (i) j The solution is identical to least mean squares update rule for linear regression!

Logistic Discrimination Logistic Regression Logistic discrimination does not model the class-conditional densities p(x C k ) but their ratio. Parametric classification approach (studied later) learns the classifier by estimating the parameters of p(x C k ); logistic discrimination estimates the parameters of the discriminant function directly.

Discriminative vs. Generative classification (Revisited) Discriminative classification: model the posterior probabilities or learn discriminant function directly. Compute P(C k x): such as logistic regression. Compute discriminant function: such as perceptron, SVM. Makes an assumption on the form of the discriminants, but not the densities. Generative classification: explicitly or implicitly model the distribution of inputs as well as outputs. Assume a model for the class-conditional probability densities p(x C k ). Estimate p(x C k ) and P(C k ) from data, and apply Bayes rule to compute the posterior probabilities P(C k x). Perform optimal classification based on P(C k x).

Multivariate Normal Distribution x N D (µ, Σ) p(x) = 1 (2π) D/2 Σ exp( 1 1/2 2 (x µ)t Σ 1 (x µ))

Multivariate Normal Distribution (2) Mahalanobis distance measures the distance from x to µ in terms of Σ: (x µ) T Σ 1 (x µ) (x µ) T Σ 1 (x µ) = c 2 is the D-dimensional hyperellipsoid centered at µ. Its shape and orientation are defined by Σ. Euclidean distance is a special case of Mahalanobis distance when Σ = s 2 I; the hyperellipsoid degenerates into a hypersphere.

Gaussian Discriminant Analysis (GDA) GDA models p(x y) using a multivariate normal distribution: y Bernoulli(β) x y = 0 N (µ 0, Σ) x y = 1 N (µ 1, Σ) The parameters of the model are {β, µ 0, µ 1, Σ}. Note that common covariance matrix is usually used.

Gaussian Discriminant Analysis (2) The log likelihood of the data is: N L(β, µ 0, µ 1, Σ) = log p(x (i), y (i) ; β, µ 0, µ 1, Σ) = log N p(x (i) y (i) ; β, µ 0, µ 1, Σ)p(y (i) ; β) The maximum likelihood estimates are: β = 1 N 1{y (i) = 1} N µ k = N 1{y (i) = k}x (i) N 1{y, k = {0, 1} (i) = k} Σ = 1 N N (x (i) µ y (i))(x (i) µ y (i)) T

Illustration of GDA Linear Models for Regression

GDA and Logistic Regression If we view P(y = 1 x; β, µ 0, µ 1, Σ) as a function of x, it can be expressed in the form: P(y = 1 x; β, µ 0, µ 1, Σ) = 1 1 + exp( w T x), where w is some appropriate function of β, µ 0, µ 1, Σ. This is exactly the form that logistic regression (a discriminative algorithm) used to model P(y = 1 x). GDA makes stronger modeling assumptions (i.e., p(x y) is multivariate Gaussian) about the data, and is more data efficient. Logistic regression makes weaker assumptions, and is more robust to deviations from modeling assumptions.

Bernoulli and Multinomial Bernoulli: the distribution for a single binary variable x {0, 1}, governed by a single continuous parameter β [0, 1] which represents the probability of x = 1. Bern(x β) = β x (1 β) 1 x Binomial: gives the probability of observing m occurrences of x = 1 in a set of N samples from a Bernoulli distribution. ( ) N Bin(m N, β) = β m (1 β) N m m Multinomial: multivariate generalization of binomial, gives the distribution over counts m k for K state discrete variable to be in state k: ( Mult(m 1,..., m K N, β) = ) N K m 1 m 2... m K k=1 β m k k

Email Spam Filter Linear Models for Regression Classifying emails to spam (y = 1) or non-spam (y = 0) One example of broader set of problems called text classification Feature vector: x = 1 0 1. 0 a aardwolf buy. zygmurgy x {0, 1} D, D is the size of the set of words (vocabulary). Word selection: include words that appear in the emails but not in a dictionary; exclude very high frequency words; etc. x j s can take more general values in {1, 2,..., k j }, then the model p(x j y) is multinomial rather than Bernoulli.

Email Spam Filter (2) To model p(x y), we assume that x j s are conditionally independent given y. This is assumption. In classifier: D p(x 1,..., x D y) = p(x 1 y)p(x 2 y)... p(x D y) = p(x j y) j=1 Given a training set {x (i), y (i) }, i = 1,..., N, the joint log likelihood of the data: N L(p(x j y), P(y)) = p(x (i), y (i) )

Email Spam Filter (3) Maximizing L w.r.t. p(x j y), P(y) gives the following estimates: p(x j = 1 y = 1) = p(x j = 1 y = 0) = p(y = 1) = N N 1{x (i) j = 1 y (i) = 1} N 1{y (i) = 1} 1{x (i) j = 1 y (i) = 0} N 1{y (i) = 0} N 1{y (i) = 1} N

Email Spam Filter (4) To make a prediction on a new example x, we simply calculate: p(y = 1 x) = = p(x y = 1)p(y = 1) p(x) ( D j=1 p(x j y = 1))p(y = 1) ( D j=1 p(x j y = 1))p(y = 1) + ( D j=1 p(x j y = 0))p(y = 0) When the original, continuous-valued attributes are not well-modeled by a multivariate normal distribution, discretizing the features and using (instead of GDA) will often result in a better classifier.

Laplace Smoothing If you haven t seen some word before in your finite training set, the classifier does not know how to make a decision. To estimate the mean of multinomial random variable x taking values {1,..., k}, given N observations {x (1),..., x (N) }, the maximum likelihood estimates are: N p(x = j) = 1{x (i) = j}, j = 1,..., k N Laplace smoothing: N p(x = j) = 1{x (i) = j} + 1, j = 1,..., k N + k k j=1 p(x = j) = 1 still holds p(x = j) 0 for all j

Laplace Smoothing (2) Returning to classifier, with Laplace smoothing we obtain the following estimates: p(x j = 1 y = 1) = p(x j = 1 y = 0) = N N 1{x (i) j = 1 y (i) = 1} + 1 N 1{y (i) = 1} + 2 1{x (i) j = 1 y (i) = 0} + 1 N 1{y (i) = 0} + 2 In practice, it usually doesn t matter whether we apply Laplace smoothing to p(y = 1) or not, since we typically have a fair fraction between spam and non-spam emails.

Event Models for Text Classification uses multivariate Bernoulli event model: the probability of an email was given by P(y) D j=1 p(x j y). A different way to represent emails: x = (x 1,..., x M ), x j denotes the j th word in the email, taking values in {1,..., V }; V is the vocabulary; M is the length of the email. Multinomial event model: p(x j y) is now multinomial. The overall probability of an email is P(y) M j=1 p(x j y).

Event Models for Text Classification (2) Given a training set {x (i), y (i) }, i = 1,..., N, where x (i) = (x (i) 1,..., x (i) M i ) T (M i is the number of words in i th training example), the maximum likelihood estimates of the model are: p(x j = k y = 1) = p(x j = k y = 0) = P(y = 1) = N Mi (i) j=1 1{x j = k y (i) = 1} N 1{y, for any j (i) = 1}M i N Mi (i) j=1 1{x j = k y (i) = 0} N 1{y, for any j (i) = 0}M i N 1{y (i) = 1} N

vs. Logistic Regression Asymptotic comparison ( training examples infinite) When model assumption is correct: (NB), Logistic regression (LR) produce identical classifier When model assumption is incorrect: LR is less biased - does not assume conditional, independence therefore expected to outperform NB.

vs. Logistic Regression (2) Non-aymptotic comparison (see [Ng and Jordan, 2002]) Convergence rate of parameter estimation - how many training examples needed to assume good estimates? NB order log D (where D = of attributes in X) LR order D NB converges more quickly to its asymptotic estimates.