Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012)

Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation Generalized Linear Regression 2 Logistic Regression 3

A Regression Example Linear Regression Probabilistic Interpretation Generalized Linear Regression Consider the house price problem as follows: living area (m 2 ) location price (RMB10,000) 65 Zhongguancun 350 100 Shangdi 320 105 Shangdi 380 110 Changping 168. Input samples: X = {x (1),..., x (N) } Features include x (i) (i) 1 : living area; x 2 : location Output target: Y = {y (1),..., y (N) } Linear regression model: f w (x) = w 0 + w 1 x 1 +... + w D x D = w = (w 0, w 1,..., w D ) T, x = (1, x 1,..., x D ) T.. D w j x j = w T x j=0

Least Mean Square Algorithm Linear Regression Probabilistic Interpretation Generalized Linear Regression Cost function: J(w) = 1 2 N (f w (x (i) ) y (i) ) 2 Gradient descent: w j := w j α w j J(w) For a single training example: w j := w j + α(y (i) f w (x (i) ))x (i) j

Least Mean Square Algorithm (2) Linear Regression Probabilistic Interpretation Generalized Linear Regression For a training set of more than one example: Batch gradient descent: Repeat until convergence { w j := w j + α N (y (i) f w(x (i) ))x (i) j } Stochastic gradient descent: Loop { For i = 1 to N { w j := w j + α(y (i) f w(x (i) ))x (i) j } }

Probabilistic Assumptions Linear Regression Probabilistic Interpretation Generalized Linear Regression Relate the inputs and target via: y (i) = w T x (i) + ɛ (i) with ɛ (i) IID Gaussian distributed, i.e., p(ɛ (i) ) = 1 2πσ exp( (ɛ(i) ) 2 2σ 2 ) Thus, p(y (i) x (i) ; w) = 1 2πσ exp( (y (i) w T x (i) ) 2 2σ 2 ) Likelihood function: N L(w) = L(w; X, y) = p(y X; w) = p(y (i) x (i) ; w)

Maximum Likelihood Linear Regression Probabilistic Interpretation Generalized Linear Regression Log likelihood L(w) = log L(w) = N 1 log exp( (y (i) w T x (i) ) 2 2πσ 2σ 2 ) = N log 1 1 2πσ σ 2 1 2 N (y (i) w T x (i) ) 2 Maximizing L(w) gives the same answer as minimizing 1 2 N (y (i) w T x (i) ) 2 Under the previous probabilistic assumptions on the data, least-square regression corresponds to finding the maximum likelihood estimate of w.

Linear Regression Probabilistic Interpretation Generalized Linear Regression Locally Weighted Linear Regression (LWR) The LWR algorithm: 1 Fit w to minimize i θ(i) (y (i) w T x (i) ) 2 2 Output w T x A standard choice for the weights is: θ (i) = exp( x(i) x 2 2τ 2 ) More general form is possible The weights depend on the particular point x at which we are trying to evaluate. LWR is an example of non-parametric algorithm.

Linear Regression Probabilistic Interpretation Generalized Linear Regression Linear Regression with Nonlinear Basis Consider linear combinations of nonlinear basis functions of the input variables: M 1 f w (x) = w 0 + w j φ j (x) = w T φ(x) j=1 φ = (φ 0, φ 1,..., φ M 1 ), φ 0 (x) = 1 Examples of basis functions: φ(x) = (1, x, x 2,..., x M 1 ) φ j (x) = exp( (x µ j )2 2s 2 ) Can still use least squares method to estimate Linear method can model nonlinear functions through using nonlinear basis functions.

Two-Class Example Logistic Regression Let P(y = 1 x) = t, P(y = 0 x) = 1 t Classification rule: { y = 1 if t > 0.5 Choose = y = 0 otherwise Equivalent test for classification rule: choose y = 1 if t 1 t > 1 or log t 1 t > 0 where t 1 t is called the odds of t and log t 1 t is called the log odds or logit function of t.

Logistic Regression Logistic Regression The posterior probability (or the hypothesis) is written as: P(y = 1 x) = f w (x) = g(w T x) = where g( ) is called logistic or sigmoid function. 1 1 + exp( w T x) The derivative of the sigmoid function: g (z) = g(z)(1 g(z)) Note that this is a model for classification rather than regression.

Parameter Estimation Logistic Regression Since thus P(y = 1 x; w) = f w (x) P(y = 0 x; w) = 1 f w (x), P(y x; w) = (f w (x)) y (1 f w (x)) 1 y. The likelihood of the parameters (given N training examples): N N L(w) = P(y (i) x (i) ; w) = (f w (x (i) )) y (i) (1 f w (x (i) )) 1 y (i) The log likelihood function: N L(w) = log L(w) = y (i) log f w (x (i) ) + (1 y (i) ) log(1 f w (x (i) ))

Parameter Estimation (2) Logistic Regression As for one training example (x, y), maximizing the log likelihood gives w j L(w) = (y f w (x))x j Use the fact that f w(x) = g(w T x) and g (z) = g(z)(1 g(z)) This therefore gives the stochastic gradient ascent rule: w j := w j + α(y (i) f w (x (i) ))x (i) j The solution is identical to least mean squares update rule for linear regression!

Logistic Discrimination Logistic Regression Logistic discrimination does not model the class-conditional densities p(x C k ) but their ratio. Parametric classification approach (studied later) learns the classifier by estimating the parameters of p(x C k ); logistic discrimination estimates the parameters of the discriminant function directly.

Discriminative vs. Generative classification (Revisited) Discriminative classification: model the posterior probabilities or learn discriminant function directly. Compute P(C k x): such as logistic regression. Compute discriminant function: such as perceptron, SVM. Makes an assumption on the form of the discriminants, but not the densities. Generative classification: explicitly or implicitly model the distribution of inputs as well as outputs. Assume a model for the class-conditional probability densities p(x C k ). Estimate p(x C k ) and P(C k ) from data, and apply Bayes rule to compute the posterior probabilities P(C k x). Perform optimal classification based on P(C k x).

Multivariate Normal Distribution x N D (µ, Σ) p(x) = 1 (2π) D/2 Σ exp( 1 1/2 2 (x µ)t Σ 1 (x µ))

Multivariate Normal Distribution (2) Mahalanobis distance measures the distance from x to µ in terms of Σ: (x µ) T Σ 1 (x µ) (x µ) T Σ 1 (x µ) = c 2 is the D-dimensional hyperellipsoid centered at µ. Its shape and orientation are defined by Σ. Euclidean distance is a special case of Mahalanobis distance when Σ = s 2 I; the hyperellipsoid degenerates into a hypersphere.

Gaussian Discriminant Analysis (GDA) GDA models p(x y) using a multivariate normal distribution: y Bernoulli(β) x y = 0 N (µ 0, Σ) x y = 1 N (µ 1, Σ) The parameters of the model are {β, µ 0, µ 1, Σ}. Note that common covariance matrix is usually used.

Gaussian Discriminant Analysis (2) The log likelihood of the data is: N L(β, µ 0, µ 1, Σ) = log p(x (i), y (i) ; β, µ 0, µ 1, Σ) = log N p(x (i) y (i) ; β, µ 0, µ 1, Σ)p(y (i) ; β) The maximum likelihood estimates are: β = 1 N 1{y (i) = 1} N µ k = N 1{y (i) = k}x (i) N 1{y, k = {0, 1} (i) = k} Σ = 1 N N (x (i) µ y (i))(x (i) µ y (i)) T

Illustration of GDA Linear Models for Regression

GDA and Logistic Regression If we view P(y = 1 x; β, µ 0, µ 1, Σ) as a function of x, it can be expressed in the form: P(y = 1 x; β, µ 0, µ 1, Σ) = 1 1 + exp( w T x), where w is some appropriate function of β, µ 0, µ 1, Σ. This is exactly the form that logistic regression (a discriminative algorithm) used to model P(y = 1 x). GDA makes stronger modeling assumptions (i.e., p(x y) is multivariate Gaussian) about the data, and is more data efficient. Logistic regression makes weaker assumptions, and is more robust to deviations from modeling assumptions.

Bernoulli and Multinomial Bernoulli: the distribution for a single binary variable x {0, 1}, governed by a single continuous parameter β [0, 1] which represents the probability of x = 1. Bern(x β) = β x (1 β) 1 x Binomial: gives the probability of observing m occurrences of x = 1 in a set of N samples from a Bernoulli distribution. ( ) N Bin(m N, β) = β m (1 β) N m m Multinomial: multivariate generalization of binomial, gives the distribution over counts m k for K state discrete variable to be in state k: ( Mult(m 1,..., m K N, β) = ) N K m 1 m 2... m K k=1 β m k k

Email Spam Filter Linear Models for Regression Classifying emails to spam (y = 1) or non-spam (y = 0) One example of broader set of problems called text classification Feature vector: x = 1 0 1. 0 a aardwolf buy. zygmurgy x {0, 1} D, D is the size of the set of words (vocabulary). Word selection: include words that appear in the emails but not in a dictionary; exclude very high frequency words; etc. x j s can take more general values in {1, 2,..., k j }, then the model p(x j y) is multinomial rather than Bernoulli.

Email Spam Filter (2) To model p(x y), we assume that x j s are conditionally independent given y. This is assumption. In classifier: D p(x 1,..., x D y) = p(x 1 y)p(x 2 y)... p(x D y) = p(x j y) j=1 Given a training set {x (i), y (i) }, i = 1,..., N, the joint log likelihood of the data: N L(p(x j y), P(y)) = p(x (i), y (i) )

Email Spam Filter (3) Maximizing L w.r.t. p(x j y), P(y) gives the following estimates: p(x j = 1 y = 1) = p(x j = 1 y = 0) = p(y = 1) = N N 1{x (i) j = 1 y (i) = 1} N 1{y (i) = 1} 1{x (i) j = 1 y (i) = 0} N 1{y (i) = 0} N 1{y (i) = 1} N

Email Spam Filter (4) To make a prediction on a new example x, we simply calculate: p(y = 1 x) = = p(x y = 1)p(y = 1) p(x) ( D j=1 p(x j y = 1))p(y = 1) ( D j=1 p(x j y = 1))p(y = 1) + ( D j=1 p(x j y = 0))p(y = 0) When the original, continuous-valued attributes are not well-modeled by a multivariate normal distribution, discretizing the features and using (instead of GDA) will often result in a better classifier.

Laplace Smoothing If you haven t seen some word before in your finite training set, the classifier does not know how to make a decision. To estimate the mean of multinomial random variable x taking values {1,..., k}, given N observations {x (1),..., x (N) }, the maximum likelihood estimates are: N p(x = j) = 1{x (i) = j}, j = 1,..., k N Laplace smoothing: N p(x = j) = 1{x (i) = j} + 1, j = 1,..., k N + k k j=1 p(x = j) = 1 still holds p(x = j) 0 for all j

Laplace Smoothing (2) Returning to classifier, with Laplace smoothing we obtain the following estimates: p(x j = 1 y = 1) = p(x j = 1 y = 0) = N N 1{x (i) j = 1 y (i) = 1} + 1 N 1{y (i) = 1} + 2 1{x (i) j = 1 y (i) = 0} + 1 N 1{y (i) = 0} + 2 In practice, it usually doesn t matter whether we apply Laplace smoothing to p(y = 1) or not, since we typically have a fair fraction between spam and non-spam emails.

Event Models for Text Classification uses multivariate Bernoulli event model: the probability of an email was given by P(y) D j=1 p(x j y). A different way to represent emails: x = (x 1,..., x M ), x j denotes the j th word in the email, taking values in {1,..., V }; V is the vocabulary; M is the length of the email. Multinomial event model: p(x j y) is now multinomial. The overall probability of an email is P(y) M j=1 p(x j y).

Event Models for Text Classification (2) Given a training set {x (i), y (i) }, i = 1,..., N, where x (i) = (x (i) 1,..., x (i) M i ) T (M i is the number of words in i th training example), the maximum likelihood estimates of the model are: p(x j = k y = 1) = p(x j = k y = 0) = P(y = 1) = N Mi (i) j=1 1{x j = k y (i) = 1} N 1{y, for any j (i) = 1}M i N Mi (i) j=1 1{x j = k y (i) = 0} N 1{y, for any j (i) = 0}M i N 1{y (i) = 1} N

vs. Logistic Regression Asymptotic comparison ( training examples infinite) When model assumption is correct: (NB), Logistic regression (LR) produce identical classifier When model assumption is incorrect: LR is less biased - does not assume conditional, independence therefore expected to outperform NB.

vs. Logistic Regression (2) Non-aymptotic comparison (see [Ng and Jordan, 2002]) Convergence rate of parameter estimation - how many training examples needed to assume good estimates? NB order log D (where D = of attributes in X) LR order D NB converges more quickly to its asymptotic estimates.