Lecture 2 October 11

Size: px

Start display at page:

Download "Lecture 2 October 11"

Stephen Warner
6 years ago
Views:

1 Itroductio to probabilistic graphical models 203/204 Lecture 2 October Lecturer: Guillaume Oboziski Scribes: Aymeric Reshef, Claire Verade Course webpage: 2. Sigle ode models (last part) The previous course itroduced the otio of Maximum Likelihood Estimator (MLE). Basic examples o Beroulli model, multiomial model ad Gaussia model were explicited, ad side otes detailed the use of Lagragia operators ad of differetials. The last example was usig the multivariate Gaussia model. We recall it briefly i the ext subsectio. 2.. The Multivariate Gaussia model If X is a radom variable takig values i R d. Let µ R d ad Σ R d d be a positive defiite matrix. X follows a multivariate Gaussia model (deoted by X N (µ, Σ)) if ( p µ,σ (x) = exp ) (2π) d 2 (x µ) Σ (x µ). det Σ is Let X,, X N (µ, Σ), iid. The, the egative log-likelihood of the joit distributio l (µ, Σ) = log p µ,σ (x i ) = d 2 log (2π) + 2 log (det Σ) + 2 Its gradiet with respect to µ is give by µ l (µ, Σ) = Σ (x i µ) = Σ ( x i µ ) (x i µ) Σ (x i µ). = Σ ( x µ), which leads to ˆµ = x, the empirical mea. I order to compute the gradiet with respect to Σ, we first write A = Σ, so that l (µ, Σ) = d 2 log (2π) 2 log (det A) + 2 (x i µ) A (x i µ) = d 2 log (2π) 2 log (det A) + 2 Tr(A Σ), 2-

2 Cours 2 October 203/204 where we itroduced the empirical covariace matrix Σ defied as Σ = (x i µ) (x i µ). The matrix A appears i the expressio of the log-likelihood i two terms: log det A ad 2 Tr(A Σ). 2 Deote by f(a) = Tr(A Σ). The f(a + H) f(a) = Tr(H Σ), which leads to f(a) = Σ. Now, write log det A as ( ( ) ) log det(a + H) = log det A 2 I + A 2 HA 2 A 2 = log det A + log det(i + H) where A 2 stads for the square root matrix of A (it exists, sice A is positive defiite) ad H = A 2 HA 2. Let s see how log det(i + H) looks like. Notig that log det I = 0, ad deotig by (λ,, λ d ) the eigevalues of H, we have that But the, log det(i + H) = log det(i + H) log det I = d log( + λ j ) j= d λ j = Tr( H) = Tr(A 2 HA 2 ) = Tr(HA ). j= d λ j + o( H ). We coclude that A log det A = A. Pluggig these results ito the gradiet of the log-likelihood with respect to A, we have A l(a) = 2 A + 2 Σ. The optimality coditio A l(a) leads to A = Σ, which meas that ˆΣ = (x i µ) (x i µ) is the empirical covariace matrix. Note that we assumed that A was ivertible, which is a implicit coditio whe writig log det A. This implies that i a rigorous sese the maximum likelihood estimator is udefied whe Σ is ot ivertible. I practice, the MLE is exteded by cotiuity to the rak deficiet case. 2.2 Models with two odes I this sectio, we work with two odes: oe ode correspods to a iput X, ad oe ode correspods to a output Y. Recall that whe dealig with two radom variables X ad Y, oe ca use a geerative model, i.e. which models the joit distributio p(x, Y ), or oe ca use istead a coditioal model (ofte cosidered equivalet to the slightly differet cocept of discrimiative model), which models the coditioal probability of the output, give the iput p(y X). The two followig models, liear regressio or a logistic regressio, are coditioal models. j= 2-2

3 Cours 2 October 203/ Liear regressio Let s assume that Y R depeds liearly o X R p. Let w R p be a weightig vector ad σ 2 > 0. We make the followig assumptio: which ca be rewritte as Y X N (w X, σ 2 ), Y = w X + ɛ, with ɛ N (0, σ 2 ). Note that if there is a offset w 0 R p, that is, if Y = w X + w 0 + ɛ, oe ca always redefie a weightig vector w R p+ such that ( ) x Y = w + ɛ. Let D = {(x, y ),, (x, y )} be a traiig set of i.i.d. radom variables. Each y i is a label (a decisio) o observatio x i. We cosider the coditioal distributio of all outputs give all iputs, which is a product of terms because of the idepedece of the pairs formig the traiig set: p(y,, y x,, x ; w, σ 2 ) = p(y i x i ; w, σ 2 ). The associated log-likelihood has the followig expressio: l(w, σ 2 ) = log p(y i x i ) = 2 log(2πσ2 ) + 2 (y i w x i ) 2 σ 2. The miimizatio problem with respect to w ca ow be reformulated as: Defie the so-called desig matrix X as fid ŵ = arg mi w 2 (y i w x i ) 2. x X =. x R p ad deote by y the vector of coordiates (y,, y ). The miimizatio problem over w ca be rewritte i a more compact way as: fid ŵ = arg mi w 2 y Xw 2. Let f : w y 2 Xw 2 = 2 (y y 2w X y + w X Xw. f is strictly covex if ad oly if its Hessia matrix is ivertible. This is ever the case whe < p (i this case, we deal with uderdetermied problems). Most of the time, the Hessia matrix is ivertible whe p. Whe this is ot the case, we ofte use the Tikhoov regularizatio, which adds 2-3

4 Cours 2 October 203/204 a pealizatio of the l 2 -orm of w by miimizig f(w) + λ w 2 with some hyperparameter λ > 0. The gradiet of f is f(w) = X (Xw y) = 0 X Xw = X y. The equatio X Xw = X y is kow as the ormal equatio. If X X is ivertible, the the optimal weightig vector is ŵ = (X X) X y = X y where X = (X X) X is the Moore-Perose pseudo-iverse of X. If X X is ot ivertible, the solutio is ot uique aymore, ad for ay h ker(x), ŵ = (X X) X y + h is a admissible solutio. I that case however it would be ecessary to use regularizatio. The computatioal cost to evaluate the optimal weightig vector from X ad y is O(p 3 ) (use a Cholesky decompositio of matrix X X ad solve two triagular systems). Now, let s differetiate l(w, σ 2 ) with respect to σ 2 : we have σ 2l(w, σ 2 ) = Settig σ 2l(w, σ 2 ) to zero gives 2σ 2 2σ 4 (y i w x i ) 2. ˆσ 2 = (y i w x i ) 2. I practice, wheever we use a data matrix X i machie learig, we first preprocess it to try ad avoid that it would be too badly coditioed, so to avoid umerical issues. Two mai operatios are applied columwise: first, a ceterig (remove the mea of the coefficiets) ad a ormalizatio (divide coefficiets from a colum by the stadard deviatio of the colum vector). Note that this preprocessig *does ot guaratee* that the matrix we obtai is well-coditioed: i particular, it ca be low rak Logistic regressio Let X R p, Y {0, }. We assume that Y follows a Beroulli distributio with parameter θ. The problem is to fid θ. Let s defie the sigmoid fuctio σ defied o the real axis ad takig values i [0, ], such that z R, σ(z) = The sigmoid fuctio is plot o Figure 2.. Oe ca easily prove that + e z. z R, σ( z) = σ(z), z R, σ (z) = σ(z)( σ(z)) = σ(z)σ( z). 2-4

5 Cours 2 October 203/ σ(x) x Figure 2.. Sigmoid fuctio. We ow assume that, for a give observatio X = x, the output Y X = x follows a Beroulli law with parameter θ = σ(w x), where w is agai a weightig vector. I practice, we still ca add a offset w x + w 0. The, the coditioal distributio is give by p(y = y X = x) = θ y ( θ) y = σ(w x) y σ( w x) y. Give a traiig set D = {(x, y ),, (x, y )} of iid radom variables, we ca compute the log-likelihood l(w) = y i log σ(w x i ) + ( y i ) log σ( w x i ). I order to miimize the log-likelihood, sice z log( + e z ) is a covex fuctio ad w w x i is liear, we calculate its gradiet. We write η i = σ(θ x i ): w l(w) = y i x i σ(w x i )σ( w x i ) σ(w x i ) ( y i )x i σ(w x i )σ( w x i ) σ( w x i ) = x i (y i η i ) Thus, w l(w) = 0 x i(y i σ(θ x i )) = 0. This equatio is oliear ad we eed a iterative optimizatio method to solve it. For this purpose, we derive the Hessia matrix of l: Hl(w) = x i (0 σ (w x i )σ ( w x i )x i ) = ( η i ( η i ))x i x i = X Diag(η i ( η i ))X where X is the desig matrix defied previously. I the followig we discuss first- ad secod-order optimizatio methods ad apply them to logistic regressio. 2-5

6 Cours 2 October 203/204 First-order methods Let f : R p R be the covex C fuctio that we wat to miimize. A descet directio at poit x is a vector d such that d, f(x) < 0. The miimizatio of f ca be doe by applyig a descet algorithm, which iteratively takes a step i a descet directio, leadig to a iterative scheme of the form x (k+) = x (k) + ε (k) d (k), where ε (k) is the stepsize. The directio d (k) is ofte chose as the opposite of the gradiet of f at poit x (k) : d (k) = f(x (k) ). There are several choices for ε (k) :. Costat step: ε (k) = ε. But the scheme does ot ecessarily coverge. 2. Decreasig step size: ε (k) k with k ε(k) = ad k (ε(k) ) 2 <. The scheme is guarateed to coverge. 3. Oe ca determie ε (k) by doig a Lie Search which tries to fid mi ε f(x (k) + εd (k) ): either exactly but this is costly ad rather useless i may situatios; or approximately (see the Armijo liesearch). This is a better method. Secod-order methods This time, let f : R p R be the C 2 fuctio that we wat to miimize. We write the secod-order Taylor-expasio of f: f(x) = f(x t )+(x x t ) f(x t )+ 2 (x xt ) Hf(x t )(x x t )+o( x x t 2 ) def = g t (x)+( x x t 2 ) A local optimum x is the reached whe { f(x ) = 0 H(f(x ) 0 I order to solve such a problem, we are goig to use Newto s method. If f is a covex fuctio, the g t (x) = f(x t ) + Hf(x t )(x x t ) ad we oly eed to fid x so that g t (x) = 0, ie. we set x t+ = x t [Hf(x t ] f(x t ). If the Hessia matrix is ot ivertible, we ca regularize the problem ad miimize g t (x) + λ x x t 2 istead. I geeral the previous update, called the Pure Newto step does ot lead to a coverget algorithm eve if the fuctio is covex! I geeral it is ecessary to use the so-called Damped Newto method, to obtai a coverget algorithm which cosists i doig the followig iteratios: where ε t is set with the Armijo Lie Search x t+ = x t ε t (Hf(x t )) f(x t ), 2-6

7 Cours 2 October 203/204 This method may be computatioally costly i high dimesio because of the iverse of the hessia matrix that eeds to be computed at each iteratio. For some fuctios, however, the pure Newto s method does coverge. This is the case for logistic regressio. I the cotext of o-covex optimizatio, the situatio is more complicated because the Hessia ca have egative eigevalues. I that case, so-called trust regio methods are typically used. Applicatio to logistic regressio We will write the form that Newto s algorithm takes for logistic regressio. We had : l(w) = w l(w) = y i log σ(w x i ) + ( y i ) log σ( w x i ) x i (y i η i ) = X (y η) Hl(w) = X Diag(η i ( η i ))X The secod-order Taylor expasio of the loss fuctio leads to mi h l(w) = l(w t ) + (w w t ) l(w t ) + 2 (w wt ) Hl(w t )(w w t ). Let us set h = w w t. The miimizatio problem becomes: { h X (y η) } 2 h X Diag(η( η))xh mi h w l(w) + h 2 h Hl(w)h. This leads, accordig to the previous part, to set w t+ = w t + Hl(w t ) w l(w). The miimizatio problem above ca be see as some weighted liear regressio over h of some fuctio of the form (ỹ i x i h) 2 i, where ỹ σi 2 i = y i η i ad σi 2 = [η i ( η i )]. Thus, this method is ofte refered as the iterative reweighted least squares algorithm (IRLS). We may ru ito a classificatio problem with more tha two classes : Y {,, K} with Y M(, π (x),, π K (x)) where We will eed to defie a rule over the classifiers (softmax fuctio, oe-versus-all, etc.) i order to make a decisio Geerative models This sectio briefly presets the Fisher liear discrimiat also kow as the liear discrimiat aalysis. Suppose that we have X R p ad Y {0, }. P (Y = X = x) = P (X = x Y = )P (Y = ) P (X = x Y = )P (Y = ) + P (X = x Y = 0)P (Y = 0) The assumptio the cosists i cosiderig P (X = x Y = ) N (x, µ, Σ ) ad P (X = x Y = 0) N (x, µ 0, Σ 0 ). Fisher s assumptio is the assumptio that Σ = Σ 0 = Σ. 2-7

8 2.3 Usupervised classificatio Cours 2 October 203/204 Usupervised learig cosists i fidig a label predictio fuctio based o ulabeled traiig data oly. I the case where the learig problem is a classificatio problem, ad uder the assumptio that the classes form clusters i iput space, the problem reduces to a clusterig problem, which cosists i fidig groups of poits that form deser clusters. Whe the clusters are assumed to be isotropic the formulatio of the K-meas algorithm is appropriate. The K-meas algorithm We start from a set of data poits (x,, x ) (where x i R p ), that are ulabelled. We wish to divide this set ito K clusters defied by their cetroids (µ,, µ K ). The problem ca be formulated as: mi µ,,µ K mi k x i µ k 2. The miimizatio step iside the summatio leads to a ocovex problem. The K-meas algorithm is a greedy algorithm which cosists i iteratively apply two steps: { } C k i x i µ k 2 = mi x i µ j 2 j µ k x i. C k i C k The first step defies the clusters C k by assigig each data poit to its closest cetroid. The secod step the updates the cetroids give the ew cluster. Two remarks: It ca be show that K-meas coverges i a fiite umber of steps. The algorithm however typically get stuck i local miima ad it practice it is ecessary to try several restarts of the algorithm with a radom iitializatio to have chaces to obtai a better solutio. 2-8

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y