Probabilistic Learning

Size: px

Start display at page:

Download "Probabilistic Learning"

April Floyd
5 years ago
Views:

1 Statistical Machine Learning Notes 11 Instructor: Justin Domke Probabilistic Learning Contents 1 Introuction 2 2 Maximum Likelihoo 2 3 Examles of Maximum Likelihoo Binomial Uniform Distribution Univariate Gaussian Multivariate Gaussian Sherical Multivariate Gaussian Proerties of Maximum Likelihoo Maximum Likelihoo is Consistent Maximum Likelihoo is Equivariant Maximum Likelihoo is Efficient Maximum Likelihoo Assumes a Whole Lot Maximum Likelihoo is Emirical Risk Minimization of the KL-ivergence Bayesian Methos 11 1

2 Probabilistic Learning 2 1 Introuction Almost all of our methos for learning have been base off the function of risk an loss. We have worke by icking some class of functions f(x) maing from inuts to oututs. We quantifie how we wante that function to behave in terms of the true risk R true (f) = E 0 [ L(f(x),y) ] =ˆ ˆ 0 (x,y)l(f(x),y)xy, (1.1) where 0 is the true (unknown) istribution. Then, we aroximate this by an emirical risk, fit the function f, an we were one. Gazing at Eq. 1.1,however, another ossible strategy comes to min. Namely, why on t we aroximate 0 with some function? Then, for some secific inut x, we can ick the best guess y by oing ˆ min y (x,y)l(y,y)y. Then we coul o everything exactly! The funamental question here is: when shoul we aly the loss function? In the traitional strategy, we aly it at training time: the reictor f(x) is fit to give the best ossible erformance, with the loss bake in. Now, we aly the loss function only at test time. Notice that we coul even change loss functions on the fly. We coul also fli things aroun. If we suenly ecie we woul rather reict x from y, we coul also o that. This may seem very attractive. As we will see, however, there is a rice to be ai for this generality. However, we ostone iscussion of all the traeoffs until later. The immeiate question is more basic: how shoul we fit? (Note: when fitting (x,y), notice that x an y are on an even footing. Thus, for simlicity, we will usually write the variables together as a single vector x.) 2 Maximum Likelihoo There have been many methos roose to fit istributions. In this class, we will focus on the maximum likelihoo metho. Suose that we are fitting a istribution (x; θ), arametrize by some vector θ. Where oes this istribution come from? You ick it. How o you ick it? We will come back to that! Let the ata be a set of vectors {ˆx}. The log-likelihoo is

3 Probabilistic Learning 3 l(θ) = ˆx log (ˆx; θ). The maximum likelihoo metho, surrisingly enough, consists of icking θ to maximize the likelihoo θ = arg maxl(θ). θ This metho has some nice roerties, but before worrying about them, let s try some examles. You may have seen these before in a statistics class. 3 Examles of Maximum Likelihoo 3.1 Binomial A binomial istribution is a istribution over a binary variable, with x {0, 1}, given by Given some training ata, we can calculate (x; θ) = θ x (1 θ) 1 x. l(θ) = ˆx log (ˆx; θ) = ˆx (ˆx log θ + (1 ˆx) log(1 θ) ) Now, we can maximize this by setting the erivative with resect to θ to zero. We have l θ = 0 = ˆx 1 (ˆx θ (1 ˆx) 1 ) 1 θ = #[ˆx = 1] 1 θ #[ˆx = 0] 1 1 θ, where #[ˆx = 1] is the number of oints in the training ata with ˆx = 1. This equation is solve by θ = #[ˆx = 1] #[ˆx = 1] + #[ˆx = 0]. Thus, the maximum likelihoo estimate is that the binomial istribution has the same robability of being 1 as in the training ata. This is quite intuitive.

4 Probabilistic Learning Uniform Distribution Consier the istribution uniform on 0 to θ. The log likelihoo is (x; θ) = 1 I[0 x θ] θ l(θ) = ˆx log (ˆx, θ). If θ is less than any value ˆx, then the robability of that oint is zero, an we can think of the log-likelihoo as being. On the other han, suose that Then, we have that θ > ˆx, ˆx. l(θ) = ˆx log 1 θ = ˆx log θ. Thus, the likelihoo is maximize by setting θ = min ˆx ˆx. 3.3 Univariate Gaussian A univariate gaussian istribution is efine by (x, µ, σ 2 ) = 1 ex( 1 (x µ) 2 ). 2πσ 2 2 σ 2 Through a bunch of maniulation, we can take the logarithm of this, an then the erivatives of the logarithm. log 2πσ 2 = 1 2 log(2πσ2 )

5 Probabilistic Learning 5 log (x; µ, σ 2 ) = 1 (x µ) σ 2 2 log σ2 1 log 2π 2 µ log (x; µ, (x µ) σ2 ) = σ 2 σ log (x; µ, σ) = 1 (x µ) (σ 2 ) 2 2 σ 2 Now, we want to o maximum likelihoo estimation. That is, we nee to maximize We can o this by solving the two equations max µ,σ 2 l(µ, σ2 ) µ l(µ, σ2 ) = ˆx σ 2l(µ, σ2 ) = ˆx From the first conition, it is easy to see that log (ˆx; µ, σ) = 0 µ log (ˆx; µ, σ) = 0 σ 2 From the secon conition, we can then fin µ = meanˆxˆx 0 = ˆx = ˆx = ˆx (1 (ˆx µ) (σ 2 ) 2 2 ((ˆx µ) 2 1 ) σ 2 ( (ˆx µ) 2 σ 2) 1 ) σ 2 σ 2 = meanˆx (ˆx µ) Multivariate Gaussian A multivariate Gaussian is efine by

6 Probabilistic Learning 6 (x; µ, Σ) = 1 (2π) /2 Σ ex( 1 1/2 2 (x µ)t Σ 1 (x µ) ) log (x; µ, Σ) = 1 2 (x µ)t Σ 1 (x µ) 2 log(2π) log Σ 1. It is unfortunate, but firmly establishe to use the symbol Σ to enote covariance matrix. It is imortant not to get this confuse with a sum. (In these notes, the ifference is inicate by the size of the symbol, as well as context.) First off, let s calculate some roerties of this istribution. It is not har to see that, by symmetry, E [x] = µ. It can also be shown that E [(x µ)(x µ) T ] = Σ. Thus, it makes sense to call µ the mean an Σ the covariance matrix. In orer to calculate the maximum likelihoo estimate, we will nee some erivatives. Using the fact that Σ 1 is symmetric, µ log (x; µ, Σ) = Σ 1 (x µ) Using the fact that X at Xa = aa T an the strange but true fact that again assuming that Σ is symmetric, we have Σ 1 log (x; µ, Σ) = 1 2 (x µ)(x µ)t Σ. log X X = X T, an Now, as ever, when oing maximum likelihoo estimation, our goal is to accomlish the maximization max Σ,µ l(σ, µ) = max Σ,µ log (ˆx; µ, Σ). ˆx

7 Probabilistic Learning 7 Setting l/µ = 0, we have ˆx log (ˆx; µ, Σ) = µ ˆx µ = meanˆxˆx. Σ 1 (ˆx µ) = 0 Setting l/σ 1 = 0, we have ˆx log (ˆx; µ, Σ) = Σ 1 ˆx ( 1 2 (x µ)(x µ)t Σ) Σ = meanˆx (ˆx µ)(ˆx µ) T. Again, this is all very intuitive. The mean is the emirical mean, an the covariance matrix is the emirical covariance matrix. However, this is not unbiase. (Recall to estimate the variance of a scalar variable, we shoul use the formula 1 n n 1 i=1 (x i µ) 2, rather than the emirical variance 1 n n i=1 (x i µ) 2.) So maximum likelihoo will ten to slightly overestimate the variance when the number of ata is small. 3.5 Sherical Multivariate Gaussian A sherical Gaussian is just a Gaussian istribution where we constrain the covariance matrix to take the form Σ = ai for some constant a. Using the fact that ai = a, this is (x; µ, a) = 1 (2π) /2 ai ex( 1 1/2 2a (x µ)t (x µ) ). This turns out to have a maximum likelihoo solution of µ = meanˆxˆx. a = 1 meanˆx(ˆx µ) T (ˆx µ).

8 Probabilistic Learning 8 4 Proerties of Maximum Likelihoo Here we will informally iscuss some of the roerties of maximum likelihoo. 4.1 Maximum Likelihoo is Consistent If the ata is actually being generate by a istribution (x; θ 0 ), for some vector θ 0, then (absent athological conitions) as the amount of ata goes to infinity, the arameters θ recovere by maximum likelihoo will converge to θ 0. This is a efinitely a goo roerty, as we robably woul consier any metho lacking it to be, more or less, broken. 4.2 Maximum Likelihoo is Equivariant Another nice roerty of the likelihoo is that it is equivariant. This just means that we can rearameterize with out affecting the solution. Secifically, suose we are consiering estimating some istribution (x; θ). Suose the maximum likelihoo estimate of θ on some ataset is θ. Now, we choose to instea arametrize our function by φ, which is some nonlinear transformation of θ. θ = g(φ) Now, if we efine q(x; φ) ( x;g(φ) ) an o maximum likelihoo estimation of φ, we will recover φ such that θ = g(φ ). (Proving this is quite easy.) Again, this is a reassuring roerty: The exact etails of how we have arametrize our function on t matter. Failing to be equivariant wouln t seem to be quite so isqualifying as failing to be consistent, but it is certainly comforting.

9 Probabilistic Learning Maximum Likelihoo is Efficient Perhas the strongest argument in favor of maximum likelihoo is that it is asymtotically efficient. Suose the ata is actually being generate by a istribution (x; θ 0 ), for some vector θ 0. As iscusse above, the maximum likelihoo is consistent, in the sense that it converges to θ 0. The next question is, how fast oes it o that? Is there some other measure that converges faster? Asymtotically, the answer is no. This result hinges on efining faster as the execte square istance between our estimate θ an the true arameters θ 0. This follows from two results that are escribe here informally. (For simlicity, these are state here for a scalar arameter θ.) These make use of a quantity calle the Fisher information. [( log (X; θ0 ) ) 2 ] I(θ 0 ) = E θ Intuitively, we can unerstan this. Consier a lanscae of ifferent values θ, in which we seek to locate the true value θ 0. If log-likelihoo changes a lot in the region aroun θ 0, then we shoul exect the true arameters to be relatively easy to locate. So, the two results are 1. The Cramer-Rao boun. This states that no unbiase estimator can have a variance 1 less than. (Technically, maximum likelihoo is not unbiase, but this is goo ni(θ 0 ) enough for our uroses, since we are looking for an asymtotic result anyway.) 2. The asymtotic normality the maximum likelihoo. This shows that, as the amount of 1 ata becomes large, the estimate arameters will be istribute with variance ni(θ 0 ). Secifically, they will be istribute as a Gaussian istribution with this variance centere at θ Maximum Likelihoo Assumes a Whole Lot Now, suose the true istribution is 0 (x). Most of the above roerties have hinge on the assumtion that there exists a vector θ 0 such that

10 Probabilistic Learning 10 0 (x) = (x; θ 0 ). Another way of stating this is that we have a well-secifie moel. You might ask: how coul we ever know this? The brief answer is that we robably on t. Now, we can create somewhat contrive situations where it is true. It is har to see how a binary variable can fail to be Binomial! In general, however, making a moel tens to be an eucate guess of sorts. 4.5 Maximum Likelihoo is Emirical Risk Minimization of the KL-ivergence All of the iscussion has eene on the assumtion that the true ata-generating istribution is known. As this is almost never true is ractice, it might seem like maximum likelihoo coul almost never be use! Unfortunately, without the assumtion of a well-secifie moel, almost of the above roerties isaear. After all, how can θ converge to θ 0 if θ 0 oesn t exist? On the other han, intuitively, it seems like the maximum likelihoo shoul still o something reasonable if the moel is almost well-secifie. That is, if there exists some vector of arameters θ such that 0 (x) (x; θ), shouln t maximum likelihoo converge to something close to 0? After all, in the revious case, the ata coul have come from arameters θ how coul the maximum likelihoo even know it i not? In fact, maximum likelihoo oes behave reasonably in the face of minor missecification. To unerstan this, we must first introuce the The Kullback-Leibler ivergence. ˆ KL( 0 ) = x 0 (x) log 0(x) (x) This is a sort of ivergence measure between robability istributions. Its origins come from information theory 1. The imortant thing to note about the KL-ivergence is that it is non-negative, an zero only when 0 =. Note also that it is not actually a istance measure, as it is not symmetric. Suose that there is some region of oints where 0 is significant, but is near zero. As the KL-ivergence measures the logarithm of, this region leas to a large ivergence. The following figures show several base istributions 0 (shown as otte lines). For each istribution, the Gaussian is comute that minimizes the KL-ivergence (shown as soli 1 Where it can be thought of as measuring the execte number of bits waste if you buil a coe for x assuming that the istribution is when the actual istribution is 0

11 Probabilistic Learning 11 lines). (x) arg min q KL( q) Now, consier the KL-ivergence between the true istribution 0, an the one that we fit,. arg min KL( 0 ) = arg min = arg max arg max 0 (x) log 0 (x) x x 0 (x) log (x) x log (ˆx) ˆx 0 (x) log (x) In thir line we have mae essentially an emirical aroximation of the true risk above. The way to unerstan this is that if the true istribution is any of the otte curves above an we fit a Gaussian then, as the amount of ata increases, we will recover the soli curve. 5 Bayesian Methos Suose we have a big jar full of bent coins. We haen to know that, insie of this bin, there are 75 coins that come u heas of tye A with robability 60% an 25 coins of tye B that come u with robability 40%. Now, we ick a coin at ranom out of the jar. We fli

12 Probabilistic Learning 12 it 8 times, an observe 3 heas, followe by 5 tails. What is the robability that we have in our hans coin of tye A? One aroach to this is to aly Bayes theorem. Pr(X Y ) = Pr(Y X)Pr(X) Pr(Y ) In our case, we want to calculate the robability that we have a coin of tye A, given that we have observe 3 heas in 10 coin flis. Pr(A Data) = Pr(Data A)Pr(A) Pr(Data) Pr(B Data) = Pr(Data B)Pr(B) Pr(Data) Now, in our case, we know that we have a 75% chance of grabbing a coin of tye A. Pr(A) =.75 Pr(B) =.25 Now, if we ha a coin of tye A, we woul have robability Pr(Data A) = Thus, we have Pr(Data B) = P r(a Data) = /P r(data) P r(b Data) = /P r(data). Now, notice that we on t nee to go through too much calculation to recover Pr(Data). Since we know that

13 Probabilistic Learning 13 we can just normalize an calculate Pr(A Data) + Pr(B Data) = 1, P(A Data) =.57 P(B Data) =.43. Thus, there is a 57% chance we have a coin of tye A. Now, all of the above may seem quite uncontroversial. However, when we say there is a 57% chance our coin is of tye A, what exactly oes that mean? After all, we icke one articular coin. It is either of tye A or it isn t. What robabilities exactly are we talking about here? The traitional view hols that talking about such robabilities is meaningless. On the other han, force to bet, wouln t everyone choose A 2? There are hilosohical issues here about the meaning of robability. We won t get too eely into these, however, just note that they exist, an are a art of the ebate in statistics between Bayesian an traitional frequentist methos. Now, let s try to formalize the rocess that we use above an scale it u to larger roblems. Instea of just two tyes of coins (two ifferent binomial istributions), imagine we have a set of otential robability istributions. Imagine also that we have, by some rior knowlege, a istribution Pr() over these istributions. What haens is the following: Some istribution is icke, with robability roortional to Pr(). A bunch of samles {ˆx} is rawn from. We get to see {ˆx}, an nee to make reictions about the future. The simlest way to aroach this situation is to again aly Bayes equation Pr( {ˆx}) = Pr({ˆx} )Pr(). Pr({ˆx}) Now, it makes sense to try to recover the most robable. This means searching for 2 Is it a contraiction to choose A an yet reject the iea of robabilities like this?

14 Probabilistic Learning 14 arg max P r( {ˆx}) = arg maxpr({ˆx} )Pr() = arg max log Pr({ˆx} ) + log Pr() = arg max log ˆx (ˆx) + log Pr() = arg max log (ˆx) + log Pr(). (In the first line, we exloit the fact that Pr({ˆx}) is constant with resect to an so oes not affect the maximizing. In the secon line, we take the logarithm. In the thir line, we use the fact that Pr({ˆx} ) = ˆx Pr(ˆx ) = ˆx (ˆx). The fourth line is just algebra.) Thus, in the last line, we just have the log-likelihoo lus the log rio P r(). Searching for to maximize Pr( {ˆx}) is known as maximum a osteriori (MAP) estimation. Notice the similarity to regularize maximum likelihoo estimation. For examle, it is common to arameterize by some vector θ, an set Pr() = Pr(θ) to be a Gaussian centere at the origin. It is easy to show that oing this results in log Pr(θ) = a θ 2, where a eens on the covariance of the Gaussian. Similarly, it can be shown that the lasso enalty corresons to a istribution of the form Pr(θ) ex ( a θ 1 ). Thus, many Bayesians view regularize maximum likelihoo estimation as imlicit MAP estimation. We shoul note, though, that real Bayesians o not o MAP estimation. To unerstan why not, suose that we have a robability istribution arameterize by a scalar, an the ostereior Pr(θ {ˆx}) looks something like the following: ˆx P(θ) θ (1) θ (2) θ MAP estimation will choose θ (1) as the most robable set of arameters. However, this oesn t look so goo, since most of the robability is in the area of θ (2). What real Bayesians o is not estimate one articular istribution, but rather, make reictions irectly from the osterior Pr( Data). How is this one? Let s look at an examle. Suose we nee to guess one single value for x. Consier the loss of some guess x :

15 Probabilistic Learning 15 min L(x x ˆx,x)Pr(x Data)x. Now, we can calculate the robability of some articular outut x by integrating over the ossible P r(x Data) = = ˆ ˆ ˆ θ θ θ P r(x, θ Data)θ P r(x θ)p r(θ Data)θ P r(x θ)p r(data θ)p r(θ)θ. Thus, finally, the true Bayesian chooses their best guess x by the roblem ˆ min L(x x ˆx,x)Pr(x θ)pr(data θ)pr(θ)xθ. (5.1) θ The question is, how to o this integral? In some situations, this can be one in close form. In general, however, one must resort to Markov chain Monte Carlo techniques for aroximately oing the integral 3. This can be quite comutationally challenging, which can be a major rawback of Bayesian methos. Let s consier the avantages an isavantages of the Bayesian aroach. The major avantage of this is that it is, in a certain sense, the otimal metho. If the true istribution is rawn from Pr(θ) then, on average, no metho for making reictions can have lower loss than Eq To make this recise, suose that we reeately get arameters θ from the istribution P r(θ), samle some ata from Pr(Data θ), make a reicte x, then measure the loss L(x,x) on some new x rawn from (x, θ). The above recie will have the lowest average loss of any metho. For this reason, many eole feel that Bayesian methos are the one, true way to o machine learning. A isavantage of Bayesian methos is that they can often be quite comutationally exensive. As mentione above, in comlex roblems, it is common to use MCMC techniques to o inference. These techniques o have guarantees of eventually converging to the right answer, but these guarantees are usually asymtotic in nature. Thus, given a finite amount of running time, one can be unsure how close the current answer is to the best one. Research is ongoing on faster MCMC methos with an eye on Bayesian inference. 3 See Introuction to Monte Carlo methos by Davi MacKay for a goo tutorial on these techniques.

16 Probabilistic Learning 16 The most obvious issue with Bayesian methos is the nee to secify the rior Pr(θ). In real alications, where oes this rior come from? This is similar to issue we face when oing (non-bayesian) robabilistic moeling we neee to secify a correct arametric moel (x; θ). While secifying the rior may aear to be a rawback of Bayesian methos, it is also something of an avantage. If you have a lot of knowlege about a articular omain, an you are able to secify this knowlege as a rior, Bayesian methos rovie a nice framework to combine your knowlege with knowlege gaine from ata. Note also that, in the view of some, techniques like regularization are essentially MAP estimation in all but name. There is a great eal of material out there on the ebate between frequentist an Bayesian statistics.

Probabilistic Learning

Statistical Machine Learning Notes 10 Instructor: Justin Domke Probabilistic Learning Contents 1 Introuction 2 2 Maximum Likelihoo 2 3 Examles of Maximum Likelihoo 3 3.1 Binomial......................................