Machine Learning Lecture Notes

Machine Learning Lecture Notes Predrag Radivojac January 25, 205 Basic Principles of Parameter Estimation In probabilistic modeling, we are typically presented with a set of observations and the objective is to find a model, or function, ˆM that shows good agreement with the data and respects certain additional requirements. We shall roughly categorize these requirements into three groups: (i)theabilitytogeneralizewell, (ii) the ability to incorporate prior nowledge and assumptions into modeling, and (iii) scalability. First,themodelshouldbeabletostandthetestoftime; that is, its performance on the previously unseen data should not deteriorate once this new data is presented. Models with such performance are said to generalize well. Second, ˆM must be able to incorporate information about the model space M from which it is selected and the process of selecting a model should be able to accept training advice from an analyst. Finally, when large amounts of data are available, learning algorithms must be able to provide solutions in reasonable time given the resources such as memory or CPU power. In summary, the choice of a model ultimately depends on the observations at hand, our experience with modeling real-life phenomena, and the ability of algorithms to find good solutions given limited resources. An easy way to thin about finding the best model is through learning parameters of a distribution. Suppose we are given a set of observations D {x i } n i, where x i 2 R and have nowledge that M is a family of all univariate Gaussian distributions, e.g. M Gaussian(µ, 2 ), with µ 2 R and 2 R +. In this case, the problem of finding the best model (by which we mean function) can be seen as finding the best parameters µ and,i.e. theproblemcan be seen as parameter estimation. We call this process estimation because the typical assumption is that the data was generated by an unnown model from M whose parameters we are trying to recover from data. We will formalize parameter estimation using probabilistic techniques and will subsequently find solutions through optimization, occasionally with constrains in the parameter space. The main assumption throughout this part will be that the set of observations D was generated (or collected) independently and according to the same distribution p X (x). The statistical framewor for model inference is shown in Figure.

Observations Knowledge + Assumptions Data generator Data set Experience Optimization n D { x i } i M, pm ( ),... Parameter estimation e.g., MMAP arg max { p( D M) p( M) } M 2M MB E[ M D D] Model Model inference: Observations + Knowledge and Assumptions + Optimization Figure : Statistical framewor for model inference. The estimates of the parameters are made using a set of observations D as well as experience in the form of model space M, priordistributionp(m), orspecificstartingsolutions in the optimization step. 2

. Maximum a posteriori and maximum lielihood estimation The idea behind the maximum a posteriori (MAP) estimation is to find the most probable model for the observed data. Given the data set D, weformalize the MAP solution as M MAP arg max {p(m D)}, M2M where p(m D) is the posterior distribution of the model given the data. In discrete model spaces, p(m D) is the probability mass function and the MAP estimate is exactly the most probable model. Its counterpart in continuous spaces is the model with the largest value of the posterior density function. Note that we use words model, which is a function, and its parameters, which are the coefficients of that function, somewhat interchangeably. However, we should eep in mind the difference, even if only for pedantic reasons. To calculate the posterior distribution we start by applying the Bayes rule as p(m D) p(d M) p(m), () p(d) where p(d M) is called the lielihood function, p(m) is the prior distribution of the model, and p(d) is the marginal distribution of the data. Notice that we use D for the observed data set, but that we usually thin of it as a realization of a multidimensional random variable D drawn according to some distribution p(d). Usingtheformulaoftotalprobability,wecanexpressp(D) as 8P >< M2M p(d M)p(M) M : discrete p(d) >: M p(d M)p(M)dM M : continuous Therefore, the posterior distribution can be fully described using the lielihood and the prior. The field of research and practice involving ways to determine this distribution and optimal models is referred to as inferential statistics. The posterior distribution is sometimes referred to as inverse probability. Finding M MAP can be greatly simplified because p(d) in the denominator does not affect the solution. We shall re-write Eq. () as p(m D) p(d M) p(m) p(d) / p(d M) p(m), where / is the proportionality symbol. Thus, we can find the MAP solution by solving the following optimization problem 3

M MAP arg max {p(d M)p(M)}. M2M In some situations we may not have a reason to prefer one model over another and can thin of p(m) as a constant over the model space M. Then, the maximum a posteriori estimation reduces to the maximization of the lielihood function; i.e. M ML arg max {p(d M)}. M2M We will refer to this solution as the maximum lielihood solution. Formally speaing, the assumption that p(m) is constant is problematic because a uniform distribution cannot be always defined (say, over R). Thus, it may be useful to thin of the maximum lielihood approach as a separate technique, rather than a special case of MAP estimation, with a conceptual caveat. Observe that MAP and ML approaches report solutions corresponding to the mode of the posterior distribution and the lielihood function, respectively. We shall later contrast this estimation technique with the view of the Bayesian statistics in which the goal is to minimize the posterior ris. Such estimation typically results in calculating conditional expectations, which can be complex integration problems. From a different point of view, MAP and ML estimates are called point estimates, asopposedtoestimatedthatreportconfidenceintervals for particular group of parameters. Example 8. Suppose data set D {2, 5, 9, 5, 4, 8} is an i.i.d. sample from a Poisson distribution with an unnown parameter t. Findthemaximum lielihood estimate of t. The probability density function of a Poisson distribution is expressed as p(x ) x e /x!, with some parameter 2 R +. We will estimate this parameter as We can write the lielihood function as ML arg max {p(d )}. 2(0,) p(d )p({x i } n i ) p(x i ) i i xi e n Q n i x i!. To find that maximizes the lielihood, we will first tae a logarithm (a monotonic function) to simplify the calculation, then find its first derivative with respect to, and finally equate it with zero to find the maximum. Specifically, we express the log-lielihood ll(d, ) lnp(d ) as 4

ll(d, )ln i and proceed with the first derivative as x i n ln (x i!) i @ll(d, ) @ n X 0. i x i n By substituting n 6and values from D, wecancomputethesolutionas ML n 5.5, which is simply a sample mean. The second derivative of the lielihood function is always negative because must be positive; thus, the previous expression indeed maximizes the lielihood. We have ignored the anomalous cases when D contains all zeros. Example 9. Let D {2, 5, 9, 5, 4, 8} again be an i.i.d. sample from Poisson( t ),butnowwearealsogivenadditionalinformation. Supposethe prior nowledge about t can be expressed using a gamma distribution (x, ) with parameters 3and. Find the maximum a posteriori estimate of t. First, we write the probability density function of the gamma family as i x i e x (x, ) x (), where x>0, >0, and >0. () is the gamma function that generalizes the factorial function; when is an integer, we have () ( )!. The MAP estimate of the parameters can be found as MAP arg max {p(d )p( )}. 2(0,) As before, we can write the lielihood function as p(d ) i xi e n Q n i x i! and the prior distribution as 5

p( ) e (). Now, we can maximize the logarithm of the posterior distribution p( D) using ln p( D) / ln p(d )+lnp( ) ln ( + x i ) (n + n ) X ln x i! ln ln () to obtain i i MAP 5 + i x i n + after incorporating all data. Aquiclooat MAP and ML suggests that as n grows, both numerators and denominators in the expressions above become increasingly more similar. Unfortunately, equality between MAP and ML when n!cannot be guaranteed because it could occur by (a distant) chance that s n i x i does not grow with n. Onewaytoproceedistocalculatetheexpectationofthedifference between MAP and ML and then investigate what happens when n!.let us do this formally. We shall first note that both estimates can be considered to be random variables. This follows from the fact that D is assumed to be an i.i.d. sample from a Poisson distribution with some true parameter t. Withthis, wecan write MAP ( +S n )/(n + ) and ML S n /n, where S n i X i and X i Poisson( t ). We shall now prove the following convergence (in mean) result lim E D [ MAP ML ] 0, n! where the expectation is performed with respect to the random variable D. Let us first express the absolute difference between the two estimates as MAP ML +S n n + / n + / apple S n n S n n(n + / ) n + + S n / n(n + / ) where is a random variable that bounds the absolute difference between and ML.Weshallnowexpresstheexpectationof as MAP 6

E [ ] and therefore calculate that " n # n(n + / ) E X X i + n + i / n + t + / n + / lim E [ ] 0. n! This result shows that the MAP estimate approaches the ML solution for large data sets. In other words, large data diminishes the importance of prior nowledge. This is an important conclusion because it simplifies mathematical apparatus necessary for practical inference... The relationship with Kullbac-Leibler divergence We now investigate the relationship between maximum lielihood estimation and Kullbac-Leibler divergence. Kullbac-Leibler divergence between two probability distributions p(x) and q(x) is defined as D KL (p q) ˆ p(x) log p(x) q(x) dx. In information theory, Kullbac-Leibler divergence has a natural interpretation of the inefficiency of signal compression when the code is constructed using a suboptimal distribution q(x) instead of the correct (but unnown) distribution p(x) according to which the data has been generated. However, more often than not, Kullbac-Leibler divergence is simply considered to be a measure of divergence between two probability distributions. Although this divergence is not a metric (it is not symmetric and does not satisfy the triangle inequality) it has important theoretical properties in that (i) itisalwaysnon-negativeand (ii) itisequaltozeroifandonlyifp(x) q(x). Consider now a divergence between an estimated probability distribution p(x ) and an underlying (true) distribution p(x t ) according to which the data set D {x i } n i was generated. The Kullbac-Leibler divergence between p(x ) and p(x t ) is D KL (p(x t ) p(x )) ˆ ˆ p(x t ) log p(x t) p(x ) dx p(x t ) log p(x ) dx ˆ p(x t ) log p(x t ) dx. The second term in the above equation is simply the entropy of the true distribution and is not influenced by our choice of the model. The first term, on the other hand, can be expressed as 7

ˆ p(x t ) log p(x ) dx E[log p(x )] Therefore, maximizing E[log p(x )] minimizes the Kullbac-Leibler divergence between p(x ) and p(x t ). Using the strong law of large numbers, we now that n log p(x i ) i a.s.! E[log p(x )] when n!. Thus, when the data set is sufficiently large, maximizing the lielihood function minimizes the Kullbac-Leibler divergence and leads to the conclusion that p(x ML )p(x t ),iftheunderlyingassumptionsaresatisfied. Under reasonable conditions, we can infer from it that ML t. This will hold for families of distributions for which a set of parameters uniquely determines the probability distribution; e.g. it will not generally hold for mixtures of distributions but we will discuss this situation later. This result is only one of the many connections between statistics and information theory..2 Parameter estimation for mixtures of distributions We now investigate parameter estimation for mixture models, which is most commonly carried out using the expectation-maximization (EM) algorithm. As before, we are given a set of i.i.d. observations D {x i } n i, with the goal of estimating the parameters of the mixture distribution p(x ) w j p(x j ). j In the equation above, we used (w,w 2,...,w m,, 2,..., m ) to combine all parameters. Just to be more concrete, we shall assume that each p(x i j ) is an exponential distribution with parameter j, i.e. p(x j ) j e jx, where j > 0. Finally,we shall assume that m is given and will address simultaneous estimation of and m later. Let us attempt to find the maximum lielihood solution first. By plugging the formula for p(x ) into the lielihood function we obtain p(d ) p(x i ) i 0 @ w j p(x i j ) A, (2) i j which, unfortunately, is difficult to maximize using differential calculus (why?). Note that although p(d ) has O(m n ) terms, it can be calculated in O(mn) time as a log-lielihood. 8

Before introducing the EM algorithm, let us for a moment present two hypothetical scenarios that will help us to understand the algorithm. First, suppose that information is available as to which mixing component generated which data point. That is, suppose that D {(x i,y i )} n i is an i.i.d. sample from some distribution p XY (x, y), where y 2Y {, 2,...,m} specifies the mixing component. How would the maximization be performed then? Let us write the lielihood function as p(d ) p(x i,y i ) i p(x i y i, )p(y i ) i w yi p(x i yi ), (3) i where w j p Y (j) P (Y j). The log-lielihood is log p(d ) (log w yi + log p(x i yi )) i n j log w j + j log p(x i yi ), where n j is the number of data points in D generated by the j-th mixing component. It is useful to observe here that when y (y,y 2,...,y n ) is nown, the internal summation operator in Eq. (2) disappears. More importantly, it follows that Eq. (3) can be maximized in a relatively straightforward manner. Let us show how. To find w (w,w 2,...,w m ) we need to solve a constrained optimization problem, which we will do by using the method of Lagrange multipliers. We shall first form the Lagrangian function as 0 L(w, ) n j log w j + @ w j A j where is the Lagrange multiplier. Then, by setting @ @w L(w, )0for every 2Y and @ @ L(w, )0,wederivethatw n and n. Thus, w n i j I(y i ), i where I( ) is the indicator function. To find all j,werecallthatweassumeda 9

mixture of exponential distributions. Thus, we proceed by setting @ @ log p(x i y i )0, i for each 2Y.Weobtainthat n i I(y i ) x i, which is simply the inverse mean over those data points generated by the -th mixture component. In summary, we observe that if the mixing component designations y are nown, the parameter estimation is greatly simplified. This was achieved by decoupling the estimation of mixing proportions and all parameters of the mixing distributions. In the second hypothetical scenario, suppose that parameters are nown, and that we would lie to estimate the best configuration of the mixture designations y (one may be tempted to call them class labels ). This tas loos lie clustering in which cluster memberships need to be determined based on the nown set of mixing distributions and mixing probabilities. To do this we can calculate the posterior distribution of y as p(y D, ) p(y i x i, ) i i w yi p(x i yi ) P m j w jp(x i j ) (4) and subsequently find the best configuration out of m n possibilities. Obviously, because of the i.i.d. assumption each element y i can be estimated separately and, thus, this estimation can be completed in O(mn) time. The MAP estimate for y i can be found as ŷ i arg max y i 2Y ( ) w yi p(x i yi ) P m j w jp(x i j ) for each i 2{, 2,...,n}. In reality, neither class labels y nor the parameters are nown. Fortunately, we have just seen that the optimization step is relatively straightforward if one of them is nown. Therefore, the intuition behind the EM algorithm is to form an iterative procedure by assuming that either y or is nown and calculate the other. For example, we can initially pic some value for, say (0),and then estimate y by computing p(y D, (0) ) as in Eq. (4). We can refer to this estimate as y (0). Using y (0) we can now refine the estimate of to () using Eq. (3). We can then iterate these two steps until convergence. In the case of mixture of exponential distributions, we arrive at the following algorithm: 0

. Initialize 2. Calculate y (0) i 3. Set t 0 (0) and w (0) arg max 2Y for 8 2Y w (0) p(x i (0) P ) m j w(0) j p(x i (0) j ) for 8i 2{, 2,...,n} 4. Repeat until convergence (a) w (t+) (b) (t+) (c) t t + (d) y (t+) i 5. Report (t) n i I(y(t) i I(y(t) i i I(y(t) i arg max 2Y and w (t) i j) ) ) x i w (t) P m j w(t) j for 8 2Y p(x i (t) ) p(x i (t) j ) This procedure is not quite yet the EM algorithm; rather, it is a version of it referred to as classification EM algorithm. In the next section we will introduce the EM algorithm..2. The expectation-maximization algorithm The previous procedure was designed to iteratively estimate class memberships and parameters of the distribution. In reality, it is not necessary to compute y; after all, we only need to estimate. Toaccomplishthis,ateachstept, wecan use p(y D, (t) ) to maximize the expected log-lielihood of both D and y E Y [log p(d, y ) (t) ] X y log p(d, y )p(y D, (t) ), (5) which can be carried out by integrating the log-lielihood function of D and y over the posterior distribution for y in which the current values of the parameters (t) are assumbed to be nown. We can now formulate the expression for the parameters in step t +as (t+) arg max n o E[log p(d, y ) (t) ]. (6) The formula above is all that is necessary to create the update rule for the EM algorithm. Note, however, that inside of it we always have to re-compute E Y [log p(d, y ) (t) ] function because the parameters (t) have been updated from the previous step. We then can perform maximization. Hence the name expectation-maximization, although it is perfectly valid to thin of the EM algorithm as an interative maximization of expectation from Eq. (5), i.e. expectation maximization. We now proceed as follows

E[log p(d, y ) (t) ] y y y log p(d, y )p(y D, (t) ) y n y n i y n i log p(x i,y i ) p(y l x l, (t) ) l log (w yi p(x i yi )) p(y l x l, (t) ). After several simplification steps, that we omit for space reasons, the expectation of the lielihood can be written as E[log p(d, y ) (t) ] i j l log (w j p(x i j )) p Yi (j x i, (t) ), from which we can see that w and { j } m j can be separately found. In the final two steps, we will first derive the update rule for the mixing probabilities and then by assuming the mixing distributions are exponential, derive the update rules for their parameters. To maximize E[log p(d, y ) (t) ] with respect to w, weobservethatthisis an instance of constrained optimization because it must hold that P m i w i. We will use the method of Lagrange multipliers; thus, for each 2Ywe need to solve 0 0 @ X n @ log w j p Yi (j x i, (t) )+ @ w j AA 0, @w j i where is the Lagrange multiplier. It is relatively straightforward to show that w (t+) n j p Yi ( x i, (t) ). (7) i Similarly, to find the solution for the parameters of the mixture distributions, we obtain that (t+) i p Y i ( x i, (t) ) i x ip Yi ( x i, (t) ) (8) for 2Y. As previously shown, we have p Yi ( x i, (t) ) w (t) (t) ) p(x i P m j w(t) j p(x i (t) j ), (9) which can be computed and stored as an n m matrix. In summary, for the mixture of m exponential distributions, we summarize the EM algorithm by combining Eqs. (7-9) as follows: 2

. Initialize 2. Set t 0 (0) and w (0) for 8 2Y 3. Repeat until convergence (a) p Yi ( x i, (t) ) w(t) P m j w(t) j p(x i (t) ) p(x i (t) j (b) w (t+) n i p Y i ( x i, (t) ) (c) (t+) (d) t t + 4. Report (t) i p Y i ( x i, (t) ) i x ip Yi ( x i, (t) ) and w (t) for 8 2Y ) for 8(i, ) Similar update rules can be obtained for different probability distributions; however, a separate derivatives have to be found. It is important to observe and undertand the difference between the CEM and the EM algorithms..3 Bayesian estimation Maximum a posteriori and maximum lielihood approaches report the solution that corresponds to the mode of the posterior distribution and the lielihood function, respectively. This approach, however, does not consider the possibility of sewed distributions, multimodal distributions or simply large regions with similar values of p(m D). Bayesianestimationaddressesthoseconcerns. The main idea in Bayesian statistics is minimization of the posterior ris ˆ R `(M, ˆM) p(m D)dM M where ˆM is our estimate and `(M, ˆM) is some loss function between two models. When `(M, ˆM) (M ˆM) 2 (ignore the abuse of notation), we can minimize the posterior ris as follows ˆ @ @ ˆM R 2ˆM 2 M p(m D)dM M 0 from which it can be derived that the minimizer of the posterior ris is the posterior mean function, i.e. ˆ M B M M p(m D)dM E M [M D]. 3

We shall refer to M B as the Bayes estimator. It is important to mention that computing the posterior mean usually involves solving complex integrals. In some situations, these integrals can be solved analytically; in others, numerical integration is necessary. Example. Let D {2, 5, 9, 5, 4, 8} yet again be an i.i.d. sample from Poisson( t ). Suppose the prior nowledge about the parameter of the exponential distribution can be expressed using a gamma distribution with parameters 3and.FindtheBayesianestimateof t. We want to find E[ D]. Let us first write the posterior distribution as p(d )p( ) p( D) p(d) p(d )p( ) p(d )p( )d, 0 where, as shown in previous examples, we have that p(d ) i xi e n Q n i x i! and p( ) e (). Before calculating p(d), letusfirstnotethat Now, we can derive that ˆ x 0 e x dx ( ). and subsequently that p(d) ˆ 0 ˆ 0 p(d )p( )d i xi e Q n e n i x i! () d ( + i x i) () Q n i x i!(n + ) i x i+ 4

p(d )p( ) p( D) p(d) i xi e n Q n i x i! Finally, e () + i xi e (n+ / ) (n + ) ( + i x i) () Q n i x i!(n + ) i x i+ ( + i x i) i x i+. E[ D] ˆ 0 p( D)d + i x i n + 5.4 which is nearly the same solution as the MAP estimate found in Example 9. It is evident from the previous example that selection of the prior distribution has important implications on calculation of the posterior mean. We have not piced the gamma distribution by chance; that is, when the lielihood was multiplied by the prior, the resulting distribution remained in the same class of functions as the lielihood. We shall refer to such prior distributions as conjugate priors. Conjugate priors are also simplifying the mathematics; in fact, this is a major reason for their consideration. Interestingly, in addition to the Poisson distribution, the gamma distribution is a conjugate prior to the exponential distribution as well as the gamma distribution itself. 5