Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2)

Size: px
Start display at page:

Download "Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2)"

Transcription

1 Lectures on Machine Learning (Fall 2017) Hyeong In Choi Seoul National University Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2) Topics to be covered: Exponential family of distributions Mean and (canonical) link functions Convexity of log partition function Generalized linear model (GLM) Various GLM models 1 Exponential family of distributions In this section, we study a family of probability distribution called the exponential family (of distributions). It is of a special form, but most, if not all, of the well known probability distributions belong to this class. 1

2 1.1 Definition Definition 1. A probability distribution (PDF or PMF) is said to belong to the exponential family of distributions (in natural or canonical form) if it is of the form. P θ (y) = 1 Z(θ) h(y)eθ T (y), (1) where y = (y 1,, y m ) is a point in R m ; and θ = (θ 1,, θ k ) R k is a parameter called the canonical (natural) parameter; T : R m R k is a map T (y) = (T 1 (y),, T k (y)); and Z(θ) = h(y)e θ T (y) dy is called the partition function, while its logarithm, A(θ) = log Z(θ), is called the log partition (cumulant) function. Remark. In this lecture and throughout this course, the dot notation as in θ T (y) always means the inner (dot) product of two vectors. Equivalently, P θ (y) can be written in the form P θ (y) = exp[θ T (y) A(θ) + C(y)], (2) where C(y) = log h(y). More generally, one sometimes introduces an extra parameter, called the dispersion parameter, to control the shape of P θ (y) by [ ] θ T (y) A(θ) P θ (y) = exp + C(y, ). 1.2 Standing assumption In our discussion on the exponential family of distribution, we always assume the following. In case P θ (y) is a probability density function (PDF), it is assumed to be continuous as a function of y. It means that there is no singularity in the probability measure P θ (y); In case P θ (y) is a probability mass function (PMF), there exists a range of discrete value of P θ (y) that is the same for all θ and for all y. If P θ (y) satisfies either condition, we say it is regular, which is always assumed throughout this course. 2

3 Remark. Sometimes people use the more general form of P θ (y) to write P θ (y) = 1 Z(θ) h(y)eη(θ) T (y). But most of the time the same result can be obtained without the use of the general form η(θ). So it is not much of a loss of generality to stick to our convention of using just θ. 1.3 Examples Let us now look at a few illustrative examples. (1) Bernoulli(µ) The Bernoulli distribution is perhaps the simplest in the exponential family. Let Y be a random variable taking its binary value in {0, 1}. Let µ = P [Y = 1]. Its distribution (in fact, the probability mass function, PMF) is then succinctly written as P (y) = µ y (1 µ) 1 y. Then, P (y) = µ y (1 µ) 1 y = exp[y log µ + (1 y) log(1 µ)] [ ( ) ] µ = exp y log + log(1 µ). 1 µ ( ) µ Letting T (y) = y and θ = log and recalling the definitions of 1 µ logit and σ functions given in Lecture 3, we have ( ) µ logit(µ) = log. 1 µ Thus Its inverse function is the sigmoid function: θ = logit(µ). (3) µ = σ(θ), (4) where σ(θ) = e θ. 3

4 Therefore, we have A(θ) = log(1 µ) = log(1 + e θ ). (5) Thus P θ (y), written in canonical form as in (2), becomes P θ (y) = exp[θ y log(1 + e θ )]. (2) Exponential distribution The exponential distribution is a distribution that models the independent arrival time. Its distribution (the probability density function, PDF) is given as P θ (y) = θe θy I(x 0). To put it in the exponential family form, we use the same θ as the canonical parameter and we let T (y) = y and h(y) = I(y 0). Since Z(θ) = 1 θ = e θy I(y 0)dy, it is already in the canonical form given as in (1). (3) Normal distribution The normal (Gaussian) distribution given by [ ] 1 P (y) = exp (y µ)2 2πσ 2 2σ 2 is the single most well known distribution. As far as its relation with the exponential family is concerned there are two views. (Here, this σ is a number, not the sigmoid function.) 1st view (σ 2 as a dispersion parameter) This is the case when the main focus is the mean. In this case, the variance σ 2 is regarded as known or as a parameter that can be fiddled with as if known. Writing P (y) in the form 1 P θ (y) = exp 2 y2 + yµ 1 2 µ2 1 σ 2 2 log(2πσ2 ), 4

5 One can see right away it is in the form of the exponential family if we set θ = (θ 1, θ 2 ) = (µ, 1) T (y) = (y, 1 2 y2 ) = σ 2 A(θ) = 1 2 µ2 = 1 2 θ2 1 C(y, ) = 1 2 log(2πσ2 ). 2nd view ( = 1) When both µ and σ are parameters to be treated as unknown, we take this point of view. In here, we set the dispersion parameter = 1. Writing out P θ (y) we have [ P θ (y) = exp 1 2σ 2 y2 + µ σ y 1 2 2σ 2 µ2 log σ 1 ] log 2π. 2 Thus it is easy to see the following: T (y) = (y, 1 2 y2 ) θ = (θ 1, θ 2 ) = ( µ σ 2, 1 σ 2 ) A(θ) = 1 2σ 2 µ2 + log σ = 1 θ θ 2 2 log( θ 2) C(y) = 1 2 log(2π). 1.4 Properties of exponential family The log partition function A(θ) plays a key role, so let us now look at it more carefully. First, since [ ] θ T (y) A(θ) P θ (y)dy = exp + C(y, ) dy = 1, 5

6 taking θ, we have [ θ T (y) A(θ) exp Thus ] T (y) θ A(θ) + C(y, ) dy = 0. [ ] θ T (y) A(θ) exp + C(y, ) T (y)dy = = θ A(θ) [ ] θ T (y) A(θ) exp + C(y, ) θ A(θ)dy [ ] θ T (y) A(θ) exp + C(y, ) dy. Since we have the following: [ ] θ T (y) A(θ) exp + C(y, ) dy = 1, Proposition 1. θ A(θ) = P θ (y)t (y)dy = E[T (Y )], where Y is the random variable with distribution P θ (y). Writing the component, we have [ A θ T (y) A(θ) = exp θ i ] + C(y, ) T i (y)dy. Taking the second partial derivative, we have 2 A = 1 [ ] { θ T (y) A(θ) exp + C(y, ) T j (y) A(θ) } T i (y)dy θ i θ j θ j = 1 ( [ ] θ T (y) A(θ) exp + C(y, ) T i (y)t j (y)dy A(θ) [ ] ) θ T (y) A(θ) exp + C(y, ) T i (y)dy. θ j 6

7 Using Proposition 1 once more, we have 2 A = 1 { } E[T i (Y )T j (Y )] E[T i (Y )]E[T j (Y )] θ i θ j = 1 [ E (Ti (Y ) E[T i (Y )] ) ( T j (Y ) E[T j (Y )] )] = 1 Cov(T i(y ), T j (Y )). Therefore we have the following result on the Hessian matrix of A. ( 2 Proposition 2. Dθ 2A = A ) = 1 θ i θ j Cov(T (Y ), T (Y )) = 1 Cov(T (Y )), where Y is the random variable with distribution P θ (y). Here, Cov(T (Y ), T (Y )) denotes the covariance matrix of T (Y ) with itself, which is sometimes called the variance matrix. Since the covariance matrix is always positive semi-definite, we have Corollary 1. A(θ) is a convex function of θ. Remark. In most exponential family models, Cov(T (Y )) is positive definite, in which case A(θ) is strictly convex. In this lecture and throughout the course, we always assume that Cov(T (Y )) is positive definite and therefore that A(θ) is strictly convex. 1.5 Maximum likelihood estimation Let D = {y (i) } N be a given data of IID samples, where y (i) R m. Then its likelihood function L(θ) and the log likelihood function l(θ) are given by L(θ) = N [ ] θ T (y (i) ) A(θ) exp + C(y (i), ) l(θ) = 1 { N θ T (y (i) ) NA(θ) } + N C(y (i), ). 7

8 Since A(θ) is strictly convex, l(θ) has a unique maximum at θ that is the unique solution of θ l(θ) = 0. Note that So we have the following: θ l(θ) = 1 { N T (y (i) ) N θ A(θ) }. Proposition 3. There is a unique θ that maximizes l(θ) at which θ A(θ) θ= θ = 1 N N T (y (i) ) Example: Bernoulli Distribution Let us apply the above argument to the Bernoulli distribution Bernoulli(µ) and see what it leads to. First, differentiating A(θ) as in (5), we get A (θ) = 1 = σ(θ). 1 + e θ Thus by Proposition 3 we have a unique θ satisfying σ( θ) = 1 N N T (y (i) ). Since T (y) = y, and y takes on 0 or 1 as its value, we have T (y) = I(y = 1). Defining µ = σ( θ) by (4), we then have µ = σ( θ) = 1 N I(y (i) = 1) N This confirms the usual estimate of µ. = Number of 1 s in the sample {y(i) }. N 8

9 Remark. T is a sufficient statistic. It is well known that the statistic S(y i,, y n ) = n T (y(i) ) is sufficient, which means that θ is a constant on the level set {(y 1,, y n ) S(y 1, y 2,, y n ) = const}. This provides many advantages in estimating θ, as one needs not to look inside the level sets. 1.6 Maximum Entropy Distribution Exponential family is useful in that it is the most random distribution under some constraints. Let Y be a random variable with an unknown distribution P (y). Suppose the expected values of certain features f 1 (y),, f k (y) are nonetheless known. Namely, f i (y)p (y)dy = C i (6) is known for some given constant C i for i = 1,, k. The question is how to fix P (y) so that it is as random as possible while satisfying the above constraints. One approach is to construct a maximum entropy distribution satisfying the constraints (6). The entropy is the measure of randomness and in general, the higher it is, the more random it is deemed. The definition of entropy is H = P (y) log P (y), y for the discrete case; for the continuous case, the sum becomes an integral so that H = P (y) log P (y)dy. We will use the integral notation here without the loss of generality. To solve the problem, we use the Lagrange multiplier method to set up the following variational problem of minimizing J(P, λ) : J(P, λ) = P (y) log P (y)dy+λ 0 (1 P (y)dy)+ k λ i {C i } f i (y)p (y)dy. 9

10 The solution is found as the critical point. Namely, setting the variational (functional or Fréchet) derivative equal to 0, δj k δp = 1 log P (y) λ 0 λ i f i (y) = 0. (7) To see where this comes from, let η be any smooth function with compact support and define { J(P +ɛη, λ) = (P +ɛη) log(p +ɛη)dy+λ 0 1 (P +ɛη)dy) } + { λ i Ci f i (P +ɛη)dy }. Taking the derivative with respect to ɛ at ɛ = 0 and setting it equal to 0, we get d J(P + ɛη, λ) = (η + η log P )dy λ 0 ηdy + λ i { f i ηdy} dɛ ɛ=0 ( k ) = 1 log P λ0 λ i f i ηdy = 0. Since the above formula is true for any η, we have k 1 log P λ 0 λ i f i = 0, which is (7). Solving (7) for P (y), we have where Z = e 1+λ 0. Now 1 = P (y)dy = 1 Z Thus P (y) = 1 k Z exp( λ i f i (y)), Z = exp( exp( k λ i f i (y))dy. k λ i f i (y))dy. Therefore P (y) belongs to the exponential family of distribution. This kind of distribution is in general called the Gibbs distribution. 10

11 2 Generalized liner model (GLM) 2.1 Mean parameter and canonical link function Before we embark on the generalized linear model, we need the following elementary fact. Lemma 1. Let f : R k R be a strictly convex C 1 function. Then x f : R k R k is an invertible function. Proof. Let p R k be any given vector. By the strict convexity and the absence of corners (the C 1 condition), there is a unique hyperplane H in R k+1 that is tangent to the graph of y = f(x), having the normal vector of the form (p, 1). Let (x, y) be the point of contact. Since the graph is the zero set of F (x, y) = f(x) y, its normal vector at (x, y) is ( f(x), 1). Therefore we must have p = f(x). Figure 1: Graph of a strictly convex function with a tangent plane Let us now look at the exponential family P θ (y) = exp [ θ T (y) A(θ) + C(y) ], (8) where the dispersion parameter is set to be 1 for the sake of simplicity of the presentation. (The argument does not change much even with the presence of the dispersion parameter.) Recall that Proposition 1 says that θ A(θ) = E[T (Y )]. We always assume that the log partition function A : R k R is strictly convex and C 1. Thus by the above Lemma, θ A : R k R k is invertible. We now set up a few terminologies. 11

12 Definition 2. The mean parameter µ is defined as µ = E[T (Y )] = θ A(θ), and the function θ A : R k R k is called the mean function. Since θ A is invertible, we define Definition 3. The inverse function ( θ A) 1 of θ A is called the canonical link function and is denoted by ψ. Thus θ = ψ(µ) = ( θ A) 1 (µ). Therefore, combining the above definitions, the mean function is also written as µ = ψ 1 (θ) = θ A(θ). Example: Bernoulli(µ) Recall, for Bernoulli(µ), P (y) is given by P (y) = µ y (1 µ) 1 y = exp [ y log ] µ + log(1 µ) 1 µ = exp [ θy log(1 + e θ ) ] = exp [ θy A(θ) ], ( ) µ where θ = log and A(θ) = log(1 + e θ ). Thus 1 µ µ = sigmoid(θ) = A (θ) θ = logit(µ) = (A ) 1 (θ). Therefore for the Bernoulli distribution the sigmoid function is the mean function and the logit function the canonical link function. 2.2 Conditional probability in GLM From now on in this lecture, to simplify notation, we use ω to stand for both w and b and extend x R d to x R d+1 by adding x 0 = 1. (In the previous lectures, we used θ as a generic term to represent both w and b. But in the exponential family notation, θ is reserved to denote the canonical parameter. So we are forced to use ω. One more note: although the typography is difficult to discern, this ω is the Greek lower case omega, not English w.) 12

13 2.2.1 GLM recipe With this notation, the expression ω = (b, w 1,, w d ) as (1) of Lecture 3 is now written as ω x, and thus the conditional probability in Lecture 3 is written as which can be simplified as P (y = 1 x) = P (y = 0 x) = P (y x) = e ω x 1 + e ω x e, ω x e(ω x)y 1 + e ω x, for y {0, 1}. Using (5), this is easily seen to be equivalent to the following conditional probability model P (y x) = exp [(ω x)y A(ω x)] = exp [θy A(θ)]. Summarizing this process, we have the following: GLM Recipe: P (y x) is gotten from P (x) by replacing the canonical parameter θ in (8) with a linear expression of x. In the binary case, the linear expression is ω x. The multiclass case mimics it in exactly the same way. To describe it correctly, first define the parameter (ω 1 ) T ω 10 ω 1j ω 1d W =. (ω i ) T. (ω k ) T... = ω i0 ω ij ω id. (9)... ω k0 ω kj ω kd Then replace θ in (8) with W x to get the conditional probability as P (y x) = exp[(w x) T (y) A(W x) + C(y)]. (10) 13

14 Written this way, we can see that (W x) T (y) = k d ω lj x j T l (y). l=1 j=0 Now, let D = {(x (i), y (i) )} N be a given data, where the x-part of the i-th data is represented by the column vector x (i) = [x (i) 0,, x (i) d ]T. Using the conditional probability as in (10), one can write the likelihood function and the log likelihood function by L(W ) = N exp[(w x (i) ) T (y (i) ) A(W x (i) ) + C(y (i) )] l(w ) = log L(W ) = = [ N k l=1 d j=0 N [(W x (i) ) T (y (i) ) A(W x (i) ) + C(y (i) )] Thus taking the derivative, one gets [ l(w ) N k d = x (i) j ω δ lrδ js T l (y (i) ) rs = = l=1 ω lj x (i) j T l(y (i) ) A(ω T 1 x (i),, ω T k x (i) ) + C(y (i) ) j=0 k A ω l l=1 d j=0 x (i) j δ lrδ js ] N [ x (i) s T r (y (i) ) A ] x s (i) ω r N [ T r (y (i) ) A ] (W x (i) ) x (i) s. (11) ω r This is a generalized form of what we got for the logistic regression (see (10) of Lecture 3). 2.3 Multiclass classification (softmax regression) We now look at the multiclass classification problem. The formula (11) is the derivative formula with which the gradient descent algorithm can be used. 14 ].

15 But for multiclass regression, the categorical distribution has the redundancy we talked about in Lecture 3. As usual, let y i = I(y = i) and let Prob[Y = i] = µ i. Then we must have y y k = 1 and µ µ k = 1. Thus the probability (PMF) can be written as P (y) = µ y 1 1 µ y 2 2 µ y k k. Rewrite P (y) by k 1 P (y) = µ y 1 1 µ y 2 2 µ y k 1 k 1 µ(1 y i) k [ ( ) ] k 1 = exp y 1 log µ y k 1 log µ k y i log µ k = exp [ k 1 y i log µ i µ k + log µ k Note that when k = 2, this is exactly the Bernoulli distribution. Define θ i = log(µ i /µ k ) and T i (y) = y i for i = 1,, k 1. Using the facts that µ i = µ k e θ i and 1 µ k = k 1 µ i, and solving for µ k and then for µ j, we get µ k = µ j = ] k 1 eθ i e θ j 1 + k 1 eθ i, (12) for j = 1,, k 1. The expression in the right hand side of (12) is called the generalized sigmoid (softmax) function. Therefore P (y) can be written in the exponential family form as P θ (y) = exp[θ T (y) A(θ)], where and θ = (θ 1,, θ k 1 ), T (y) = (y 1,, y k 1 ), ( k 1 A(θ) = log µ k = log 1 + e θ i ). 15

16 Note that the mean parameter µ i is given as µ i = E[T i (Y )] = E[Y i ] = Prob[Y i = 1] = Prob[Y = i], which again can be calculated by the fact µ = θ A. Indeed one can verify that A e θ j = θ j 1 + = µ k 1 j. eθ i We have shown above that θ j = log(µ j /µ k ). Thus for j = 1,, k 1, we have ( ) µ j θ j = log 1 k 1 µ. i The expression on the right side of the above equation is called the generalized logit function. 2.4 Probit regression Recall that the essence of the GLM for the exponential family is the way it links the (outside) features x 1,, x d to the probability model. To be specific, given the probability model (8), the linking was done by setting θ = ψ(µ) = ω x so that the conditional probability becomes P (y x) = exp[(ω x)t (y) A(ω x) + C(y)]. But there is no a priori reason why µ has to be related to ω x via the canonical link function only. One may use any function as long as it invertible. So set g(µ) = ω x, where g : R k R k is an invertible function. So we call g a link function as opposed to the canonical link function and its inverse g 1 the mean function. So we have g(µ) = ω x : link function µ = g 1 (ω x) : mean function. Let Φ(t) be the cumulative distribution function (CDF) of the standard normal distribution N (0, 1), i.e., Φ(t) = 1 2π t 16 e s2 /2 ds.

17 The probit regression deals with the following situation: the output y is binary taking on values in {0, 1}, and its probability model is Bernoulli. So P (y) = µ y (1 µ) 1 y, where µ = P (y = 1). Its link to the outside variables (features) is via the link function g = Φ 1. Thus Therefore, g(µ) = ω x = ω 0 + ω 1 x ω d x d µ = g 1 (ω x) = Φ(ω x). (13) P (y x) = Φ(ω x) y (1 Φ(ω x)) 1 y. Since this is still a Bernoulli hence an exponential family distribution, the general machinery of GLM can be employed. But, in the probit regression we change tack and take a different approach. First, we let y { 1, 1}. Then since µ = E[Y ] = 1 P (y = 1) + ( 1) P (y = 1) = P (y = 1) [1 P (y = 1)], we have µ = 2P (y = 1) 1. Thus P (y = 1) = 1 (1 + µ) and P (y = 1) = (1 µ). Therefore, P (y) = 1 (1 + yµ). 2 Thus using (13), we have P (y x) = 1 [1 + yφ(ω x)]. 2 Now let the data D = {(x i, y i )} n be given. function is n l(ω) = log P (y i x i ) Thus = n l(ω) ω k = ( ) 1 log 2 [1 + y iφ(ω x i )], n y i Φ (ω x i ) 1 + y i Φ(ω x i ) x ik, 17 Then the log likelihood

18 which is in a neat form to apply the numerical methods introduced in Lecture 3, where Φ (ω x) = 1 e (ω x)2 /2. 2π Comments (1). When one uses a link function which is not canonical, the probability model does not even have to belong to the exponential family. As long as one can get a hold of the conditional distribution P (y x), one is in business. Namely with it one can form the likelihood function and the rest of the maximum likelihood machinery. (2). One may not even need the probability distribution as long as one can somehow estimate ω so that the decisions can be made. (3). Historically the GLM was first developed for the exponential family but was later extended to the non-exponential family and even to the case where the distribution is not completely known. 2.5 GLM for other distributions GLM for Poisson(µ) Recall that the Poisson distribution is the probability distribution (PMF) on the non-negative integers given by P (n) = e µ µ n, n! for n = 0, 1, 2,. It is an exponential family distribution as it can be written as P (y) = e µ µ y y! = exp{y log µ µ log(y!)}, where y = 0, 1, 2,. It can be put in the exponential family form by setting θ = log µ : canonical link function µ = e θ : mean function. 18

19 One can independently check that E[Y ] = np (n) = µ GLM for Γ(α, λ) n=0 Recall that the gamma distribution is written in the following form: P (y) = λe λy (λy) α 1, for y 0 Γ(α) = exp{log λ λy + (α 1) log(λy) log Γ(α)} = exp{α log λ λy + (α 1) log y log Γ(α)} { ( ) λy ( log λ) 1 } = exp + 1 log y log Γ(1/). Set = 1 as the dispersion parameter and let θ = λ. Then P (y) is an α exponential family distribution written as: { θy ( log( θ)) } P θ (y) = exp + C(y, ), where C(y, ) = 1 log(1/) + ( 1 1 ) log y log Γ(1/). Therefore the log partition function is A(θ) = log( θ). Differentiating this we get the mean parameter µ = A (θ) = 1 θ. Since E[Y ] = α for Y Γ(α, λ), this fact is independently verified. λ To recap, the mean function and the canonical link function are given by µ = ψ 1 (θ) = 1 θ θ = ψ(µ) = 1 µ : mean function : canonical link function. 19

20 In practice, this canonical link function is rarely used. Instead one uses one of the following three alternatives. (i) the inverse link g(µ) = 1 µ, (ii) the log link g(µ) = log µ, (iii) the identity link g(µ) = µ GLM Summary The GLM link and inverse link (mean) functions for various probability models are summarized in the following Table 1. In here, the range of j in the link and the inverse link of the categorical distribution is j = 1,, k 1, although µ = (µ 1,, µ k ). Range Link Inverse Link (mean) Dispersion N(µ, σ 2 ) (, ) θ = µ µ = θ σ 2 Bernoulli(µ) {0, 1} θ = logit(µ) µ = σ(θ) 1 ( ) µ j Categorical(µ) {1,, k} θ j = log 1 e θ j k 1 µ µ j = i 1 + k 1 1 eθ i {0, 1} or Probit Φ { 1, 1} 1 Φ 1 Poisson(µ) Z + θ = log(µ) µ = e θ 1 Gamma(α, λ) R + θ = 1 µ = 1 1 µ θ α Table 1: GLM-Summary 20

Rapid Introduction to Machine Learning/ Deep Learning

Rapid Introduction to Machine Learning/ Deep Learning Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/62 Lecture 1b Logistic regression & neural network October 2, 2015 2/62 Table of contents 1 1 Bird s-eye

More information

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

STA216: Generalized Linear Models. Lecture 1. Review and Introduction STA216: Generalized Linear Models Lecture 1. Review and Introduction Let y 1,..., y n denote n independent observations on a response Treat y i as a realization of a random variable Y i In the general

More information

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning. Lecture 3: Logistic Regression. Feng Li. Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification

More information

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52 Statistics for Applications Chapter 10: Generalized Linear Models (GLMs) 1/52 Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52 Components of a linear model The two

More information

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random STA 216: GENERALIZED LINEAR MODELS Lecture 1. Review and Introduction Much of statistics is based on the assumption that random variables are continuous & normally distributed. Normal linear regression

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models David Rosenberg New York University April 12, 2015 David Rosenberg (New York University) DS-GA 1003 April 12, 2015 1 / 20 Conditional Gaussian Regression Gaussian Regression Input

More information

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples ST3241 Categorical Data Analysis I Generalized Linear Models Introduction and Some Examples 1 Introduction We have discussed methods for analyzing associations in two-way and three-way tables. Now we will

More information

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Generalized Linear Models. Last time: Background & motivation for moving beyond linear Generalized Linear Models Last time: Background & motivation for moving beyond linear regression - non-normal/non-linear cases, binary, categorical data Today s class: 1. Examples of count and ordered

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Advanced Methods for Data Analysis (36-402/36-608 Spring 2014 1 Generalized linear models 1.1 Introduction: two regressions So far we ve seen two canonical settings for regression.

More information

Logistic Regression. Seungjin Choi

Logistic Regression. Seungjin Choi Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Generalized Linear Models and Exponential Families

Generalized Linear Models and Exponential Families Generalized Linear Models and Exponential Families David M. Blei COS424 Princeton University April 12, 2012 Generalized Linear Models x n y n β Linear regression and logistic regression are both linear

More information

Lecture 3 - Linear and Logistic Regression

Lecture 3 - Linear and Logistic Regression 3 - Linear and Logistic Regression-1 Machine Learning Course Lecture 3 - Linear and Logistic Regression Lecturer: Haim Permuter Scribe: Ziv Aharoni Throughout this lecture we talk about how to use regression

More information

STAT5044: Regression and Anova

STAT5044: Regression and Anova STAT5044: Regression and Anova Inyoung Kim 1 / 18 Outline 1 Logistic regression for Binary data 2 Poisson regression for Count data 2 / 18 GLM Let Y denote a binary response variable. Each observation

More information

Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

Introduction: exponential family, conjugacy, and sufficiency (9/2/13) STA56: Probabilistic machine learning Introduction: exponential family, conjugacy, and sufficiency 9/2/3 Lecturer: Barbara Engelhardt Scribes: Melissa Dalis, Abhinandan Nath, Abhishek Dubey, Xin Zhou Review

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

SB1a Applied Statistics Lectures 9-10

SB1a Applied Statistics Lectures 9-10 SB1a Applied Statistics Lectures 9-10 Dr Geoff Nicholls Week 5 MT15 - Natural or canonical) exponential families - Generalised Linear Models for data - Fitting GLM s to data MLE s Iteratively Re-weighted

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert Logistic regression Comments Mini-review and feedback These are equivalent: x > w = w > x Clarification: this course is about getting you to be able to think as a machine learning expert There has to be

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Foundations of Statistical Inference

Foundations of Statistical Inference Foundations of Statistical Inference Jonathan Marchini Department of Statistics University of Oxford MT 2013 Jonathan Marchini (University of Oxford) BS2a MT 2013 1 / 27 Course arrangements Lectures M.2

More information

STAT Chapter 5 Continuous Distributions

STAT Chapter 5 Continuous Distributions STAT 270 - Chapter 5 Continuous Distributions June 27, 2012 Shirin Golchi () STAT270 June 27, 2012 1 / 59 Continuous rv s Definition: X is a continuous rv if it takes values in an interval, i.e., range

More information

Generalized Linear Models (1/29/13)

Generalized Linear Models (1/29/13) STA613/CBB540: Statistical methods in computational biology Generalized Linear Models (1/29/13) Lecturer: Barbara Engelhardt Scribe: Yangxiaolu Cao When processing discrete data, two commonly used probability

More information

Classification. Sandro Cumani. Politecnico di Torino

Classification. Sandro Cumani. Politecnico di Torino Politecnico di Torino Outline Generative model: Gaussian classifier (Linear) discriminative model: logistic regression (Non linear) discriminative model: neural networks Gaussian Classifier We want to

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set

More information

Generalized Linear Models I

Generalized Linear Models I Statistics 203: Introduction to Regression and Analysis of Variance Generalized Linear Models I Jonathan Taylor - p. 1/16 Today s class Poisson regression. Residuals for diagnostics. Exponential families.

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable Lecture Notes 1 Probability and Random Variables Probability Spaces Conditional Probability and Independence Random Variables Functions of a Random Variable Generation of a Random Variable Jointly Distributed

More information

Generalized Linear Models 1

Generalized Linear Models 1 Generalized Linear Models 1 STA 2101/442: Fall 2012 1 See last slide for copyright information. 1 / 24 Suggested Reading: Davison s Statistical models Exponential families of distributions Sec. 5.2 Chapter

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Multiclass Logistic Regression

Multiclass Logistic Regression Multiclass Logistic Regression Sargur. Srihari University at Buffalo, State University of ew York USA Machine Learning Srihari Topics in Linear Classification using Probabilistic Discriminative Models

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

CS489/698: Intro to ML

CS489/698: Intro to ML CS489/698: Intro to ML Lecture 04: Logistic Regression 1 Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of

More information

Chapter 3. Exponential Families. 3.1 Regular Exponential Families

Chapter 3. Exponential Families. 3.1 Regular Exponential Families Chapter 3 Exponential Families 3.1 Regular Exponential Families The theory of exponential families will be used in the following chapters to study some of the most important topics in statistical inference

More information

Generalized Linear Models Introduction

Generalized Linear Models Introduction Generalized Linear Models Introduction Statistics 135 Autumn 2005 Copyright c 2005 by Mark E. Irwin Generalized Linear Models For many problems, standard linear regression approaches don t work. Sometimes,

More information

Iterative Reweighted Least Squares

Iterative Reweighted Least Squares Iterative Reweighted Least Squares Sargur. University at Buffalo, State University of ew York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative

More information

Machine Learning. 7. Logistic and Linear Regression

Machine Learning. 7. Logistic and Linear Regression Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,

More information

Gradient-Based Learning. Sargur N. Srihari

Gradient-Based Learning. Sargur N. Srihari Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Weighted Least Squares I

Weighted Least Squares I Weighted Least Squares I for i = 1, 2,..., n we have, see [1, Bradley], data: Y i x i i.n.i.d f(y i θ i ), where θ i = E(Y i x i ) co-variates: x i = (x i1, x i2,..., x ip ) T let X n p be the matrix of

More information

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics Data from one or a series of random experiments are collected. Planning experiments and collecting data (not discussed here). Analysis:

More information

Modern Methods of Statistical Learning sf2935 Lecture 5: Logistic Regression T.K

Modern Methods of Statistical Learning sf2935 Lecture 5: Logistic Regression T.K Lecture 5: Logistic Regression T.K. 10.11.2016 Overview of the Lecture Your Learning Outcomes Discriminative v.s. Generative Odds, Odds Ratio, Logit function, Logistic function Logistic regression definition

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable Lecture Notes 1 Probability and Random Variables Probability Spaces Conditional Probability and Independence Random Variables Functions of a Random Variable Generation of a Random Variable Jointly Distributed

More information

Lecture 4. Continuous Random Variables and Transformations of Random Variables

Lecture 4. Continuous Random Variables and Transformations of Random Variables Math 408 - Mathematical Statistics Lecture 4. Continuous Random Variables and Transformations of Random Variables January 25, 2013 Konstantin Zuev (USC) Math 408, Lecture 4 January 25, 2013 1 / 13 Agenda

More information

Machine learning - HT Maximum Likelihood

Machine learning - HT Maximum Likelihood Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This

More information

1. Fisher Information

1. Fisher Information 1. Fisher Information Let f(x θ) be a density function with the property that log f(x θ) is differentiable in θ throughout the open p-dimensional parameter set Θ R p ; then the score statistic (or score

More information

When is MLE appropriate

When is MLE appropriate When is MLE appropriate As a rule of thumb the following to assumptions need to be fulfilled to make MLE the appropriate method for estimation: The model is adequate. That is, we trust that one of the

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is

The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is Example The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is log p 1 p = β 0 + β 1 f 1 (y 1 ) +... + β d f d (y d ).

More information

Generalized linear models

Generalized linear models Generalized linear models Douglas Bates November 01, 2010 Contents 1 Definition 1 2 Links 2 3 Estimating parameters 5 4 Example 6 5 Model building 8 6 Conclusions 8 7 Summary 9 1 Generalized Linear Models

More information

Linear Regression Models P8111

Linear Regression Models P8111 Linear Regression Models P8111 Lecture 25 Jeff Goldsmith April 26, 2016 1 of 37 Today s Lecture Logistic regression / GLMs Model framework Interpretation Estimation 2 of 37 Linear regression Course started

More information

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam. CS 189 Spring 2013 Introduction to Machine Learning Midterm You have 1 hour 20 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. Please use non-programmable calculators

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

Link lecture - Lagrange Multipliers

Link lecture - Lagrange Multipliers Link lecture - Lagrange Multipliers Lagrange multipliers provide a method for finding a stationary point of a function, say f(x, y) when the variables are subject to constraints, say of the form g(x, y)

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) Exercises Introduction to Machine Learning SS 2018 Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) LAS Group, Institute for Machine Learning Dept of Computer Science, ETH Zürich Prof

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

Lecture 4 September 15

Lecture 4 September 15 IFT 6269: Probabilistic Graphical Models Fall 2017 Lecture 4 September 15 Lecturer: Simon Lacoste-Julien Scribe: Philippe Brouillard & Tristan Deleu 4.1 Maximum Likelihood principle Given a parametric

More information

Probability Distributions Columns (a) through (d)

Probability Distributions Columns (a) through (d) Discrete Probability Distributions Columns (a) through (d) Probability Mass Distribution Description Notes Notation or Density Function --------------------(PMF or PDF)-------------------- (a) (b) (c)

More information

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han Math for Machine Learning Open Doors to Data Science and Artificial Intelligence Richard Han Copyright 05 Richard Han All rights reserved. CONTENTS PREFACE... - INTRODUCTION... LINEAR REGRESSION... 4 LINEAR

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

LECTURE 11: EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS

LECTURE 11: EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS LECTURE : EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS HANI GOODARZI AND SINA JAFARPOUR. EXPONENTIAL FAMILY. Exponential family comprises a set of flexible distribution ranging both continuous and

More information

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber Machine Learning Regression-Based Classification & Gaussian Discriminant Analysis Manfred Huber 2015 1 Logistic Regression Linear regression provides a nice representation and an efficient solution to

More information

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016 Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016 1 Theory This section explains the theory of conjugate priors for exponential families of distributions,

More information

Topic 2: Logistic Regression

Topic 2: Logistic Regression CS 4850/6850: Introduction to Machine Learning Fall 208 Topic 2: Logistic Regression Instructor: Daniel L. Pimentel-Alarcón c Copyright 208 2. Introduction Arguably the simplest task that we can teach

More information

Introduction to Generalized Linear Models

Introduction to Generalized Linear Models Introduction to Generalized Linear Models Edps/Psych/Soc 589 Carolyn J. Anderson Department of Educational Psychology c Board of Trustees, University of Illinois Fall 2018 Outline Introduction (motivation

More information

Generalized Linear Models. Kurt Hornik

Generalized Linear Models. Kurt Hornik Generalized Linear Models Kurt Hornik Motivation Assuming normality, the linear model y = Xβ + e has y = β + ε, ε N(0, σ 2 ) such that y N(μ, σ 2 ), E(y ) = μ = β. Various generalizations, including general

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Linear and Logistic Regression. Dr. Xiaowei Huang

Linear and Logistic Regression. Dr. Xiaowei Huang Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII

More information

Chapter 5 continued. Chapter 5 sections

Chapter 5 continued. Chapter 5 sections Chapter 5 sections Discrete univariate distributions: 5.2 Bernoulli and Binomial distributions Just skim 5.3 Hypergeometric distributions 5.4 Poisson distributions Just skim 5.5 Negative Binomial distributions

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

Linear Classification: Probabilistic Generative Models

Linear Classification: Probabilistic Generative Models Linear Classification: Probabilistic Generative Models Sargur N. University at Buffalo, State University of New York USA 1 Linear Classification using Probabilistic Generative Models Topics 1. Overview

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Maria Ryskina, Yen-Chia Hsu 1 Introduction

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology Singapore University of Design and Technology Lecture 9: Hypothesis testing, uniformly most powerful tests. The Neyman-Pearson framework Let P be the family of distributions of concern. The Neyman-Pearson

More information

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017 Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017 Put your solution to each problem on a separate sheet of paper. Problem 1. (5106) Let X 1, X 2,, X n be a sequence of i.i.d. observations from a

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

Dimensionality Reduction with Generalized Linear Models

Dimensionality Reduction with Generalized Linear Models Dimensionality Reduction with Generalized Linear Models Mo Chen 1 Wei Li 1 Wei Zhang 2 Xiaogang Wang 1 1 Department of Electronic Engineering 2 Department of Information Engineering The Chinese University

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Generalized linear models

Generalized linear models Generalized linear models Søren Højsgaard Department of Mathematical Sciences Aalborg University, Denmark October 29, 202 Contents Densities for generalized linear models. Mean and variance...............................

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Probability and Distributions

Probability and Distributions Probability and Distributions What is a statistical model? A statistical model is a set of assumptions by which the hypothetical population distribution of data is inferred. It is typically postulated

More information

Stat 710: Mathematical Statistics Lecture 12

Stat 710: Mathematical Statistics Lecture 12 Stat 710: Mathematical Statistics Lecture 12 Jun Shao Department of Statistics University of Wisconsin Madison, WI 53706, USA Jun Shao (UW-Madison) Stat 710, Lecture 12 Feb 18, 2009 1 / 11 Lecture 12:

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

ECE 4400:693 - Information Theory

ECE 4400:693 - Information Theory ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential

More information

Chapter 8.8.1: A factorization theorem

Chapter 8.8.1: A factorization theorem LECTURE 14 Chapter 8.8.1: A factorization theorem The characterization of a sufficient statistic in terms of the conditional distribution of the data given the statistic can be difficult to work with.

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information