Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Size: px
Start display at page:

Download "Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar"

Transcription

1 Multiple regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Multiple regression 1 / 36

2 Previous two lectures Linear and logistic regression Viewed as probabilistic models Connects data and parameters Estimating parameters by maximizing likelihood Need to solve optimization problem (analytical solutions available in special cases e.g. linear regression) Examples of numerical optimization : gradient descent, Newton s method Important mathematical idea: convexity Multiple regression 2 / 36

3 Applications to GWAS Practical issues Model checking Multiple hypothesis testing Multiple regression 3 / 36

4 This lecture Can we predict phenotype from genotype? Depends on heritability Multiple regression: ridge regression Bayesian statistics Multiple regression 4 / 36

5 GWAS so far 2554 studies and SNP-phenotype associations Success? Can we use the results of GWAS to predict phenotype? Multiple regression 5 / 36

6 GWAS so far 2554 studies and SNP-phenotype associations Success? Can we use the results of GWAS to predict phenotype? Multiple regression 5 / 36

7 Outline Heritability Bayesian statistics Bernoulli model Ridge regression Probabilistic interpretation Estimating β Estimating hyperparameters Back to Heritability Multiple regression Heritability 6 / 36

8 How well does a model fit the data? (Narrow-sense) Heritability h 2 What is the accuracy of the best linear predictor of the phenotype? After we learn the parameters, how well can the model predict the phenotype? Narrow-sense: space of linear models Broad-sense: space of all models h 2 = R 2 R 2 = 1 SS res SS tot SS res = Σ n i=1(y i ˆβ T x i ) 2 SS tot = Σ n i=1(y i ȳ) 2 Multiple regression Heritability 7 / 36

9 Examples of heritability h2=0.2 Genotype Phenotype h2=0.8 Genotype Phenotype Can better predict phenotype from genotype Left:R 2 = 0.19 vs Right:R 2 = 0.79 Multiple regression Heritability 8 / 36

10 Heritability estimates from GWAS 18 GWAS variants for type 2 Diabetes explained 6% of known heritability (Manolio et al. Nature 2009) 180 GWAS loci for height explain 10% of phenotypic variance (Lango-Allen et al. Nature 2010) from a sample of 133, 653 individuals. h 2 for height is estimated to be 0.80 (Silventoinen et al. Twin Res. 2003) Multiple regression Heritability 9 / 36

11 Heritability estimates from GWAS 1 Disease! Heritability! (h 2 )! Number of! GWAS loci! Heritability explained! by GWAS loci! Alzheimer s! 0.79! 4! 0.18! 23%! Bipolar disorder! 0.77! 5! 0.02! 3%! Breast cancer! 0.53! 13! 0.07! 13%! CAD! 0.49! 12! 0.12! 25%! Crohn s disease! 0.55! 32! 0.07! 13%! Prostate cancer! 0.50! 27! 0.15! 31%! Schizophrenia! 0.81! 4! 0.00! 0%! SLE (lupus)! 0.66! 23! 0.09! 13%! Type 1 diabetes! 0.80! 45! 0.11! 14%! Type 2 diabetes! 0.42! 25! 0.12! 28%! % of h 2 explained! So et al. Gen. Epi.2011 Multiple regression Heritability 9 / 36

12 The model for the phenotype m y = β 0 + β j x j + ɛ j=1 y :Phenotype x j :Genotype at SNP j sampled independently ɛ N (0, σ 2 ) m Var [y] = Var β 0 + β j x j + ɛ = = j=1 m Var [β j x j ] + Var [ɛ] j=1 m β 2 j Var [x j ] + σ 2 j=1 Multiple regression Heritability 10 / 36

13 The model for the phenotype To simplify notation, we assume that each genotype is standardized. So E [x j ] = 0. Var [x j ] = 1. We assume that phenotype has mean 0. E [y] = 0. Var [y] = = m β 2 j Var [x j ] + σ 2 j=1 m β 2 j + σ 2 j=1 Multiple regression Heritability 10 / 36

14 Heritability What is the best possible accuracy for predicting phenotype from genotype? Best accuracy is obtained when we use the true model (which we don t know in practice of course). Accuracy here refers to low mean squared error. E [(y (β 0 + β T x)) 2] = E [ ɛ 2] = σ 2 Also note [ E (y (β 0 + β T x)) 2] Var [y] Multiple regression Heritability 11 / 36

15 Heritability What is the best possible accuracy for predicting phenotype from genotype? Best accuracy is obtained when we use the true model (which we don t know in practice of course). Accuracy here refers to low mean squared error. E [(y (β 0 + β T x)) 2] = E [ ɛ 2] = σ 2 Also note [ E (y (β 0 + β T x)) 2] Var [y] Multiple regression Heritability 11 / 36

16 Heritability If we knew the true model E [(y (β 0 + β T x)) 2] h 2 = 1 Var [y] σ 2 = 1 m j=1 β j 2 + σ 2 = m j=1 β j 2 m j=1 β j 2 + σ 2 Don t know the values of β j and σ 2. Multiple regression Heritability 12 / 36

17 Heritability If we knew the true model E [(y (β 0 + β T x)) 2] h 2 = 1 Var [y] σ 2 = 1 m j=1 β j 2 + σ 2 = m j=1 β j 2 m j=1 β j 2 + σ 2 Don t know the values of β j and σ 2. Multiple regression Heritability 12 / 36

18 What happens in GWAS? Testing each of m SNPs for association. A is the set of associated SNPs. ˆβj is the estimate of the effect size. What is the heritability of associated SNPs? ĥ 2 A = j A ˆβ j 2 Var [y] If the number of discoveries A is smaller than the number of associated SNPs m, ĥ2 A < h2. Difference termed missing heritability. Multiple regression Heritability 13 / 36

19 What happens in GWAS? Testing each of m SNPs for association. A is the set of associated SNPs. ˆβj is the estimate of the effect size. What is the heritability of associated SNPs? ĥ 2 A = j A ˆβ j 2 Var [y] If the number of discoveries A is smaller than the number of associated SNPs m, ĥ2 A < h2. Difference termed missing heritability. Multiple regression Heritability 13 / 36

20 Reasons for missing heritability Power SNPs that should be in A but are not included. The true function is non-linear. Estimates of heritability are biased upwards. Multiple regression Heritability 14 / 36

21 Solving the power issue Idea: learn a function that relates all SNPs to phenotype y = X β + ɛ Can compute the MLE (equivalently the OLS) estimate. β = ( ) X T 1 X X T y Multiple regression Heritability 15 / 36

22 What if X T X is not invertible β = ( ) X T 1 X X T y Can you think of any reasons why that could happen? Answer 1: n < m + 1. Intuitively, not enough data to estimate all the parameters. Answer 2: X columns are not linearly independent. Intuitively, there are two features that are perfectly correlated. In this case, solution is not unique. Multiple regression Heritability 16 / 36

23 Ridge regression For X T X that is not invertible β = ( X T X + λi) 1 XT y This is equivalent to adding an extra term to RSS( β) RSS( β) { { }}{ 1 β T ( } X T X β 2 X y) T T β λ β 2 2 }{{} regularization Multiple regression Heritability 17 / 36

24 Outline Heritability Bayesian statistics Bernoulli model Ridge regression Probabilistic interpretation Estimating β Estimating hyperparameters Back to Heritability Multiple regression Bayesian statistics 18 / 36

25 Bayesian statistics Frequentist Statistics: Evalute ˆθ on repeated samples: Bayesian: Bayes theorem θ Fixed X P Random ˆθ = t(x) also Random E[ˆθ] = θ? θ Random X Random P (θ X) = P (X θ)p (θ) P (X) Likelihood P rior M arginallikelihood Multiple regression Bayesian statistics 19 / 36

26 Bernoulli model: Bayesian treatment X 1,, X n iid Ber (p) Likelihood : L(p) P (x 1,..., x n p) n = P (x i p) = i=1 n p x i (1 p) (1 x i) i=1 = p n x (1 p) n(1 x) Prior : P (p) = Beta(p; α, β) p α 1 (1 p) β 1 Multiple regression Bayesian statistics 20 / 36

27 Bernoulli model: Bayesian treatment X 1,, X n iid Ber (p) Likelihood : L(p) P (x 1,..., x n p) n = P (x i p) = i=1 n p x i (1 p) (1 x i) i=1 = p n x (1 p) n(1 x) Prior : P (p) = Beta(p; α, β) p α 1 (1 p) β 1 Multiple regression Bayesian statistics 20 / 36

28 Beta distribution P (x α, β) = Γ(α + β) Γ(α)Γ(β) xα 1 (1 x) β 1 1{0 x 1} X Beta(α, β) E [X] = α α + β αβ Var [X] = (α + β) 2 (α + β + 1) Multiple regression Bayesian statistics 21 / 36

29 Beta distribution α = β = c P (x α, β) = Γ(α + β) Γ(α)Γ(β) xα 1 (1 x) β 1 1{0 x 1} α=β Density X Beta(c, c) E [X] = Var [X] = 4(c + 1) What happens when c? What happens when c 0? x Multiple regression Bayesian statistics 21 / 36

30 Beta distribution P (x α, β) = α = d, β = cd Γ(α + β) Γ(α)Γ(β) xα 1 (1 x) β 1 1{0 x 1} Density α,β 0.2,0.8 2,8 20,80 X Beta(d, cd) E [X] = c c Var [X] = (c + 1) 2 (d(1 + c) + 1) What happens when d? x Multiple regression Bayesian statistics 21 / 36

31 Bernoulli model: Bayesian treatment X 1,, X n iid Ber (p) Likelihood : P (x 1,..., x n p) = p n x (1 p) n(1 x) Prior : P (p) = Beta(p; α, β) p α 1 (1 p) β 1 Posterior : P (p x 1,, x n ) P (x 1,, x n p)p (p) = p α+n x 1 (1 p) β+n(1 x) 1 = Beta(p; α + n x, β + n(1 x)) Multiple regression Bayesian statistics 22 / 36

32 Bernoulli model: Bayesian treatment Posterior mean p MEAN E [p x 1,, x n ] = pp (p x 1,, x n )dp = Mean of Beta with parameters (α + n x, β + n(1 x)) = α + n x α + β + n ( ) n = x + α ( ) α + β α + β + n α + β α + β + n Posterior mean: Convex combination of MLE and prior mean Multiple regression Bayesian statistics 22 / 36

33 Bernoulli model: Bayesian treatment Posterior mean n E [p x 1,, x n ] = x α + β + n + α α + β α + β α + β + n Posterior mean is a smoothed version of MLE Example: Observe all 1 out of n trials n p MLE p MEAN (α = β = 5) Approaches MLE as n, i.e., prior matters less with more data. Prior can be viewed as adding pseudo-observations. Multiple regression Bayesian statistics 22 / 36

34 Choosing prior How do we choose α, β? Subjective Bayes: encode all reasonable assumptions of domain into prior. Other considerations: computational efficiency (conjugate priors). Multiple regression Bayesian statistics 23 / 36

35 Outline Heritability Bayesian statistics Bernoulli model Ridge regression Probabilistic interpretation Estimating β Estimating hyperparameters Back to Heritability Multiple regression Ridge regression 24 / 36

36 Review: Probabilistic interpretation for OLS Linear regression model: y = β x + ɛ ɛ N (0, σ 2 0) is a Gaussian random variable Thus, y N (β x, σ 2 0) We assume that β is fixed (Frequentist interpretation) We define p(y x, β, σ0 2 ) as the sampling distribution given fixed values for the parameters β, σ0 2 The likelihood function maps parameters to probabilities L : β, σ 2 0 p(y D, β, σ 2 0) = i p(y i x i, β, σ 2 0) Maximizing likelihood with respect to β minimizes RSS and yields the OLS solution: β OLS = β ML = arg max β L(β, σ 2 0) Multiple regression Ridge regression 25 / 36

37 Review: Probabilistic interpretation for OLS Linear regression model: y = β x + ɛ ɛ N (0, σ 2 0) is a Gaussian random variable Thus, y N (β x, σ 2 0) We assume that β is fixed (Frequentist interpretation) We define p(y x, β, σ0 2 ) as the sampling distribution given fixed values for the parameters β, σ0 2 The likelihood function maps parameters to probabilities L : β, σ 2 0 p(y D, β, σ 2 0) = i p(y i x i, β, σ 2 0) Maximizing likelihood with respect to β minimizes RSS and yields the OLS solution: β OLS = β ML = arg max β L(β, σ 2 0) Multiple regression Ridge regression 25 / 36

38 Review: Probabilistic interpretation for OLS Linear regression model: y = β x + ɛ ɛ N (0, σ 2 0) is a Gaussian random variable Thus, y N (β x, σ 2 0) We assume that β is fixed (Frequentist interpretation) We define p(y x, β, σ0 2 ) as the sampling distribution given fixed values for the parameters β, σ0 2 The likelihood function maps parameters to probabilities L : β, σ 2 0 p(y D, β, σ 2 0) = i p(y i x i, β, σ 2 0) Maximizing likelihood with respect to β minimizes RSS and yields the OLS solution: β OLS = β ML = arg max β L(β, σ 2 0) Multiple regression Ridge regression 25 / 36

39 Review: Probabilistic interpretation for OLS Linear regression model: y = β x + ɛ ɛ N (0, σ 2 0) is a Gaussian random variable Thus, y N (β x, σ 2 0) We assume that β is fixed (Frequentist interpretation) We define p(y x, β, σ0 2 ) as the sampling distribution given fixed values for the parameters β, σ0 2 The likelihood function maps parameters to probabilities L : β, σ 2 0 p(y D, β, σ 2 0) = i p(y i x i, β, σ 2 0) Maximizing likelihood with respect to β minimizes RSS and yields the OLS solution: β OLS = β ML = arg max β L(β, σ 2 0) Multiple regression Ridge regression 25 / 36

40 Review: Probabilistic interpretation for OLS Linear regression model: y = β x + ɛ ɛ N (0, σ 2 0) is a Gaussian random variable Thus, y N (β x, σ 2 0) We assume that β is fixed (Frequentist interpretation) We define p(y x, β, σ0 2 ) as the sampling distribution given fixed values for the parameters β, σ0 2 The likelihood function maps parameters to probabilities L : β, σ 2 0 p(y D, β, σ 2 0) = i p(y i x i, β, σ 2 0) Maximizing likelihood with respect to β minimizes RSS and yields the OLS solution: β OLS = β ML = arg max β L(β, σ 2 0) Multiple regression Ridge regression 25 / 36

41 Review: Probabilistic interpretation for OLS Linear regression model: y = β x + ɛ ɛ N (0, σ 2 0) is a Gaussian random variable Thus, y N (β x, σ 2 0) We assume that β is fixed (Frequentist interpretation) We define p(y x, β, σ0 2 ) as the sampling distribution given fixed values for the parameters β, σ0 2 The likelihood function maps parameters to probabilities L : β, σ 2 0 p(y D, β, σ 2 0) = i p(y i x i, β, σ 2 0) Maximizing likelihood with respect to β minimizes RSS and yields the OLS solution: β OLS = β ML = arg max β L(β, σ 2 0) Multiple regression Ridge regression 25 / 36

42 Probabilistic interpretation of Ridge Regression Ridge Regression model: y = β x + ɛ y N (β x, σ0) 2 is a Gaussian random variable (as before) β j N (0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that Y has mean zero β is a random variable with a prior distribution To find β given data D and (σ 2, σ0 2 ), we can compute posterior distribution of β: P (β D, σ 2, σ0) 2 = P (D β, σ2, σ0 2 )P (β) P (D σ 2, σ0 2) Maximum a posterior (MAP) estimate: β map = arg max β P (β D, σ 2, σ 2 0) = arg max β P (D, β, σ 2, σ 2 0) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(β) Multiple regression Ridge regression 26 / 36

43 Probabilistic interpretation of Ridge Regression Ridge Regression model: y = β x + ɛ y N (β x, σ0) 2 is a Gaussian random variable (as before) β j N (0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that Y has mean zero β is a random variable with a prior distribution To find β given data D and (σ 2, σ0 2 ), we can compute posterior distribution of β: P (β D, σ 2, σ0) 2 = P (D β, σ2, σ0 2 )P (β) P (D σ 2, σ0 2) Maximum a posterior (MAP) estimate: β map = arg max β P (β D, σ 2, σ 2 0) = arg max β P (D, β, σ 2, σ 2 0) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(β) Multiple regression Ridge regression 26 / 36

44 Probabilistic interpretation of Ridge Regression Ridge Regression model: y = β x + ɛ y N (β x, σ0) 2 is a Gaussian random variable (as before) β j N (0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that Y has mean zero β is a random variable with a prior distribution To find β given data D and (σ 2, σ0 2 ), we can compute posterior distribution of β: P (β D, σ 2, σ0) 2 = P (D β, σ2, σ0 2 )P (β) P (D σ 2, σ0 2) Maximum a posterior (MAP) estimate: β map = arg max β P (β D, σ 2, σ 2 0) = arg max β P (D, β, σ 2, σ 2 0) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(β) Multiple regression Ridge regression 26 / 36

45 Probabilistic interpretation of Ridge Regression Ridge Regression model: y = β x + ɛ y N (β x, σ0) 2 is a Gaussian random variable (as before) β j N (0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that Y has mean zero β is a random variable with a prior distribution To find β given data D and (σ 2, σ0 2 ), we can compute posterior distribution of β: P (β D, σ 2, σ0) 2 = P (D β, σ2, σ0 2 )P (β) P (D σ 2, σ0 2) Maximum a posterior (MAP) estimate: β map = arg max β P (β D, σ 2, σ 2 0) = arg max β P (D, β, σ 2, σ 2 0) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(β) Multiple regression Ridge regression 26 / 36

46 Probabilistic interpretation of Ridge Regression Ridge Regression model: y = β x + ɛ y N (β x, σ0) 2 is a Gaussian random variable (as before) β j N (0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that Y has mean zero β is a random variable with a prior distribution To find β given data D and (σ 2, σ0 2 ), we can compute posterior distribution of β: P (β D, σ 2, σ0) 2 = P (D β, σ2, σ0 2 )P (β) P (D σ 2, σ0 2) Maximum a posterior (MAP) estimate: β map = arg max β P (β D, σ 2, σ 2 0) = arg max β P (D, β, σ 2, σ 2 0) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(β) Multiple regression Ridge regression 26 / 36

47 Probabilistic interpretation of Ridge Regression Ridge Regression model: y = β x + ɛ y N (β x, σ0) 2 is a Gaussian random variable (as before) β j N (0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that Y has mean zero β is a random variable with a prior distribution To find β given data D and (σ 2, σ0 2 ), we can compute posterior distribution of β: P (β D, σ 2, σ0) 2 = P (D β, σ2, σ0 2 )P (β) P (D σ 2, σ0 2) Maximum a posterior (MAP) estimate: β map = arg max β P (β D, σ 2, σ 2 0) = arg max β P (D, β, σ 2, σ 2 0) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(β) Multiple regression Ridge regression 26 / 36

48 Probabilistic interpretation of Ridge Regression Ridge Regression model: y = β x + ɛ y N (β x, σ0) 2 is a Gaussian random variable (as before) β j N (0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that Y has mean zero β is a random variable with a prior distribution To find β given data D and (σ 2, σ0 2 ), we can compute posterior distribution of β: P (β D, σ 2, σ0) 2 = P (D β, σ2, σ0 2 )P (β) P (D σ 2, σ0 2) Maximum a posterior (MAP) estimate: β map = arg max β P (β D, σ 2, σ 2 0) = arg max β P (D, β, σ 2, σ 2 0) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(β) Multiple regression Ridge regression 26 / 36

49 Estimating β given hyperparameters (σ 2, σ 2 0) Let X 1,..., X n be i.i.d with y β, x N (β x, σ0 2) Let β j be i.i.d with β j N (0, σ 2 ) Joint likelihood of data and parameters (given σ 0, σ) P (D, β) = P (D β)p (β) = i P (y i x i, β) j P (β j ) Joint log likelihood Plugging in Gaussian PDF, we get: log P (D, β) = i log P (y i x i, β) + j log P (β j ) = i (βt x i y i ) 2 2σ 2 0 j 1 2σ 2 β2 j + const MAP estimate: β map = arg max β log p(d, β) As with OLS, set gradient equal to zero and solve (for β) Multiple regression Ridge regression 27 / 36

50 Estimating β given hyperparameters (σ 2, σ 2 0) Let X 1,..., X n be i.i.d with y β, x N (β x, σ0 2) Let β j be i.i.d with β j N (0, σ 2 ) Joint likelihood of data and parameters (given σ 0, σ) P (D, β) = P (D β)p (β) = i P (y i x i, β) j P (β j ) Joint log likelihood Plugging in Gaussian PDF, we get: log P (D, β) = i log P (y i x i, β) + j log P (β j ) = i (βt x i y i ) 2 2σ 2 0 j 1 2σ 2 β2 j + const MAP estimate: β map = arg max β log p(d, β) As with OLS, set gradient equal to zero and solve (for β) Multiple regression Ridge regression 27 / 36

51 Estimating β given hyperparameters (σ 2, σ 2 0) Let X 1,..., X n be i.i.d with y β, x N (β x, σ0 2) Let β j be i.i.d with β j N (0, σ 2 ) Joint likelihood of data and parameters (given σ 0, σ) P (D, β) = P (D β)p (β) = i P (y i x i, β) j P (β j ) Joint log likelihood Plugging in Gaussian PDF, we get: log P (D, β) = i log P (y i x i, β) + j log P (β j ) = i (βt x i y i ) 2 2σ 2 0 j 1 2σ 2 β2 j + const MAP estimate: β map = arg max β log p(d, β) As with OLS, set gradient equal to zero and solve (for β) Multiple regression Ridge regression 27 / 36

52 Maximum a posterior (MAP) estimate Regularized linear regression: a new error to minimize E(β) = i (β T x i y i ) 2 + λ β 2 2 where λ > 0 is used to denote σ0 2/σ2. This extra term β 2 2 is called regularization/regularizer and controls the model complexity. Intuitions If λ +, then σ 2 0 σ 2. That is, the variance of noise is far greater than what our prior model can allow for β. In this case, our prior model on β would be more accurate than what data can tell us. Thus, we are getting a simple model. Numerically, β map 0 If λ 0, then we trust our data more. Numerically, β map β ols = arg min i (β T x i y i ) 2 Multiple regression Ridge regression 28 / 36

53 Maximum a posterior (MAP) estimate Regularized linear regression: a new error to minimize E(β) = i (β T x i y i ) 2 + λ β 2 2 where λ > 0 is used to denote σ0 2/σ2. This extra term β 2 2 is called regularization/regularizer and controls the model complexity. Intuitions If λ +, then σ 2 0 σ 2. That is, the variance of noise is far greater than what our prior model can allow for β. In this case, our prior model on β would be more accurate than what data can tell us. Thus, we are getting a simple model. Numerically, β map 0 If λ 0, then we trust our data more. Numerically, β map β ols = arg min i (β T x i y i ) 2 Multiple regression Ridge regression 28 / 36

54 Maximum a posterior (MAP) estimate Regularized linear regression: a new error to minimize E(β) = i (β T x i y i ) 2 + λ β 2 2 where λ > 0 is used to denote σ0 2/σ2. This extra term β 2 2 is called regularization/regularizer and controls the model complexity. Intuitions If λ +, then σ 2 0 σ 2. That is, the variance of noise is far greater than what our prior model can allow for β. In this case, our prior model on β would be more accurate than what data can tell us. Thus, we are getting a simple model. Numerically, β map 0 If λ 0, then we trust our data more. Numerically, β map β ols = arg min i (β T x i y i ) 2 Multiple regression Ridge regression 28 / 36

55 Maximum a posterior (MAP) estimate Regularized linear regression: a new error to minimize E(β) = i (β T x i y i ) 2 + λ β 2 2 where λ > 0 is used to denote σ0 2/σ2. This extra term β 2 2 is called regularization/regularizer and controls the model complexity. Intuitions If λ +, then σ 2 0 σ 2. That is, the variance of noise is far greater than what our prior model can allow for β. In this case, our prior model on β would be more accurate than what data can tell us. Thus, we are getting a simple model. Numerically, β map 0 If λ 0, then we trust our data more. Numerically, β map β ols = arg min i (β T x i y i ) 2 Multiple regression Ridge regression 28 / 36

56 Closed-form solution For regularized linear regression: the solution changes very little (in form) from the OLS solution arg min i (β T x i y i ) 2 + λ β 2 2 β map = ( X T X + λi ) 1 X T y and reduces to the OLS solution when λ = 0, as expected. If we have to use numerical procedure, the gradients and the Hessian matrix would change nominally too, E(β) = 2(X T Xβ X T y + λβ), H = 2(X T X + λi) As long as λ 0, the optimization is convex. Multiple regression Ridge regression 29 / 36

57 Closed-form solution For regularized linear regression: the solution changes very little (in form) from the OLS solution arg min i (β T x i y i ) 2 + λ β 2 2 β map = ( X T X + λi ) 1 X T y and reduces to the OLS solution when λ = 0, as expected. If we have to use numerical procedure, the gradients and the Hessian matrix would change nominally too, E(β) = 2(X T Xβ X T y + λβ), H = 2(X T X + λi) As long as λ 0, the optimization is convex. Multiple regression Ridge regression 29 / 36

58 Estimating hyperparameters: σ 2, σ 2 0 Ridge regression model: y = Xβ + ɛ β N (0, σ 2 I m ) To find (σ 2, σ0 2 ) given data D = (y, X), we need to compute the marginal likelihood L(σ 2, σ0) 2 P (y X, σ 2, σ0) 2 = P (y, β X, σ 2, σ0)dβ 2 β = P (y β, X, σ0)p 2 (β σ 2 )dβ β = N (0, σ 2 XX T + σ0i 2 n ) = 1 [ y T K 1 y + log det K ] + const 2 K σ 2 XX T + σ 2 0I n Multiple regression Ridge regression 30 / 36

59 Estimating hyperparameters: σ 2, σ 2 0 Ridge regression model: y = Xβ + ɛ β N (0, σ 2 I m ) To find (σ 2, σ0 2 ) given data D = (y, X), we need to compute the marginal likelihood L(σ 2, σ0) 2 P (y X, σ 2, σ0) 2 = P (y, β X, σ 2, σ0)dβ 2 β = P (y β, X, σ0)p 2 (β σ 2 )dβ β = N (0, σ 2 XX T + σ0i 2 n ) = 1 [ y T K 1 y + log det K ] + const 2 K σ 2 XX T + σ 2 0I n Multiple regression Ridge regression 30 / 36

60 Estimating hyperparameters: σ 2, σ 2 0 Ridge regression model: y = Xβ + ɛ β N (0, σ 2 I m ) To find (σ 2, σ0 2 ) given data D = (y, X), we need to compute the marginal likelihood L(σ 2, σ0) 2 P (y X, σ 2, σ0) 2 = P (y, β X, σ 2, σ0)dβ 2 β = P (y β, X, σ0)p 2 (β σ 2 )dβ β = N (0, σ 2 XX T + σ0i 2 n ) = 1 [ y T K 1 y + log det K ] + const 2 K σ 2 XX T + σ 2 0I n Multiple regression Ridge regression 30 / 36

61 Statistical properties of ridge-regression estimator Assumptions: Linear model is correct β map is a biased estimator of β. Contrast with OLS. β ols is an unbiased estimator of β (Lecture 3). What about the variance? Multiple regression Ridge regression 31 / 36

62 Computing ridge-regression estimator Estimating β given hyperparameters. Same runtime as OLS, O(m 2 n). Estimating the hyperparameters needs a numerical procedure. Each iteration is O(n 3 ). Multiple regression Ridge regression 32 / 36

63 Outline Heritability Bayesian statistics Bernoulli model Ridge regression Probabilistic interpretation Estimating β Estimating hyperparameters Back to Heritability Multiple regression Back to Heritability 33 / 36

64 So how does this relate to heritability? Heritability related to the hyperparameters h 2 m = mσ2 mσ 2 + σ 2 0 Given genotype and phenotype pairs {(x i, y i )}, model the phenotype y i as y i = β T x i + ɛ i where β j N (0, σ 2 ) and ɛ i N (0, σ 0 2 ). Estimate the hyperparameters (σ 2, σ0 2 ) by maximizing the marginal likelihood. Multiple regression Back to Heritability 34 / 36

65 Application of ridge regression to estimate heritability Termed linear Mixed models in the genetics literature. Yang et al applied this model to height to estimate h 2 G = 0.45 dramatically higher than the estimates from GWAS (0.05). Been applied to a number of phenotypes. Multiple regression Back to Heritability 35 / 36

66 Application of ridge regression to estimate heritability 2 Visscher et al. AJHG 2012 Multiple regression Back to Heritability 35 / 36

67 Summary Increasing the number of SNPs in the regression leads to statistical and numerical difficulties. Regularized regression is a solution. Ridge regression is one form of regularization. Can also be derived from a Bayesian perspective. Have been useful in closing the missing heritability gap. Multiple regression Back to Heritability 36 / 36

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

GWAS IV: Bayesian linear (variance component) models

GWAS IV: Bayesian linear (variance component) models GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

Bayesian Regression (1/31/13)

Bayesian Regression (1/31/13) STA613/CBB540: Statistical methods in computational biology Bayesian Regression (1/31/13) Lecturer: Barbara Engelhardt Scribe: Amanda Lea 1 Bayesian Paradigm Bayesian methods ask: given that I have observed

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Lecture 2: Conjugate priors

Lecture 2: Conjugate priors (Spring ʼ) Lecture : Conjugate priors Julia Hockenmaier juliahmr@illinois.edu Siebel Center http://www.cs.uiuc.edu/class/sp/cs98jhm The binomial distribution If p is the probability of heads, the probability

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation Machine Learning CMPT 726 Simon Fraser University Binomial Parameter Estimation Outline Maximum Likelihood Estimation Smoothed Frequencies, Laplace Correction. Bayesian Approach. Conjugate Prior. Uniform

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method. STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method. Rebecca Barter May 5, 2015 Linear Regression Review Linear Regression Review

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Hierarchical Models & Bayesian Model Selection

Hierarchical Models & Bayesian Model Selection Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Contact information Please report any typos or

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood Regression Estimation - Least Squares and Maximum Likelihood Dr. Frank Wood Least Squares Max(min)imization Function to minimize w.r.t. β 0, β 1 Q = n (Y i (β 0 + β 1 X i )) 2 i=1 Minimize this by maximizing

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:

More information

A Bayesian Treatment of Linear Gaussian Regression

A Bayesian Treatment of Linear Gaussian Regression A Bayesian Treatment of Linear Gaussian Regression Frank Wood December 3, 2009 Bayesian Approach to Classical Linear Regression In classical linear regression we have the following model y β, σ 2, X N(Xβ,

More information

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017 Lecture 2: Genetic Association Testing with Quantitative Traits Instructors: Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 29 Introduction to Quantitative Trait Mapping

More information

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Linear Regression [DRAFT - In Progress] Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory

More information

Relevance Vector Machines

Relevance Vector Machines LUT February 21, 2011 Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise Support Vector Machines The relevance vector machine (RVM) is a bayesian

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural

More information

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF Accouncements You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF Please do not zip these files and submit (unless there are >5 files) 1 Bayesian Methods Machine Learning

More information

IEOR165 Discussion Week 5

IEOR165 Discussion Week 5 IEOR165 Discussion Week 5 Sheng Liu University of California, Berkeley Feb 19, 2016 Outline 1 1st Homework 2 Revisit Maximum A Posterior 3 Regularization IEOR165 Discussion Sheng Liu 2 About 1st Homework

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 24) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 2, 24 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24 Outline Review

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming

More information

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL* USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL* 3 Conditionals and marginals For Bayesian analysis it is very useful to understand how to write joint, marginal, and conditional distributions for the multivariate

More information

Linear Models A linear model is defined by the expression

Linear Models A linear model is defined by the expression Linear Models A linear model is defined by the expression x = F β + ɛ. where x = (x 1, x 2,..., x n ) is vector of size n usually known as the response vector. β = (β 1, β 2,..., β p ) is the transpose

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Generative Models Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

ECE531 Lecture 10b: Maximum Likelihood Estimation

ECE531 Lecture 10b: Maximum Likelihood Estimation ECE531 Lecture 10b: Maximum Likelihood Estimation D. Richard Brown III Worcester Polytechnic Institute 05-Apr-2011 Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 1 / 23 Introduction So

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Linear regression Least-squares estimation

More information

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari MS&E 226: Small Data Lecture 11: Maximum likelihood (v2) Ramesh Johari ramesh.johari@stanford.edu 1 / 18 The likelihood function 2 / 18 Estimating the parameter This lecture develops the methodology behind

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:

More information

Topic 12 Overview of Estimation

Topic 12 Overview of Estimation Topic 12 Overview of Estimation Classical Statistics 1 / 9 Outline Introduction Parameter Estimation Classical Statistics Densities and Likelihoods 2 / 9 Introduction In the simplest possible terms, the

More information

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017 Bayesian inference Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark April 10, 2017 1 / 22 Outline for today A genetic example Bayes theorem Examples Priors Posterior summaries

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your

More information

Linear and logistic regression

Linear and logistic regression Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22 Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein Statistical Schools of Thought Remainder of lecture is to provide

More information

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling Due: Tuesday, May 10, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to the questions below, including

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 7 Unsupervised Learning Statistical Perspective Probability Models Discrete & Continuous: Gaussian, Bernoulli, Multinomial Maimum Likelihood Logistic

More information

An Introduction to Bayesian Linear Regression

An Introduction to Bayesian Linear Regression An Introduction to Bayesian Linear Regression APPM 5720: Bayesian Computation Fall 2018 A SIMPLE LINEAR MODEL Suppose that we observe explanatory variables x 1, x 2,..., x n and dependent variables y 1,

More information

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian RL Seminar. Chris Mansley September 9, 2008 Bayesian RL Seminar Chris Mansley September 9, 2008 Bayes Basic Probability One of the basic principles of probability theory, the chain rule, will allow us to derive most of the background material in

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

Module 17: Bayesian Statistics for Genetics Lecture 4: Linear regression

Module 17: Bayesian Statistics for Genetics Lecture 4: Linear regression 1/37 The linear regression model Module 17: Bayesian Statistics for Genetics Lecture 4: Linear regression Ken Rice Department of Biostatistics University of Washington 2/37 The linear regression model

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

10-701/15-781, Machine Learning: Homework 4

10-701/15-781, Machine Learning: Homework 4 10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Introduction to Simple Linear Regression

Introduction to Simple Linear Regression Introduction to Simple Linear Regression Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Introduction to Simple Linear Regression 1 / 68 About me Faculty in the Department

More information

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics DS-GA 100 Lecture notes 11 Fall 016 Bayesian statistics In the frequentist paradigm we model the data as realizations from a distribution that depends on deterministic parameters. In contrast, in Bayesian

More information

Probability and Estimation. Alan Moses

Probability and Estimation. Alan Moses Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of

More information

Today. Calculus. Linear Regression. Lagrange Multipliers

Today. Calculus. Linear Regression. Lagrange Multipliers Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Theory of Maximum Likelihood Estimation. Konstantin Kashin Gov 2001 Section 5: Theory of Maximum Likelihood Estimation Konstantin Kashin February 28, 2013 Outline Introduction Likelihood Examples of MLE Variance of MLE Asymptotic Properties What is Statistical

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 11 Maximum Likelihood as Bayesian Inference Maximum A Posteriori Bayesian Gaussian Estimation Why Maximum Likelihood? So far, assumed max (log) likelihood

More information

Probabilistic Reasoning in Deep Learning

Probabilistic Reasoning in Deep Learning Probabilistic Reasoning in Deep Learning Dr Konstantina Palla, PhD palla@stats.ox.ac.uk September 2017 Deep Learning Indaba, Johannesburgh Konstantina Palla 1 / 39 OVERVIEW OF THE TALK Basics of Bayesian

More information

Chapter 8.8.1: A factorization theorem

Chapter 8.8.1: A factorization theorem LECTURE 14 Chapter 8.8.1: A factorization theorem The characterization of a sufficient statistic in terms of the conditional distribution of the data given the statistic can be difficult to work with.

More information

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff IEOR 165 Lecture 7 Bias-Variance Tradeoff 1 Bias-Variance Tradeoff Consider the case of parametric regression with β R, and suppose we would like to analyze the error of the estimate ˆβ in comparison to

More information

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester Physics 403 Parameter Estimation, Correlations, and Error Bars Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Best Estimates and Reliability

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

g-priors for Linear Regression

g-priors for Linear Regression Stat60: Bayesian Modeling and Inference Lecture Date: March 15, 010 g-priors for Linear Regression Lecturer: Michael I. Jordan Scribe: Andrew H. Chan 1 Linear regression and g-priors In the last lecture,

More information

Introduction to Maximum Likelihood Estimation

Introduction to Maximum Likelihood Estimation Introduction to Maximum Likelihood Estimation Eric Zivot July 26, 2012 The Likelihood Function Let 1 be an iid sample with pdf ( ; ) where is a ( 1) vector of parameters that characterize ( ; ) Example:

More information

Linear Regression. Volker Tresp 2014

Linear Regression. Volker Tresp 2014 Linear Regression Volker Tresp 2014 1 Learning Machine: The Linear Model / ADALINE As with the Perceptron we start with an activation functions that is a linearly weighted sum of the inputs h i = M 1 j=0

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015 Logistic Regression Mohammad Emtiyaz Khan EPFL Oct 8, 2015 Mohammad Emtiyaz Khan 2015 Classification with linear regression We can use y = 0 for C 1 and y = 1 for C 2 (or vice-versa), and simply use least-squares

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam. CS 189 Spring 2013 Introduction to Machine Learning Midterm You have 1 hour 20 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. Please use non-programmable calculators

More information

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is

More information

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA The Bayes classifier Linear discriminant analysis (LDA) Theorem The classifier satisfies In linear discriminant analysis (LDA), we make the (strong) assumption that where the min is over all possible classifiers.

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction

More information

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh September 24 (All of the slides in this course have been adapted from previous versions

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Due Thursday, September 19, in class What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a

More information