CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Size: px

Start display at page:

Download "CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS"

Georgia Webster
5 years ago
Views:

1 CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University

2 Problems in Linear Regression More predictors than the number of samples Not unusual to see such data. E.g., microarray data Ill-conditioned matrix LS estimates depend upon (X X) 1, and the computation with ill-conditioned matrix X may be singular or nearly singular In this case, small changes to the elements of X lead to large change in (X X) 1

3 Ill-conditioned Matrices x + y = 2 ቊ x y = 2 ቊ x + y = 2 x y = The solution is x = 2, y = 0 on the left, x=1, y=1 on the right. The coefficient matrix is called ill-conditioned because a small change in the constant coefficients results in a large change in the solution.

4 Solutions Sometimes can be improved by trading a little bias to reduce the variance of the predicted values Determine a smaller subset of predictors that exhibit the strongest effect (Feature Selection) It gives the big picture sacrificing some of the small details

5 Motivation If more than two independent variables are highly correlated: The intercept is approximated well, but coefficients? Reference:

6 Motivation It happens because x1 and x2 are highly correlated. RSS(40, -38) = 21.7 (our estimate) is very closed to RSS(1, 1) = 22.6 (the truth) Effective way of dealing with this problem is through penalization: Instead of minimizing RSS only, we consider an additional term in the regression form Reference:

7 Ridge Regression Ridge Regression Model n Minimize (y i Xb) 2 i=1 p s. t. b 2 j c j=1

8 Ridge Regression Why does this help? Smaller coefficients give less sensitivity of the variables. Reference:

9 Ridge Regression Lagrange Multiplier A strategy for finding the local maxima or minima of a function subject to equality/inequality constraints Minimizing n (Y i i=1 Equivalent to minimizing n μ) 2 s. t. μ 2 C (Y i μ) 2 +λμ 2 i=1

10 Lagrange Multiplier Example Find the extrema of the function F(x, y) = 2y + x subject to the constraint 0 = g(x, y) = y 2 + xy 1 Derivative of H x, y, z = 2y + x + z(y 2 + xy 1) Test with

11 Ridge Regression Ridge Regression Model n Minimize (y i Xb) 2 + λ b 2 i=1 Where b 2 is L-2 norm of b P-norm (p 1) x p n i=1 x i p 1/p p=1, Manhattan norm (L-1 norm); p=2, Euclidean norm; p=, maximum norm

12 Optimization H b, λ = y Xb y Xb +λb b = y y 2b X y + b X Xb+λb b H b, λ b = 2X y + 2X Xb + 2λb = 0 X X + λi b = X y b = (X X + λi) 1 X y X X + λi is always invertible. Always gives a unique solution, መb

13 Ridge Regression Similar to the ordinary least squares solution, but with the addition of a ridge regularization λ 0, መb ridge መb OLS λ, መb ridge 0 Applying the ridge regression penalty has the effect of shrinking the estimates toward zero Introduce bias but reduce the variance of the estimate

14 LASSO LASSO: Least Absolute Shrinkage and Selection Operator L-1 norm penalization Most coefficients are shrunken all the way to zeros called sparse n Minimize (y i Xb) 2 i=1 p s. t. j=1 b j c

15 LASSO Absolute value operation on penalization not differentiable and lacks of a close form solution In the special case of an orthonormal design matrix, X X=I, it is possible to obtain close form solution (but approximate solution) b J lasso = S(b J OLS, λ), where S is soft-thresholding operator defined as: S z, λ z λ if z > λ = 0 if z λ z + λ if z < λ

16 Soft-thresholding operator Proximal mapping of the L1-norm: n argmin b (y i Xb) 2 + λ b 1 i=1 y Xb 2 + λ b 1 X y + X Xb + λ b 1 The L-1norm is separable, so we can consider each of its components separately. Let s examine first the case where b i 0. Then, b 1 = sign(b i ) and the optimum b i is obtained as X iy i + X ix i b i + λsign b i = 0 b i = X iy i λsign b i

17 Soft-thresholding operator In the case where b i = 0, the subdifferential of the L-1norm is the interval [-1, 1] and the optimality condition is X iy i + λ 1, 1 = 0 X iy i = λ, λ X iy i λ Absolute value function (left), and its sub-differential as a function x(right)

18 Geometry of ridge vs lasso A geometric illustration of why lasso results in sparsity. Lasso (left): confidents tend to be zeros

19 References 03/2-20.pdf e/statlearning/asset/model_selection.pdf yregularization.pdf /155

20 Reference LASSO reference

21 Elastic Net Hui Zou and Trevor Hastie in 2005 Regularize combining L-1 and L-2 norm penalties of the Lasso and Ridge Overcome the limitation of Lasso: For the case, where large p, small n and several variables are highly correlated each other, LASSO tends to select only one variable among them (others become zeros).

22 Elastic Net The elastic net method includes the LASSO and ridge regression n መb = argmin b (y i Xb) 2 + λ 1 b + λ 2 b 2 i=1 If λ 1 = λ, λ 2 = 0, it is LASSO If λ 1 = 0, λ 2 = λ, it is Ridge regression

23 Group LASSO Yuan and Lin in 2006 When pre-defined groups of covariates are given, they are allowed to be selected together n J መb = argmin b (y i Xb) 2 + λ i=1 b i Kj = (φ Kφ) 1/2 φ R d, K R d d j=1 b j Kj

24 Interaction effects Assumption that variables are independent in Linear model When considering the relationship among two or more variables, need to describes simultaneous influence of the variables. y = β 0 + β 1 x 1 + β 2 x 2 + β 12 x 1 x 2 + e

25 References Elastic net H. Zou and T. Hastie, Regularization and variable selection via the elastic net, J.R. Statist. Soc., )% %20Zou%20&%20Hastie.pdf Group Lasso M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables, J.R. Statist. Soc., l.pdf

26 SNP Single-Nucleotide Polymorphisms (SNP) DNA Variation occurring commonly within a population (e.g., 1%) in a single nucleotide. There are two alleles, A and G, at this locus 26/28

27 SNP Examine DNA variations at prespecified loci (about 10 millions) Bi-allele (major/minor allele) Assumed that Minor allele may cause a specific rare phenotype (e.g., diseases) while Major allele is mainly observed in a population. Data Encoding SNP1 SNP2 SNP3 G A A C C T G G A A C C G A C C C T SNP1 SNP2 SNP /28

28 GWAS Genome-Wide Association Study Focus on associations between SNPs and a phenotype (e.g., cancer) Conventionally, statistically compare SNPs of two groups (case and control) based on a linear regression model. Pairwise univariate analysis Perform t-test between each SNP and a phenotype individually X y y = xx x 12 β 3 ββ εε ε SNP SNP SNP. Phenotype y = x i β i + ε y

29 Multiple correction problem Occurs when one considers a set of statistical inferences simultaneously On n numbers of statistical tests, the false positive would be α n. When there are 10 6 SNPs and the significance of SNPs are determined with p-value cutoff of 0.05 (α = 0.05), the false positives would be 50,000. Require a higher significance threshold for individual comparisons

30 Multiple correction problem Bonferroni correction Threshold: 0.05 (or 0.01) / the number of hypotheses If a trail is testing m = 20 hypothesis with a desired α=0.05, Bonferroni correction would be α=0.05/20=0.0025

31 Manhattan plot

32 eqtl expression Quantitative Trait Loci (eqtl) mapping study Association between SNPs and mrna Find genomic loci that regulate expression levels of mrnas (gene expression) Capture the insight of the genetic architecture of gene expression. X SNP SNP SNP. Gene Y Gene

Association studies and regression

Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration