High-dimensional data analysis, fall

Size: px

Start display at page:

Download "High-dimensional data analysis, fall"

Prosper Gardner
6 years ago
Views:

High-dimensional data analysis, fall 2013 10 Yeast understanding basic

2003, 2010 Arabidopsis Thaliana association mapping 3,745 p-values Zhao

1 High-dimensional data analysis, fall Yeast understanding basic life functions 11,904 p-values Blomberg et al. 2003, 2010 Arabidopsis Thaliana association mapping 3,745 p-values Zhao et al fmri brain scans function of brain language network appr. 3 mill. p-values Taylor et al. 2006

2 Slides for B&vdG , 10.7: Stable solutions Exercises: 10.1

3 B&vdG 10.2: Subsampling, stablility and selection Sometimes the aim is prediction, sometimes variable selection (and sometimes both) both are important, but selection is harder! Setting: Y = Xβ + ε + think of Lasso (but ideas more general) Recall from B&vdG 2 that the regularisation path is the set of p functions of λ defined as follows {β j λ ; λ Λ, j = 1, p}, where Λ typically is some interval [λ min, λ max ], and that S λ = j; β j λ 0. Now write S λ = S λ I to indicate dependence on the sample, which above is I = 1, n.

4 Let I be a random subset of {1, n} of size m = n 2 selcted by drawing without replacement, and for a subset K (typically K = {j}) of 1,, p let the subsampling probability be Π K λ = P [K S λ I } = #size m subsets I with K S λ I } n m Here Π K λ may be estimated by drawing randomly without replacement a large number B of subsets I 1, I B subsets, and computing Π K λ = 1 B B b=1 1{K S λ I b }

5 B&vdG argue that the stability path {Π {j} λ ; λ Λ, j = 1, p}, is better for variable selection than the regularization path. Typically they use λ min = 0, and take λ max as the smallest value of λ for which all β j are estimated with zero (this value can be seen to be max 1 j p 2 X jy /n.)

6 B&vdG : Vitamin B2 production using bacillus subtilus Numerical experiment using n = 115 values of the logarithm of vitamin B2 production p = 4088 gene expression values 6 genes where selected at random from the 200 genes with the highest marginal empirical correlation with the log vitamin B2 production response varaible. The other genes where subjected to a random permutation of rows, so that their possible connenctions with the response variable disappeared.

7 Regularization path Stability path x-axis: λ/λ max (in reverse ordering) Y-axis: left: the β j (λ), right: the Π j (λ) Red lines are the non-permuted genes

8 B&vdG : Motif regression Heat shock experiment for finding transcription factor binding sites in DNA sequeces. Subset containing n = 1200 gene expression values p = 666 motif scores Lasso estimates β j = β j (λ CV ) with λ CV chosen by 10-fold cross-validation, and the corresponding subsampling probabilities Π j = Π j (λ CV ) for the 9 most promising motifs: Should one use ordering from β j or from Π j?

9 Numerical experiment: choose 5 covariates at random, set correponding β-s to values which lead to very low signalnoise ratio (=0.1), set all other β-s to zero, simulate with i.i.d. N 0, 1 error variables ε t. Gives following result: x-axis: Π j (λ CV ), y-axis: β j (λ CV ) Red crosses are the active genes

10 B&vdG 10.3: Stability selection Traditionally: select one element, say S(λ 0 ) from the set of models {S λ ; λ Λ} Alternatively: select a value Π trh and select the model S stable = j; max λ Λ Π j λ > Π trh (and then perhaps reestimate the β-s in this set with ols). Often Λ = {λ CV }. Type 1 error: select a covariate which isn t active, i.e. a j not in S 0 Type 2 error: not select a covariate which is active, i.e. a j S 0 Want to make probability of both errors small.

11 S Λ λ Λ S λ, q Λ = E S Λ V = S 0 c S stable = #type 1 errors Thm 10.1 Assume {1(j S Λ ; j S 0 c )} has excangable distribution and that Then, for Π trh > 1/2, E(S 0 S Λ ) E(S 0 c S Λ ) S 0 S 0 c. E V E(V): = PFER = Per Family Error 1 2Π trh 1 q Λ 2 p. E(V)/p = PCER = Per Comparison error rate

12 Type 1 error control: for a given value ν choose Π trh such that E V ν. If ν is choosen as some suitable small number, say ν = α = 0.05 one then gets Type 1 (or, equivalently, PFER) error control, P V > 0 E V However, sometimes bigger ν-values are also of interest, e.g. if one wants to control PCER. By Thm 10.1, E V ν holds if the threshold is chosen as 1 2Π trh 1 q λ 2 p = ν Π trh = (1 + q 2 λ pν )/2 (only useful if q λ 2 < pν so that Π trh < 1) Homework: Problem 10.1

13 But here q Λ = ES Λ I isn t known. One way to handle this is to beforehand decide on a value q and then use a procedure which at most selects q covariates. Then of course ES Λ I q. Possible ways of doing this include use standard Lasso but only select the q covariates with the largest absolute values of the regression coefficients; select the q variables which enter first in the regularization path. This instead leads to the problem of selecting q. An alternative is to turn things around and decide on a value of Π trh, say Π trh = 0.9 and then use q = νp(2π trh 1).

14 B&vdG 10.4: A numerical experiment 2.5 Red triangles: stability selection, controlled to E V 2.5 Black dots: crossvalidated Lasso Each pair from a different simulation set-up

15 B&vdG 10.7: Proofs Read!

Statistical Learning with the Lasso, spring The Lasso

Statistical Learning with the Lasso, spring 2017 1 Yeast: understanding basic life functions p=11,904 gene values n number of experiments ~ 10 Blomberg et al. 2003, 2010 The Lasso fmri brain scans function