Lecture 2 Part 1 Optimization

Size: px

Start display at page:

Download "Lecture 2 Part 1 Optimization"

Blaise Martin
5 years ago
Views:

1 Lecture 2 Part 1 Optimization (January 16, 2015) Mu Zhu University of Waterloo

2 Need for Optimization E(y x), P(y x) want to go after them first, model some examples last week then, estimate didn t discuss last week often requires solving an optimization problem Fields Institute, Toronto, Canada 2015 by Mu Zhu 2

3 Ex I: Linear Regression min β n ( yi x T i β ) 2 i=1 = y Xβ 2 y = y 1., X = x T 1.. y n x T n β = (X T X) 1 X T y Fields Institute, Toronto, Canada 2015 by Mu Zhu 3

4 Ex II: Logistic Regression y i Bernoulli(p i ) p i P(y i = 1 x i ) log p i = x T i β or p i = exp(xt i β) 1 p i 1+exp(x T i β) L(p i ;y i ) = n i=1 p y i i (1 p i) 1 y i Fields Institute, Toronto, Canada 2015 by Mu Zhu 4

5 Ex II: Logistic Regression log-likelihood by Newton-Raphson l(p i ;y i ) = logl(p i ;y i ) n = y i log(p i )+(1 y i )log(1 p i ) = l(β;y i,x i ) = to estimate β, i=1 n y i log i=1 p i 1 p i +log(1 p i ) n y i x T i β log(1+e xt i β ) i=1 max β l(β;y i,x i ) Fields Institute, Toronto, Canada 2015 by Mu Zhu 5

6 Ex II: Logistic Regression l (β) = n y i x i i=1 = [ [ e xt i β 1+e xt i β ] x i = ] x 1... x n n (y i p i )x i i=1 y 1 p 1. y n p n = XT (y p) Fields Institute, Toronto, Canada 2015 by Mu Zhu 6

7 Ex II: Logistic Regression l (β) [ ] n (e xt i β )(1+e xt i β ) (e xt i β )(e xt i β ) = x i x T i=1 (1+e xt i β ) 2 i n [ = x i pi p 2 i] x T i [ = i=1 ] x 1... x n p 1 (1 p 1 ) p n (1 p n ) x T 1.. x T n = X T WX Fields Institute, Toronto, Canada 2015 by Mu Zhu 7

8 Ex II: Logistic Regression β new = β old [ ] 1 [ ] l (β old ) l (β old ) = β old + [ X T WX ] 1[ X T (y p) ] = [ X T WX ] 1 X T W }{{} weighted least squares [ Xβold +W 1 (y p) ] (both W and p depend on β old ) w ii p i (1 p i ) max at p i = 1/2 and min at p i = 0 or p i = 1 estimates influenced mostly by points near decision boundary Fields Institute, Toronto, Canada 2015 by Mu Zhu 8

9 Enter Big Data if all this sounds easy, think about x R d and d is large relative to n end up estimating too much with too little resulting estimate cannot be very good in statistically terms, Var( β) inflated to reduce variance, introduce bias Fields Institute, Toronto, Canada 2015 by Mu Zhu 9

10 Penalized Regression let x 1,...,x d denote columns of X suppose y,x 1,...,x d all standardized (1 T y = 0, y = 1, etc) bias each β j by introducing penalty J(β j ) min β 1 2 y (β 1x β d x d ) 2 + d J(β j ) j=1 Fields Institute, Toronto, Canada 2015 by Mu Zhu 10

11 A Class of Penalty Functions J(β j ) = λ β j α d J(β j ) = λ d β j α λ β α j=1 j=1 α α difficulty algorithm ridge regression 2 l 2 convex analytic (exercise) LASSO 1 l 1 convex coordinate descent subset selection 0 l 0 NP-hard heuristic β j 0 = 0, β j = 0 1, β j 0 β j 0 = I(β j 0) Fields Institute, Toronto, Canada 2015 by Mu Zhu 11

12 Bias-Variance Trade-off Exercise Show that, under typical model assumptions, y = Xβ +ε, E(ε) = 0, and Var(ε) = σ 2 I, (i) the usual estimate, β = (X T X) 1 X T y, is unbiased; (ii) the ridge regression estimate call it β λ is biased, but orthonormal matrix V such that Var( β λ ) = σ 2 V D λ V T, Var( β) = σ 2 V DV T, and D λ (j,j) D(j,j) j. (Hint: Use the singular value decomposition of X.) Fields Institute, Toronto, Canada 2015 by Mu Zhu 12

13 Ex III: LASSO coordinate descent at each iteration, solve a univariate problem, min β j L(β j ) = 1 2 z β jx j 2 +λ β j +c j, where z y k j β k x k, c j = k jλ β k, while fixing all β k, k j. cycle through j = 1,2,...,d,1,2,...,d,... Fields Institute, Toronto, Canada 2015 by Mu Zhu 13

14 Ex III: LASSO d dβ j L(β j ) = 0 β j as a function of x j T z 45 β j (x T jx j ) +λ sgn(β j ) = x T jz }{{} x j 2 =1 λ 0 λ β j = x T j z λ, β j > 0; x T j z +λ, β j < 0 solution will typically contain many zeros, i.e., be sparse selection effect of the LASSO Fields Institute, Toronto, Canada 2015 by Mu Zhu 14

15 The l 1 Penalty D. L. Donoho (2006), For most large underdetermined systems of linear equations the minimal l 1 -norm solution is also the sparsest solution, Communications on Pure and Applied Mathematics 59, pp l 1 -problem as a convex relaxation of l 0 -problem Fields Institute, Toronto, Canada 2015 by Mu Zhu 15

16 Ex IV: Graphical LASSO from x i iid N(µ,Σ), estimate Ω Σ 1 (hard for d large) recall linear discriminant analysis requires Σ 1 (last week) for x = (x 1,...,x d ) T N(µ,Σ), variables x j and x k conditionally independent given all other variables if and only if Ω jk = 0 (Dempster, 1972; Biometrics) for Gaussian graphical models, edge between nodes j and k if and only if Ω jk 0 Fields Institute, Toronto, Canada 2015 by Mu Zhu 16

17 Ex IV: Graphical LASSO µ = x = 1 n n x i, S = 1 n i=1 n (x i µ)(x i µ) T i=1 L(Σ) = n i=1 1 [ (2π)d Σ exp 1 ] 2 (x i µ) T Σ 1 (x i µ) l(ω) = const+ n 2 log Σ n tr [ (x i µ) T Σ 1 (x i µ) ] i=1 = const+ n 2 log Ω n 2 tr(ωs) Fields Institute, Toronto, Canada 2015 by Mu Zhu 17

18 Ex IV: Graphical LASSO maximize l 1 -penalized likelihood (Friedman, Hastie & Tibshirani, 2008; Biostatistics): max Ω 0 log Ω tr(ωs) λ Ω 1 }{{} l(ω) where Ω 1 = j,k Ω jk by coordinate descent (one row/column at a time) resulting graphical model is sparse in the number of edges Fields Institute, Toronto, Canada 2015 by Mu Zhu 18

19 Ex IV: Graphical LASSO H-Ras. Left: Inactive form (4q21) & active form (6q21). Regions labelled Switch I & Switch II are known to undergo major conformational changes between 4q21 and 6q21. Right: Estimated graphical model showing conditional dependence structures, obtained by analysing 4q21 alone. (L. Soltan-Ghoraie, F. Burkowski & M. Zhu) Fields Institute, Toronto, Canada 2015 by Mu Zhu 19

20 Ex V: The Netflix Problem (Part 1) high-profile, million-dollar Netflix contest ( ) rating matrix R where r ui = rating of item i by user u set T = {(u,i) : r ui observed} want to predict the missing entries of R Fields Institute, Toronto, Canada 2015 by Mu Zhu 20

21 Ex V: The Netflix Problem (Part 1) Illustration of the Rating Matrix R Fields Institute, Toronto, Canada 2015 by Mu Zhu 21

22 Ex V: The Netflix Problem (Part 1) want to solve min R rank( R) s.t. r ui = r ui for (u,i) T philosophy: only a few factors affect user preferences, so rank of rating matrix must be low but problem is NP-hard Fields Institute, Toronto, Canada 2015 by Mu Zhu 22

23 Ex V: The Netflix Problem (Part 1) instead, solve convex relaxation, min R R s.t. r ui = r ui for (u,i) T where denotes the nuclear norm of a matrix let σ 1,σ 2,... be singular values of A; then, A = σ j 1 whereas rank(a) = I(σ j 0) = σ j 0 a matter of l 1 vs l 0 Fields Institute, Toronto, Canada 2015 by Mu Zhu 23

24 Ex V: The Netflix Problem (Part 1) E. J. Candès & B. Recht (2009), Exact matrix completion via convex optimization, Found Comput Math 9, pp B. Recht, M. Fazel, & P. A. Parrilo (2010), Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, SIAM Review 52, pp under certain conditions, min minrank( ) min a semi-definite program Fields Institute, Toronto, Canada 2015 by Mu Zhu 24

25 Ex VI: The Netflix Problem (Part 2) explicit parameterization R PQ T = p T 1.. [ q 1 q m ] p T n p u,q i R K (K n,m) are latent coordinates just estimate p u,q i and predict missing entries with r ui = p T uq i in reality, user- and item-effects are removed prior to doing this Fields Institute, Toronto, Canada 2015 by Mu Zhu 25

26 Ex VI: The Netflix Problem (Part 2) min p u,q i (u,i) T( rui p T uq i) 2 +λ[ u p u 2 + i q i 2 ] coordinate descent still applies (over p 1,...,p n, q 1,...,q m,...) strictly speaking, blockwise coordinate descent each step convex, but overall nonconvex (both p u, q i unknown) many traps (e.g., local solutions, saddle points) Fields Institute, Toronto, Canada 2015 by Mu Zhu 26

27 Ex VII: MCP For the penalized regression problem, Zhang (2010; Ann. Stat.) proposed the so-called minimax concave penalty (MCP): βj ( J(β j ) = λ 1 x ) λ β j β2 j 2γ dx =, β j γλ; 0 γλ γλ2, β j > γλ, instead of J(β j ) = λ β j α. Fields Institute, Toronto, Canada 2015 by Mu Zhu 27

28 (a) (b) J(β j ) J(β j ) β j β j (a) MCP (λ = 1, γ = 10); (b) LASSO (λ = 1). Fields Institute, Toronto, Canada 2015 by Mu Zhu 28

29 Ex VII: MCP for β j large, J( ) constant beyond a certain point reduces bias for β j small, J( ) still like the LASSO keeps sparsity in between, smooth interpolation that minimizes maximal concavity coordinate descent applies, but many traps (e.g., local solutions, saddle points) Fields Institute, Toronto, Canada 2015 by Mu Zhu 29

30 Summary key ideas: introduce bias to reduce variance; penalty functions l 1 norm; nuclear norm; nonconvex penalties specific methods: ridge; LASSO; MCP graphical LASSO matrix completion; matrix factorization Newton-Raphson; coordinate descent application areas: proteins recommender systems Fields Institute, Toronto, Canada 2015 by Mu Zhu 30

31 Next... 2 pm by Professor S. Vavasis on optimization a short, 10-minute break research with my students (mostly, the Netflix Problem) Fields Institute, Toronto, Canada 2015 by Mu Zhu 31

Expanded Alternating Optimization of Nonconvex Functions with Applications to Matrix Factorization and Penalized Regression

Expanded Alternating Optimization of Nonconvex Functions with Applications to Matrix Factorization and Penalized Regression W. James Murdoch and Mu Zhu arxiv:1412.4128v1 [stat.co] 12 Dec 2014 Abstract