Subset selection with sparse matrices

Size: px

Start display at page:

Download "Subset selection with sparse matrices"

Reynold Moore
5 years ago
Views:

1 Subset selection with sparse matrices Alberto Del Pia, University of Wisconsin-Madison Santanu S. Dey, Georgia Tech Robert Weismantel, ETH Zürich February 1, 018 Schloss Dagstuhl

2 Subset selection for regression A large number of variables can be observed, and we are interested in the value of a predictor variable b. Due to time or cost constraints, it is not feasible to sample all the variables every time a prediction is required. Goal: select a small subset of variables to predict b in the future. Natural applications abound in medical or social studies. Example: Predict the risks of heart disease in terms of observable quantities age, sex, blood pressure, cholesterol level, ). The goal is to identify a small set of attributes for future tests. 1

3 Subset selection for regression b Mx + µ b x R n, µ R m m 1 Image credit: The Elements of Statistical Learning

4 Subset selection for regression Mx + µ b x R n, µ R b M ˆb Image credit: The Elements of Statistical Learning M 1

5 Approaches Mx + µ b x R n, µ R Different communities and different approaches: Enumeration: Best-subset regression. Greedy algorithms: Forward- and backward-stepwise selection, forward-stagewise regression, etc. Branch and bound: Leaps and bounds procedure. Convex optimization: Shrinkage methods: ridge regression, the lasso, least angle regression, etc. 3

6 Complexity Mx + µ b x R n, µ R The problem is NP-hard [Welch, 198]. Few polynomially solvable cases: When the n k largest eigenvalues of M M are identical, for k fixed. [Gao, Li, 013] When the covariance graph is a tree. [Das and Kempe, 008] Approximate algorithm when the covariance has constant bandwidth. [Das, Kempe, 008] Under some conditions, using l 1 norm to replace the cardinality constraint yields the exact solution with an overwhelg probability. [Donoho, 006] [Candes, Romberg, Tao 006] 4

7 Our contribution Mx + µ b x R n, µ R Can be solved in polynomial time if: i) M is obtained from a diagonal matrix by adding a fixed number of extra columns. d 1... c 1 c k d n 5

8 Our contribution Mx + µ b x R n, µ R Can be solved in polynomial time if: ii) M is obtained by adding a fixed number of extra columns to a block diagonal matrix, where each block has a fixed number of variables. A 1... c1 c k A h 5

9 First reduction A c 1 c k )x + µ b x R n+k, µ R 6

10 First reduction A c 1 c k )x + µ b x R n+k, µ R Lemma We just need to show how to solve in polynomial time the problem Ax b ) k c l λ l x R n, λ R k 6

11 The diagonal case

12 The diagonal case Dx b ) k c l λ l x R n, λ R k 7

13 The diagonal case Dx b ) k c l λ l x R n, λ R k First, we consider a simpler problem by fixing the variables λ: Dx b x R n 7

14 The diagonal case Dx b ) k c l λ l x R n, λ R k First, we consider a simpler problem by fixing the variables λ: 1. d.. x R n d n x b 7

15 The diagonal case Dx b ) k c l λ l x R n, λ R k First, we consider a simpler problem by fixing the variables λ: n d j x j b j ) j=1 x R n 7

16 A simpler diagonal problem n d j x j b j ) j=1 x R n 8

17 A simpler diagonal problem n d j x j b j ) j=1 x R n We can order indices 1,..., n such that we have b j1 b j b jn. The optimal support {j 1, j,..., j σ } only depends on the ordering. An optimal solution is { x bji /d ji for i = 1,..., σ j i := 0 otherwise 8

18 A simpler diagonal problem n d j x j b j ) j=1 x R n x b x 1 8

19 Back to the diagonal case Dx b ) k c l λ l x R n, λ R k 9

20 Back to the diagonal case Dx b ) k c l λ l x R n, λ R k x b k cl λ l x 1 9

21 Back to the diagonal case Dx b ) k c l λ l x R n, λ R k We wish to partition all λ vectors based on the optimal support they yield. In each cell of the partition, we want to have an ordering of all the k b i c l i λ l 9

22 Back to the diagonal case Dx b ) k c l λ l x R n, λ R k For every two indices i, j, we subdivide all points λ R k based on k b i ci l k λ l b j c l j λ l We just need four hyperplanes. In total we obtain On ) hyperplanes. By the hyperplane arrangement theorem, we obtain On k ) polyhedra Q t in R k. 9

23 Back to the diagonal case Consider one polyhedron Q t. Dx b ) k c l λ l x R n, λ R k We obtain an ordering of all the b i that holds for all λ Q t k c l i λ l The σ indices highest in the ordering yield an optimal support for all λ in Q t. 9

24 Back to the diagonal case Dx b ) k c l λ l x R n, λ R k Let X be the set containing all these On k ) supports. For each χ X, the original problem with the additional constraints can be solved in polynomial time. x i = 0, for all i / χ, The original problem has an optimal solution with support contained in χ for some χ X. 9

25 The block diagonal case

26 The block diagonal case Ax b ) k c l λ l x R n, λ R k 10

27 The block diagonal case Ax b ) k c l λ l x R n, λ R k Same overall strategy: 1. Design an algorithm for the simpler problem obtained by fixing the variables λ.. Cover the space of the λ variables with a polynomial number of regions such that in each region the algorithm yields the same optimal support. 10

28 The block diagonal case Ax b ) k c l λ l x R n, λ R k First, we consider a simpler problem by fixing the variables λ: Ax b x R n 10

29 The block diagonal case Ax b ) k c l λ l x R n, λ R k First, we consider a simpler problem by fixing the variables λ: A1... x R n A h x 1 b 1.. x h b h 10

30 The block diagonal case Ax b ) k c l λ l x R n, λ R k First, we consider a simpler problem by fixing the variables λ: h A i x i b i i=1 x R n 10

31 A simpler block diagonal problem h A i x i b i i=1 x R n 11

32 A simpler block diagonal problem h A i x i b i i=1 x R n We have to decide how to distribute σ among the different blocks. We can do it recursively for support 1,,..., σ. At each iteration we only need to redistribute Oθ 3 ) supports, where θ is the maximum number of variables in a block. The optimal support only depends on an ordering of Oh θ3 ) values of the form { A i x i b i } : x i R ni, suppx i ) j i,j 11

33 Back to the block diagonal case Ax b ) k c l λ l x R n, λ R k P) 1

34 Back to the block diagonal case Ax b ) k c l λ l x R n, λ R k P) We wish to partition all λ vectors based on the optimal support they yield. In each cell of the partition, we want to have an ordering of all the ) k Ai x i b i c i lλ l : x i R ni, suppx i ) j i,j 1

35 Partitioning the space Ai x i b i i,j ) k c i lλ l : x i R ni, suppx i ) j Issue 1: There is no such polyhedral decomposition of the λ space. Why? There are quadratic functions involved. Solution: Linearization Instead of the λ space, we define a new space S where we can linearize any quadratic monomial. The dimension of S is Ok ). 13

36 Partitioning the space Ai x i b i i,j ) k c i lλ l : x i R ni, suppx i ) j Issue : The objective value of each i, j) depends on its optimal support. Solution: Enumeration We partition S in polyhedra P t S, such that in each of them every i, j) has a fixed optimal support. 13

37 Partitioning the space Ai x i b i i,j ) k c i lλ l : x i R ni, suppx i ) j How? For each fixed support, the optimal value of every i, j) is a quadratic function. By comparing all pairs of quadratic functions, we obtain Oh θ ) = Oh) hyperplanes in S. By the hyperplane arrangement theorem, we obtain Oh k ) polyhedra P t in S. The best quadratic function among those corresponding to i, j yields the optimal support of i, j). 13

38 Partitioning the space Ai x i b i i,j ) k c i lλ l : x i R ni, suppx i ) j Now we can apply the old partitioning procedure: In each P t every expression is a linear function in S. For every two such expression, we subdivide all points in S based on which is largest. In total we obtain Oh θ3 ) hyperplanes. By the hyperplane arrangement theorem, we obtain Oh θ3 k ) polyhedra Q t,u P t. 13

39 Wrapping up the block diagonal case Consider one polyhedron Q t,u. Ax b ) k c l λ l x R n, λ R k We obtain an ordering of all the ) k Ai x i b i c i lλ l : x i R ni, suppx i ) j i,j that holds for all λ Q t,u The algorithm for the simpler block diagonal problem yields the same optimal support for all λ in Q t,u. 14

40 Wrapping up the block diagonal case Ax b ) k c l λ l x R n, λ R k Let X be the set containing all these Oh θ3 k ) supports. For each χ X, the original problem with the additional constraints can be solved in polynomial time. x i = 0, for all i / χ, The original problem has an optimal solution with support contained in χ for some χ X. 14

41 Subset selection with sparse matrices Alberto Del Pia, University of Wisconsin-Madison Santanu S. Dey, Georgia Tech Robert Weismantel, ETH Zürich February 1, 018 Schloss Dagstuhl

Subset selection in sparse matrices

Subset selection in sparse matrices Alberto Del Pia Santanu S. Dey Robert Weismantel October 5, 2018 Abstract In subset selection we search for the best linear predictor that involves a small subset of