Color Scheme. swright/pcmi/ M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

Size: px

Start display at page:

Download "Color Scheme. swright/pcmi/ M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14"

Joel Butler
5 years ago
Views:

1 Color Scheme swright/pcmi/ M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

2 Statistical Inference via Optimization Many problems in statistical inference can be formulated as optimization problems: image reconstruction image restoration / denoising supervised learning (regression / classification) unsupervised learning... Standard formulation: observed data: y unknown mathematical object (signal, image, vector, matri,...): inference criterion: arg min g(, y) M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

3 Inference via Optimization Inference criterion: arg min g(, y) = { : g(, y) g(z, y), z } Question 1: how to build g? Where does it come from? Answer: from the application domain (machine learning, signal processing, inverse problems, system identification, statistics, computer vision, bioinformatics,...) together with statistical principles.... eamples ahead. Question 2: how to solve the optimization problem? Answer: We ll discuss in these sessions (and see also earlier sessions: Mahoney, Duchi,...) M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

4 Inference and Regularized Optimization Inference criterion: Typical structure of g: arg min g(, y) g(, y) = h(, y) + τψ() h(, y) how well fits / eplains the data y; (data term, log-likelihood, loss function, observation model,...) ψ() knowledge/constraints/structure: the regularizer τ 0: the regularization parameter (or constant). Since y is fied, often drop it for convenience and write f () = h(, y), min f () + τψ(). M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

5 Probabilistic / Bayesian Interpretations Inference criterion: Typical structure of g: arg min g(, y) g(, y) = h(, y) + τψ() Likelihood (observation model): p(y ) = 1 Z l ep ( h(, y) ) Prior: p() = 1 Z p ep ( τψ() ) Gaussian: ψ() = 2 Laplacian: ψ() = 1. Posterior: p( y) = p(y ) p() p(y) Log-posterior: log p( y) = K(y) h(, y) τψ() = K(y) g(, y) is a maimum a posteriori (MAP) estimate. M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

6 Regularizers Inference criterion: min f () + τψ() Typically, the unknown is a vector R n or a matri R n m Common regularizers impose/encourage one (or a combination of) the following characteristics: small norm (vector or matri) sparsity (few nonzeros) specific nonzero patterns (e.g., group/tree structure) low-rank (matri) smoothness or piece-wise smoothness M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

7 Unconstrained vs Constrained Formulations Tikhonov regularization: min f () + τψ() Morozov regularization: min subject to ψ() f () ε Ivanov regularization: min subject to f () ψ() δ Under mild conditions, these are all equivalent. Morozov and Ivanov can be written as Tikhonov using indicator functions. Which one is most convenient depends on the application and contet. M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

8 Relationship Between l 1 and l 0 Finding the sparsest solution is NP-hard (Muthukrishnan, 2005). ŵ = arg min w w 0 s.t. Aw y 2 2 δ. The related best subset selection problem is also NP-hard (Amaldi and Kann, 1998; Davis et al., 1997). ŵ = arg min w Aw y 2 2 s.t. w 0 τ. Under conditions, replacing l 0 with l 1 yields similar results: central issue in compressive sensing (CS) (Candès et al., 2006; Donoho, 2006) M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

9 Under-Constrained Systems: Relating l 0 and l 1 Let be the sparsest solution of A = y, where A R m n and m < n. = arg min 0 s.t. A = y. Suppose that has k nonzero elements, with k n. Consider the l 1 norm version: min 1 s.t. A = y Advantage: this is a conve problem! Fact: all norms are conve. will solve this problem too, provided that + v 1 1, v ker(a). Recall: ker(a) = { R n : A = 0} is the kernel (a.k.a. null space) of A. Net: elementary analysis by Yin and Zhang (2008), based on work by Kashin (1977) and Garnaev and Gluskin (1984). M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

10 Equivalence Between l 1 and l 0 Minimum l 0 (sparsest) solution: arg min 0 s.t. A = y. Minimum l 1 solution(s): G = arg min 1 s.t. A = y. G, if + v 1 1, v ker(a) Let S = {i : i 0} (support of with cardinality k n); and S c = {1,..., n} \ S + v 1 = S + v S 1 + v S c 1 S 1 + v S c 1 v S 1 ( a + b a b ) = 1 + v 1 2 v S 1 ( v S c 1 = v 1 v S 1 ) 1 + v 1 2 k v 2. ( a 1 n a 2 ) Hence, G, if 1 v 1 2 v 2 k, v ker(a)...but, in general, we have only: 1 v 1 v 2 n However, we may have v 1 1, if v is restricted to a random subspace. v 2 M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

11 Bounding the l 1 /l 2 Ratio in Random Kernels If the elements of A R m n are sampled i.i.d. from N (0, 1) (zero mean, unit variance Gaussian), then, with high probability, v 1 v 2 C m log(n/m), for all v ker(a), for some constant C (based on concentration of measure phenomena). Thus, with high probability, G, if m 4 C 2 k log n Conclusion: Can solve under-determined system, where A has i.i.d. N (0, 1) elements, by solving min 1 s.t. A = b, (a conve problem), if the solution is sparse enough. M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

Note that v 1 is well away from the lower bound of 1 over the whole

12 Ratio v 1 / v 2 on Random Null Spaces Random A R 4 7, showing ratio v 1 for v ker(a) with v 2 = 1 Blue: v 1 1. Red: ratio 7. Note that v 1 is well away from the lower bound of 1 over the whole nullspace. M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

Blue: v 1 1. Red: v 1 20. Note that v 1 is closer to upper bound throughout.

13 Ratio v 1 / v 2 on Random Null Spaces The effect grows more pronounced as m/n grows. Random A R 17 20, showing ratio v 1 for v N(A) with v 2 = 1. Blue: v 1 1. Red: v Note that v 1 is closer to upper bound throughout. M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

14 References I Amaldi, E. and Kann, V. (1998). On the approimation of minimizing non zero variables or unsatisfied relations in linear systems. Theoretical Computer Science, 209: Candès, E., Romberg, J., and Tao, T. (2006). Robust uncertainty principles: Eact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52: Davis, G., Mallat, S., and Avellaneda, M. (1997). Greedy adaptive approimation. Journal of Constructive Approimation, 13: Donoho, D. (2006). Compressed sensing. IEEE Transactions on Information Theory, 52: Garnaev, A. and Gluskin, E. (1984). The widths of an Euclidean ball. Doklady Akademii Nauk, 277: Kashin, B. (1977). Diameters of certain finite-dimensional sets in classes of smooth functions. Izvestiya Akademii Nauk. SSSR: Seriya Matematicheskaya, 41: Muthukrishnan, S. (2005). Data Streams: Algorithms and Applications. Now Publishers, Boston, MA. Yin, W. and Zhang, Y. (2008). Etracting salient features from less data via l 1 -minimization authors. SIAG/OPT Views-and-News, 19: M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

Large-Scale L1-Related Minimization in Compressive Sensing and Beyond

Large-Scale L1-Related Minimization in Compressive Sensing and Beyond Yin Zhang Department of Computational and Applied Mathematics Rice University, Houston, Texas, U.S.A. Arizona State University March