Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint

Size: px

Start display at page:

Download "Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint"

Gwendolyn Hawkins
5 years ago
Views:

1 Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint Xia Hong 1, Sheng Chen 2, Chris J. Harris 2 1 School of Systems Engineering University of Reading, Reading RG6 6AY, UK x.hong@reading.ac.uk 2 School of Electronics and Computer Science University of Southampton, Southampton SO17 1BJ, UK s: {sqc,cjh}@ecs.soton.ac.uk International Joint Conference on Neural Networks 2010

2 Outline 1 Motivations Existing Regularisation Approaches Our Contributions 2 Proposed Sparse Kernel Density Estimator Problem Formulation Approximate Zero-Norm Regularisation D-Optimality Based Subset Selection 3 Numerical Examples Experimental Set Up Experimental Results 4 Conclusions

3 Outline 1 Motivations Existing Regularisation Approaches Our Contributions 2 Proposed Sparse Kernel Density Estimator Problem Formulation Approximate Zero-Norm Regularisation D-Optimality Based Subset Selection 3 Numerical Examples Experimental Set Up Experimental Results 4 Conclusions

4 Regularisation Methods Two-norm of weight vector Naturally combined with quadratic main cost function, and computationally efficient implementation Only drive many weights to small near-zero values One-norm of weight vector Can drive many weights to zero, and hence should achieve sparser results than two-norm based method Harder to minimise and higher complexity implementation Zero-norm of weight vector Ultimate model sparsity and generalisation performance Intractable in implementation, and even with approximation, very difficult to minimise and impose very high complexity Two-norm and one-norm based regularisations have been combined with OLS algorithm, with the former approach providing highly efficient sparse kernel modelling

5 Outline 1 Motivations Existing Regularisation Approaches Our Contributions 2 Proposed Sparse Kernel Density Estimator Problem Formulation Approximate Zero-Norm Regularisation D-Optimality Based Subset Selection 3 Numerical Examples Experimental Set Up Experimental Results 4 Conclusions

6 Our Contributions We incorporate an effective approximate zero-norm regularisation into sparse kernel density estimation Approximate zero-norm naturally merges into underlying constrained nonnegative quadratic programming Various SVM algorithms can readily be applied to obtain SKD estimate efficiently Proposed sparse kernel density estimator: use D-optimality OLS subset selection to select a small number of significant kernels, in terms of kernel eigenvalues then solve final SKD estimate from associate subset constrained nonnegative quadratic programming

7 Outline 1 Motivations Existing Regularisation Approaches Our Contributions 2 Proposed Sparse Kernel Density Estimator Problem Formulation Approximate Zero-Norm Regularisation D-Optimality Based Subset Selection 3 Numerical Examples Experimental Set Up Experimental Results 4 Conclusions

8 Kernel Density Estimation Give finite data set D N = {x k } N k=1, drawn from unknown density p(x), where x k R m Infer p(x) based on D N using kernel density estimate ˆp(x; β N, ρ) = N β k K ρ (x, x k ) k=1 s.t. β k 0, 1 k N, β T N1 N = 1 Here β N = [β 1 β 2 β N ] T : kernel weight vector, 1 N : the vector of ones with dimension N, and K ρ (, ): chosen kernel function with kernel width ρ Unsupervised density estimation supervised regression using Parzen window estimate as desired response

9 Regression Formulation For x k D N, denote ŷ k = ˆp(x k ; β N, ρ), y k as Parzen window estimate at x k, and ε k = y k ŷ k regression formulation y k = ŷ k + ε k = φ T N(k)β N + ε k or over D N y = Φ N β N + ε Associated constrained nonnegative quadratic programming { } 1 min 2 β βt NB N β N v T N β N N s.t. β T N1 N = 1 and β i 0, 1 i N where B N = Φ T NΦ N is the design matrix and v N = Φ T Ny This is not using kernel density estimate to fit Parzen window estimate!

10 Outline 1 Motivations Existing Regularisation Approaches Our Contributions 2 Proposed Sparse Kernel Density Estimator Problem Formulation Approximate Zero-Norm Regularisation D-Optimality Based Subset Selection 3 Numerical Examples Experimental Set Up Experimental Results 4 Conclusions

11 Zero-Norm Constraint Given α > 0, an approximation to zero norm β N 0 is N β N 0 (1 ) e α β i i=1 Combining this zero-norm constraint with constrained NNQP { 1 min 2 βt NB N β N v T N β N + λ N ( )} 1 e α β i β N i=1 s.t. β T N1 N = 1 and β i 0, 1 i N with λ > 0 a small regularisation parameter With 2nd order Taylor series expansion for e α β i e α β i 1 α β i + α2 βi 2 2 N ( ) N N 1 e α β i α β i α2 2 i=1 i=1 i=1 β 2 i

12 Constrained NNQP Hence, new constrained NNQP { } 1 min 2 β βt NA N β N v T N β N N s.t. β T N1 N = 1 and β i 0, 1 i N A N = B N δi N and δ = λα 2 predetermined small parameter Remark: Under convexity constraint on β N, minimisation of approximate zero norm maximisation of two norm β T NI N β N Design matrix B N should positive definite, and δ bounded by smallest eigenvalue of B N so that A N also positive definite Common for B N of large data set to be ill-conditioned Approach most effective when it is applied following some model subset selection preprocessing

13 Outline 1 Motivations Existing Regularisation Approaches Our Contributions 2 Proposed Sparse Kernel Density Estimator Problem Formulation Approximate Zero-Norm Regularisation D-Optimality Based Subset Selection 3 Numerical Examples Experimental Set Up Experimental Results 4 Conclusions

14 D-Optimality Design Least squares estimate ˆβ N = B 1 N ΦT Ny is unbiased and covariance matrix of estimate Cov [ˆβ ] N B 1 N Estimation accurate depends on condition number C = max{σ i, 1 i N} min{σ i, 1 i N} where σ i, 1 i N, are eigenvalues of B N D-optimality design maximises determinant of design matrix Selected subset model Φ Ns maximises ) det (Φ T Ns Φ Ns = det ( ) B Ns Prevent oversized ill-posed model and high estimate variances Unsupervised D-optimality design particularly suitable for determining structure of kernel density estimate

15 OFR Aided Algorithm Orthogonal forward regression selects Φ Ns of N s significant kernels based on D-optimality criterion Complexity of this preprocessing no more than O(N 2 ) This preprocessing results in subset constrained NNQP { } 1 min 2 β βt N s A Ns β Ns v T N s β Ns Ns s.t. β T N s 1 Ns = 1 and β i 0, 1 i N s with v Ns = Φ T N s y, A Ns = B Ns δi Ns, B Ns = Φ T N s Φ Ns, δ < w T N s w Ns Various SVM algorithms can be used to solve this problem As N s is very small and A Ns is well-conditioned, we use simple multiplicative nonnegative quadratic programming algorithm Complexity of which is negligible, in comparison with O(N 2 ) of D-optimality based OFR preprocessing

16 Outline 1 Motivations Existing Regularisation Approaches Our Contributions 2 Proposed Sparse Kernel Density Estimator Problem Formulation Approximate Zero-Norm Regularisation D-Optimality Based Subset Selection 3 Numerical Examples Experimental Set Up Experimental Results 4 Conclusions

17 Experimental Setup Training set had N randomly drawn samples, while test set of N test = 10, 000 samples for calculating L 1 test error L 1 = 1 N test p(x k ) ˆp(x k ; β N N, ρ) test k=1 between true density p(x) and estimate ˆp(x k ; β N, ρ) Numerical approximation of Kullback-Leibler divergence (KLD) p(x) D KL (p ˆp) = p(x) log R ˆp(x; β m N, ρ) dx also used for testing in 2-D case Proposed SKD estimator compared with PW estimator, our previous SKD estimator and reduced set density estimator (RSDE), as well as Gaussian mixture model (GMM) estimator

18 Outline 1 Motivations Existing Regularisation Approaches Our Contributions 2 Proposed Sparse Kernel Density Estimator Problem Formulation Approximate Zero-Norm Regularisation D-Optimality Based Subset Selection 3 Numerical Examples Experimental Set Up Experimental Results 4 Conclusions

19 First 2-D Example True density: mixture of Gaussian and Laplacian distributions p(x 1, x 2 ) = 1 (x 1 2) 2 4π e 2 e (x 2 2) e 0.7 x 1+2 e 0.5 x 2+2 N = 500, and experiment repeated N run = 100 times Performance comparison, N = 500 and average over 100 runs estimator PW previous SKD RSDE GMM proposed SKD kernel ρ Par = 0.42 ρ = 1.1 ρ = 1.2 tunable ρ = 1.1 L ± ± ± ± ± 0.69 KLC ± ± ± ± ± 0.31 kernel no ± ± ± 1.5 maximum minimum Similar test performance to existing kernel density estimators, but sparser estimate

20 Second 2-D Example True density: mixture of five Gaussian distributions p(x, y) = 5 i=1 1 (x µ i,1 ) 2 10π e 2 e (y µ i,2 )2 2 Five means of Gaussian distributions: [ ], [ ], [ ], [ ], and [ ] Performance comparison, N = 500 and average over 100 runs estimator PW previous SKD RSDE GMM proposed SKD kernel ρ Par = 0.5 ρ = 1.1 ρ = 1.2 tunable ρ = 1.0 L ± ± ± ± ± 0.63 KLC ± ± ± ± ± 1.09 kernel no ± ± ± 1.3 maximum minimum Similar test performance to existing kernel density estimators, but sparser estimate

21 6-D Example True density: mixture of three Gaussian distributions p(x) = i=1 1 1 (2π) 6/2 det 1/2 Γ i e 1 (x µ 2 i )T Γ 1 i (x µ i ) with µ 1 = [ ] T Γ 1 = diag{1.0, 2.0, 1.0, 2.0, 1.0, 2.0} µ 2 = [ ] T Γ 2 = diag{2.0, 1.0, 2.0, 1.0, 2.0, 1.0} µ 3 = [ ] T Γ 3 = diag{2.0, 1.0, 2.0, 1.0, 2.0, 1.0} Estimation set had N = 600 samples, and experiment was repeated N run = 100 times

22 6-D Example Results Performance comparison, N = 600 and average over 100 runs estimator PW previous SKD RSDE GMM proposed SKD kernel ρ Par = 0.65 ρ = 1.2 ρ = 1.2 tunable ρ = 1.2 L ± ± ± ± ± 0.24 kernel no ± ± ± 1.3 maximum minimum Similar test performance to existing kernel density estimators, but sparser estimate

23 Conclusions We have integrated zero-norm regularisation naturally into construction of sparse kernel density estimator Classical Parzen window estimate as desired response Convexity constraint with zero-norm approximation turns problem into tractable nonnegative quadratic programming D-optimality preprocessing selects small significant kernel subset to ensure well-conditioned solution Complexity compares favourably with existing sparse kernel density estimators Zero-norm regularisation and D-optimality aided estimator offers an efficient means for selecting very sparse kernel density estimates with excellent generalisation performance

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview