Sparse Learning and Distributed PCA. Jianqing Fan
|
|
- Justin Thornton
- 5 years ago
- Views:
Transcription
1 w/ control of statistical errors and computing resources Jianqing Fan Princeton University
2 Coauthors Han Liu Qiang Sun Tong Zhang Dong Wang Kaizheng Wang Ziwei Zhu
3 Outline Computational Resources and Statistical Errors Sparse Learning: Complexity and Stat Error Distributed PCA: Communication and Stat Error Conclusion
4 Introduction
5 Data Tsunami Large and complex data arise routinely Biological Sci.: Genomics, Genetics, Neuroscience, Medicine Engineering: Machine learning, surveil. videos, social media, networks Natural Sci.: Astronomy, earth sciences, meterology. Social Sci: Economics, finance, business, and digital humanities
6 Big Data are ubiquitous Medicine Internet Business Science Big Data Finance Biological Science Digital Humanities Government Engineering EB ZB ZB ZB ZB
7 Deep Impact System: storage, communication, computation architectures Analysis: statistics, computation, optimization, privacy
8 Data Science Data Science Acquisition & Storage Computing Analysis Applications Applications
9 Statisticians Dream Efficient Methods Computable in Polynomial Time and Communicable
10 About this talk Control: Statistical error + computing resources Sparse Learning Technique: Folded concave penalty (complexity) Distributed PCA Approach: divide-and-conquer (communication)
11 Sparse Learning with control of computation complexity and statistical error Han Liu Qiang Sun Tong Zhang
12 A General Framework for Sparsity Learning Penalized M-estimator Fan & Li, 01; Negahban et al 12: { } β = arg min L(β) + P λ ( β ). β R d L(β): Quadratic, GLIM, Gausian pseuo-likelihood, Huber-loss etc. P λ ( ): Lasso, folded concave: SCAD or MCP. R P Pλ(βj) ( j) Lasso SCAD MCP β j
13 Convex and Folded-Concave Regularization Lasso is fast to compute, but introduces estimation bias. Inferior estimation rate: β β 2 s(logd)/n. Bickel, et al, 09; Negahban et al. 12;... Folded-concave estimators have oracle properties. Oracle estimation rate: β β 2 s/n, with beta-min cond.. Obtain oracle estimator with high probability But require heavy computation
14 Convex and Folded-Concave Regularization Lasso is fast to compute, but introduces estimation bias. Inferior estimation rate: β β 2 s(logd)/n. Bickel, et al, 09; Negahban et al. 12;... Folded-concave estimators have oracle properties. Oracle estimation rate: β β 2 s/n, with beta-min cond.. Obtain oracle estimator with high probability But require heavy computation
15 Nonconvex Penalized M-estimator Theoretical properties for the global optimum (or a specific local optimum), not obtainable in practice. Local min: Fan & Li (2001), Kim et al (2008), Fan & Lv (2011), Wang, Liu, Zhang (2014), Loh & Wainwright (2014),... Global min: Zhang & Zhang (2012), Kim & Kwon (2012), Negahban et al. (2012), Agarwal et al (2012),... Practical algorithms w/o theoretical guarantee: coordinate descent algorithm Friedman, et al., 08; Breheny & Huang, 11. What computed is not what proved.
16 Nonconvex Penalized M-estimator Theoretical properties for the global optimum (or a specific local optimum), not obtainable in practice. Local min: Fan & Li (2001), Kim et al (2008), Fan & Lv (2011), Wang, Liu, Zhang (2014), Loh & Wainwright (2014),... Global min: Zhang & Zhang (2012), Kim & Kwon (2012), Negahban et al. (2012), Agarwal et al (2012),... Practical algorithms w/o theoretical guarantee: coordinate descent algorithm Friedman, et al., 08; Breheny & Huang, 11. What computed is not what proved.
17 Algorithmic Approach to Sparse Learning
18 Folded Concave Penalized Regression argminl(β) + d j=1 p λ( β j ) ILQA LLA (0) Given a good initial estimator β LQA and LLA LQA : p (a) λ ( β j ) = p λ ( β 0 j ) + 0.5p λ ( β0 j )( β j 2 β 0 j 2 ) Fan & Li, 01 LLA : p λ ( β j ) = p λ ( β 0 j ) + p λ ( β0 j )( β j β 0 j ) Zou & Li (2008) Iterative use is an MM algorithm. Hunter & Li, 05. LQA LLA L LQA LLA
19 Local Linear Approximation Given β 0 = 0, compute β(1) β(2) { = arg min { = arg min L(β) + L(β) + d j=1 d j= } λ (0) j β j, λ (0) j = p λ (0) ( β j ) } λ (1) j β j, λ (1) j = p λ (1) ( β j ) How to solve each step? When to stop?
20 A Scalable Solution Isotropic LQA: L( β (0) )+ L( β (0) ),β β (0) + φ 2 β β (0) 2. Avoid storage and computation of Hessian Analytic solution with p λ ( β ): LLA d j=1 w j β j. Related to (proximal) gradient methods Nesterov, 83, 13; Tseng 08; Boyd & Vandenberghe, 09, Agarwal et al, Used by Loh and Wainwright, 14; Wang, Liu, Zhang, 14 Majorization requres φ 2 L( β (0) ), but majorization requires only locally, solved by line search.
21 A Scalable Solution Isotropic LQA: L( β (0) )+ L( β (0) ),β β (0) + φ 2 β β (0) 2. Avoid storage and computation of Hessian Analytic solution with p λ ( β ): LLA d j=1 w j β j. Related to (proximal) gradient methods Nesterov, 83, 13; Tseng 08; Boyd & Vandenberghe, 09, Agarwal et al, Used by Loh and Wainwright, 14; Wang, Liu, Zhang, 14 Majorization requres φ 2 L( β (0) ), but majorization requires only locally, solved by line search.
22 (Local Adaptive) MM Algorithm MM Principle: to minimize a general function f(β) Majorize f(β) by g k (β) at point β (k) : f(β (k) ) = g k (β (k) ) and f(β) g k (β) Compute β (k+1) { = argmin β gk (β) } LQA and LLA LQA LLA Property: The target values decrease: (a) f(β (k+1) ) g k (β (k+1) ) g }{{} k (β (k) ) = f(β (k) ). local requirement
23 Solution to General Sparse Learning Idea: Iteratively refine the estimator via convex program: β(0) β(1) β(2) β(t ), { β(l) =arg min L(β)+ λ (l 1) β 1 },l=1,...,t. β R d LLA to penalty λ (l 1) = (p λ (l 1) ( β 1 ),...,p λ (l 1) ( β d )) T Each is solved approximately by iterative use of LAMM.
24 Algorithm with Controlled Complexity Contraction stage: l=1 with optimization error ε c. Tightening stage: 2 l T with optimization error ε t. ε c ε t : for Gaussian noise ε c n 1 logd, ε t n 1. (0) : (1) : (1,0) =0 (2,0) = (1)... (1,k 1) = (1), (2,1)... (2,k 2) = (2), LAMM LAMM LAMM (1,1) LAMM LAMM LAMM k 1 2 c ; k 2 log( 1 t );... (T 1) : LAMM LAMM... LAMM... (T,0) = (T 1) (T,1)... (T,k T ) = (T ), k T log( 1 t ). T log(λ n) log(logd).
25 Algorithmic Convergence: Phase Transition Contracting stage: sub-linear convergence 1/ k, not a strong convex problem, but gives a sparse solution Computational Rate for Constant Correlation Design (,k) ( ) log contraction stage : = 1 tightening stage : = 2 tightening stage : = 3 tightening stage : = iteration count Tightening stage: linear convergence ρ k, strong convex
26 Effects on Statistical Errors Contraction region: Sparse strong convexity and smoothness. s/n (T ) (T 1) (1) (2) (0) s β (l) β 2 C s/n +δ β (l 1) β }{{} 2, for a δ (0,1). stat error
27 Theoretical Properties
28 Localized Sparse Eigenvalue Definition (Localized Sparse Eigenvalue) ρ + (m,r)=sup{u T J 2 L(β)u J : u J 2 2 =1, J m, β β 2 r}; u,β ρ (m,r) is defined by taking inf. Assumeption 1 (LSE Condition) Let s = β 0. There exists r λ s such that 0 < ρ ρ (6s,r) < ρ + (6s,r) ρ < +. Other Assump: min j S β j λ. p λ (αλ) = 0 for large α.
29 Localized Sparse Eigenvalue Definition (Localized Sparse Eigenvalue) ρ + (m,r)=sup{u T J 2 L(β)u J : u J 2 2 =1, J m, β β 2 r}; u,β ρ (m,r) is defined by taking inf. Assumeption 1 (LSE Condition) Let s = β 0. There exists r λ s such that 0 < ρ ρ (6s,r) < ρ + (6s,r) ρ < +. Other Assump: min j S β j λ. p λ (αλ) = 0 for large α.
30 Statistical Theory Theorem (Optimal Statistical Rate) Contracting property: there exists a δ (0, 1) such that β (l) β 2 C s/n + δ β (l 1) β 2, for l 2, If we take T log(λ n), then β (T ) β 2 s/n. The result is deterministic: ( L(β ) + ε c ) λ r/ s. For l = 1, β (1) β 2 λ s s log(d)/n, not enough for post-lasso.
31 Statistical Theory Theorem (Optimal Statistical Rate) Contracting property: there exists a δ (0, 1) such that β (l) β 2 C s/n + δ β (l 1) β 2, for l 2, If we take T log(λ n), then β (T ) β 2 s/n. The result is deterministic: ( L(β ) + ε c ) λ r/ s. For l = 1, β (1) β 2 λ s s log(d)/n, not enough for post-lasso.
32 Computational Theory Theorem (Algorithmic Complexity) The total number of ilamm iterations needed is C 1 ( + ε 2 (T 1)C 1 ) log, c ε t The number of LAMM iterations is O(n 2 /logd + log(logn) logn) Initial estimator is cruder, but dominates computation time.
33 Computational Theory Theorem (Algorithmic Complexity) The total number of ilamm iterations needed is C 1 ( + ε 2 (T 1)C 1 ) log, c ε t The number of LAMM iterations is O(n 2 /logd + log(logn) logn) Initial estimator is cruder, but dominates computation time.
34 Conclusion ILAMM provides a statistical optimization algorithm to solve nonconvex optimization problems. Examine iteration and optimization error effects. Propose weaker conditions which require localized analysis. All approximate solutions of ilamm enjoy optimal statistical convergence rate with controlled algorithmic complexity.
35 Distributed PCA Dong Wang Kaizheng Wang Ziwei Zhu
36 Principal Component Analysis Fundamental tools for statistical machine learning Dim. reduction, summarize data, latent factors Rates of convergence: v 1 v 1 2 = O P ( d nλ ) d dimension, n sample size, λ eigengap. (Paul 07; Jung et al. 09; Johnstone and Lu 12, Shen et al. 16; Wang and Fan, 17). What for distributed PCA?
37 Principal Component Analysis Fundamental tools for statistical machine learning Dim. reduction, summarize data, latent factors Rates of convergence: v 1 v 1 2 = O P ( d nλ ) d dimension, n sample size, λ eigengap. (Paul 07; Jung et al. 09; Johnstone and Lu 12, Shen et al. 16; Wang and Fan, 17). What for distributed PCA?
38 Problem Setup N samples stored on m servers. i.i.d. sub-gaussian with cov. Σ. generalized to {Σ (l) } m l=1 Spectral-decomp: Σ = VΛV T. Interest: column space of V K Error: ρ( VK,V K ) = VK VT K V K V T K F.
39 Distributed PCA Communications: Kd vectors each machine Aggregation: Ṽ K = argmin UK :d K orthorn Spill-over: Send more than K eigenvectors 1 m m l=1 ρ2 (U K, V(l) K )
40 Statistical Error Analysis Bias and Sample Variance
41 Bias and Variance V K : top-k eigenvectors of Σ = E[ V(l) K ( V(l) K )T ]. ρ(ṽ K,V K ) ρ(ṽ K,V }{{ K) } +ρ(v K,V K ). }{{} sample variance term bias term Sample Variance Term: 1 m m i=1 V (l) K V (l)t K Σ = ρ(ṽ K,V K ) 0
42 Analysis of Sample Variance Theorem 1 (variance) Let r(σ) = (E X 2 ) 2 / Σ ( effective rank) ( V(l) V (l)t K K Σ Σ Kr(Σ) ) F ψ1 = O, λ K λ K +1 n ( ρ(ṽ K,V Σ Kr(Σ) ) K ) ψ1 = O. λ K λ K +1 mn X ψ1 a n implies X = O P (a n ) with exp. concentration.
43 Analysis of Bias ρ(v K,V K ) = V K (V K )T V K V T K F. Comparisons: 1 Σ(l) PCA V(l) K ( V(l) expectation K )T 2 Σ(l) expectation Σ PCA V K V T K Σ PCA V K (V K )T When are they the same?
44 Analysis of Bias ρ(v K,V K ) = V K (V K )T V K V T K F. Comparisons: 1 Σ(l) PCA V(l) K ( V(l) expectation K )T 2 Σ(l) expectation Σ PCA V K V T K Σ PCA V K (V K )T When are they the same?
45 Unbiasedness Definition: Z is symmetric if Z d = I j Z for all j. X with cov Σ = VΛV T has symmetric innovation if V T X is symmetric. Theorem 2 (unbiasedness) If X has sign symmetric innovation, then Σ and Σ share the same set of eigenvectors. If ρ( V(l) K,V K ) ψ1 1/4, then ρ(v K,V K ) = 0.
46 Analysis of Bias: General Case Use Taylor s expansion and perturbation: V(l) K and V K are top K eigenvectors of Σ (l) and Σ, then V (l) K ( V(l) K )T = V K V T K + linear in ( Σ (l) Σ) + O P ( Σ (l) Σ 2 ). Σ = V K V T K + O( Σ (l) Σ 2 ). Theorem 3 (general case bias) ( Σ ρ(v 2 K r(σ) ) K,V K ) = O. n(λ K λ K +1 ) 2
47 Statistical Error Analysis Theorem 4 (MSE) Assume that n K r(σ)( Σ /(λ K λ K +1 )) 2. ) 1 Symmetric dist: ρ(ṽ K,V K ) ψ1 = O( Σ Kr(Σ) λ K λ ; K +1 N 2 General dist: For universal constants C 1 and C 2, ρ(ṽ K,V K ) ψ1 C 1 Σ Kr(Σ) + C 2 Σ 2 Kr(Σ) λ K λ K +1 N n(λ }{{} K λ K +1 ) }{{ 2 } Sample Variance Bias If m C 3 /Bias, then Bias is negligible. Same rate as PCA on the whole data. Extend to heterogeneous data The required n is sharp
48 Simulation Study
49 Simulation Setting {x i R d } mn i.i.d. i=1 N(0,Σ). Σ = diag(λ, λ 2, λ,1,,1) 4 K = 3, V K = (e 1,e 2,e 3 ), eigengap δ = λ 4 1 When λ = O(d), we have ρ(ṽ K,V K ) ψ1 = ( d ) O mnδ Error computed based on 100 simulations
50 Statistical Error Rate: ρ(ṽ K,V K ) v.s. m n = 500 n = 1000 n = 2000 n = 4000 log(error) d=100 d=200 d=400 d=800 d= log(m) log(error) d=100 d=200 d=400 d=800 d= log(m) log(error) d=100 d=200 d=400 d=800 d= log(m) log(error) d=100 d=200 d=400 d=800 d= log(m) ρ(ṽ K,V K ) m 1/2 Also verify ρ(ṽ K,V K ) n 1/2,d 1/2,δ 1/2
51 Statistical Error Rate: ρ(ṽ K,V K ) v.s. {d, m, n and δ} Multiple regression: log(ρ(ṽ K,V K )) = β 0 +β 1 log(d)+β 2 log(m)+β 3 log(n)+β 4 log(δ)+ε Estimated coefficients: β1 β2 β3 β Multiple R 2 :
52 Statistical Error Rate: ρ(ṽ K,V K ) v.s. {d, m, n and δ} Fitted Observed Figure: Observed and fitted values of log(ρ(ṽ K,V K ))
53 Comparison of Three Approaches 1 DP: Distributed PCA 2 FP: Full sample PCA 3 DP5: Distributed PCA with 5 extra eigenvectors Purpose: Examine spill-over effect
54 Comparison of Three Approaches: λ = 30 m = 5 m = 10 m = 20 m = 50 log(error) DP FP DP5 log(error) DP FP DP5 log(error) DP FP DP5 log(error) DP FP DP log(n) log(n) log(n) log(n) Figure: Comparison between DP, FP and DP5 when λ = 30 and d = Little Spill Over!
55 Conclusion Distributed PCA is unbiased when symmetric. In general, the bias is controllable when m is not too large. Distributed PCA has aggregated variance effect. Enjoy the same performance as PCA on full data
56 The End g{tç~ léâ
Guarding against Spurious Discoveries in High Dimension. Jianqing Fan
in High Dimension Jianqing Fan Princeton University with Wen-Xin Zhou September 30, 2016 Outline 1 Introduction 2 Spurious correlation and random geometry 3 Goodness Of Spurious Fit (GOSF) 4 Asymptotic
More informationEstimators based on non-convex programs: Statistical and computational guarantees
Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright
More informationFactor-Adjusted Robust Multiple Test. Jianqing Fan (Princeton University)
Factor-Adjusted Robust Multiple Test Jianqing Fan Princeton University with Koushiki Bose, Qiang Sun, Wenxin Zhou August 11, 2017 Outline 1 Introduction 2 A principle of robustification 3 Adaptive Huber
More informationA UNIFIED APPROACH TO MODEL SELECTION AND SPARS. REGULARIZED LEAST SQUARES by Jinchi Lv and Yingying Fan The annals of Statistics (2009)
A UNIFIED APPROACH TO MODEL SELECTION AND SPARSE RECOVERY USING REGULARIZED LEAST SQUARES by Jinchi Lv and Yingying Fan The annals of Statistics (2009) Mar. 19. 2010 Outline 1 2 Sideline information Notations
More informationSmoothly Clipped Absolute Deviation (SCAD) for Correlated Variables
Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)
More informationPermutation-invariant regularization of large covariance matrices. Liza Levina
Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work
More informationNonconvex penalties: Signal-to-noise ratio and algorithms
Nonconvex penalties: Signal-to-noise ratio and algorithms Patrick Breheny March 21 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/22 Introduction In today s lecture, we will return to nonconvex
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationGLOBAL SOLUTIONS TO FOLDED CONCAVE PENALIZED NONCONVEX LEARNING. BY HONGCHENG LIU 1,TAO YAO 1 AND RUNZE LI 2 Pennsylvania State University
The Annals of Statistics 2016, Vol. 44, No. 2, 629 659 DOI: 10.1214/15-AOS1380 Institute of Mathematical Statistics, 2016 GLOBAL SOLUTIONS TO FOLDED CONCAVE PENALIZED NONCONVEX LEARNING BY HONGCHENG LIU
More informationarxiv: v7 [stat.ml] 9 Feb 2017
Submitted to the Annals of Applied Statistics arxiv: arxiv:141.7477 PATHWISE COORDINATE OPTIMIZATION FOR SPARSE LEARNING: ALGORITHM AND THEORY By Tuo Zhao, Han Liu and Tong Zhang Georgia Tech, Princeton
More informationRobust estimation, efficiency, and Lasso debiasing
Robust estimation, efficiency, and Lasso debiasing Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics WHOA-PSI workshop Washington University in St. Louis Aug 12, 2017 Po-Ling
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon
More informationRobust high-dimensional linear regression: A statistical perspective
Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics STOC workshop on robustness and nonconvexity Montreal,
More informationHomogeneity Pursuit. Jianqing Fan
Jianqing Fan Princeton University with Tracy Ke and Yichao Wu http://www.princeton.edu/ jqfan June 5, 2014 Get my own profile - Help Amazing Follow this author Grace Wahba 9 Followers Follow new articles
More informationProperties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation
Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana
More informationThe picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R
The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R Xingguo Li Tuo Zhao Tong Zhang Han Liu Abstract We describe an R package named picasso, which implements a unified framework
More informationThe MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010
Department of Biostatistics Department of Statistics University of Kentucky August 2, 2010 Joint work with Jian Huang, Shuangge Ma, and Cun-Hui Zhang Penalized regression methods Penalized methods have
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,
More informationA Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models
A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los
More informationAn iterative hard thresholding estimator for low rank matrix recovery
An iterative hard thresholding estimator for low rank matrix recovery Alexandra Carpentier - based on a joint work with Arlene K.Y. Kim Statistical Laboratory, Department of Pure Mathematics and Mathematical
More informationAnalysis of Greedy Algorithms
Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm
More informationPaper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)
Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36 Outlines 2 / 36 Motivation
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationGeneralized Power Method for Sparse Principal Component Analysis
Generalized Power Method for Sparse Principal Component Analysis Peter Richtárik CORE/INMA Catholic University of Louvain Belgium VOCAL 2008, Veszprém, Hungary CORE Discussion Paper #2008/70 joint work
More informationChapter 3. Linear Models for Regression
Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear
More informationStepwise Searching for Feature Variables in High-Dimensional Linear Regression
Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Qiwei Yao Department of Statistics, London School of Economics q.yao@lse.ac.uk Joint work with: Hongzhi An, Chinese Academy
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationGradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve
More informationarxiv: v1 [math.st] 24 Mar 2016
The Annals of Statistics 2016, Vol. 44, No. 2, 629 659 DOI: 10.1214/15-AOS1380 c Institute of Mathematical Statistics, 2016 arxiv:1603.07531v1 [math.st] 24 Mar 2016 GLOBAL SOLUTIONS TO FOLDED CONCAVE PENALIZED
More informationBAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage
BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement
More informationAn efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss
An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao
More informationHigh-dimensional regression with unknown variance
High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: Y i = f i + ε i with ε i i.i.d. N (0, σ 2 ) f = (f
More informationSOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu
SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray
More informationSparsity Regularization
Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationLecture 2 Part 1 Optimization
Lecture 2 Part 1 Optimization (January 16, 2015) Mu Zhu University of Waterloo Need for Optimization E(y x), P(y x) want to go after them first, model some examples last week then, estimate didn t discuss
More informationComparisons of penalized least squares. methods by simulations
Comparisons of penalized least squares arxiv:1405.1796v1 [stat.co] 8 May 2014 methods by simulations Ke ZHANG, Fan YIN University of Science and Technology of China, Hefei 230026, China Shifeng XIONG Academy
More informationTighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time
Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time Zhaoran Wang Huanran Lu Han Liu Department of Operations Research and Financial Engineering Princeton University Princeton, NJ 08540 {zhaoran,huanranl,hanliu}@princeton.edu
More informationNonconcave Penalized Likelihood with A Diverging Number of Parameters
Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized
More informationarxiv: v3 [stat.me] 8 Jun 2018
Between hard and soft thresholding: optimal iterative thresholding algorithms Haoyang Liu and Rina Foygel Barber arxiv:804.0884v3 [stat.me] 8 Jun 08 June, 08 Abstract Iterative thresholding algorithms
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationStatistical Sparse Online Regression: A Diffusion Approximation Perspective
Statistical Sparse Online Regression: A Diffusion Approximation Perspective Jianqing Fan Wenyan Gong Chris Junchi Li Qiang Sun Princeton University Princeton University Princeton University University
More informationA General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations
A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations Joint work with Karim Oualkacha (UQÀM), Yi Yang (McGill), Celia Greenwood
More informationAnalysis of Multi-stage Convex Relaxation for Sparse Regularization
Journal of Machine Learning Research 11 (2010) 1081-1107 Submitted 5/09; Revised 1/10; Published 3/10 Analysis of Multi-stage Convex Relaxation for Sparse Regularization Tong Zhang Statistics Department
More informationSelection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty
Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the
More informationLearning with L q<1 vs L 1 -norm regularisation with exponentially many irrelevant features
Learning with L q
More informationA Constructive Approach to L 0 Penalized Regression
Journal of Machine Learning Research 9 (208) -37 Submitted 4/7; Revised 6/8; Published 8/8 A Constructive Approach to L 0 Penalized Regression Jian Huang Department of Applied Mathematics The Hong Kong
More informationSparse Covariance Matrix Estimation with Eigenvalue Constraints
Sparse Covariance Matrix Estimation with Eigenvalue Constraints Han Liu and Lie Wang 2 and Tuo Zhao 3 Department of Operations Research and Financial Engineering, Princeton University 2 Department of Mathematics,
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationSample questions for Fundamentals of Machine Learning 2018
Sample questions for Fundamentals of Machine Learning 2018 Teacher: Mohammad Emtiyaz Khan A few important informations: In the final exam, no electronic devices are allowed except a calculator. Make sure
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationA New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables
A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationTractable Upper Bounds on the Restricted Isometry Constant
Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationDivide-and-combine Strategies in Statistical Modeling for Massive Data
Divide-and-combine Strategies in Statistical Modeling for Massive Data Liqun Yu Washington University in St. Louis March 30, 2017 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017
More informationLinear regression methods
Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response
More informationLecture 14: Variable Selection - Beyond LASSO
Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationMultivariate Normal Models
Case Study 3: fmri Prediction Graphical LASSO Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 26 th, 2013 Emily Fox 2013 1 Multivariate Normal Models
More informationVARIABLE SELECTION IN QUANTILE REGRESSION
Statistica Sinica 19 (2009), 801-817 VARIABLE SELECTION IN QUANTILE REGRESSION Yichao Wu and Yufeng Liu North Carolina State University and University of North Carolina, Chapel Hill Abstract: After its
More informationLinear Regression. CSL603 - Fall 2017 Narayanan C Krishnan
Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization
More informationMultivariate Normal Models
Case Study 3: fmri Prediction Coping with Large Covariances: Latent Factor Models, Graphical Models, Graphical LASSO Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February
More informationLinear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan
Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis
More informationDivide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates
: A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.
More informationConvex relaxation for Combinatorial Penalties
Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationStochastic Composition Optimization
Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators
More informationJournal of Multivariate Analysis. Consistency of sparse PCA in High Dimension, Low Sample Size contexts
Journal of Multivariate Analysis 5 (03) 37 333 Contents lists available at SciVerse ScienceDirect Journal of Multivariate Analysis journal homepage: www.elsevier.com/locate/jmva Consistency of sparse PCA
More informationEstimation of High-dimensional Vector Autoregressive (VAR) models
Estimation of High-dimensional Vector Autoregressive (VAR) models George Michailidis Department of Statistics, University of Michigan www.stat.lsa.umich.edu/ gmichail CANSSI-SAMSI Workshop, Fields Institute,
More informationSparse PCA with applications in finance
Sparse PCA with applications in finance A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon 1 Introduction
More informationSta$s$cal Op$miza$on for Big Data. Zhaoran Wang and Han Liu (Joint work with Tong Zhang)
Sta$s$cal Op$miza$on for Big Data Zhaoran Wang and Han Liu (Joint work with Tong Zhang) Big Data Movement Big Data = Massive Data- size + High Dimensional + Complex Structural + Highly Noisy Big Data give
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationThe Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA
The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:
More informationUltra High Dimensional Variable Selection with Endogenous Variables
1 / 39 Ultra High Dimensional Variable Selection with Endogenous Variables Yuan Liao Princeton University Joint work with Jianqing Fan Job Market Talk January, 2012 2 / 39 Outline 1 Examples of Ultra High
More informationarxiv: v5 [cs.lg] 23 Dec 2017 Abstract
Nonconvex Sparse Learning via Stochastic Optimization with Progressive Variance Reduction Xingguo Li, Raman Arora, Han Liu, Jarvis Haupt, and Tuo Zhao arxiv:1605.02711v5 [cs.lg] 23 Dec 2017 Abstract We
More informationGeneralized Elastic Net Regression
Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1
More informationThe Pennsylvania State University The Graduate School Eberly College of Science NEW PROCEDURES FOR COX S MODEL WITH HIGH DIMENSIONAL PREDICTORS
The Pennsylvania State University The Graduate School Eberly College of Science NEW PROCEDURES FOR COX S MODEL WITH HIGH DIMENSIONAL PREDICTORS A Dissertation in Statistics by Ye Yu c 2015 Ye Yu Submitted
More informationGeneralized Concomitant Multi-Task Lasso for sparse multimodal regression
Generalized Concomitant Multi-Task Lasso for sparse multimodal regression Mathurin Massias https://mathurinm.github.io INRIA Saclay Joint work with: Olivier Fercoq (Télécom ParisTech) Alexandre Gramfort
More informationConfidence Intervals for Low-dimensional Parameters with High-dimensional Data
Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology
More informationSemi-Penalized Inference with Direct FDR Control
Jian Huang University of Iowa April 4, 2016 The problem Consider the linear regression model y = p x jβ j + ε, (1) j=1 where y IR n, x j IR n, ε IR n, and β j is the jth regression coefficient, Here p
More informationTheoretical results for lasso, MCP, and SCAD
Theoretical results for lasso, MCP, and SCAD Patrick Breheny March 2 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/23 Introduction There is an enormous body of literature concerning theoretical
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationAsymptotic Equivalence of Regularization Methods in Thresholded Parameter Space
Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space Jinchi Lv Data Sciences and Operations Department Marshall School of Business University of Southern California http://bcf.usc.edu/
More informationBi-level feature selection with applications to genetic association
Bi-level feature selection with applications to genetic association studies October 15, 2008 Motivation In many applications, biological features possess a grouping structure Categorical variables may
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationSparse Covariance Selection using Semidefinite Programming
Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support
More informationMSA220/MVE440 Statistical Learning for Big Data
MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from
More informationCSC 576: Variants of Sparse Learning
CSC 576: Variants of Sparse Learning Ji Liu Department of Computer Science, University of Rochester October 27, 205 Introduction Our previous note basically suggests using l norm to enforce sparsity in
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationHigh-dimensional covariance estimation based on Gaussian graphical models
High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,
More informationGroup exponential penalties for bi-level variable selection
for bi-level variable selection Department of Biostatistics Department of Statistics University of Kentucky July 31, 2011 Introduction In regression, variables can often be thought of as grouped: Indicator
More informationFunctional SVD for Big Data
Functional SVD for Big Data Pan Chao April 23, 2014 Pan Chao Functional SVD for Big Data April 23, 2014 1 / 24 Outline 1 One-Way Functional SVD a) Interpretation b) Robustness c) CV/GCV 2 Two-Way Problem
More informationSpectral Method and Regularized MLE Are Both Optimal for Top-K Ranking
Spectral Method and Regularized MLE Are Both Optimal for Top-K Ranking Yuxin Chen Electrical Engineering, Princeton University Joint work with Jianqing Fan, Cong Ma and Kaizheng Wang Ranking A fundamental
More informationA Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression
A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent
More informationRestricted Strong Convexity Implies Weak Submodularity
Restricted Strong Convexity Implies Weak Submodularity Ethan R. Elenberg Rajiv Khanna Alexandros G. Dimakis Department of Electrical and Computer Engineering The University of Texas at Austin {elenberg,rajivak}@utexas.edu
More informationSTAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song
STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song Presenter: Jiwei Zhao Department of Statistics University of Wisconsin Madison April
More information6. Regularized linear regression
Foundations of Machine Learning École Centrale Paris Fall 2015 6. Regularized linear regression Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe agathe.azencott@mines paristech.fr
More information