Nearest Neighbor based Coordinate Descent
|
|
- Lora Ward
- 5 years ago
- Views:
Transcription
1 Nearest Neighbor based Coordinate Descent Pradeep Ravikumar, UT Austin Joint with Inderjit Dhillon, Ambuj Tewari Simons Workshop on Information Theory, Learning and Big Data, 2015
2 Modern Big Data Across modern applications {fmri images, gene expression profiles, social networks} many^many variables in system, not enough observations Curse of dimensionality To train the system or model, number of observations have to be much larger than variables in system (scaling exponentially in non-parametric models, polynomially in parametric models)
3 Modern Big Data Research over the last decade and a half: Finesse curse of dimensionality when there is some intrinsic low-dimensional structure such as (group) sparsity, low rank, etc.
4 Modern Big Data Research over the last decade and a half: Finesse curse of dimensionality when there is some intrinsic low-dimensional structure such as (group) sparsity, low rank, etc. See Negahban, Ravikumar, Wainwright, Yu, 2012; Chandrasekharan, Recht, Willsky, 2012 for a general linear-algebraic notion of structure
5 Modern Big Data Research over the last decade and a half: Finesse curse of dimensionality when there is some intrinsic low-dimensional structure such as (group) sparsity, low rank, etc. See Negahban, Ravikumar, Wainwright, Yu, 2012; Chandrasekharan, Recht, Willsky, 2012 for a general linear-algebraic notion of structure Under such structure, we know how to obtain estimators whose statistical or sample complexity depends weakly on problem dimension p typically scaling as log(p)
6 Modern Big Data Research over the last decade and a half: Finesse curse of dimensionality when there is some intrinsic low-dimensional structure such as (group) sparsity, low rank, etc. See Negahban, Ravikumar, Wainwright, Yu, 2012; Chandrasekharan, Recht, Willsky, 2012 for a general linear-algebraic notion of structure Under such structure, we know how to obtain estimators whose statistical or sample complexity depends weakly on problem dimension p typically scaling as log(p) Can we achieve similar weak dependence on p in computational complexity?
7 Convex Optimization Optimization Problem: min L(w). w2r p Loss L is convex and smooth: L(w) L(v) κ 1 w v 1 Sparse minimizer w : w 0 = s, w B
8 Coordinate Descent Optimization Problem: min L(w). w2r p Algorithm 1 Cyclic coordinate Descent Initialize: Set the initial value of w 0. for n =1,... do j = t mod p. w j t = arg min L(w t L(w t 1 t + 1 αe + j te ) α j ). wl t = wt 1 l, for l 6= j. end for
9 Coordinate Descent (CD) Coordinate Descent Optimize only a single coordinate per step
10 Coordinate Descent (CD) Coordinate Descent Optimize only a single coordinate per step Small computation per step; well suited for high-dimensional problems
11 Coordinate Descent (CD) Coordinate Descent Optimize only a single coordinate per step Small computation per step; well suited for high-dimensional problems Recently shown to enjoy good empirical performance
12 Coordinate Descent (CD) Coordinate Descent Optimize only a single coordinate per step Small computation per step; well suited for high-dimensional problems Recently shown to enjoy good empirical performance But at least linear (or worse) dependence of comp. complexity on p!
13 Coordinate Descent (CD) Coordinate Descent Optimize only a single coordinate per step Small computation per step; well suited for high-dimensional problems Recently shown to enjoy good empirical performance But at least linear (or worse) dependence of comp. complexity on p! Suppose the optimal solution is sparse (very few coordinates are non-zero)
14 Coordinate Descent (CD) Coordinate Descent Optimize only a single coordinate per step Small computation per step; well suited for high-dimensional problems Recently shown to enjoy good empirical performance But at least linear (or worse) dependence of comp. complexity on p! Suppose the optimal solution is sparse (very few coordinates are non-zero) If CD judiciously chooses coordinate to optimize at each step, can it be expected to leverage potential sparsity of optimum?
15 Greedy Coordinate Descent (GCD) Optimization Problem: min L(w). w2r p Algorithm 3 Greedy Coordinate Gradient Descent Initialize: Set the initial value of w 0. for t =1,... do j = arg max l r l L(w t ). w t = w t 1 1 apple 1 r j L(w t )e j. end for
16 Greedy Coordinate Descent: Analysis Loss L is convex and smooth: L(w) L(v) κ 1 w v 1 Sparse minimizer w : w 0 = s, w B Initialize: w 0 0. for t =1,... do j =argmax l l L(w t ). w t = w t 1 1 κ 1 j L(w t )e j. end for Greedy Coordinate Descent Guarantee: L(w t ) L(w ) κ 1 2 w 0 w 2 1 t = κ 1 2 s 2 B 2 t
17 Greedy CD PRO: No. of iterations avoids costly dependence on dimension p CON: Each GCD iteration (naively implemented) takes Ω(p) time
18 Greedy CD PRO: No. of iterations avoids costly dependence on dimension p CON: Each GCD iteration (naively implemented) takes Ω(p) time Solution: Perform approximate greedy steps via reduction to Approximate Nearest Neighbor (ANN)
19 Greedy CD PRO: No. of iterations avoids costly dependence on dimension p CON: Each GCD iteration (naively implemented) takes Ω(p) time Solution: Perform approximate greedy steps via reduction to Approximate Nearest Neighbor (ANN) allows us to use recent advances in sublinear time ANN search: e.g. locality sensitive hashing (LSH)
20 Fast Greedy and Nearest Neighbor Common objective in statistical learning: L(w) = n l(w T x i,y i ) i=1
21 Fast Greedy and Nearest Neighbor Common objective in statistical learning: L(w) = n i=1 l(w T x i,y i ) j L(w) = x j,r(w) is an inner product between feature j and residual r(w) =(l (w T x i,y i )) n i=1
22 Fast Greedy and Nearest Neighbor Common objective in statistical learning: L(w) = n i=1 l(w T x i,y i ) j L(w) = x j,r(w) is an inner product between feature j and residual r(w) =(l (w T x i,y i )) n i=1 Greedy step needs to compute (assuming x j 2 =1) arg max j [p] x j,r(w t ) arg min j [2p] x j r(w t ) 2 2
23 Fast Greedy and Nearest Neighbor Common objective in statistical learning: L(w) = n i=1 l(w T x i,y i ) j L(w) = x j,r(w) is an inner product between feature j and residual r(w) =(l (w T x i,y i )) n i=1 Greedy step needs to compute (assuming x j 2 =1) arg max j [p] x j,r(w t ) arg min j [2p] x j r(w t ) 2 2 Leverage state-of-the-art in NN search to do this in o(p) time
24 Approximate Greedy CD: Analysis If greedy step has multiplicative approximation factor (1 + ϵ nn )then:either L(w t ) L(w t ) L(w ) ϵ + ϵ nn (1 + ϵ nn ) r(wt ) 2, or 1+ϵ nn ϵ nn (1/ϵ)+1 κ1 w 0 w 2 1 t In summary, convergence rate is K κ1s 2 t
25 Fast Greedy: Computational Complexity If each greedy step costs C t (n, p, ϵ nn ), overall cost C G to accuracy ϵ is: C G = C t (n, p, ϵ nn ) Kκ 1s 2 Preprocessing time C (n, p, ϵ nn )canbe amortized. ϵ
26 Fast Greedy: Computational Complexity Locality Sensitive Hashing: Uses random projections to hash data points such that distant points are unlikely to collide. It gives: (ρ =1/(1 + ϵ nn ) < 1) C t = O (np ρ ) C = O ( ) np 1+ρ ϵ 2 nn
27 Fast Greedy: Computational Complexity Locality Sensitive Hashing: Uses random projections to hash data points such that distant points are unlikely to collide. It gives: (ρ =1/(1 + ϵ nn ) < 1) C t = O (np ρ ) C = O ( ) np 1+ρ ϵ 2 nn Ailon & Chazelle (2006) s method: Uses multiple lookup tables after random projections. It gives: C t = O ( n log n + ϵ 3 nn log 2 p ) ( ) C = O p ϵ 2 nn
28 Fast Greedy: Computational Complexity Locality Sensitive Hashing: Uses random projections to hash data points such that distant points are unlikely to collide. It gives: (ρ =1/(1 + ϵ nn ) < 1) C t = O (np ρ ) C = O ( ) np 1+ρ ϵ 2 nn Ailon & Chazelle (2006) s method: Uses multiple lookup tables after random projections. It gives: C t = O ( n log n + ϵ 3 nn log 2 p ) ( ) C = O p ϵ 2 nn Quad Trees+Random Projections: Under mutual incoherence, using simple quad tree with random Gaussian projections, ( ) we obtain: C t = O p ϵ 2 nn C = O ( ) np log p ϵ 2 nn Mutual incoherence (µ =max i j x i,x j < 1) plays important role in statistical complexity for sparse parameter recovery. Here it is also related to computational complexity.
29 Non-smooth Objectives Smooth plus separable composite objective: (think of R = λ 1 ) min w R p L(w)+R(w) Separable regularizer: R(w) = j R j(w j )
30 Non-smooth Objectives Smooth plus separable composite objective: (think of R = λ 1 ) min w R p L(w)+R(w) Separable regularizer: R(w) = j R j(w j ) If we updated coordinate j: w t+1 j =argmin w gt j(w w t j)+ κ 1 2 (w wt j) 2 +R j (w) Guaranteed descent in objective is κ 1 2 ηj t 2 where ηj t = w t+1 j wj t
31 Modified Greedy for non-smooth objectives Modified greedy algorithm (chooses j with maximum guaranteed descent) Initialize: w 0 0. for t =1,... do j t arg max j [p] ηj t w t+1 w t + η t j t e jt end for Guarantee: L(w t )+R(w t ) L(w ) R(w ) κ 1 2 w 0 w 2 1 t
32 nces w w + η w t+1 j =argmin w gt j(w wj)+ t κ jt e jt end for 1 2 (w wt j) 2 +R j (w) κ 1 Gu ality 2 (w Experiments wt sen- j) 2 +R j (w) Guarantee: Guaranteed descent in objective is κ 1 2 ηj t 2 where L(w tive is κ 1 ηj t = w t+1 j wj t 2 ηj t 2 where L(w t )+R(w t ) L(w ) R(w ) κ 1 w 0 w 2 t ally Three algos. (cyclic CD, greedy CD, greedy CD+LSH), Two loss functions (logistic, squared), l 1 regularization (up to putational LSH 3000 n putation 2000 ry simple 500 orithm for Experiments Three algos. (cyclic CD, greedy CD, greedy CD+LSH), Two los D, greedy CD+LSH), X: standard Two loss Gaussian functions with(logistic, normalized squared), columns, l 1 regularization Y = Xw tr with n=3684, p=10000, k=100 (logistic loss) n=4605, p=100000, k=100 (lo alized columns, Y = 3000 Xw tr with w tr 100-sparse, n = 400 log(p) 3500 Cyclic n=4605, p=100000, k=100 (logistic loss) Greedy.LSH n=5526, p= , k=100 (logistic loss) 2500 Greedy Cyclic Cyclic Greedy.LSH Greedy.LSH 2000 Greedy Greedy Objective Objective Objective n=5526, 10 p= , 15 k= (squared 25500loss) 30 CPU Time (in seconds) Cyclic Objective CPU Time (in second
33 Experiments: Logistic Loss Objective n=3684, p=10000, k=100 (logistic loss) Cyclic Greedy.LSH Greedy Objective n=4605, p=100000, k=100 (logistic loss) Cyclic Greedy.LSH Greedy Objective n=5526, p= , k=100 (logistic loss) Cyclic Greedy.LSH Greedy CPU Time (in seconds) CPU Time (in seconds) CPU Time (in seconds)
34 Experiments: Squared Loss n=3684, p=10000, k=100 (squared loss) Cyclic Greedy.LSH Greedy n=4605, p=100000, k=100 (squared loss) Cyclic Greedy.LSH Greedy n=5526, p= , k=100 (squared loss) Cyclic Greedy.LSH Greedy Objective Objective Objective CPU Time (in seconds) CPU Time (in seconds) CPU Time (in seconds)
35 Summary Optimization Method with sub-linear dependence on p! New connections between computational geometry and first order optimization Interplay between statistical and computational efficiency: mutual incoherence very simple data structure works for ANN New greedy algorithm for composite objectives
High dimensional Ising model selection
High dimensional Ising model selection Pradeep Ravikumar UT Austin (based on work with John Lafferty, Martin Wainwright) Sparse Ising model US Senate 109th Congress Banerjee et al, 2008 Estimate a sparse
More informationFast Coordinate Descent methods for Non-Negative Matrix Factorization
Fast Coordinate Descent methods for Non-Negative Matrix Factorization Inderjit S. Dhillon University of Texas at Austin SIAM Conference on Applied Linear Algebra Valencia, Spain June 19, 2012 Joint work
More informationExploiting Primal and Dual Sparsity for Extreme Classification 1
Exploiting Primal and Dual Sparsity for Extreme Classification 1 Ian E.H. Yen Joint work with Xiangru Huang, Kai Zhong, Pradeep Ravikumar and Inderjit Dhillon Machine Learning Department Carnegie Mellon
More informationl 1 and l 2 Regularization
David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about
More informationIs the test error unbiased for these programs? 2017 Kevin Jamieson
Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning
More informationHigh-dimensional Statistical Models
High-dimensional Statistical Models Pradeep Ravikumar UT Austin MLSS 2014 1 Curse of Dimensionality Statistical Learning: Given n observations from p(x; θ ), where θ R p, recover signal/parameter θ. For
More informationHigh-dimensional Statistics
High-dimensional Statistics Pradeep Ravikumar UT Austin Outline 1. High Dimensional Data : Large p, small n 2. Sparsity 3. Group Sparsity 4. Low Rank 1 Curse of Dimensionality Statistical Learning: Given
More informationNew ways of dimension reduction? Cutting data sets into small pieces
New ways of dimension reduction? Cutting data sets into small pieces Roman Vershynin University of Michigan, Department of Mathematics Statistical Machine Learning Ann Arbor, June 5, 2012 Joint work with
More informationApproximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.
Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and
More informationEstimators based on non-convex programs: Statistical and computational guarantees
Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright
More informationNearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2
Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal
More informationVariables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014
for a Million Variables Cho-Jui Hsieh The University of Texas at Austin ICML workshop on Covariance Selection Beijing, China June 26, 2014 Joint work with M. Sustik, I. Dhillon, P. Ravikumar, R. Poldrack,
More informationQUIC & DIRTY: A Quadratic Approximation Approach for Dirty Statistical Models
QUIC & DIRTY: A Quadratic Approximation Approach for Dirty Statistical Models Cho-Jui Hsieh, Inderjit S. Dhillon, Pradeep Ravikumar University of Texas at Austin Austin, TX 7872 USA {cjhsieh,inderjit,pradeepr}@cs.utexas.edu
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationRestricted Strong Convexity Implies Weak Submodularity
Restricted Strong Convexity Implies Weak Submodularity Ethan R. Elenberg Rajiv Khanna Alexandros G. Dimakis Department of Electrical and Computer Engineering The University of Texas at Austin {elenberg,rajivak}@utexas.edu
More informationIs the test error unbiased for these programs?
Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x
More informationA Two-Stage Approach for Learning a Sparse Model with Sharp
A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis The University of Iowa, Nanjing University, Alibaba Group February 3, 2017 1 Problem and Chanllenges 2 The Two-stage Approach
More informationMLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT
MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net
More informationLinear Regression 1 / 25. Karl Stratos. June 18, 2018
Linear Regression Karl Stratos June 18, 2018 1 / 25 The Regression Problem Problem. Find a desired input-output mapping f : X R where the output is a real value. x = = y = 0.1 How much should I turn my
More informationOn Iterative Hard Thresholding Methods for High-dimensional M-Estimation
On Iterative Hard Thresholding Methods for High-dimensional M-Estimation Prateek Jain Ambuj Tewari Purushottam Kar Microsoft Research, INDIA University of Michigan, Ann Arbor, USA {prajain,t-purkar}@microsoft.com,
More informationMachine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014
Case Study 3: fmri Prediction Fused LASSO LARS Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, 2014 Emily Fox 2014 1 LASSO Regression
More informationAnalysis of Greedy Algorithms
Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationA Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-7) A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis Zhe Li, Tianbao Yang, Lijun Zhang, Rong
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationQUIC & DIRTY: A Quadratic Approximation Approach for Dirty Statistical Models
QUIC & DIRTY: A Quadratic Approximation Approach for Dirty Statistical Models Cho-Jui Hsieh, Inderjit S. Dhillon, Pradeep Ravikumar University of Texas at Austin Austin, TX 7872 USA {cjhsieh,inderjit,pradeepr}@cs.utexas.edu
More informationSparse Inverse Covariance Estimation for a Million Variables
Sparse Inverse Covariance Estimation for a Million Variables Inderjit S. Dhillon Depts of Computer Science & Mathematics The University of Texas at Austin SAMSI LDHD Opening Workshop Raleigh, North Carolina
More informationLearning Bound for Parameter Transfer Learning
Learning Bound for Parameter Transfer Learning Wataru Kumagai Faculty of Engineering Kanagawa University kumagai@kanagawa-u.ac.jp Abstract We consider a transfer-learning problem by using the parameter
More informationProvable Alternating Minimization Methods for Non-convex Optimization
Provable Alternating Minimization Methods for Non-convex Optimization Prateek Jain Microsoft Research, India Joint work with Praneeth Netrapalli, Sujay Sanghavi, Alekh Agarwal, Animashree Anandkumar, Rashish
More informationConvex relaxation for Combinatorial Penalties
Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,
More informationHigh-dimensional graphical model selection: Practical and information-theoretic limits
1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John
More informationAccelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization
Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu
More informationLearning discrete graphical models via generalized inverse covariance matrices
Learning discrete graphical models via generalized inverse covariance matrices Duzhe Wang, Yiming Lv, Yongjoon Kim, Young Lee Department of Statistics University of Wisconsin-Madison {dwang282, lv23, ykim676,
More informationOverparametrization for Landscape Design in Non-convex Optimization
Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,
More informationA tutorial on sparse modeling. Outline:
A tutorial on sparse modeling. Outline: 1. Why? 2. What? 3. How. 4. no really, why? Sparse modeling is a component in many state of the art signal processing and machine learning tasks. image processing
More informationHigh-dimensional graphical model selection: Practical and information-theoretic limits
1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John
More informationSparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28
Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:
More informationDual Augmented Lagrangian, Proximal Minimization, and MKL
Dual Augmented Lagrangian, Proximal Minimization, and MKL Ryota Tomioka 1, Taiji Suzuki 1, and Masashi Sugiyama 2 1 University of Tokyo 2 Tokyo Institute of Technology 2009-09-15 @ TU Berlin (UT / Tokyo
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationBlock stochastic gradient update method
Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationIntroduction to Machine Learning (67577) Lecture 7
Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationCOMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017
COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationApplied Machine Learning for Biomedical Engineering. Enrico Grisan
Applied Machine Learning for Biomedical Engineering Enrico Grisan enrico.grisan@dei.unipd.it Data representation To find a representation that approximates elements of a signal class with a linear combination
More informationEE 381V: Large Scale Optimization Fall Lecture 24 April 11
EE 381V: Large Scale Optimization Fall 2012 Lecture 24 April 11 Lecturer: Caramanis & Sanghavi Scribe: Tao Huang 24.1 Review In past classes, we studied the problem of sparsity. Sparsity problem is that
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationarxiv: v7 [stat.ml] 9 Feb 2017
Submitted to the Annals of Applied Statistics arxiv: arxiv:141.7477 PATHWISE COORDINATE OPTIMIZATION FOR SPARSE LEARNING: ALGORITHM AND THEORY By Tuo Zhao, Han Liu and Tong Zhang Georgia Tech, Princeton
More informationDetecting Sparse Structures in Data in Sub-Linear Time: A group testing approach
Detecting Sparse Structures in Data in Sub-Linear Time: A group testing approach Boaz Nadler The Weizmann Institute of Science Israel Joint works with Inbal Horev, Ronen Basri, Meirav Galun and Ery Arias-Castro
More informationOslo Class 6 Sparsity based regularization
RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationInfo-Greedy Sequential Adaptive Compressed Sensing
Info-Greedy Sequential Adaptive Compressed Sensing Yao Xie Joint work with Gabor Braun and Sebastian Pokutta Georgia Institute of Technology Presented at Allerton Conference 2014 Information sensing for
More informationLogistic Regression Logistic
Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,
More informationLecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent
10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationLASSO Review, Fused LASSO, Parallel LASSO Solvers
Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable
More informationNoisy and Missing Data Regression: Distribution-Oblivious Support Recovery
: Distribution-Oblivious Support Recovery Yudong Chen Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 7872 Constantine Caramanis Department of Electrical
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationRegML 2018 Class 2 Tikhonov regularization and kernels
RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationOptimal Data-Dependent Hashing for Approximate Near Neighbors
Optimal Data-Dependent Hashing for Approximate Near Neighbors Alexandr Andoni 1 Ilya Razenshteyn 2 1 Simons Institute 2 MIT, CSAIL April 20, 2015 1 / 30 Nearest Neighbor Search (NNS) Let P be an n-point
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training
More informationSparse PCA in High Dimensions
Sparse PCA in High Dimensions Jing Lei, Department of Statistics, Carnegie Mellon Workshop on Big Data and Differential Privacy Simons Institute, Dec, 2013 (Based on joint work with V. Q. Vu, J. Cho, and
More informationEfficient Variational Inference in Large-Scale Bayesian Compressed Sensing
Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing George Papandreou and Alan Yuille Department of Statistics University of California, Los Angeles ICCV Workshop on Information
More informationOrdinary Least Squares Linear Regression
Ordinary Least Squares Linear Regression Ryan P. Adams COS 324 Elements of Machine Learning Princeton University Linear regression is one of the simplest and most fundamental modeling ideas in statistics
More informationSparse Linear Programming via Primal and Dual Augmented Coordinate Descent
Sparse Linear Programg via Primal and Dual Augmented Coordinate Descent Presenter: Joint work with Kai Zhong, Cho-Jui Hsieh, Pradeep Ravikumar and Inderjit Dhillon. Sparse Linear Program Given vectors
More informationBig & Quic: Sparse Inverse Covariance Estimation for a Million Variables
for a Million Variables Cho-Jui Hsieh The University of Texas at Austin NIPS Lake Tahoe, Nevada Dec 8, 2013 Joint work with M. Sustik, I. Dhillon, P. Ravikumar and R. Poldrack FMRI Brain Analysis Goal:
More informationAn Efficient Sparse Metric Learning in High-Dimensional Space via l 1 -Penalized Log-Determinant Regularization
via l 1 -Penalized Log-Determinant Regularization Guo-Jun Qi qi4@illinois.edu Depart. ECE, University of Illinois at Urbana-Champaign, 405 North Mathews Avenue, Urbana, IL 61801 USA Jinhui Tang, Zheng-Jun
More informationMachine Learning Foundations
Machine Learning Foundations ( 機器學習基石 ) Lecture 11: Linear Models for Classification Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan
More informationMaximum Likelihood, Logistic Regression, and Stochastic Gradient Training
Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions
More informationCOMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso
COMS 477 Lecture 6. Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso / 2 Fixed-design linear regression Fixed-design linear regression A simplified
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:
Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Carlos Guestrin University of Washington October 7, 2013 1 Sparsity Vector w is sparse, if many entries are zero: Very useful
More information9 Classification. 9.1 Linear Classifiers
9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationBregman Divergences for Data Mining Meta-Algorithms
p.1/?? Bregman Divergences for Data Mining Meta-Algorithms Joydeep Ghosh University of Texas at Austin ghosh@ece.utexas.edu Reflects joint work with Arindam Banerjee, Srujana Merugu, Inderjit Dhillon,
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationGradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725
Gradient Descent Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh Convex Optimization 10-725/36-725 Based on slides from Vandenberghe, Tibshirani Gradient Descent Consider unconstrained, smooth convex
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationRandomized Algorithms
Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models
More informationIndexed Learning for Large-Scale Linear Classification
Indexed Learning for Large-Scale Linear Classification Ian En-Hsu Yen Department of Computer Science National Taiwan University Taipei 06, Taiwan r0092207@csie.ntu.edu.tw Abstract Linear classification
More informationA Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices
A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22
More informationDistributed Inexact Newton-type Pursuit for Non-convex Sparse Learning
Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology
More informationAccelerated Block-Coordinate Relaxation for Regularized Optimization
Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationIntroduction to graphical models: Lecture III
Introduction to graphical models: Lecture III Martin Wainwright UC Berkeley Departments of Statistics, and EECS Martin Wainwright (UC Berkeley) Some introductory lectures January 2013 1 / 25 Introduction
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationParallel Coordinate Optimization
1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a
More informationNumerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization
Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725 Consider Last time: proximal Newton method min x g(x) + h(x) where g, h convex, g twice differentiable, and h simple. Proximal
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More information