Divide-and-combine Strategies in Statistical Modeling for Massive Data
|
|
- Brittany Morton
- 6 years ago
- Views:
Transcription
1 Divide-and-combine Strategies in Statistical Modeling for Massive Data Liqun Yu Washington University in St. Louis March 30, 2017 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
2 Introduction Many statistical problems can be formulated into the following form, min L({x i, y i } n θ R p i=1, θ) + λp(θ), (1) where L({x i, y i } n i=1, θ) is a loss function, or a negative log-likelihood function, or certain criterion function (e.g., M-estimator) and P(θ) is some regularization on θ. Examples - Linear regression, - Logistic regression, ˆβ = arg min y Xβ λ β 1 (or other penalties) β ˆβ = arg min β n log ( 1 + e x i β ) n y i x i β i=1 i=1 } {{ } negative log-likelihood + λ β 1 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
3 When is Divide-and-Combine needed? Notation: X 1 y 1 X 2 y 2 X = R n p and y = R n,.. X K where each X k R n k p, y k R n k is a subset of data (X, y). Two scenarios: 1 When the data (X k, y k ) s are collected and stored separately at different locations (e.g., by different organizations), and transferring data is prohibitive due to communication cost or security/privacy reasons. 2 When the data (X, y) are too big to be stored or processed in a single computer, e.g., petabytes of data that cannot fit into a single computer/server. y K Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
4 A Trivial Example: Simple Linear Regression Simple linear regression is embarrassingly parallel: K ˆβ = (X T X) 1 X T y = X T k X k Compute X T k X k s and X T k y k s in parallel then aggregate. 1 K X T k y k Simple linear regression happens to have a closed form solution that happens to be computable in parallel. But what about more general cases? For example, ˆβ LASSO = arg min y Xβ λ β 1. β No closed-form solution, not straightforward how to compute ˆβ LASSO in parallel. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
5 Two Approaches (not including subsampling) There are generally two approaches for solving (1) in parallel. 1 Distributed numerical optimization algorithms that solves (1) in parallel. - Distributed coordinate descent, for example, [1]. - Lagrangian primal-dual algorithms, including ADMM [2] and CoCoA [3]. - CSL/EDSL by [4, 5]: quadratic approximation with local Hessian. 2 Divide-and-combine statistical aggregation: aggregating subset results. min L(X 1, y 1 ; θ) (+λp(θ)) ˆθ 1 min L(X 2, y 2 ; θ) (+λp(θ)) ˆθ 2. min L(X K, y K ; θ) (+λp(θ)) ˆθ K?Aggregate ˆθ GLOBAL Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
6 Part I: Distributed optimization algorithms, focus on ADMM (CSL/EDSL if time permits) Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
7 ADMM The ADMM solves the following problem, min {f (x) + g(z)} s.t. Ax + Bz = c, (2) x,z where x and z are the parameters of interest. A, B, c are constants. Many statistical problems can be formulated into this form, e.g., linear regression with regularization, min β y Xβ λ β 1 min r λ β 1 s.t. r = y Xβ (3) β,r The ADMM solves (2) by iteratively minimizing its augmented Lagrangian L ρ (x, z, u) := f (x) + g(z) + u T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 in the primal variables x and z and updating the dual variable u via dual ascent, where ρ is the tunable augmentation parameter. Specifically, the ADMM carries out the following updates at iteration t, x t+1 := arg min x f (x) + ρ 2 Ax + Bzt c + u t 2 2, z t+1 := arg min z g(z) + ρ 2 Axt+1 + Bz c + u t 2 2, u t+1 := u t + (Ax t+1 + Bz t+1 c). (4) Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
8 Parallelize the ADMM min L(X, y; θ) + λp(θ) θ Rp X 1 y 1 X 2 y 2 X=, y=.. X K y K min Equivalent to (a generic formulation) K min L(X k, y k ; θ k ) + λp(θ) θ 1,...,θ K,θ Apply ADMM, θ t+1 k K L(X k, y k ; θ) + λp(θ), s.t. θ k = θ, k. := arg min θk L(X k, y k ; θ k ) + ρ 2 θ k θ t + u t k 2 2, (typically easy to solve) θ t+1 := arg min θ λp(θ) + ρ 2 θ θ t+1 ūu t 2 2, (closed-form solution) u t+1 k := u t k + (θt+1 k θ t+1 ), where θ t+1 = ( ) K θt+1 k /K, ūu t+1 = ( ) K ut+1 k /K. There can be other formulations that result in easier subproblems, depending on the specific form of the problem, e.g., (3). Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
9 ADMM: An Application The model: where ɛ i.i.d N(0, 1) Y = X 6 + X 12 + X 15 + X X 1 ɛ, (5) Estimation accuracy Time performance l 1 accuracy M=1 M=10 M=100 Time (second) M=1 M=10 M= Iteration Iteration Figure: ADMM applied to non-convex (SCAD) penalized quantile regression for (5) with τ = 0.3. Sample size n = 30, 000, dimension p = 100, M is the number of subsets. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
10 ADMM Pros: General purpose, minimal assumptions. No approximation, no further statistical analysis. Very flexible parallelization, convergence rate insensitive to # of partitions K. Example: split along both n and p, useful when both n and p are large [8]. Cons: X = X 11 X X 1N X 21 X X 2N X M1 X M2... X MN Iterative, convergence is slow, communication is expensive. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
11 CSL/EDSL Fast convergence, but stronger assumptions on the problems it solve, [4, 5]. Convergence rate of CSL type DC QR Estimation error K=5 K=10 K=20 global Iteration Figure: CSL applied to penalized quantile regression with τ = 0.3 (surrogate Hessian is used). Sample size n = 500, dimension p = 15, K is the number of subsets. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
12 Part II: Divide-and-combine statistical aggregation min L(X 1, y 1 ; θ) (+λp(θ)) ˆθ 1 min L(X 2, y 2 ; θ) (+λp(θ)) ˆθ 2. min L(X K, y K ; θ) (+λp(θ)) ˆθ K?Aggregate ˆθ GLOBAL Consider simple cases with λ = 0. Examples for λ 0: [6] and the de-biased Lasso in [7,10], among others. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
13 Naive Approach: Simple Average ˆθ GLOBAL = K ˆθ k K Performs poorly in general. Especially when the underlying data generating model is non-linear. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
14 One-step Further In [9], a one-step estimator from the subsets average is considered, under the general context of M-estimator where L( ) is the criterion function. 1 First, take the average 2 Then, compute the one-step estimator where 2 L(X, y; ˆθ (0) ) = ˆθ (0) = K ˆθ k K ˆθ (1) = ˆθ (0) [ 2 L(X, y; ˆθ (0) )] 1 [ L(X, y; ˆθ (0) )], K 2 L(X k, y k ; ˆθ (0) ) and L(X, y; ˆθ (0) ) = K L(X k, y k ; ˆθ (0) ). Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
15 One-step Further Theorem 1 (Cheng and Huo (2015)) Denote θ 0 as the true parameter. Under some mild conditions, the one-step estimator ˆθ (1) satisfies as long as K = O( n), where ) ( [ 2 L(X, y; ˆθ (0) )] 1 E Σ = E n (ˆθ (1) θ 0 ) N(0, Σ) ([ L(X, y; ˆθ (0) )][ L(X, y; ˆθ (0) )] T ) E ( [ 2 L(X, y; ˆθ (0) )] 1 ). That is to say, the aggregated estimation ˆθ (1) is asymptotically equivalent to the global estimator as if the estimation is made with all data, as long as the number of subsets does not grow too fast. For example, in MLE case, Σ reduces to the Fisher information. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
16 Aggregated Estimating Equation (AEE) Take a closer look at simple linear regression, K ˆβ = (X T X) 1 X T y = X T k X k 1 K K X T k y k = X T k X k a weighted-average of subset estimations. 1 K X T k X k ˆβ k, Use local curvature of the loss function, or gradient of estimating equation as weights, K X T (y Xβ) = X T k (y k X k β) (OLS estimating equation). Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
17 Aggregated Estimating Equation (AEE) Lin et al. [11] generalizes the idea to estimating equation estimation. M(θ) = n φ(x i, y i ; θ) = 0. (or n i=1 L(x i, y i ; θ) = 0) i=1 - Use gradient of M k = i S k φ(x i, y i ; θ) as weight for subset k, i.e., φ(x i, y i ; ˆθ k ) A k = θ i S k - Then calculate the AEE estimator as K ˆθ AEE = A k It is proved in [11] that the AEE estimator is equivalent to the global estimator under some mild conditions as long as K = O(n γ ) for some 0 < γ < 1. 1 K A k ˆθ k Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
18 Summary Another approach: Song et. al proposed a D&C method by combining local confidence distributions or confidence inference functions, [12]. Distributed optimization algorithm VS Statistical aggregation: A distributed optimization algorithm is generally iterative, while statistical aggregation is non-iterative. Statistical aggregation methods are communication efficient. Distributed optimization algorithm solves the original problem and find the global estimation; statistical aggregation estimation is not equivalent to the global estimation, but an ideal statistical aggregation method is supposed to be asymptotically equivalent to the global solution as if the estimation is made on the entire data. Did not talk about inference, but For bootstrap based inference, distributed optimization algorithms help parallelize and speed up the computation. For inference based on asymptotic results, the asymptotic covariance matrices are often distributedly computable. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
19 Q&A Thank you! Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
20 References I [1] Peter Richtárik and Martin Takáč (2016) Distributed Coordinate Descent Method for Learning with Big Data. Journal of Machine Learning Research [2] Stephen Boyd et al. (2011) Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning [3] Virginia Smith et al. (2016) CoCoA: A general framework for communication-efficient distributed optimization. arxiv preprint arxiv: [4] Michael I. Jordan et al. (2016) Communication-Efficient Distributed Statistical Inference. arxiv: v3 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
21 References II [5] Jialei Wang et al. (2016) Efficient Distributed Learning with Sparsity. arxiv: v1 [6] Xueying Chen and Min-ge Xie (2014) A Split-and-conquer Approach for Analysis of Extraordinarily Large Data. Statistica Sinica [7] Adel Javanmard and Andrea Montanari (2014) Confidence Intervals and Hypothesis Testing for High-dimensional Regression. Journal of Machine Learning Research [8] Neal Parikh and Stephen Boyd (2014) Block splitting for distributed optimization. Mathematical Programming Computation [9] Cheng Huang and Xiaoming Huo (2015) A Distributed One-Step Estimator. arxiv: v2 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
22 References III [10] Jason Lee et al. (2015) Communication-efficient sparse regression: a one-shot approach. Journal of Machine Learning Research [11] Nan Lin and Ruibin Xi (2011) Aggregated estimating equation estimation. Statistics and Its Interface [12] Peter X.K. Song et al.(2016) Confidence distributions and confidence inference functions: General data integration methods for big complex data. Personal Communication Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22
Introduction to Alternating Direction Method of Multipliers
Introduction to Alternating Direction Method of Multipliers Yale Chang Machine Learning Group Meeting September 29, 2016 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction
More informationImportance Sampling for Minibatches
Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,
More informationMSA220/MVE440 Statistical Learning for Big Data
MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from
More informationDual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)
More informationSparse and Regularized Optimization
Sparse and Regularized Optimization In many applications, we seek not an exact minimizer of the underlying objective, but rather an approximate minimizer that satisfies certain desirable properties: sparsity
More informationAlternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization
Alternating Direction Method of Multipliers Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last time: dual ascent min x f(x) subject to Ax = b where f is strictly convex and closed. Denote
More informationarxiv: v1 [math.oc] 23 May 2017
A DERANDOMIZED ALGORITHM FOR RP-ADMM WITH SYMMETRIC GAUSS-SEIDEL METHOD JINCHAO XU, KAILAI XU, AND YINYU YE arxiv:1705.08389v1 [math.oc] 23 May 2017 Abstract. For multi-block alternating direction method
More informationA Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models
A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los
More informationDistributed Optimization via Alternating Direction Method of Multipliers
Distributed Optimization via Alternating Direction Method of Multipliers Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato Stanford University ITMANET, Stanford, January 2011 Outline precursors dual decomposition
More informationParallel Coordinate Optimization
1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a
More informationAn efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss
An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao
More informationThe Alternating Direction Method of Multipliers
The Alternating Direction Method of Multipliers With Adaptive Step Size Selection Peter Sutor, Jr. Project Advisor: Professor Tom Goldstein December 2, 2015 1 / 25 Background The Dual Problem Consider
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationADMM and Fast Gradient Methods for Distributed Optimization
ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work
More informationAsynchronous Non-Convex Optimization For Separable Problem
Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent
More informationAn interior-point stochastic approximation method and an L1-regularized delta rule
Photograph from National Geographic, Sept 2008 An interior-point stochastic approximation method and an L1-regularized delta rule Peter Carbonetto, Mark Schmidt and Nando de Freitas University of British
More informationSparse Gaussian conditional random fields
Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationBAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage
BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement
More informationSupplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM
Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationUniform Post Selection Inference for LAD Regression and Other Z-estimation problems. ArXiv: Alexandre Belloni (Duke) + Kengo Kato (Tokyo)
Uniform Post Selection Inference for LAD Regression and Other Z-estimation problems. ArXiv: 1304.0282 Victor MIT, Economics + Center for Statistics Co-authors: Alexandre Belloni (Duke) + Kengo Kato (Tokyo)
More informationGaussian Graphical Models and Graphical Lasso
ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf
More informationRobust estimation, efficiency, and Lasso debiasing
Robust estimation, efficiency, and Lasso debiasing Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics WHOA-PSI workshop Washington University in St. Louis Aug 12, 2017 Po-Ling
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationLinearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学
Linearized Alternating Direction Method: Two Blocks and Multiple Blocks Zhouchen Lin 林宙辰北京大学 Dec. 3, 014 Outline Alternating Direction Method (ADM) Linearized Alternating Direction Method (LADM) Two Blocks
More informationDifferential network analysis from cross-platform gene expression data: Supplementary Information
Differential network analysis from cross-platform gene expression data: Supplementary Information Xiao-Fei Zhang, Le Ou-Yang, Xing-Ming Zhao, and Hong Yan Contents 1 Supplementary Figures Supplementary
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course otes for EE7C (Spring 018): Conve Optimization and Approimation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Ma Simchowitz Email: msimchow+ee7c@berkeley.edu October
More informationNonconcave Penalized Likelihood with A Diverging Number of Parameters
Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More informationAccelerated primal-dual methods for linearly constrained convex problems
Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize
More informationA Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem
A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem Kangkang Deng, Zheng Peng Abstract: The main task of genetic regulatory networks is to construct a
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationLearning in a Distributed and Heterogeneous Environment
Learning in a Distributed and Heterogeneous Environment Martin Jaggi EPFL Machine Learning and Optimization Laboratory mlo.epfl.ch Inria - EPFL joint Workshop - Paris - Feb 15 th Machine Learning Methods
More informationGradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve
More information10-725/36-725: Convex Optimization Spring Lecture 21: April 6
10-725/36-725: Conve Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 21: April 6 Scribes: Chiqun Zhang, Hanqi Cheng, Waleed Ammar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationDistributed Box-Constrained Quadratic Optimization for Dual Linear SVM
Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments
More informationMS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari
MS&E 226: Small Data Lecture 11: Maximum likelihood (v2) Ramesh Johari ramesh.johari@stanford.edu 1 / 18 The likelihood function 2 / 18 Estimating the parameter This lecture develops the methodology behind
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline
More informationSingle Index Quantile Regression for Heteroscedastic Data
Single Index Quantile Regression for Heteroscedastic Data E. Christou M. G. Akritas Department of Statistics The Pennsylvania State University SMAC, November 6, 2015 E. Christou, M. G. Akritas (PSU) SIQR
More informationFunctional SVD for Big Data
Functional SVD for Big Data Pan Chao April 23, 2014 Pan Chao Functional SVD for Big Data April 23, 2014 1 / 24 Outline 1 One-Way Functional SVD a) Interpretation b) Robustness c) CV/GCV 2 Two-Way Problem
More informationAdaptive Piecewise Polynomial Estimation via Trend Filtering
Adaptive Piecewise Polynomial Estimation via Trend Filtering Liubo Li, ShanShan Tu The Ohio State University li.2201@osu.edu, tu.162@osu.edu October 1, 2015 Liubo Li, ShanShan Tu (OSU) Trend Filtering
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Distributed ADMM for Gaussian Graphical Models Yaoliang Yu Lecture 29, April 29, 2015 Eric Xing @ CMU, 2005-2015 1 Networks / Graphs Eric Xing
More informationOn Bayesian Computation
On Bayesian Computation Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang Previous Work: Information Constraints on Inference Minimize the minimax risk under constraints
More informationHomework 5. Convex Optimization /36-725
Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationDual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725
Dual Methods Lecturer: Ryan Tibshirani Conve Optimization 10-725/36-725 1 Last time: proimal Newton method Consider the problem min g() + h() where g, h are conve, g is twice differentiable, and h is simple.
More informationLASSO Review, Fused LASSO, Parallel LASSO Solvers
Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationA Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression
A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent
More informationIntroduction to Machine Learning
Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationA Sequential Split-Conquer-Combine Approach for Analysis of Big Spatial Data
A Sequential Split-Conquer-Combine Approach for Analysis of Big Spatial Data Min-ge Xie Department of Statistics & Biostatistics Rutgers, The State University of New Jersey In collaboration with Xuying
More informationTheoretical results for lasso, MCP, and SCAD
Theoretical results for lasso, MCP, and SCAD Patrick Breheny March 2 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/23 Introduction There is an enormous body of literature concerning theoretical
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More information10725/36725 Optimization Homework 4
10725/36725 Optimization Homework 4 Due November 27, 2012 at beginning of class Instructions: There are four questions in this assignment. Please submit your homework as (up to) 4 separate sets of pages
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationLinear Regression (9/11/13)
STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter
More informationAn ADMM algorithm for optimal sensor and actuator selection
An ADMM algorithm for optimal sensor and actuator selection Neil K. Dhingra, Mihailo R. Jovanović, and Zhi-Quan Luo 53rd Conference on Decision and Control, Los Angeles, California, 2014 1 / 25 2 / 25
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationRecent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables
Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop
More informationParameter Norm Penalties. Sargur N. Srihari
Parameter Norm Penalties Sargur N. srihari@cedar.buffalo.edu 1 Regularization Strategies 1. Parameter Norm Penalties 2. Norm Penalties as Constrained Optimization 3. Regularization and Underconstrained
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationPreconditioning via Diagonal Scaling
Preconditioning via Diagonal Scaling Reza Takapoui Hamid Javadi June 4, 2014 1 Introduction Interior point methods solve small to medium sized problems to high accuracy in a reasonable amount of time.
More informationPrimal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions
Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function
More informationarxiv: v2 [cs.lg] 10 Oct 2018
Journal of Machine Learning Research 9 (208) -49 Submitted 0/6; Published 7/8 CoCoA: A General Framework for Communication-Efficient Distributed Optimization arxiv:6.0289v2 [cs.lg] 0 Oct 208 Virginia Smith
More informationEM Algorithm II. September 11, 2018
EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data
More informationOn Optimal Frame Conditioners
On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,
More informationCS-E4830 Kernel Methods in Machine Learning
CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This
More informationEstimators based on non-convex programs: Statistical and computational guarantees
Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction
More informationarxiv: v3 [stat.ml] 14 Apr 2016
arxiv:1307.0048v3 [stat.ml] 14 Apr 2016 Simple one-pass algorithm for penalized linear regression with cross-validation on MapReduce Kun Yang April 15, 2016 Abstract In this paper, we propose a one-pass
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationMSA220/MVE440 Statistical Learning for Big Data
MSA220/MVE440 Statistical Learning for Big Data Lecture 7/8 - High-dimensional modeling part 1 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Classification
More informationConstrained Optimization and Lagrangian Duality
CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may
More informationMCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17
MCMC for big data Geir Storvik BigInsight lunch - May 2 2018 Geir Storvik MCMC for big data BigInsight lunch - May 2 2018 1 / 17 Outline Why ordinary MCMC is not scalable Different approaches for making
More informationOutline of GLMs. Definitions
Outline of GLMs Definitions This is a short outline of GLM details, adapted from the book Nonparametric Regression and Generalized Linear Models, by Green and Silverman. The responses Y i have density
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationDealing with Constraints via Random Permutation
Dealing with Constraints via Random Permutation Ruoyu Sun UIUC Joint work with Zhi-Quan Luo (U of Minnesota and CUHK (SZ)) and Yinyu Ye (Stanford) Simons Institute Workshop on Fast Iterative Methods in
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationThe Randomized Newton Method for Convex Optimization
The Randomized Newton Method for Convex Optimization Vaden Masrani UBC MLRG April 3rd, 2018 Introduction We have some unconstrained, twice-differentiable convex function f : R d R that we want to minimize:
More informationLinear programming II
Linear programming II Review: LP problem 1/33 The standard form of LP problem is (primal problem): max z = cx s.t. Ax b, x 0 The corresponding dual problem is: min b T y s.t. A T y c T, y 0 Strong Duality
More informationOn Markov chain Monte Carlo methods for tall data
On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational
More informationDistributed Optimization and Statistics via Alternating Direction Method of Multipliers
Distributed Optimization and Statistics via Alternating Direction Method of Multipliers Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato Stanford University Stanford Statistics Seminar, September 2010
More informationAn Inexact Newton Method for Optimization
New York University Brown Applied Mathematics Seminar, February 10, 2009 Brief biography New York State College of William and Mary (B.S.) Northwestern University (M.S. & Ph.D.) Courant Institute (Postdoc)
More informationShiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers
Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)
More informationSparsity Regularization
Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation
More informationRirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology
Rirdge Regression Szymon Bobek Institute of Applied Computer science AGH University of Science and Technology Based on Carlos Guestrin adn Emily Fox slides from Coursera Specialization on Machine Learnign
More informationRecent Developments in Statistical Theory and Methods Based on Distributed Computing (18w5089)
Recent Developments in Statistical Theory and Methods Based on Distributed Computing (18w5089) Sujit Ghosh (North Carolina State University), Xiaoming Huo (Georgia Institute of Technology), Hua Zhou (University
More informationStatistical Inference
Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park
More informationProperties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation
Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana
More informationSOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu
SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray
More informationLikelihood-Based Methods
Likelihood-Based Methods Handbook of Spatial Statistics, Chapter 4 Susheela Singh September 22, 2016 OVERVIEW INTRODUCTION MAXIMUM LIKELIHOOD ESTIMATION (ML) RESTRICTED MAXIMUM LIKELIHOOD ESTIMATION (REML)
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationSupport Vector Machine
Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary
More informationGeneralized Linear Models. Kurt Hornik
Generalized Linear Models Kurt Hornik Motivation Assuming normality, the linear model y = Xβ + e has y = β + ε, ε N(0, σ 2 ) such that y N(μ, σ 2 ), E(y ) = μ = β. Various generalizations, including general
More information