Supplementary Materials to Learning Sparse Causal Gaussian Networks With Experimental Intervention: Regularization and Coordinate Descent

Size: px
Start display at page:

Download "Supplementary Materials to Learning Sparse Causal Gaussian Networks With Experimental Intervention: Regularization and Coordinate Descent"

Transcription

1 Supplementary Materials to Learning Sparse Causal Gaussian Networks With Experimental Intervention: Regularization and Coordinate Descent Fei Fu and Qing Zhou 1 Proofs 1.1 Proof of Proposition 1 [ ( We want to minimize g = 1 2 log ) 2 ( βkj ξ kj + ckj ξkj) ] 2 +η β kj over β kj. After differentiating g with respect to β kj and setting the derivative to zero, we obtain for β kj > 0, η β kj 2 (2ηξ kj 1) β kj + (c kj η ξ kj ) = 0, (1) and for β kj < 0, η β 2 kj (2ηξ kj + 1) β kj + (c kj η + ξ kj ) = 0. (2) Apparently, both (1) and (2) have the same discriminant = 1 4(c kj ξ 2 kj )η2. The only possible minimizers of g are 0, positive real roots of (1) or negative real roots of (2). In the rest of the proof, we will only show that Proposition 1 holds when ξ kj 0. The proof for ξ kj < 0 is analogous. First, consider ξ kj = 0. It is easily seen that g is minimized at β kj = 0, which is included in the third case of the proposition. ( ) Now consider the case when ξ kj > 0. In this case, let β1 = 2ηξ kj 1 + 2η and β ( ) 2 = 2ηξ kj 1 2η be the two possible real roots of (1). If (1) has two real roots, β2 is a local maximum. Also note that if ξ kj > 0, (2) can only have positive real roots. Thus, g can only be minimized at 0 or β 1 if it is real. Now we only need to find out when 0 or β 1 There are four cases: minimizes g. Case 1. > 0 and β 1 > 0 > β 2 : This is equivalent to 0 < η < ξ kj/c kj. In this case, we have g(β1 ) < g(0) (Figure S1A). Case 2. > 0 and β 1 > β 2 0: This is equivalent to ξ kj/c kj η < and η > (2ξ kj ) 1. In this case, β1 is a local minimum and β 2 Thus, we need to compare g(β1 ) with g(0) to determine arg min g. βkj ( ) 1 2 c kj ξkj 2 is a local maximum (Figure S1B). 1

2 ξ kj = 0.8, c kj = 0.8, γ = 0.5 ξ kj = 0.9, c kj = 0.82, γ = 2 g g ~ β (A) kj ~ β (B) kj ξ kj = 0.4, c kj = 0.8, γ = 0.5 ξ kj = 0.9, c kj = 0.85, γ = 2.5 g g ~ β (C) kj ~ β (D) kj Figure S1: Examples illustrating different scenarios for minimizing g over β kj when ξ kj > 0. Case 3. > 0 and 0 β 1 > β 2 : This is equivalent to ξ kj/c kj η < ( 1 2 c kj ξkj) 2 and η (2ξ kj ) 1. In this case, neither β1 nor β 2 is positive. So arg min g = 0 (Figure S1C). βkj ( 1. Case 4. 0: This is equivalent to η 2 c kj ξkj) 2 If < 0, clearly arg min g = βkj 0. If = 0, β1 = β 2 is an inflection point if they are positive (Figure S1D). So it is also true that arg min βkj g = 0. Therefore, we have shown that Proposition 1 holds. 2

3 1.2 Proof of Theorem 3 We omit the proof of the first part of Theorem 3, since it is similar to that of Theorem 2. To prove the second part, let B = {j : φ j 0}. For notational ease, let us first, by permuting the indices, rewrite the parameter θ as θ = (θ a, θ b ) = (φ A, φ B, σ 2, σ 2 ), where θ a = φ A and θ b = (φ B, σ 2, σ 2 ). Let r = A be the number of zero components of φ. Now we only need to show that with probability tending to 1, for any θ b satisfying θ b θ b = O p(n 1/2 ) and any constant C > 0, (0, θ b ) = arg max θ R ( (θ a, θ b ) ). (3) a C/ n To establish (3), we again study the behavior of R(θ) around the point (0, θ b ) by expanding L(θ) around (0, θ b ). Let a n = 1/ n, θ o = (0, θ b ), and u = (u a, 0) such that u C and θ o + a n u Ω. Then we have the following result similar to that in the proof of Theorem 2: = = R(θ o + a n u) R(θ o ) p [ αk L k (θo k )T u k {1 + o p (1)} α ] k nk 2 ut k I(θo k )u k {1 + o p (1)} k=1 p k=1 [ αk L k (θo k )T nk λ n n r τ j u j j=1 u k {1 + o p (1)} α ] k 2 ut k I(θo k )u k {1 + o p (1)} λ n n γ/2 n r n φ j γ u j. j=1 (4) Note that both the first and the second terms in the last line of (4) are on the order of O p (1) for any fixed constant C. Since φ j is n-consistent, we have n φ j = O p (1), for j = 1,..., r. Then the third term in the last line of (4) is on the order of λ n n (γ 1)/2. Therefore, (3) holds, and the proof is complete. 3

4 2 Supplementary Algorithm The following algorithm is used in the second step of the CD algorithm to check the acyclicity constraint. The time complexity is O(V + E). Algorithm S1 Check whether a DAG G remains acyclic if an edge i j is added. function Cycle(G, i, j) for v V \{i} do C v 0 end for C i 1 Q ENQUEUE(Q, i) while Q do u DEQUEUE(Q) for v Π G u do if v = j then return true else if C v = 0 then C v 1 ENQUEUE(Q, v) end if end if end for end while return false end function 4

5 3 Supplementary Figures 3.1 Demonstration of convergence Figure S2 demonstrates convergence of the CD algorithm on a simulated data set with p = 200. The figure plots the maximum absolute difference (MAD) in the coefficient matrix between two adjacent iterations. The two bumps at iterations 16 and 25 reflect changes in the active set of blocks after a complete cycle. The decrease in the MAD before, between, and after the bumps indicates convergence given a (fixed) active set of blocks. Decreasing MAD at the bumps (iterations 1, 16, and 25) demonstrates convergence in the structure (active set). At the final iteration, the CD algorithm also cycles through all blocks (which allows update on the active set), but the active set stays the same and the MAD is already below the threshold. Therefore, the algorithm stops running. The fact that there are only a few changes in the active set is due to the use of a warm start which often gives a good initial estimate. p = 200, β ij = 0.5 Maximum absolute difference Iteration t Figure S2: A typical plot for the convergence of the CD algorithm. 5

6 3.2 Choice of α Figure S3 shows the sensitivity of the simulation results with p = 100 and β ij = 0.5 (Table 1) to the choice of α [Equation (11) in the main paper]. The solid line plots TPR versus α, the dotdashed line plots R/P versus α, and the long dashed line plots FP/P versus α. The addition of the latter two curves plots FDR versus α. See Table 1 for notations. p = 100, β ij = α Figure S3: Simulation results for different values of α. 6

7 4 Supplementary Tables All notations in the supplementary tables are defined in Table 1 in the main text. Table S1 shows the large sample performance of the CD algorithm and the KO method. The data were simulated in the same way as described in Section 5.1 in the main text. Here, the sample size is n = 6000 and the number of edges is 2p for each data set. Table S1: Large sample results for the CD algorithm and the KO method p β ij CD algorithm KO method P E R M FP TPR FDR TPR FDR

8 Table S2 shows the performance of the CD algorithm for DAGs with different degrees of sparsity. The data were simulated with intervention as described in Section 5.1 in the main text. We fixed the sample size of a data set to n = 5p, where p is the number of nodes, and changed the number of edges from p to 4p. The coefficients β ij = 0.5 for all the results in this table. Table S2: Performance of the CD algorithm for DAGs with different degrees of sparsity p # edges TPR FDR (0.081) (0.088) (0.100) (0.085) (0.109) (0.046) (0.061) (0.058) (0.058) (0.071) (0.061) (0.088) 8

9 The following two tables summarize the results of the CD algorithm on observational data as well as a comparison with the PC-based method on observational data. Table S3: Performance of the CD algorithm on observational data p β ij P E R M FP TPR FDR (0.068) (0.056) (0.106) (0.066) (0.130) (0.091) (0.064) (0.075) (0.141) (0.115) (0.089) (0.091) (0.059) (0.054) (0.071) (0.097) (0.096) (0.091) (0.029) (0.047) (0.053) (0.059) (0.080) (0.095) 9

10 Table S4: Performance comparison between the PC-based method and the CD algorithm on observational data p β ij PC-based method CD algorithm P TPR FDR P TPR FDR (0.037) (0.187) (0.029) (0.137) (0.057) (0.126) (0.060) (0.099) (0.033) (0.118) (0.081) (0.169) (0.057) (0.108) (0.062) (0.075) (0.082) (0.077) (0.143) (0.144) (0.058) (0.079) (0.073) (0.084) (0.050) (0.056) (0.051) (0.048) (0.071) (0.079) (0.099) (0.090) (0.047) (0.068) (0.049) (0.094) (0.031) (0.030) (0.029) (0.028) (0.046) (0.066) (0.083) (0.053) (0.059) (0.066) (0.064) (0.086) 10

Learning Sparse Causal Gaussian Networks With Experimental Intervention: Regularization and Coordinate Descent

Learning Sparse Causal Gaussian Networks With Experimental Intervention: Regularization and Coordinate Descent Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/jasa Fei FU and Qing ZHOU Learning Sparse Causal Gaussian Networs With Experimental Intervention: Regularization

More information

Inferring Causal Phenotype Networks from Segregating Populat

Inferring Causal Phenotype Networks from Segregating Populat Inferring Causal Phenotype Networks from Segregating Populations Elias Chaibub Neto chaibub@stat.wisc.edu Statistics Department, University of Wisconsin - Madison July 15, 2008 Overview Introduction Description

More information

arxiv: v3 [stat.me] 10 Mar 2016

arxiv: v3 [stat.me] 10 Mar 2016 Submitted to the Annals of Statistics ESTIMATING THE EFFECT OF JOINT INTERVENTIONS FROM OBSERVATIONAL DATA IN SPARSE HIGH-DIMENSIONAL SETTINGS arxiv:1407.2451v3 [stat.me] 10 Mar 2016 By Preetam Nandy,,

More information

Linear Classifiers as Pattern Detectors

Linear Classifiers as Pattern Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2014/2015 Lesson 16 8 April 2015 Contents Linear Classifiers as Pattern Detectors Notation...2 Linear

More information

Sparse inverse covariance estimation with the lasso

Sparse inverse covariance estimation with the lasso Sparse inverse covariance estimation with the lasso Jerome Friedman Trevor Hastie and Robert Tibshirani November 8, 2007 Abstract We consider the problem of estimating sparse graphs by a lasso penalty

More information

Multilayer Neural Networks

Multilayer Neural Networks Multilayer Neural Networks Multilayer Neural Networks Discriminant function flexibility NON-Linear But with sets of linear parameters at each layer Provably general function approximators for sufficient

More information

Analysis of Algorithms. Outline. Single Source Shortest Path. Andres Mendez-Vazquez. November 9, Notes. Notes

Analysis of Algorithms. Outline. Single Source Shortest Path. Andres Mendez-Vazquez. November 9, Notes. Notes Analysis of Algorithms Single Source Shortest Path Andres Mendez-Vazquez November 9, 01 1 / 108 Outline 1 Introduction Introduction and Similar Problems General Results Optimal Substructure Properties

More information

Estimation of large dimensional sparse covariance matrices

Estimation of large dimensional sparse covariance matrices Estimation of large dimensional sparse covariance matrices Department of Statistics UC, Berkeley May 5, 2009 Sample covariance matrix and its eigenvalues Data: n p matrix X n (independent identically distributed)

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data Suppleentary to Learning Discriinative Bayesian Networks fro High-diensional Continuous Neuroiaging Data Luping Zhou, Lei Wang, Lingqiao Liu, Philip Ogunbona, and Dinggang Shen Proposition. Given a sparse

More information

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

More information

Counterfactuals via Deep IV. Matt Taddy (Chicago + MSR) Greg Lewis (MSR) Jason Hartford (UBC) Kevin Leyton-Brown (UBC)

Counterfactuals via Deep IV. Matt Taddy (Chicago + MSR) Greg Lewis (MSR) Jason Hartford (UBC) Kevin Leyton-Brown (UBC) Counterfactuals via Deep IV Matt Taddy (Chicago + MSR) Greg Lewis (MSR) Jason Hartford (UBC) Kevin Leyton-Brown (UBC) Endogenous Errors y = g p, x + e and E[ p e ] 0 If you estimate this using naïve ML,

More information

Rapid Introduction to Machine Learning/ Deep Learning

Rapid Introduction to Machine Learning/ Deep Learning Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/32 Lecture 5a Bayesian network April 14, 2016 2/32 Table of contents 1 1. Objectives of Lecture 5a 2 2.Bayesian

More information

Reflections and Rotations in R 3

Reflections and Rotations in R 3 Reflections and Rotations in R 3 P. J. Ryan May 29, 21 Rotations as Compositions of Reflections Recall that the reflection in the hyperplane H through the origin in R n is given by f(x) = x 2 ξ, x ξ (1)

More information

y(x n, w) t n 2. (1)

y(x n, w) t n 2. (1) Network training: Training a neural network involves determining the weight parameter vector w that minimizes a cost function. Given a training set comprising a set of input vector {x n }, n = 1,...N,

More information

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement

More information

min 4x 1 5x 2 + 3x 3 s.t. x 1 + 2x 2 + x 3 = 10 x 1 x 2 6 x 1 + 3x 2 + x 3 14

min 4x 1 5x 2 + 3x 3 s.t. x 1 + 2x 2 + x 3 = 10 x 1 x 2 6 x 1 + 3x 2 + x 3 14 The exam is three hours long and consists of 4 exercises. The exam is graded on a scale 0-25 points, and the points assigned to each question are indicated in parenthesis within the text. If necessary,

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Applied Machine Learning for Biomedical Engineering. Enrico Grisan

Applied Machine Learning for Biomedical Engineering. Enrico Grisan Applied Machine Learning for Biomedical Engineering Enrico Grisan enrico.grisan@dei.unipd.it Data representation To find a representation that approximates elements of a signal class with a linear combination

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Differential network analysis from cross-platform gene expression data: Supplementary Information

Differential network analysis from cross-platform gene expression data: Supplementary Information Differential network analysis from cross-platform gene expression data: Supplementary Information Xiao-Fei Zhang, Le Ou-Yang, Xing-Ming Zhao, and Hong Yan Contents 1 Supplementary Figures Supplementary

More information

Predicting causal effects in large-scale systems from observational data

Predicting causal effects in large-scale systems from observational data nature methods Predicting causal effects in large-scale systems from observational data Marloes H Maathuis 1, Diego Colombo 1, Markus Kalisch 1 & Peter Bühlmann 1,2 Supplementary figures and text: Supplementary

More information

Nonconvex penalties: Signal-to-noise ratio and algorithms

Nonconvex penalties: Signal-to-noise ratio and algorithms Nonconvex penalties: Signal-to-noise ratio and algorithms Patrick Breheny March 21 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/22 Introduction In today s lecture, we will return to nonconvex

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science

More information

Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence General overview Introduction Directed acyclic graphs (DAGs) and conditional independence DAGs and causal effects

More information

Causal Effect Identification in Alternative Acyclic Directed Mixed Graphs

Causal Effect Identification in Alternative Acyclic Directed Mixed Graphs Proceedings of Machine Learning Research vol 73:21-32, 2017 AMBN 2017 Causal Effect Identification in Alternative Acyclic Directed Mixed Graphs Jose M. Peña Linköping University Linköping (Sweden) jose.m.pena@liu.se

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For

More information

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao

More information

An Algorithm to Determine the Clique Number of a Split Graph. In this paper, we propose an algorithm to determine the clique number of a split graph.

An Algorithm to Determine the Clique Number of a Split Graph. In this paper, we propose an algorithm to determine the clique number of a split graph. An Algorithm to Determine the Clique Number of a Split Graph O.Kettani email :o_ket1@yahoo.fr Abstract In this paper, we propose an algorithm to determine the clique number of a split graph. Introduction

More information

Learning in Bayesian Networks

Learning in Bayesian Networks Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks

More information

Neural Networks Lecture 4: Radial Bases Function Networks

Neural Networks Lecture 4: Radial Bases Function Networks Neural Networks Lecture 4: Radial Bases Function Networks H.A Talebi Farzaneh Abdollahi Department of Electrical Engineering Amirkabir University of Technology Winter 2011. A. Talebi, Farzaneh Abdollahi

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

More information

Recommendation Systems

Recommendation Systems Recommendation Systems Popularity Recommendation Systems Predicting user responses to options Offering news articles based on users interests Offering suggestions on what the user might like to buy/consume

More information

Noise-Blind Image Deblurring Supplementary Material

Noise-Blind Image Deblurring Supplementary Material Noise-Blind Image Deblurring Supplementary Material Meiguang Jin University of Bern Switzerland Stefan Roth TU Darmstadt Germany Paolo Favaro University of Bern Switzerland A. Upper and Lower Bounds Our

More information

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J 7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured

More information

Learning Methods for Linear Detectors

Learning Methods for Linear Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2

More information

If there exists a threshold k 0 such that. then we can take k = k 0 γ =0 and achieve a test of size α. c 2004 by Mark R. Bell,

If there exists a threshold k 0 such that. then we can take k = k 0 γ =0 and achieve a test of size α. c 2004 by Mark R. Bell, Recall The Neyman-Pearson Lemma Neyman-Pearson Lemma: Let Θ = {θ 0, θ }, and let F θ0 (x) be the cdf of the random vector X under hypothesis and F θ (x) be its cdf under hypothesis. Assume that the cdfs

More information

Introduction to Alternating Direction Method of Multipliers

Introduction to Alternating Direction Method of Multipliers Introduction to Alternating Direction Method of Multipliers Yale Chang Machine Learning Group Meeting September 29, 2016 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction

More information

Dynamic Working Memory in Recurrent Neural Networks

Dynamic Working Memory in Recurrent Neural Networks Dynamic Working Memory in Recurrent Neural Networks Alexander Atanasov Research Advisor: John Murray Physics 471 Fall Term, 2016 Abstract Recurrent neural networks (RNNs) are physically-motivated models

More information

Discovery of Linear Acyclic Models Using Independent Component Analysis

Discovery of Linear Acyclic Models Using Independent Component Analysis Created by S.S. in Jan 2008 Discovery of Linear Acyclic Models Using Independent Component Analysis Shohei Shimizu, Patrik Hoyer, Aapo Hyvarinen and Antti Kerminen LiNGAM homepage: http://www.cs.helsinki.fi/group/neuroinf/lingam/

More information

[Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements

[Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements [Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements Aasthaa Bansal PhD Pharmaceutical Outcomes Research & Policy Program University of Washington 69 Biomarkers

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

Simplicity of Additive Noise Models

Simplicity of Additive Noise Models Simplicity of Additive Noise Models Jonas Peters ETH Zürich - Marie Curie (IEF) Workshop on Simplicity and Causal Discovery Carnegie Mellon University 7th June 2014 contains joint work with... ETH Zürich:

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

CMPS 6610 Fall 2018 Shortest Paths Carola Wenk

CMPS 6610 Fall 2018 Shortest Paths Carola Wenk CMPS 6610 Fall 018 Shortest Paths Carola Wenk Slides courtesy of Charles Leiserson with changes and additions by Carola Wenk Paths in graphs Consider a digraph G = (V, E) with an edge-weight function w

More information

Iterative Matching Pursuit and its Applications in Adaptive Time-Frequency Analysis

Iterative Matching Pursuit and its Applications in Adaptive Time-Frequency Analysis Iterative Matching Pursuit and its Applications in Adaptive Time-Frequency Analysis Zuoqiang Shi Mathematical Sciences Center, Tsinghua University Joint wor with Prof. Thomas Y. Hou and Sparsity, Jan 9,

More information

MIT Spring 2016

MIT Spring 2016 MIT 18.655 Dr. Kempthorne Spring 2016 1 MIT 18.655 Outline 1 2 MIT 18.655 3 Decision Problem: Basic Components P = {P θ : θ Θ} : parametric model. Θ = {θ}: Parameter space. A{a} : Action space. L(θ, a)

More information

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,

More information

arxiv: v1 [stat.ml] 23 Oct 2016

arxiv: v1 [stat.ml] 23 Oct 2016 Formulas for counting the sizes of Markov Equivalence Classes Formulas for Counting the Sizes of Markov Equivalence Classes of Directed Acyclic Graphs arxiv:1610.07921v1 [stat.ml] 23 Oct 2016 Yangbo He

More information

CSC 412 (Lecture 4): Undirected Graphical Models

CSC 412 (Lecture 4): Undirected Graphical Models CSC 412 (Lecture 4): Undirected Graphical Models Raquel Urtasun University of Toronto Feb 2, 2016 R Urtasun (UofT) CSC 412 Feb 2, 2016 1 / 37 Today Undirected Graphical Models: Semantics of the graph:

More information

Midterm Review. Igor Yanovsky (Math 151A TA)

Midterm Review. Igor Yanovsky (Math 151A TA) Midterm Review Igor Yanovsky (Math 5A TA) Root-Finding Methods Rootfinding methods are designed to find a zero of a function f, that is, to find a value of x such that f(x) =0 Bisection Method To apply

More information

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina. Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

Robust Principal Component Analysis

Robust Principal Component Analysis ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Sparse Linear Programming via Primal and Dual Augmented Coordinate Descent

Sparse Linear Programming via Primal and Dual Augmented Coordinate Descent Sparse Linear Programg via Primal and Dual Augmented Coordinate Descent Presenter: Joint work with Kai Zhong, Cho-Jui Hsieh, Pradeep Ravikumar and Inderjit Dhillon. Sparse Linear Program Given vectors

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

High-dimensional Covariance Estimation Based On Gaussian Graphical Models High-dimensional Covariance Estimation Based On Gaussian Graphical Models Shuheng Zhou, Philipp Rutimann, Min Xu and Peter Buhlmann February 3, 2012 Problem definition Want to estimate the covariance matrix

More information

The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression

The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression Cun-hui Zhang and Jian Huang Presenter: Quefeng Li Feb. 26, 2010 un-hui Zhang and Jian Huang Presenter: Quefeng The Sparsity

More information

arxiv: v6 [math.st] 3 Feb 2018

arxiv: v6 [math.st] 3 Feb 2018 Submitted to the Annals of Statistics HIGH-DIMENSIONAL CONSISTENCY IN SCORE-BASED AND HYBRID STRUCTURE LEARNING arxiv:1507.02608v6 [math.st] 3 Feb 2018 By Preetam Nandy,, Alain Hauser and Marloes H. Maathuis,

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent KDD 2011 Rainer Gemulla, Peter J. Haas, Erik Nijkamp and Yannis Sismanis Presenter: Jiawen Yao Dept. CSE, UT Arlington 1 1

More information

3 : Representation of Undirected GM

3 : Representation of Undirected GM 10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:

More information

Machine Learning. 7. Logistic and Linear Regression

Machine Learning. 7. Logistic and Linear Regression Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,

More information

A Parametric Simplex Algorithm for Linear Vector Optimization Problems

A Parametric Simplex Algorithm for Linear Vector Optimization Problems A Parametric Simplex Algorithm for Linear Vector Optimization Problems Birgit Rudloff Firdevs Ulus Robert Vanderbei July 9, 2015 Abstract In this paper, a parametric simplex algorithm for solving linear

More information

Index Models for Sparsely Sampled Functional Data. Supplementary Material.

Index Models for Sparsely Sampled Functional Data. Supplementary Material. Index Models for Sparsely Sampled Functional Data. Supplementary Material. May 21, 2014 1 Proof of Theorem 2 We will follow the general argument in the proof of Theorem 3 in Li et al. 2010). Write d =

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

Milnor s Exotic 7-Spheres Jhan-Cyuan Syu ( June, 2017 Introduction

Milnor s Exotic 7-Spheres Jhan-Cyuan Syu ( June, 2017 Introduction 許展銓 1 2 3 R 4 R 4 R n n 4 R 4 ξ : E π M 2n J : E E R J J(v) = v v E ξ ξ ξ R ξ : E π M M ξ : Ē π M ξ ξ R = ξ R i : E Ē i(cv) = ci(v) c C v E ξ : E π M n M c(ξ) = c i (ξ) H 2n (M, Z) i 0 c 0 (ξ) = 1 c i

More information

Conjugate Gradients I: Setup

Conjugate Gradients I: Setup Conjugate Gradients I: Setup CS 205A: Mathematical Methods for Robotics, Vision, and Graphics Justin Solomon CS 205A: Mathematical Methods Conjugate Gradients I: Setup 1 / 22 Time for Gaussian Elimination

More information

Quadratic and Rational Inequalities

Quadratic and Rational Inequalities Quadratic and Rational Inequalities Definition of a Quadratic Inequality A quadratic inequality is any inequality that can be put in one of the forms ax 2 + bx + c < 0 ax 2 + bx + c > 0 ax 2 + bx + c

More information

Robust Monte Carlo Methods for Sequential Planning and Decision Making

Robust Monte Carlo Methods for Sequential Planning and Decision Making Robust Monte Carlo Methods for Sequential Planning and Decision Making Sue Zheng, Jason Pacheco, & John Fisher Sensing, Learning, & Inference Group Computer Science & Artificial Intelligence Laboratory

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Artificial Neural Networks 2

Artificial Neural Networks 2 CSC2515 Machine Learning Sam Roweis Artificial Neural s 2 We saw neural nets for classification. Same idea for regression. ANNs are just adaptive basis regression machines of the form: y k = j w kj σ(b

More information

Meta-Analysis for Diagnostic Test Data: a Bayesian Approach

Meta-Analysis for Diagnostic Test Data: a Bayesian Approach Meta-Analysis for Diagnostic Test Data: a Bayesian Approach Pablo E. Verde Coordination Centre for Clinical Trials Heinrich Heine Universität Düsseldorf Preliminaries: motivations for systematic reviews

More information

sparse and low-rank tensor recovery Cubic-Sketching

sparse and low-rank tensor recovery Cubic-Sketching Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru

More information

Supplementary Materials to Nested Partially-Latent Class Models for Dependent Binary Data; Estimating Disease Etiology

Supplementary Materials to Nested Partially-Latent Class Models for Dependent Binary Data; Estimating Disease Etiology Biostatistics (2015), 0, 0, pp. 1 12 doi:10.1093/biostatistics/nplcm 20151028-supplimentary Supplementary Materials to Nested Partially-Latent Class Models for Dependent Binary Data; Estimating Disease

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China) Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu

More information

Radial-Basis Function Networks

Radial-Basis Function Networks Radial-Basis Function etworks A function is radial basis () if its output depends on (is a non-increasing function of) the distance of the input from a given stored vector. s represent local receptors,

More information

Quantized Iterative Hard Thresholding:

Quantized Iterative Hard Thresholding: Quantized Iterative Hard Thresholding: ridging 1-bit and High Resolution Quantized Compressed Sensing Laurent Jacques, Kévin Degraux, Christophe De Vleeschouwer Louvain University (UCL), Louvain-la-Neuve,

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University Submitted to the Annals of Statistics arxiv: arxiv:0000.0000 RATE-OPTIMAL GRAPHON ESTIMATION By Chao Gao, Yu Lu and Harrison H. Zhou Yale University Network analysis is becoming one of the most active

More information

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Neural Networks CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Perceptrons x 0 = 1 x 1 x 2 z = h w T x Output: z x D A perceptron

More information

High-dimensional covariance estimation based on Gaussian graphical models

High-dimensional covariance estimation based on Gaussian graphical models High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,

More information

Pattern Classification

Pattern Classification Pattern Classification All materials in these slides were taen from Pattern Classification (2nd ed) by R. O. Duda,, P. E. Hart and D. G. Stor, John Wiley & Sons, 2000 with the permission of the authors

More information

Bayesian Inference of Multiple Gaussian Graphical Models

Bayesian Inference of Multiple Gaussian Graphical Models Bayesian Inference of Multiple Gaussian Graphical Models Christine Peterson,, Francesco Stingo, and Marina Vannucci February 18, 2014 Abstract In this paper, we propose a Bayesian approach to inference

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION DOI: 1.138/NPHYS535 Spontaneous synchrony in power-grid networks Adilson E. Motter, Seth A. Myers, Marian Anghel and Takashi Nishikawa Supplementary Sections S1. Power-grid data. The data required for

More information

Graphical Model Selection

Graphical Model Selection May 6, 2013 Trevor Hastie, Stanford Statistics 1 Graphical Model Selection Trevor Hastie Stanford University joint work with Jerome Friedman, Rob Tibshirani, Rahul Mazumder and Jason Lee May 6, 2013 Trevor

More information

Gradient Descent. Sargur Srihari

Gradient Descent. Sargur Srihari Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors

More information

Equations and Solutions

Equations and Solutions Section 2.1 Solving Equations: The Addition Principle 1 Equations and Solutions ESSENTIALS An equation is a number sentence that says that the expressions on either side of the equals sign, =, represent

More information

Linear Classifiers as Pattern Detectors

Linear Classifiers as Pattern Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2013/2014 Lesson 18 23 April 2014 Contents Linear Classifiers as Pattern Detectors Notation...2 Linear

More information

Variable selection for model-based clustering

Variable selection for model-based clustering Variable selection for model-based clustering Matthieu Marbac (Ensai - Crest) Joint works with: M. Sedki (Univ. Paris-sud) and V. Vandewalle (Univ. Lille 2) The problem Objective: Estimation of a partition

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information