Supplementary Materials to Learning Sparse Causal Gaussian Networks With Experimental Intervention: Regularization and Coordinate Descent
|
|
- Carmel Thomas
- 5 years ago
- Views:
Transcription
1 Supplementary Materials to Learning Sparse Causal Gaussian Networks With Experimental Intervention: Regularization and Coordinate Descent Fei Fu and Qing Zhou 1 Proofs 1.1 Proof of Proposition 1 [ ( We want to minimize g = 1 2 log ) 2 ( βkj ξ kj + ckj ξkj) ] 2 +η β kj over β kj. After differentiating g with respect to β kj and setting the derivative to zero, we obtain for β kj > 0, η β kj 2 (2ηξ kj 1) β kj + (c kj η ξ kj ) = 0, (1) and for β kj < 0, η β 2 kj (2ηξ kj + 1) β kj + (c kj η + ξ kj ) = 0. (2) Apparently, both (1) and (2) have the same discriminant = 1 4(c kj ξ 2 kj )η2. The only possible minimizers of g are 0, positive real roots of (1) or negative real roots of (2). In the rest of the proof, we will only show that Proposition 1 holds when ξ kj 0. The proof for ξ kj < 0 is analogous. First, consider ξ kj = 0. It is easily seen that g is minimized at β kj = 0, which is included in the third case of the proposition. ( ) Now consider the case when ξ kj > 0. In this case, let β1 = 2ηξ kj 1 + 2η and β ( ) 2 = 2ηξ kj 1 2η be the two possible real roots of (1). If (1) has two real roots, β2 is a local maximum. Also note that if ξ kj > 0, (2) can only have positive real roots. Thus, g can only be minimized at 0 or β 1 if it is real. Now we only need to find out when 0 or β 1 There are four cases: minimizes g. Case 1. > 0 and β 1 > 0 > β 2 : This is equivalent to 0 < η < ξ kj/c kj. In this case, we have g(β1 ) < g(0) (Figure S1A). Case 2. > 0 and β 1 > β 2 0: This is equivalent to ξ kj/c kj η < and η > (2ξ kj ) 1. In this case, β1 is a local minimum and β 2 Thus, we need to compare g(β1 ) with g(0) to determine arg min g. βkj ( ) 1 2 c kj ξkj 2 is a local maximum (Figure S1B). 1
2 ξ kj = 0.8, c kj = 0.8, γ = 0.5 ξ kj = 0.9, c kj = 0.82, γ = 2 g g ~ β (A) kj ~ β (B) kj ξ kj = 0.4, c kj = 0.8, γ = 0.5 ξ kj = 0.9, c kj = 0.85, γ = 2.5 g g ~ β (C) kj ~ β (D) kj Figure S1: Examples illustrating different scenarios for minimizing g over β kj when ξ kj > 0. Case 3. > 0 and 0 β 1 > β 2 : This is equivalent to ξ kj/c kj η < ( 1 2 c kj ξkj) 2 and η (2ξ kj ) 1. In this case, neither β1 nor β 2 is positive. So arg min g = 0 (Figure S1C). βkj ( 1. Case 4. 0: This is equivalent to η 2 c kj ξkj) 2 If < 0, clearly arg min g = βkj 0. If = 0, β1 = β 2 is an inflection point if they are positive (Figure S1D). So it is also true that arg min βkj g = 0. Therefore, we have shown that Proposition 1 holds. 2
3 1.2 Proof of Theorem 3 We omit the proof of the first part of Theorem 3, since it is similar to that of Theorem 2. To prove the second part, let B = {j : φ j 0}. For notational ease, let us first, by permuting the indices, rewrite the parameter θ as θ = (θ a, θ b ) = (φ A, φ B, σ 2, σ 2 ), where θ a = φ A and θ b = (φ B, σ 2, σ 2 ). Let r = A be the number of zero components of φ. Now we only need to show that with probability tending to 1, for any θ b satisfying θ b θ b = O p(n 1/2 ) and any constant C > 0, (0, θ b ) = arg max θ R ( (θ a, θ b ) ). (3) a C/ n To establish (3), we again study the behavior of R(θ) around the point (0, θ b ) by expanding L(θ) around (0, θ b ). Let a n = 1/ n, θ o = (0, θ b ), and u = (u a, 0) such that u C and θ o + a n u Ω. Then we have the following result similar to that in the proof of Theorem 2: = = R(θ o + a n u) R(θ o ) p [ αk L k (θo k )T u k {1 + o p (1)} α ] k nk 2 ut k I(θo k )u k {1 + o p (1)} k=1 p k=1 [ αk L k (θo k )T nk λ n n r τ j u j j=1 u k {1 + o p (1)} α ] k 2 ut k I(θo k )u k {1 + o p (1)} λ n n γ/2 n r n φ j γ u j. j=1 (4) Note that both the first and the second terms in the last line of (4) are on the order of O p (1) for any fixed constant C. Since φ j is n-consistent, we have n φ j = O p (1), for j = 1,..., r. Then the third term in the last line of (4) is on the order of λ n n (γ 1)/2. Therefore, (3) holds, and the proof is complete. 3
4 2 Supplementary Algorithm The following algorithm is used in the second step of the CD algorithm to check the acyclicity constraint. The time complexity is O(V + E). Algorithm S1 Check whether a DAG G remains acyclic if an edge i j is added. function Cycle(G, i, j) for v V \{i} do C v 0 end for C i 1 Q ENQUEUE(Q, i) while Q do u DEQUEUE(Q) for v Π G u do if v = j then return true else if C v = 0 then C v 1 ENQUEUE(Q, v) end if end if end for end while return false end function 4
5 3 Supplementary Figures 3.1 Demonstration of convergence Figure S2 demonstrates convergence of the CD algorithm on a simulated data set with p = 200. The figure plots the maximum absolute difference (MAD) in the coefficient matrix between two adjacent iterations. The two bumps at iterations 16 and 25 reflect changes in the active set of blocks after a complete cycle. The decrease in the MAD before, between, and after the bumps indicates convergence given a (fixed) active set of blocks. Decreasing MAD at the bumps (iterations 1, 16, and 25) demonstrates convergence in the structure (active set). At the final iteration, the CD algorithm also cycles through all blocks (which allows update on the active set), but the active set stays the same and the MAD is already below the threshold. Therefore, the algorithm stops running. The fact that there are only a few changes in the active set is due to the use of a warm start which often gives a good initial estimate. p = 200, β ij = 0.5 Maximum absolute difference Iteration t Figure S2: A typical plot for the convergence of the CD algorithm. 5
6 3.2 Choice of α Figure S3 shows the sensitivity of the simulation results with p = 100 and β ij = 0.5 (Table 1) to the choice of α [Equation (11) in the main paper]. The solid line plots TPR versus α, the dotdashed line plots R/P versus α, and the long dashed line plots FP/P versus α. The addition of the latter two curves plots FDR versus α. See Table 1 for notations. p = 100, β ij = α Figure S3: Simulation results for different values of α. 6
7 4 Supplementary Tables All notations in the supplementary tables are defined in Table 1 in the main text. Table S1 shows the large sample performance of the CD algorithm and the KO method. The data were simulated in the same way as described in Section 5.1 in the main text. Here, the sample size is n = 6000 and the number of edges is 2p for each data set. Table S1: Large sample results for the CD algorithm and the KO method p β ij CD algorithm KO method P E R M FP TPR FDR TPR FDR
8 Table S2 shows the performance of the CD algorithm for DAGs with different degrees of sparsity. The data were simulated with intervention as described in Section 5.1 in the main text. We fixed the sample size of a data set to n = 5p, where p is the number of nodes, and changed the number of edges from p to 4p. The coefficients β ij = 0.5 for all the results in this table. Table S2: Performance of the CD algorithm for DAGs with different degrees of sparsity p # edges TPR FDR (0.081) (0.088) (0.100) (0.085) (0.109) (0.046) (0.061) (0.058) (0.058) (0.071) (0.061) (0.088) 8
9 The following two tables summarize the results of the CD algorithm on observational data as well as a comparison with the PC-based method on observational data. Table S3: Performance of the CD algorithm on observational data p β ij P E R M FP TPR FDR (0.068) (0.056) (0.106) (0.066) (0.130) (0.091) (0.064) (0.075) (0.141) (0.115) (0.089) (0.091) (0.059) (0.054) (0.071) (0.097) (0.096) (0.091) (0.029) (0.047) (0.053) (0.059) (0.080) (0.095) 9
10 Table S4: Performance comparison between the PC-based method and the CD algorithm on observational data p β ij PC-based method CD algorithm P TPR FDR P TPR FDR (0.037) (0.187) (0.029) (0.137) (0.057) (0.126) (0.060) (0.099) (0.033) (0.118) (0.081) (0.169) (0.057) (0.108) (0.062) (0.075) (0.082) (0.077) (0.143) (0.144) (0.058) (0.079) (0.073) (0.084) (0.050) (0.056) (0.051) (0.048) (0.071) (0.079) (0.099) (0.090) (0.047) (0.068) (0.049) (0.094) (0.031) (0.030) (0.029) (0.028) (0.046) (0.066) (0.083) (0.053) (0.059) (0.066) (0.064) (0.086) 10
Learning Sparse Causal Gaussian Networks With Experimental Intervention: Regularization and Coordinate Descent
Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/jasa Fei FU and Qing ZHOU Learning Sparse Causal Gaussian Networs With Experimental Intervention: Regularization
More informationInferring Causal Phenotype Networks from Segregating Populat
Inferring Causal Phenotype Networks from Segregating Populations Elias Chaibub Neto chaibub@stat.wisc.edu Statistics Department, University of Wisconsin - Madison July 15, 2008 Overview Introduction Description
More informationarxiv: v3 [stat.me] 10 Mar 2016
Submitted to the Annals of Statistics ESTIMATING THE EFFECT OF JOINT INTERVENTIONS FROM OBSERVATIONAL DATA IN SPARSE HIGH-DIMENSIONAL SETTINGS arxiv:1407.2451v3 [stat.me] 10 Mar 2016 By Preetam Nandy,,
More informationLinear Classifiers as Pattern Detectors
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2014/2015 Lesson 16 8 April 2015 Contents Linear Classifiers as Pattern Detectors Notation...2 Linear
More informationSparse inverse covariance estimation with the lasso
Sparse inverse covariance estimation with the lasso Jerome Friedman Trevor Hastie and Robert Tibshirani November 8, 2007 Abstract We consider the problem of estimating sparse graphs by a lasso penalty
More informationMultilayer Neural Networks
Multilayer Neural Networks Multilayer Neural Networks Discriminant function flexibility NON-Linear But with sets of linear parameters at each layer Provably general function approximators for sufficient
More informationAnalysis of Algorithms. Outline. Single Source Shortest Path. Andres Mendez-Vazquez. November 9, Notes. Notes
Analysis of Algorithms Single Source Shortest Path Andres Mendez-Vazquez November 9, 01 1 / 108 Outline 1 Introduction Introduction and Similar Problems General Results Optimal Substructure Properties
More informationEstimation of large dimensional sparse covariance matrices
Estimation of large dimensional sparse covariance matrices Department of Statistics UC, Berkeley May 5, 2009 Sample covariance matrix and its eigenvalues Data: n p matrix X n (independent identically distributed)
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationSupplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data
Suppleentary to Learning Discriinative Bayesian Networks fro High-diensional Continuous Neuroiaging Data Luping Zhou, Lei Wang, Lingqiao Liu, Philip Ogunbona, and Dinggang Shen Proposition. Given a sparse
More informationCS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas
CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1
More informationCounterfactuals via Deep IV. Matt Taddy (Chicago + MSR) Greg Lewis (MSR) Jason Hartford (UBC) Kevin Leyton-Brown (UBC)
Counterfactuals via Deep IV Matt Taddy (Chicago + MSR) Greg Lewis (MSR) Jason Hartford (UBC) Kevin Leyton-Brown (UBC) Endogenous Errors y = g p, x + e and E[ p e ] 0 If you estimate this using naïve ML,
More informationRapid Introduction to Machine Learning/ Deep Learning
Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/32 Lecture 5a Bayesian network April 14, 2016 2/32 Table of contents 1 1. Objectives of Lecture 5a 2 2.Bayesian
More informationReflections and Rotations in R 3
Reflections and Rotations in R 3 P. J. Ryan May 29, 21 Rotations as Compositions of Reflections Recall that the reflection in the hyperplane H through the origin in R n is given by f(x) = x 2 ξ, x ξ (1)
More informationy(x n, w) t n 2. (1)
Network training: Training a neural network involves determining the weight parameter vector w that minimizes a cost function. Given a training set comprising a set of input vector {x n }, n = 1,...N,
More informationBAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage
BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement
More informationmin 4x 1 5x 2 + 3x 3 s.t. x 1 + 2x 2 + x 3 = 10 x 1 x 2 6 x 1 + 3x 2 + x 3 14
The exam is three hours long and consists of 4 exercises. The exam is graded on a scale 0-25 points, and the points assigned to each question are indicated in parenthesis within the text. If necessary,
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationApplied Machine Learning for Biomedical Engineering. Enrico Grisan
Applied Machine Learning for Biomedical Engineering Enrico Grisan enrico.grisan@dei.unipd.it Data representation To find a representation that approximates elements of a signal class with a linear combination
More informationLinear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging
More informationDifferential network analysis from cross-platform gene expression data: Supplementary Information
Differential network analysis from cross-platform gene expression data: Supplementary Information Xiao-Fei Zhang, Le Ou-Yang, Xing-Ming Zhao, and Hong Yan Contents 1 Supplementary Figures Supplementary
More informationPredicting causal effects in large-scale systems from observational data
nature methods Predicting causal effects in large-scale systems from observational data Marloes H Maathuis 1, Diego Colombo 1, Markus Kalisch 1 & Peter Bühlmann 1,2 Supplementary figures and text: Supplementary
More informationNonconvex penalties: Signal-to-noise ratio and algorithms
Nonconvex penalties: Signal-to-noise ratio and algorithms Patrick Breheny March 21 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/22 Introduction In today s lecture, we will return to nonconvex
More informationStochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions
International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.
More informationChapter 11. Stochastic Methods Rooted in Statistical Mechanics
Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science
More informationGraphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence
Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence General overview Introduction Directed acyclic graphs (DAGs) and conditional independence DAGs and causal effects
More informationCausal Effect Identification in Alternative Acyclic Directed Mixed Graphs
Proceedings of Machine Learning Research vol 73:21-32, 2017 AMBN 2017 Causal Effect Identification in Alternative Acyclic Directed Mixed Graphs Jose M. Peña Linköping University Linköping (Sweden) jose.m.pena@liu.se
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For
More informationAn efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss
An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao
More informationAn Algorithm to Determine the Clique Number of a Split Graph. In this paper, we propose an algorithm to determine the clique number of a split graph.
An Algorithm to Determine the Clique Number of a Split Graph O.Kettani email :o_ket1@yahoo.fr Abstract In this paper, we propose an algorithm to determine the clique number of a split graph. Introduction
More informationLearning in Bayesian Networks
Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks
More informationNeural Networks Lecture 4: Radial Bases Function Networks
Neural Networks Lecture 4: Radial Bases Function Networks H.A Talebi Farzaneh Abdollahi Department of Electrical Engineering Amirkabir University of Technology Winter 2011. A. Talebi, Farzaneh Abdollahi
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:
More informationRecommendation Systems
Recommendation Systems Popularity Recommendation Systems Predicting user responses to options Offering news articles based on users interests Offering suggestions on what the user might like to buy/consume
More informationNoise-Blind Image Deblurring Supplementary Material
Noise-Blind Image Deblurring Supplementary Material Meiguang Jin University of Bern Switzerland Stefan Roth TU Darmstadt Germany Paolo Favaro University of Bern Switzerland A. Upper and Lower Bounds Our
More informationPattern Recognition and Machine Learning. Perceptrons and Support Vector machines
Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationr=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J
7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured
More informationLearning Methods for Linear Detectors
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2
More informationIf there exists a threshold k 0 such that. then we can take k = k 0 γ =0 and achieve a test of size α. c 2004 by Mark R. Bell,
Recall The Neyman-Pearson Lemma Neyman-Pearson Lemma: Let Θ = {θ 0, θ }, and let F θ0 (x) be the cdf of the random vector X under hypothesis and F θ (x) be its cdf under hypothesis. Assume that the cdfs
More informationIntroduction to Alternating Direction Method of Multipliers
Introduction to Alternating Direction Method of Multipliers Yale Chang Machine Learning Group Meeting September 29, 2016 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction
More informationDynamic Working Memory in Recurrent Neural Networks
Dynamic Working Memory in Recurrent Neural Networks Alexander Atanasov Research Advisor: John Murray Physics 471 Fall Term, 2016 Abstract Recurrent neural networks (RNNs) are physically-motivated models
More informationDiscovery of Linear Acyclic Models Using Independent Component Analysis
Created by S.S. in Jan 2008 Discovery of Linear Acyclic Models Using Independent Component Analysis Shohei Shimizu, Patrik Hoyer, Aapo Hyvarinen and Antti Kerminen LiNGAM homepage: http://www.cs.helsinki.fi/group/neuroinf/lingam/
More information[Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements
[Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements Aasthaa Bansal PhD Pharmaceutical Outcomes Research & Policy Program University of Washington 69 Biomarkers
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationClass 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio
Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant
More informationSimplicity of Additive Noise Models
Simplicity of Additive Noise Models Jonas Peters ETH Zürich - Marie Curie (IEF) Workshop on Simplicity and Causal Discovery Carnegie Mellon University 7th June 2014 contains joint work with... ETH Zürich:
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationCMPS 6610 Fall 2018 Shortest Paths Carola Wenk
CMPS 6610 Fall 018 Shortest Paths Carola Wenk Slides courtesy of Charles Leiserson with changes and additions by Carola Wenk Paths in graphs Consider a digraph G = (V, E) with an edge-weight function w
More informationIterative Matching Pursuit and its Applications in Adaptive Time-Frequency Analysis
Iterative Matching Pursuit and its Applications in Adaptive Time-Frequency Analysis Zuoqiang Shi Mathematical Sciences Center, Tsinghua University Joint wor with Prof. Thomas Y. Hou and Sparsity, Jan 9,
More informationMIT Spring 2016
MIT 18.655 Dr. Kempthorne Spring 2016 1 MIT 18.655 Outline 1 2 MIT 18.655 3 Decision Problem: Basic Components P = {P θ : θ Θ} : parametric model. Θ = {θ}: Parameter space. A{a} : Action space. L(θ, a)
More informationVasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks
C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,
More informationarxiv: v1 [stat.ml] 23 Oct 2016
Formulas for counting the sizes of Markov Equivalence Classes Formulas for Counting the Sizes of Markov Equivalence Classes of Directed Acyclic Graphs arxiv:1610.07921v1 [stat.ml] 23 Oct 2016 Yangbo He
More informationCSC 412 (Lecture 4): Undirected Graphical Models
CSC 412 (Lecture 4): Undirected Graphical Models Raquel Urtasun University of Toronto Feb 2, 2016 R Urtasun (UofT) CSC 412 Feb 2, 2016 1 / 37 Today Undirected Graphical Models: Semantics of the graph:
More informationMidterm Review. Igor Yanovsky (Math 151A TA)
Midterm Review Igor Yanovsky (Math 5A TA) Root-Finding Methods Rootfinding methods are designed to find a zero of a function f, that is, to find a value of x such that f(x) =0 Bisection Method To apply
More informationApproximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.
Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and
More informationSparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28
Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:
More informationRobust Principal Component Analysis
ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationSparse Linear Programming via Primal and Dual Augmented Coordinate Descent
Sparse Linear Programg via Primal and Dual Augmented Coordinate Descent Presenter: Joint work with Kai Zhong, Cho-Jui Hsieh, Pradeep Ravikumar and Inderjit Dhillon. Sparse Linear Program Given vectors
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationHigh-dimensional Covariance Estimation Based On Gaussian Graphical Models
High-dimensional Covariance Estimation Based On Gaussian Graphical Models Shuheng Zhou, Philipp Rutimann, Min Xu and Peter Buhlmann February 3, 2012 Problem definition Want to estimate the covariance matrix
More informationThe Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression
The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression Cun-hui Zhang and Jian Huang Presenter: Quefeng Li Feb. 26, 2010 un-hui Zhang and Jian Huang Presenter: Quefeng The Sparsity
More informationarxiv: v6 [math.st] 3 Feb 2018
Submitted to the Annals of Statistics HIGH-DIMENSIONAL CONSISTENCY IN SCORE-BASED AND HYBRID STRUCTURE LEARNING arxiv:1507.02608v6 [math.st] 3 Feb 2018 By Preetam Nandy,, Alain Hauser and Marloes H. Maathuis,
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descent
Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent KDD 2011 Rainer Gemulla, Peter J. Haas, Erik Nijkamp and Yannis Sismanis Presenter: Jiawen Yao Dept. CSE, UT Arlington 1 1
More information3 : Representation of Undirected GM
10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:
More informationMachine Learning. 7. Logistic and Linear Regression
Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,
More informationA Parametric Simplex Algorithm for Linear Vector Optimization Problems
A Parametric Simplex Algorithm for Linear Vector Optimization Problems Birgit Rudloff Firdevs Ulus Robert Vanderbei July 9, 2015 Abstract In this paper, a parametric simplex algorithm for solving linear
More informationIndex Models for Sparsely Sampled Functional Data. Supplementary Material.
Index Models for Sparsely Sampled Functional Data. Supplementary Material. May 21, 2014 1 Proof of Theorem 2 We will follow the general argument in the proof of Theorem 3 in Li et al. 2010). Write d =
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network
More informationMilnor s Exotic 7-Spheres Jhan-Cyuan Syu ( June, 2017 Introduction
許展銓 1 2 3 R 4 R 4 R n n 4 R 4 ξ : E π M 2n J : E E R J J(v) = v v E ξ ξ ξ R ξ : E π M M ξ : Ē π M ξ ξ R = ξ R i : E Ē i(cv) = ci(v) c C v E ξ : E π M n M c(ξ) = c i (ξ) H 2n (M, Z) i 0 c 0 (ξ) = 1 c i
More informationConjugate Gradients I: Setup
Conjugate Gradients I: Setup CS 205A: Mathematical Methods for Robotics, Vision, and Graphics Justin Solomon CS 205A: Mathematical Methods Conjugate Gradients I: Setup 1 / 22 Time for Gaussian Elimination
More informationQuadratic and Rational Inequalities
Quadratic and Rational Inequalities Definition of a Quadratic Inequality A quadratic inequality is any inequality that can be put in one of the forms ax 2 + bx + c < 0 ax 2 + bx + c > 0 ax 2 + bx + c
More informationRobust Monte Carlo Methods for Sequential Planning and Decision Making
Robust Monte Carlo Methods for Sequential Planning and Decision Making Sue Zheng, Jason Pacheco, & John Fisher Sensing, Learning, & Inference Group Computer Science & Artificial Intelligence Laboratory
More informationSVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels
SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic
More informationArtificial Neural Networks 2
CSC2515 Machine Learning Sam Roweis Artificial Neural s 2 We saw neural nets for classification. Same idea for regression. ANNs are just adaptive basis regression machines of the form: y k = j w kj σ(b
More informationMeta-Analysis for Diagnostic Test Data: a Bayesian Approach
Meta-Analysis for Diagnostic Test Data: a Bayesian Approach Pablo E. Verde Coordination Centre for Clinical Trials Heinrich Heine Universität Düsseldorf Preliminaries: motivations for systematic reviews
More informationsparse and low-rank tensor recovery Cubic-Sketching
Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru
More informationSupplementary Materials to Nested Partially-Latent Class Models for Dependent Binary Data; Estimating Disease Etiology
Biostatistics (2015), 0, 0, pp. 1 12 doi:10.1093/biostatistics/nplcm 20151028-supplimentary Supplementary Materials to Nested Partially-Latent Class Models for Dependent Binary Data; Estimating Disease
More informationAdvanced Machine Learning
Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear
More informationComputational statistics
Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial
More informationNetwork Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)
Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu
More informationRadial-Basis Function Networks
Radial-Basis Function etworks A function is radial basis () if its output depends on (is a non-increasing function of) the distance of the input from a given stored vector. s represent local receptors,
More informationQuantized Iterative Hard Thresholding:
Quantized Iterative Hard Thresholding: ridging 1-bit and High Resolution Quantized Compressed Sensing Laurent Jacques, Kévin Degraux, Christophe De Vleeschouwer Louvain University (UCL), Louvain-la-Neuve,
More informationStochastic Proximal Gradient Algorithm
Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind
More informationRATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University
Submitted to the Annals of Statistics arxiv: arxiv:0000.0000 RATE-OPTIMAL GRAPHON ESTIMATION By Chao Gao, Yu Lu and Harrison H. Zhou Yale University Network analysis is becoming one of the most active
More informationNeural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Neural Networks CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Perceptrons x 0 = 1 x 1 x 2 z = h w T x Output: z x D A perceptron
More informationHigh-dimensional covariance estimation based on Gaussian graphical models
High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,
More informationPattern Classification
Pattern Classification All materials in these slides were taen from Pattern Classification (2nd ed) by R. O. Duda,, P. E. Hart and D. G. Stor, John Wiley & Sons, 2000 with the permission of the authors
More informationBayesian Inference of Multiple Gaussian Graphical Models
Bayesian Inference of Multiple Gaussian Graphical Models Christine Peterson,, Francesco Stingo, and Marina Vannucci February 18, 2014 Abstract In this paper, we propose a Bayesian approach to inference
More informationSUPPLEMENTARY INFORMATION
DOI: 1.138/NPHYS535 Spontaneous synchrony in power-grid networks Adilson E. Motter, Seth A. Myers, Marian Anghel and Takashi Nishikawa Supplementary Sections S1. Power-grid data. The data required for
More informationGraphical Model Selection
May 6, 2013 Trevor Hastie, Stanford Statistics 1 Graphical Model Selection Trevor Hastie Stanford University joint work with Jerome Friedman, Rob Tibshirani, Rahul Mazumder and Jason Lee May 6, 2013 Trevor
More informationGradient Descent. Sargur Srihari
Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors
More informationEquations and Solutions
Section 2.1 Solving Equations: The Addition Principle 1 Equations and Solutions ESSENTIALS An equation is a number sentence that says that the expressions on either side of the equals sign, =, represent
More informationLinear Classifiers as Pattern Detectors
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2013/2014 Lesson 18 23 April 2014 Contents Linear Classifiers as Pattern Detectors Notation...2 Linear
More informationVariable selection for model-based clustering
Variable selection for model-based clustering Matthieu Marbac (Ensai - Crest) Joint works with: M. Sedki (Univ. Paris-sud) and V. Vandewalle (Univ. Lille 2) The problem Objective: Estimation of a partition
More informationChris Bishop s PRML Ch. 8: Graphical Models
Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular
More information