LARGE NUMBERS OF EXPLANATORY VARIABLES. H.S. Battey. WHAO-PSI, St Louis, 9 September 2018
|
|
- Junior Robbins
- 5 years ago
- Views:
Transcription
1 LARGE NUMBERS OF EXPLANATORY VARIABLES HS Battey Department of Mathematics, Imperial College London WHAO-PSI, St Louis, 9 September 2018
2 Regression, broadly defined Response variable Y i, eg, blood pressure, disease status, survival time, Vector X i of v potential explanatory variables Observed on units, eg patients, i = 1,, n v n, eg X i arising from gene expression analysis
3 Goal: scientific understanding Sparsity is critical for interpretability and statistical stability Lasso (Tibshirani, 1996): penalized LS Minimize Y X β λ β 1 More generally penalized MLE There results a single model Cox, D R and Battey, H S (2017), Proc Nat Acad Sci, 114, : aim for a confidence set of models
4 Goal: scientific understanding Sparsity is critical for interpretability and statistical stability Lasso (Tibshirani, 1996): penalized LS Minimize Y X β λ β 1 More generally penalized MLE There results a single model Cox, D R and Battey, H S (2017), Proc Nat Acad Sci, 114, : aim for a confidence set of models
5 Arrange variable indices in a k k k hypercube where k 15 Traverse the cube: rows, columns, etc Assess each variable several times alongside k 1 different companions
6 Arrange variable indices in a k k k hypercube where k 15 Traverse the cube: rows, columns, etc Assess each variable several times alongside k 1 different companions
7 Arrange variable indices in a k k k hypercube where k 15 Traverse the cube: rows, columns, etc Assess each variable several times alongside k 1 different companions
8 Arrange variable indices in a k k k hypercube where k 15 Traverse the cube: rows, columns, etc Assess each variable several times alongside k 1 different companions Retain or discard variables according to a decision rule Repeat Informal checks Test low dimensional subsets for compatibility with the data A confidence set of models
9 Arrange variable indices in a k k k hypercube where k 15 Traverse the cube: rows, columns, etc Assess each variable several times alongside k 1 different companions Retain or discard variables according to a decision rule Repeat Informal checks Test low dimensional subsets for compatibility with the data A confidence set of models
10 Arrange variable indices in a k k k hypercube where k 15 Traverse the cube: rows, columns, etc Assess each variable several times alongside k 1 different companions Retain or discard variables according to a decision rule Repeat Informal checks Test low dimensional subsets for compatibility with the data A confidence set of models
11 TYPICAL OUTPUT Model Recoded variable indices This arose from an example with n =129 individuals: 105 with osteoarthritis, 24 controls v 50, 000 genetic variables Statistics can take us no further Any choice between such models would require additional data or subject matter expertise
12 THREE PHASES A reduction phase uses significance tests as an informal guide to discarding variables On the reduced set, an exploratory phase allows assessment of anomalies A model selection phase assesses candidate models for their compatibility with the data
13 REDUCTION PHASE
14 REMARKS ON THE REDUCTION PHASE No restriction to perfect (hyper)cubes Partially balanced incomplete block designs (Yates, 1936) A more severe reduction than marginal screening Arrangement rerandomization Prior assessment of importance Binary responses and separating hyperplanes
15 KEY ASPECTS OF ANALYSIS What is the probability that a signal variable is falsely discarded? How many of the variables ultimately suggested as potentially important are noise variables?
16 Battey, HS and Cox, DR (2018), Proc R Soc Lond A, 474: Specify behaviour under idealized conditions Provide guidance on decision rules The goal is not to set up a procedure to achieve pre-assigned error rates
17 SOME FIRST-STAGE REDUCTION STRATEGIES Retain the single variable with highest score (lowest p-value); Retain the two variables with highest scores; Retain all variables, if any, whose scores exceed a threshold In the second stage of the procedure, the third strategy is always used
18 A SIMPLIFYING APPROXIMATION Analyses involving the same variable are treated as independent Slightly pessimistic assessment of the proportion of correctly retained signal variables Slightly optimistic assessment of the number of falsely retained noise variables blanksignal variables are on an equal footing (same signal strength)
19 Behaviour is governed largely by θ, the probability that a particular signal variable survives reduction in any single analysis in which it appears This depends on the reduction strategy used
20 In a set of k variables chosen at random for test, the number of signal variables has a Poisson distribution of mean a kv S0 /v 1 e a (1 + a) lim a 0 1 e a = 0 So θ ϑ for small a, where ϑ is the survival probability conditioned on the event that there is exactly one signal variable in the set
21 SOME ANALYSIS OF FIRST-STAGE REDUCTION To avoid distributional assumptions on the response variable we frame the discussion in terms of p-values a For noise variables these are uniformly distributed on (0, 1) For signal variables, we model their density as (1 γ)x γ, 0 < γ < 1 a A parallel analysis for the normal theory linear model gives qualitatively the same conclusions
22 STRATEGY 1 ϑ (1) = pr{signal beats best of (k 1) noise} = (1 γ) 1 0 dx(1 x) k 1 x γ Γ(1 γ)γ(k) = (1 γ) Γ(1 γ + k) Γ(2 γ)/k1 γ Γ( ) is close to one over the range of interest If γ = 0 the notional signal variable is selected wp 1/k, ie at random
23 STRATEGY 2 ϑ (2) = pr(signal among 2 most significant) = ϑ (1) + ϑ (21), where ϑ (21) = pr(signal comes second) = 1 0 dx(k 1)(1 x) k 2 x(1 γ)x γ (1 γ)γ(2 γ)/k 1 γ ϑ (2) ϑ (1) is negligible for γ close to 1 If γ = 0, ϑ (2) = 2/k
24 APPROXIMATION ERROR Exact Approx Exact Approx Figure: Exact and approximate values of ϑ (1) (left) and ϑ (21) (right) over γ (0, 1) for k = 10
25 STRATEGY 3 If we set a critical level α, then ϑ (3) = α 0 dx(1 γ)x γ = α 1 γ
26 COMPARISON OF STRATEGIES Choose α to equalize survival probabilities of signal variables and compare expected number of retained noise variables Stategies 1 and 3 are equivalent at this α 2 dominates 3, ie, strategy 3 selects more noise variables on average for the same probability of selecting a signal variable Strategy 2 increases number of retained noise variables by roughly a factor of four over strategy 1
27 EXPLORATORY PHASE The procedure is intended to be largely exploratory Data on the reduced set of variables should be subjected to a thorough exploratory analysis eg inspect interaction plots, using significance tests as an informal guide for identifying potential interactions
28 FINAL PHASE: SETS OF WELL-FITTING MODELS
29 A POST-SELECTION INFERENCE PROBLEM Confidence sets of models comprise all low-dimensional subsets that pass a LR test against the comprehensive model Comprehensive model selected in the light of the data Fits better than an arbitrary model nesting the one to be tested Reject H 0 too often in hypothetical repeated sampling Conditional coverage probabilities of model confidence sets are lower than nominal A wasteful solution: split the sample
30 In each Monte Carlo replication: A SIMPLE EXPERIMENT Generate n = 10 2 replicates of v = 10 3 variables from a N(0, PΣP 1 ), P a permutation matrix Σ is an identity matrix with one diag block replaced by a correlation matrix of dim v S0 + v C0 and equal correlation ρ
31 A SIMPLE EXPERIMENT (CONTINUED) For each i = 1,, n, v S0 of the v S0 + v C0 correlated variables are multiplied by a constant signal and added, together with standard normal noise This is Y i Arrange variable indices in cube and reduce by strategy 2 followed by strategy 3 at the 01% level LR test against the comprehensive model
32 MC ESTIMATES: 2 4 FACTORIAL EXPERIMENT pr(s Ŝ) pr(s M) E M\S v S0 v C0 ρ signal noise lasso full sample split sample full sample split sample full sample split sample (004) 100 (000) 100 (000) 057 (050) 099 (008) 68 (90) 157 (300) (021) 096 (021) 074 (044) 045 (050) 074 (044) 54 (65) 131 (256) (000) 100 (000) 100 (004) 055 (050) 098 (013) 44 (59) 106 (498) (012) 096 (019) 076 (043) 041 (049) 076 (043) 26 (37) 110 (278) (004) 100 (006) 098 (013) 067 (047) 097 (016) 273 (273) 684 (108) (030) 093 (026) 073 (045) 048 (050) 072 (045) 235 (211) 398 (902) (000) 100 (000) 099 (008) 062 (049) 099 (012) 113 (174) 122 (300) (010) 095 (022) 076 (043) 038 (049) 075 (043) 41 (62) 947 (206) (008) 100 (000) 100 (004) 095 (021) 098 (013) 75 (81) 105 (143) (041) 099 (009) 095 (022) 088 (033) 095 (023) 462 (398) 182 (255) (000) 100 (000) 100 (000) 096 (019) 099 (009) 00 (00) 154 (261) (004) 100 (000) 098 (014) 090 (031) 097 (017) 14 (26) 807 (120) (010) 100 (000) 100 (004) 096 (019) 099 (011) 175 (145) 303 (317) (043) 098 (013) 091 (028) 091 (029) 090 (030) 116 (93) 430 (405) (000) 100 (000) 100 (004) 098 (015) 099 (011) 00 (04) 413 (799) (004) 100 (000) 096 (019) 092 (028) 095 (021) 36 (55) 217 (325) Table: S is the true set of signal variables, Ŝ is the set of variables surviving the reduction phase, M is the set of low dimensional models whose likelihood ratio test against the comprehensive model is not rejected at the 1% level Empirical standard errors in parenthesis Lasso is the full sample version, tuned to pick as many variables as are retained through the reduction phase
33 ESTIMATED MAIN EFFECTS outcome v S0 v C0 ρ signal/noise I1{S Ŝ} 1 88 (0 11) 0 19 (0 09) 0 28 (0 09) 3 94 (0 26) I1{S M} 1 56 (0 10) 0 19 (0 09) 0 26 (0 09) 2 64 (0 14) M\S 0 51 (0 00) 0 48 (0 00) 2 67 (0 01) 1 35 (0 01) Table: Monte Carlo estimates of effects of v S0, v C0, ρ and signal/noise (dichotomized to aid interpretation) on the logit scale for binary outcomes (first two rows) and on the log scale for counts (third row)
34 SUMMARY If the data are compatible with several alternative reasonable explanations, one should aim to specify as many as is feasible This view is in contraposition to that implicit in the use of the lasso and similar methods Any choice between well-fitting models must be based on subject-matter expertise or additional data
35 REFERENCES Cox, D R and Battey, H S (2017) Large numbers of explanatory variables, a semi-descriptive analysis, Proc Nat Acad Sci, 114, Battey, HS and Cox, D R (2018), Large numbers of explanatory variables: a probabilistic assessment, Proc R Soc Lond A, 474 SOFTWARE Matlab code is available at hbattey/softwarecube An R package, HCmodelSets, written by H H Hoeltgebaum is available on the Comprehensive R Archive Network
An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models
Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong (Session CPS023) p.3938 An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models Vitara Pungpapong
More informationBuilding a Prognostic Biomarker
Building a Prognostic Biomarker Noah Simon and Richard Simon July 2016 1 / 44 Prognostic Biomarker for a Continuous Measure On each of n patients measure y i - single continuous outcome (eg. blood pressure,
More informationApplied Machine Learning Annalisa Marsico
Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature
More informationHigh-Throughput Sequencing Course
High-Throughput Sequencing Course DESeq Model for RNA-Seq Biostatistics and Bioinformatics Summer 2017 Outline Review: Standard linear regression model (e.g., to model gene expression as function of an
More informationExperimental Design and Data Analysis for Biologists
Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1
More informationDEGseq: an R package for identifying differentially expressed genes from RNA-seq data
DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics
More informationNon-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models
Optimum Design for Mixed Effects Non-Linear and generalized Linear Models Cambridge, August 9-12, 2011 Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models
More informationChris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010
Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,
More informationProteomics and Variable Selection
Proteomics and Variable Selection p. 1/55 Proteomics and Variable Selection Alex Lewin With thanks to Paul Kirk for some graphs Department of Epidemiology and Biostatistics, School of Public Health, Imperial
More informationEvaluation. Andrea Passerini Machine Learning. Evaluation
Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain
More informationGeneralized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.
Texts in Statistical Science Generalized Linear Mixed Models Modern Concepts, Methods and Applications Walter W. Stroup CRC Press Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint
More informationFast Likelihood-Free Inference via Bayesian Optimization
Fast Likelihood-Free Inference via Bayesian Optimization Michael Gutmann https://sites.google.com/site/michaelgutmann University of Helsinki Aalto University Helsinki Institute for Information Technology
More informationKnockoffs as Post-Selection Inference
Knockoffs as Post-Selection Inference Lucas Janson Harvard University Department of Statistics blank line blank line WHOA-PSI, August 12, 2017 Controlled Variable Selection Conditional modeling setup:
More informationEvaluation requires to define performance measures to be optimized
Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation
More informationCategorical Data Analysis 1
Categorical Data Analysis 1 STA 312: Fall 2012 1 See last slide for copyright information. 1 / 1 Variables and Cases There are n cases (people, rats, factories, wolf packs) in a data set. A variable is
More informationChapter 1 Statistical Inference
Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations
More informationModel Selection. Frank Wood. December 10, 2009
Model Selection Frank Wood December 10, 2009 Standard Linear Regression Recipe Identify the explanatory variables Decide the functional forms in which the explanatory variables can enter the model Decide
More informationDiscrete Multivariate Statistics
Discrete Multivariate Statistics Univariate Discrete Random variables Let X be a discrete random variable which, in this module, will be assumed to take a finite number of t different values which are
More informationEfficient Likelihood-Free Inference
Efficient Likelihood-Free Inference Michael Gutmann http://homepages.inf.ed.ac.uk/mgutmann Institute for Adaptive and Neural Computation School of Informatics, University of Edinburgh 8th November 2017
More informationMODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES
MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by
More informationMarginal Screening and Post-Selection Inference
Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2
More informationContents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1
Contents Preface to Second Edition Preface to First Edition Abbreviations xv xvii xix PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1 1 The Role of Statistical Methods in Modern Industry and Services
More informationSupporting Information for Estimating restricted mean. treatment effects with stacked survival models
Supporting Information for Estimating restricted mean treatment effects with stacked survival models Andrew Wey, David Vock, John Connett, and Kyle Rudser Section 1 presents several extensions to the simulation
More informationMultilevel Statistical Models: 3 rd edition, 2003 Contents
Multilevel Statistical Models: 3 rd edition, 2003 Contents Preface Acknowledgements Notation Two and three level models. A general classification notation and diagram Glossary Chapter 1 An introduction
More informationModel Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model
Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population
More informationSequential Implementation of Monte Carlo Tests with Uniformly Bounded Resampling Risk
Sequential Implementation of Monte Carlo Tests with Uniformly Bounded Resampling Risk Axel Gandy Department of Mathematics Imperial College London a.gandy@imperial.ac.uk user! 2009, Rennes July 8-10, 2009
More informationBayesian Regression Linear and Logistic Regression
When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we
More informationModel-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate
Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Lucas Janson, Stanford Department of Statistics WADAPT Workshop, NIPS, December 2016 Collaborators: Emmanuel
More informationComputationally Efficient Estimation of Multilevel High-Dimensional Latent Variable Models
Computationally Efficient Estimation of Multilevel High-Dimensional Latent Variable Models Tihomir Asparouhov 1, Bengt Muthen 2 Muthen & Muthen 1 UCLA 2 Abstract Multilevel analysis often leads to modeling
More informationPolitical Science 236 Hypothesis Testing: Review and Bootstrapping
Political Science 236 Hypothesis Testing: Review and Bootstrapping Rocío Titiunik Fall 2007 1 Hypothesis Testing Definition 1.1 Hypothesis. A hypothesis is a statement about a population parameter The
More informationClassification and Regression Trees
Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity
More informationVariable Selection and Weighting by Nearest Neighbor Ensembles
Variable Selection and Weighting by Nearest Neighbor Ensembles Jan Gertheiss (joint work with Gerhard Tutz) Department of Statistics University of Munich WNI 2008 Nearest Neighbor Methods Introduction
More informationPoisson Regression. Ryan Godwin. ECON University of Manitoba
Poisson Regression Ryan Godwin ECON 7010 - University of Manitoba Abstract. These lecture notes introduce Maximum Likelihood Estimation (MLE) of a Poisson regression model. 1 Motivating the Poisson Regression
More informationComparing Adaptive Designs and the. Classical Group Sequential Approach. to Clinical Trial Design
Comparing Adaptive Designs and the Classical Group Sequential Approach to Clinical Trial Design Christopher Jennison Department of Mathematical Sciences, University of Bath, UK http://people.bath.ac.uk/mascj
More informationRegression, Ridge Regression, Lasso
Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.
More informationSTATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours
Instructions: STATS216v Introduction to Statistical Learning Stanford University, Summer 2017 Remember the university honor code. Midterm Exam (Solutions) Duration: 1 hours Write your name and SUNet ID
More informationCross-Validation with Confidence
Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University WHOA-PSI Workshop, St Louis, 2017 Quotes from Day 1 and Day 2 Good model or pure model? Occam s razor We really
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Hypothesis testing Machine Learning CSE546 Kevin Jamieson University of Washington October 30, 2018 2018 Kevin Jamieson 2 Anomaly detection You are
More informationGraphlet Screening (GS)
Graphlet Screening (GS) Jiashun Jin Carnegie Mellon University April 11, 2014 Jiashun Jin Graphlet Screening (GS) 1 / 36 Collaborators Alphabetically: Zheng (Tracy) Ke Cun-Hui Zhang Qi Zhang Princeton
More informationPost-Selection Inference
Classical Inference start end start Post-Selection Inference selected end model data inference data selection model data inference Post-Selection Inference Todd Kuffner Washington University in St. Louis
More informationA Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)
A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationA knockoff filter for high-dimensional selective inference
1 A knockoff filter for high-dimensional selective inference Rina Foygel Barber and Emmanuel J. Candès February 2016; Revised September, 2017 Abstract This paper develops a framework for testing for associations
More informationAn R # Statistic for Fixed Effects in the Linear Mixed Model and Extension to the GLMM
An R Statistic for Fixed Effects in the Linear Mixed Model and Extension to the GLMM Lloyd J. Edwards, Ph.D. UNC-CH Department of Biostatistics email: Lloyd_Edwards@unc.edu Presented to the Department
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationSTAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă
STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani
More informationSummary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing
Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Statistics Journal Club, 36-825 Beau Dabbs and Philipp Burckhardt 9-19-2014 1 Paper
More informationMULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS Page 1 MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level
More informationStatistical aspects of prediction models with high-dimensional data
Statistical aspects of prediction models with high-dimensional data Anne Laure Boulesteix Institut für Medizinische Informationsverarbeitung, Biometrie und Epidemiologie February 15th, 2017 Typeset by
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationMeasuring Social Influence Without Bias
Measuring Social Influence Without Bias Annie Franco Bobbie NJ Macdonald December 9, 2015 The Problem CS224W: Final Paper How well can statistical models disentangle the effects of social influence from
More informationMultiple QTL mapping
Multiple QTL mapping Karl W Broman Department of Biostatistics Johns Hopkins University www.biostat.jhsph.edu/~kbroman [ Teaching Miscellaneous lectures] 1 Why? Reduce residual variation = increased power
More information1-bit Matrix Completion. PAC-Bayes and Variational Approximation
: PAC-Bayes and Variational Approximation (with P. Alquier) PhD Supervisor: N. Chopin Bayes In Paris, 5 January 2017 (Happy New Year!) Various Topics covered Matrix Completion PAC-Bayesian Estimation Variational
More informationOr How to select variables Using Bayesian LASSO
Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO On Bayesian Variable Selection
More informationClassification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.
Classification Classification is similar to regression in that the goal is to use covariates to predict on outcome. We still have a vector of covariates X. However, the response is binary (or a few classes),
More informationECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam
ECLT 5810 Linear Regression and Logistic Regression for Classification Prof. Wai Lam Linear Regression Models Least Squares Input vectors is an attribute / feature / predictor (independent variable) The
More informationLogistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ
Logistic Regression The goal of a logistic regression analysis is to find the best fitting and most parsimonious, yet biologically reasonable, model to describe the relationship between an outcome (dependent
More informationEconomics 671: Applied Econometrics Department of Economics, Finance and Legal Studies University of Alabama
Problem Set #1 (Random Data Generation) 1. Generate =500random numbers from both the uniform 1 ( [0 1], uniformbetween zero and one) and exponential exp ( ) (set =2and let [0 1]) distributions. Plot the
More informationComputational Systems Biology: Biology X
Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,
More informationA Unified Approach to Uncertainty for Quality Improvement
A Unified Approach to Uncertainty for Quality Improvement J E Muelaner 1, M Chappell 2, P S Keogh 1 1 Department of Mechanical Engineering, University of Bath, UK 2 MCS, Cam, Gloucester, UK Abstract To
More informationStatistical Analysis of Spatio-temporal Point Process Data. Peter J Diggle
Statistical Analysis of Spatio-temporal Point Process Data Peter J Diggle Department of Medicine, Lancaster University and Department of Biostatistics, Johns Hopkins University School of Public Health
More informationSPRING 2007 EXAM C SOLUTIONS
SPRING 007 EXAM C SOLUTIONS Question #1 The data are already shifted (have had the policy limit and the deductible of 50 applied). The two 350 payments are censored. Thus the likelihood function is L =
More informationThe Goodness-of-fit Test for Gumbel Distribution: A Comparative Study
MATEMATIKA, 2012, Volume 28, Number 1, 35 48 c Department of Mathematics, UTM. The Goodness-of-fit Test for Gumbel Distribution: A Comparative Study 1 Nahdiya Zainal Abidin, 2 Mohd Bakri Adam and 3 Habshah
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationMustafa H. Tongarlak Bruce E. Ankenman Barry L. Nelson
Proceedings of the 0 Winter Simulation Conference S. Jain, R. R. Creasey, J. Himmelspach, K. P. White, and M. Fu, eds. RELATIVE ERROR STOCHASTIC KRIGING Mustafa H. Tongarlak Bruce E. Ankenman Barry L.
More informationFall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.
1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n
More informationEconometric Analysis of Cross Section and Panel Data
Econometric Analysis of Cross Section and Panel Data Jeffrey M. Wooldridge / The MIT Press Cambridge, Massachusetts London, England Contents Preface Acknowledgments xvii xxiii I INTRODUCTION AND BACKGROUND
More informationLasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices
Article Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Fei Jin 1,2 and Lung-fei Lee 3, * 1 School of Economics, Shanghai University of Finance and Economics,
More informationIssues using Logistic Regression for Highly Imbalanced data
Issues using Logistic Regression for Highly Imbalanced data Yazhe Li, Niall Adams, Tony Bellotti Imperial College London yli16@imperialacuk Credit Scoring and Credit Control conference, Aug 2017 Yazhe
More informationBootstrap and Parametric Inference: Successes and Challenges
Bootstrap and Parametric Inference: Successes and Challenges G. Alastair Young Department of Mathematics Imperial College London Newton Institute, January 2008 Overview Overview Review key aspects of frequentist
More informationChapter 9 Regression with a Binary Dependent Variable. Multiple Choice. 1) The binary dependent variable model is an example of a
Chapter 9 Regression with a Binary Dependent Variable Multiple Choice ) The binary dependent variable model is an example of a a. regression model, which has as a regressor, among others, a binary variable.
More informationProgram Evaluation with High-Dimensional Data
Program Evaluation with High-Dimensional Data Alexandre Belloni Duke Victor Chernozhukov MIT Iván Fernández-Val BU Christian Hansen Booth ESWC 215 August 17, 215 Introduction Goal is to perform inference
More informationAdaptive Designs: Why, How and When?
Adaptive Designs: Why, How and When? Christopher Jennison Department of Mathematical Sciences, University of Bath, UK http://people.bath.ac.uk/mascj ISBS Conference Shanghai, July 2008 1 Adaptive designs:
More informationCross-Validation with Confidence
Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation
More informationAlgorithmisches Lernen/Machine Learning
Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines
More informationBAYESIAN DECISION THEORY
Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will
More informationMA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2
MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2 1 Bootstrapped Bias and CIs Given a multiple regression model with mean and
More informationMultiple Change-Point Detection and Analysis of Chromosome Copy Number Variations
Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Yale School of Public Health Joint work with Ning Hao, Yue S. Niu presented @Tsinghua University Outline 1 The Problem
More informationA TWO-STAGE LINEAR MIXED-EFFECTS/COX MODEL FOR LONGITUDINAL DATA WITH MEASUREMENT ERROR AND SURVIVAL
A TWO-STAGE LINEAR MIXED-EFFECTS/COX MODEL FOR LONGITUDINAL DATA WITH MEASUREMENT ERROR AND SURVIVAL Christopher H. Morrell, Loyola College in Maryland, and Larry J. Brant, NIA Christopher H. Morrell,
More informationRepresent processes and observations that span multiple levels (aka multi level models) R 2
Hierarchical models Hierarchical models Represent processes and observations that span multiple levels (aka multi level models) R 1 R 2 R 3 N 1 N 2 N 3 N 4 N 5 N 6 N 7 N 8 N 9 N i = true abundance on a
More informationVariable Selection for Highly Correlated Predictors
Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu Department of Statistics, University of Illinois at Urbana-Champaign WHOA-PSI, Aug, 2017 St. Louis, Missouri 1 / 30 Background Variable
More informationPQL Estimation Biases in Generalized Linear Mixed Models
PQL Estimation Biases in Generalized Linear Mixed Models Woncheol Jang Johan Lim March 18, 2006 Abstract The penalized quasi-likelihood (PQL) approach is the most common estimation procedure for the generalized
More informationAnalysing categorical data using logit models
Analysing categorical data using logit models Graeme Hutcheson, University of Manchester The lecture notes, exercises and data sets associated with this course are available for download from: www.research-training.net/manchester
More informationGlossary. The ISI glossary of statistical terms provides definitions in a number of different languages:
Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the
More informationReview of Statistics 101
Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods
More informationApproximate Bayesian Computation
Approximate Bayesian Computation Michael Gutmann https://sites.google.com/site/michaelgutmann University of Helsinki and Aalto University 1st December 2015 Content Two parts: 1. The basics of approximate
More informationMIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE
MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on
More information8 Nominal and Ordinal Logistic Regression
8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on
More informationPrediction & Feature Selection in GLM
Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis
More informationAnalysis Methods for Supersaturated Design: Some Comparisons
Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs
More informationRecent Developments in Post-Selection Inference
Recent Developments in Post-Selection Inference Yotam Hechtlinger Department of Statistics yhechtli@andrew.cmu.edu Shashank Singh Department of Statistics Machine Learning Department sss1@andrew.cmu.edu
More informationTutorial on Approximate Bayesian Computation
Tutorial on Approximate Bayesian Computation Michael Gutmann https://sites.google.com/site/michaelgutmann University of Helsinki Aalto University Helsinki Institute for Information Technology 16 May 2016
More informationTuning Parameter Selection in L1 Regularized Logistic Regression
Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 2012 Tuning Parameter Selection in L1 Regularized Logistic Regression Shujing Shi Virginia Commonwealth University
More informationFundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur
Fundamentals to Biostatistics Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Statistics collection, analysis, interpretation of data development of new
More informationA Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models
A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los
More informationLinear Regression Models P8111
Linear Regression Models P8111 Lecture 25 Jeff Goldsmith April 26, 2016 1 of 37 Today s Lecture Logistic regression / GLMs Model framework Interpretation Estimation 2 of 37 Linear regression Course started
More informationNormal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,
Likelihood Let P (D H) be the probability an experiment produces data D, given hypothesis H. Usually H is regarded as fixed and D variable. Before the experiment, the data D are unknown, and the probability
More informationExtended Bayesian Information Criteria for Model Selection with Large Model Spaces
Extended Bayesian Information Criteria for Model Selection with Large Model Spaces Jiahua Chen, University of British Columbia Zehua Chen, National University of Singapore (Biometrika, 2008) 1 / 18 Variable
More informationLinear discriminant functions
Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative
More informationHigh-dimensional statistics and data analysis Course Part I
and data analysis Course Part I 3 - Computation of p-values in high-dimensional regression Jérémie Bigot Institut de Mathématiques de Bordeaux - Université de Bordeaux Master MAS-MSS, Université de Bordeaux,
More information