Big Data, Causal Modeling, and Estimation

Size: px
Start display at page:

Download "Big Data, Causal Modeling, and Estimation"

Transcription

1 Big Data, Causal Modeling, and Estimation The Center for Interdisciplinary Studies in Security and Privacy Summer Workshop Sherri Rose NSF Mathematical Sciences Postdoctoral Research Fellow Department of Biostatistics Johns Hopkins Bloomberg School of Public Health drsherrirose.com targetedlearningbook.com August 30, 2012 CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

2 General Research Areas Robust estimation Causal inference High-dimensional longitudinal data methods for complex observational data Sequential decision theory (e.g., dynamic regimes) Ensemble machine learning in prediction and causal inference Most of my applications have been in the areas of medicine, public health, and biology, but these methods are very general and can been used in many disparate fields. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

3 Albert Einstein To raise new questions, new possibilities, to regard old problems from a new angle, requires creative imagination and marks real advance in science. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

4 Motivation Essay ublished research findings are sometimes refuted by subsequent Pevidence, with ensuing confusion and disappointment. Refutation and controversy is seen across the range of factors that influence this problem and some corollaries thereof. Modeling the Framework for False Positive Findings Several methodologists have pointed out [9 11] that the high rate of nonreplication (lack of confirmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research findings solely on the basis of a single study assessed by formal statistical significance, typically for a p-value less than Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles should be interpreted based only on p-values. Research findings are defined here as any relationship reaching formal statistical significance, e.g., effective interventions, informative predictors, risk factors, or associations. Negative research is also very useful. Negative is actually a misnomer, and the misinterpretation is widespread. However, here we will target relationships that investigators claim exist, rather than null findings. As has been shown previously, the Open access, freely available online Why Most Published Research Findings Are False John P. A. Ioannidis Summary There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research. It can be proven that most claimed research findings are false. is characteristic of the field and can vary a lot depending on whether the field targets highly likely relationships or searches for only one or a few true relationships among thousands and millions of hypotheses that may be postulated. Let us also consider, for computational simplicity, circumscribed fields where either there is only one true relationship (among many that can be hypothesized) or the power is similar to find any of the several existing true relationships. The pre-study probability of a relationship being true is R (R + 1). The probability of a study finding a true relationship reflects the power 1 β (one minus the Type II error rate). The probability of claiming a relationship when none truly exists reflects the Type I error rate, α. Assuming that c relationships are being probed in the field, the expected values of the 2 2 table are given in Table 1. After a research finding has been claimed based on achieving formal statistical significance, the post-study probability that it is true is the positive predictive value, PPV. The PPV is also the complementary probability of what Wacholder et al. have called the false positive report probability [10]. According to the 2 2 table, one gets PPV = (1 β)r (R βr + α). A research finding is thus Citation: Ioannidis JPA (2005) Why most published research findings are false. PLoS Med 2(8): e124. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

5 Motivation Debate over Hormone Replacement Therapy (HRT) Professional groups gave HRT their stamp of approval 15 years ago. Studies indicated HRT protective against osteoporosis and heart disease. In 1998, a study demonstrated increased risk of heart attack among women with heart disease taking HRT. In 2002 a study showed increased risk for breast cancer, heart disease, and stroke, among other ailments, for women on HRT. Why were there inconsistencies in the study results? CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

6 Motivation Debate over mammography Mammography gained widespread acceptance as effective tool for breast cancer screening in the 1980s. The Health Insurance Plan trial and Swedish Two-County trial demonstrated mammography saved lives. In 2009, surprise over new recommendations from the U.S. Preventive Services Task Force. Among women without a family history, mammography now recommended for women aged 50 to 74. Previous guidelines started at age 40. Why was there a seemingly sudden paradigm shift? CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

7 [Big Data] CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

8 What Role Does Big Data Play in Biostatistics? Many of the data-related problems biostatisticians face in the modern era involve Big Data. Examples: imaging data post-market safety analysis environmental health medicine genomics... CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

9 Imaging Data Understanding the unique complexities of imaging data is no small task! Ani Eloyan, PhD, Johns Hopkins University Brain imaging data mostly consist of collections of three-dimensional arrays collected over time resulting in a four-dimensional array for each subject. The first major issue in analyzing these data is the simple fact that our brains are very different in size, shape and so on. In many cases the transformation of the matrices into a common space a form in which they can be compared to each other is still an open problem which is hindering the analysis of the data. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

10 Imaging Data: Competitions Dr. Eloyan was part of the Johns Hopkins team that won a recent prediction contest examining attention deficit hyperactivity disorder (ADHD), the 2011 ADHD-200 Global Competition. They used neuroimaging data and other information to categorize subjects into neurotypical, ADHD primary inattentive type, or ADHD combined type diagnoses. Eloyan et al. Automated Diagnoses of Attention Deficit Hyperactive Disorder using Magnetic Resonance Imaging. Frontiers in Systems Neuroscience, in press. Preprint: CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

11 Competitions Continued! Public competitions involving the analysis of large databases are making continued mainstream appearances, following the $1 million Netflix Prize where teams developed algorithms to improve upon the content providers existing recommendation system for movies. Next up: the $3 million Heritage Health Prize Competition where the goal is to predict future hospitalizations using existing high-dimensional patient data. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

12 Medical Databases Continued: Safety Analysis A behemoth example of a massive clinical database is the US Food and Drug Administration s Sentinel Initiative, which aims to monitor drugs and medical devices for safety over time. The end result of this program will be a national electronic system, and the new system already has access to 100 million people and their medical records. Consider the volume of medical data that one person can accumulate over a few years: repeated measurements of blood pressure, lung function, antibody concentrations, scans, etc. Multiply that by 100 million and you get an idea of the size of the database. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

13 Medical Databases Continued: Safety Analysis The sheer scale of this project and its longitudinal nature provide significant challenges. One complexity is accurately defining the data. For example: One must acknowledge that subjects drop out and are not followed for the entire time period, and this drop out is often not random and due to a specific issue such as drug side effects. Traditional assumptions of parametric modeling are not likely to be supported by what is known about how the data was generated. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

14 Safety Analysis Mark van der Laan, PhD, UC Berkeley We need to use the state-of-the-art in estimation without relying on restrictive assumptions; we need methods that aim to learn from these large data sets as much as the data allow. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

15 Electronic Health Records + Electronic medical records are only part of big data. They are increasingly being combined with other big data sets. Example: Environmental health issues such as air quality. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

16 Air Quality Studying air quality bring in an additional component: geography. Different regions have different particulate matter in the air. Cory Zigler, PhD, Harvard There are satellites measuring markers of ambient air quality at increasingly fine spatial and temporal resolutions. But all the data in the world won t change some of the salient issues such as the fact that people who live near one another share many things in common in addition to the air they breathe. Teasing out the health effects of air pollution from other factors requires thoughtful statistical reasoning throughout the entire process: you must define the right question, choose the right spatial and temporal resolution of the data, ultimately apply the right analytical methods and interpret them correctly. This must be a combined effort from people with a wide array of quantitative skills. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

17 Big Data and Medicine Alessio Fasano, PhD, University of Maryland School of Medicine Imagine that you have in your hands the ability to unveil the secrets of human biology, to establish how the human host interacts and communicates with the parallel civilization of bacteria living in symbiosis with us, to understand the yin and yang between tolerance and immune response, and the ability to turn on and off autoimmune diseases at will. Imagine, in other words, that you have the power to decipher the secrets of complex diseases, so that innovative preventive and therapeutic interventions can be developed. All this is theoretically possible with celiac disease, the only autoimmune disease for which the environmental trigger is known. [Continued next slide...] CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

18 Big Data and Medicine Alessio Fasano, PhD, University of Maryland School of Medicine However, these goals are achievable only if robust statistical methodologies are applied to elaborate the enormous amount of data that we have recently acquired, thanks to advances in our knowledge about celiac disease pathogenesis. Trying to make sense of the complexity of celiac disease without fundamentals in statistics is like trying to decipher Egyptian hieroglyphics without having the key to interpret them. Dr. Fasano is leading innovative new projects studying the introduction of gluten in infants and their microbial environment, among other projects. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

19 Genomics Steven Salzberg, PhD, Johns Hopkins University Next-generation sequencing technology can now generate more data in a single day than the entire Human Genome Project generated in 12 years. It has transformed biomedical science. Simply moving this data around presents major challenges to many scientists and institutions: their networks just arent fast enough. Analyzing the data is a much bigger problem. With such large data sets, it is all too easy to find rare statistical anomalies and to confuse them with real phenomena. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

20 Genes and Privacy Just this week, a new article came out in the New York Times: Genes Now Tell Doctors Secrets They Can t Utter by Gina Kolata. A quick synopsis is that subjects in studies of disease risk submitting samples typically sign a waver that they wish to remain anonymous. Serious ethical issues arise when researchers, who are not clinicians, discover important findings in a subject s genes with substantial implications for the subject and/or biological family members. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

21 A Few Specific Applied Projects... The causal effect of leisure-time physical activity on mortality in the elderly. New prediction functions for mortality in elderly populations. Finding quantitative trait loci genes. When to initiate combined antiretroviral therapy in HIV-infected persons in the United States. Success of an in vitro fertilization program in a longitudinal study population.... CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

22 [Causal Modeling] CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

23 Causal Modeling MAIN TAKE HOME MESSAGE: Causal assumptions allow us to interpret the parameter of interest as a causal effect. These additional assumptions are untestable; we cannot use the data to verify their accuracy. The causal modeling assumptions are separate from the chosen estimation procedure. A so-called causal estimation method is simply a statistical estimation method when causal assumptions are not made. The interpretation of the parameter will differ; it now has a statistical interpretation, but not a causal one. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

24 Let s Step Back... When we ask scientific questions, we frequently collect data in an attempt to answer these questions. In many areas of research, we are often interested in causal effects. That is to say, we prefer not to merely conclude that there is an association or correlation between two variables. Instead, we want to know that X causes Y. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

25 Defining the Question of Interest The first step is accurately defining the question of interest. This includes a clear description of the data, model, and parameter. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

26 Defining the Question of Interest DATA: Our study is an experiment where we draw a random variable from our population n times. The data we observe are realizations of these n random variables, and the random variables have an underlying probability distribution. Formally: The data consists of n i.i.d. copies of random variable O P 0, where P 0 is the true underlying probability distribution for O. In this talk we ll explore a simple case, where O is defined as: O = (W, A, Y ) P 0. W is a vector of baseline (first time point) variables, A is some intervention (often a treatment or exposure in biostatistics), and Y is an outcome. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

27 Defining the Question of Interest STATISTICAL MODEL: A statistical model in general represents the set of possible probability distributions of the data. Our statistical model should represent our knowledge about the data. We may wish to assume a nonparametric statistical model. Then we are saying that we know the data are comprised of observations on n independent and identically distributed random variables, which is a real assumption, but we make no other assumptions. A parametric statistical model would assume that the probability distribution underlying the data is known (up to a certain number of parameters). Our statistical model makes no such assumption, as, in practice, it is widely known that nonsaturated parametric statistical models are wrong. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

28 Defining the Question of Interest CAUSAL ASSUMPTIONS: Now, we ve made only those assumptions in our nonparametric model that are supported by the data. But there is nothing about the statistical model that allows us to interpret our parameter as causal...yet. We can make additional causal assumptions, and these assumptions combined with our statistical model are referred to simply as the model for the observed data. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

29 More on Causal Assumptions We can assume a structural causal model (SCM) (Pearl 2009), comprised of endogenous variables X = (X j : j) and exogenous variables U = (U Xj : j). The SCM describes that each X j is a deterministic function of other endogenous variables and an exogenous error U j. The errors U are never observed. For each X j we characterize its parents from among X with Pa(X j ). For example, in our simple study, X = (W, A, Y ), and Pa(A) = W. We know this due to the time ordering of the variables. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

30 More on Causal Assumptions Thus we can now write: X j = f Xj (Pa(X j ), U Xj ), j = 1..., J, and the functional form of f Xj is often unspecified. An SCM can be fully parametric, but we do not do that here as our background knowledge does not support the assumptions involved. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

31 More on Causal Assumptions: Our Example We could specify the following SCM: W = f W (U W ), A = f A (W, U A ), Y = f Y (W, A, U Y ), Recall that we assume for the full data: 1 for each X j, X j = f j (Pa(X j ), U Xj ) depends on the other endogenous variables only through the parents Pa(X j ), 2 the exogenous variables have a particular joint distribution P U. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

32 Causal Graph U W U A U W U A W A U Y W A U Y (a) Y (b) Y U W U A U W U A W A U Y W A U Y (c) Y (d) Y Figure: Causal graphs with various assumptions about the distribution of P U CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

33 A Note on Causal Assumptions We could alternatively use the Neyman Rubin Causal Model and assume (1) randomization (A Y a W ) and (2) stable unit treatment value assumption (SUTVA; no interference between subjects and consistency assumption). CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

34 Defining the Question of Interest PARAMETER: One possible target parameter, the risk difference: ψ RD = Ψ(P) = E[E(Y A = 1, W ) E(Y A = 0, W )] = E(Y 1 ) E(Y 0 ) = P(Y 1 = 1) P(Y 0 = 1) CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

35 [Estimation] CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

36 The Need for Targeted Learning in Semiparametric Models MLE/machine learning are not targeted for effect parameters. For that, we need a subsequent targeted bias-reduction step: Targeted MLE Targeted Learning Avoid reliance on human art and unrealistic (parametric) models Define interesting parameters Target the fit of data-generating distribution to the parameter of interest Incorporate machine learning Statistical inference CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

37 Targeted Maximum Likelihood Learning Two-step procedure that incorporates estimates of the probability of the outcome given intervention and covariates as well as an estimate of the probability of intervention given covariates. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

38 Targeted Maximum Likelihood Learning Super Learner (van der Laan, Polley, and Hubbard 2007) Allows researchers to use multiple algorithms to outperform a single algorithm in semiparametric statistical models. It is related to stacking algorithms. TMLE (van der Laan and Rubin 2006) With an initial estimate of the relevant part of the data-generating distribution obtained using super learning, the second stage of TMLE updates this initial fit in a step targeted toward making an optimal bias-variance tradeoff for the parameter of interest. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

39 TMLE TMLE: Double Robust Removes asymptotic residual bias of initial estimator for the target parameter, if it uses a consistent estimator of intervention mechanism. If initial estimator was consistent for the target parameter, the additional fitting of the data in the targeting step may remove finite sample bias, and preserves consistency property of the initial estimator. TMLE: Efficiency If the initial estimator and the intervention estimator are both consistent, then it is also asymptotically efficient according to semiparametric statistical model efficiency theory. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

40 TMLE TMLE: In Practice Allows the incorporation of machine learning methods for the estimation of outcome regression and intervention mechanism so that we do not make assumptions about the probability distribution P 0 we do not believe. Thus, every effort is made to achieve minimal bias and the asymptotic semiparametric efficiency bound for the variance. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

41 TMLE Algorithm Observed data random variables Target parameter map O1,...,On Ψ() INPUTS Initial estimator of the probability distribution of the data P 0 n Targeted estimator of the probability distribution of the data True probability distribution P0 P n STATISTICAL MODEL Set of possible probability distributions of the data Initial estimator Ψ(P 0 n) Ψ(P0) Ψ(P n) Targeted estimator True value (estimand) of target parameter VALUES OF TARGET PARAMETER Values mapped to the real line with better estimates closer to the truth CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

42 Landscape: Other Estimators Maximum-Likelihood-Based Estimators Maximum-likelihood-based substitution estimators will be of the type ψ n = Ψ(Q n ) = 1 n n { Q n (1, W i ) Q n (0, W i )}, i=1 where this estimate is obtained by plugging in Q n = ( Q n, Q W,n ) into the mapping Ψ. Qn (A = a, W i ) = E n (Y A = a, W i ). Estimating-Equation-Based Methods An estimating function is a function of the data O and the parameter of interest. If D(ψ)(O) is an estimating function, then we can define a corresponding estimating equation: n 0 = D(ψ)(O i ), i=1 and solution ψ n satisfying n i=1 D(ψ n)(o i ) = 0. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

43 Effect Estimation vs. Prediction Both effect and prediction research questions are inherently estimation questions, but they are distinct in their goals. Effect: Interested in estimating the effect of exposure on outcome adjusted for covariates (includes causal effects). Prediction: Interested in generating a function to input covariates and predict a value for the outcome. Effect parameters where no causal assumptions are made may be referred to as variable importance measures (VIMs), especially when one is creating a ranked list of effect measures. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

44 The Prediction Estimation Problem A loss function assigns a measure of performance to a candidate function (e.g., Q) when applied to an observation O. We define our parameter of interest, Q 0 = E 0 (Y A, W ), as the minimizer of the expected squared error loss: Q 0 = arg min QE 0 L(O, Q), where L(O, Q) = (Y Q(A, W )) 2. E 0 L(O, Q), which we want to be small, evaluates the candidate Q, and it is minimized at the optimal choice of Q 0. We refer to expected loss as the risk. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

45 Revisiting Super Learner Super Learner Suppose a researcher is interested in using several different parametric statistical models to estimate E 0 (Y A, W ). We can use these algorithms to build a library of algorithms consisting of all weighted averages of the algorithms. One of these weighted averages might perform better than one of the algorithms alone. It is this principle that allows us to map a collection of algorithms into a library of weighted averages of these algorithms. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

46 Super Learner CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

47 A few elevator pitches... Risk Score Prediction in Elderly Populations CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

48 Background Risk scores are calculated to identify those patients at the highest level of risk for disease or death. In some cases, interventions are implemented for patients at high risk. Prediction has been used most notably to generate tables for risk of heart disease and breast cancer. Standard practice for risk score prediction relies heavily on regression in parametric statistical models, assuming a functional form that is not known. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

49 Background In high-dimensional data, researchers often have dozens, hundreds, or even thousands of potential covariates to include in their parametric statistical model. Not only does this provide an impossible challenge to correctly specify the parametric statistical model for the conditional mean, but the complexity of the parametric statistical model may also increase to the point that there are more unknown parameters than observations. A fully saturated parametric statistical model will often result in a gross overfit of the data. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

50 Background Recent medical and epidemiologic studies for prediction have employed newer machine learning methods. Researchers are then left with questions such as, When should I use random forest instead of standard regression techniques? Example of Opposite Findings for the Better Algorithm: Austin et al. Logistic regression had superior performance compared with regression trees for predicting in-hospital mortality in patients hospitalized with heart failure. J Clin Epidemiol. 2010; 63(10): Peng et al. Random forest can predict 30-day mortality of spontaneous intracerebral hemorrhage with remarkable discrimination. Eur J Neurol. 2010;17(7): CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

51 Elevator Pitch: Risk Score Prediction Kaiser Permanente Database Nested case-control sample (n=27,012) from a Kaiser Permanente database of persons over the age of 65 in Northern California. Outcome was death. Covariates were 184 medical flags covering a variety of diseases, treatments, and conditions as well as gender and age. Generally weak signal with R 2 = Rose, Fireman, van der Laan. Nested case-control risk score prediction. In: van der Laan, Rose. Targeted Learning: Causal Inference for Observational and Experimental Data. New York, NY: Springer, 2011: CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

52 Super Learner 1. Input data and the collection of 16 algorithms. 2. Split data into 10 blocks. 3. Fit each of the 16 algorithms on the training set (nonshaded blocks). 4. Predict the probabilities of death (Z) using the validation set (shaded block) for each algorithm, based on the corresponding training set fit. Collection of 16 Algorithms Data algorithm a algorithm b algorithm p algorithm a algorithm b algorithm p algorithm a algorithm b algorithm p 1 Z 1,a Z 1,b... 2 Z 2,a Z 2,b Z 10,a Z 10,b... CV MSE a... CV MSE b Family of weighted combinations... Z 1,p Z 2,p Z 10,p CV MSE p 5. Calculate estimated MSE within each validation set for each algorithm using Z. Average the risks across validation sets resulting in one estimated crossvalidated MSE for each algorithm. 6. Propose a family of weighted combinations of the 16 algorithms indexed by a weight vector α. Super learner function P n (Y=1 Z)=expit(α a,n Z a +α b,n Z b +...+α p,n Z p ) 8. Fit each of the 16 algorithms on the complete data set. Combine these fits with the weights obtained in the previous step to generate the super learner predictor function. 7. Use the probabilities (Z) to predict the outcome Y and estimate the vector α, thereby determining the combination that minimizes the crossvalidated risk over the family of weighted combinations. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

53 Elevator Pitch: Risk Score Prediction Sonoma Data Set Cohort study of n = 2, 066 residents of Sonoma, CA aged 54 and over. Outcome was death. Covariates were gender, age, self-rated health, leisure-time physical activity, smoking status, cardiac event history, and chronic health condition status. Almost two-fold improvement (R 2 = 0.200) with less than 10% of the subjects and less than 10% the number of covariates. Rose. Mortality risk score prediction in an elderly population using machine learning. Am J Epid, in press. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

54 Prediction Discussion Previous literature indicates that perception of health in elderly adults may be as important as less subjective measures when assessing later outcomes (Idler & Benyamini 1997, Blazer 2008). Likewise, benefits of physical activity in older populations have also been shown (Denaei et al. 2009). Even when the result is a negligible improvement relative to the best algorithms in the collection, the super learner provides a tool to run many algorithms and return a prediction function with the best (or equal) cross-validated MSE, avoiding the need to commit to a single algorithm. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

55 Summary: The Road Map DEFINING THE RESEARCH QUESTION BEGIN DATA The data are n i.i.d. observations of random variable O. O has probability distribution P 0. MODEL The statistical model M is a set of possible probability distributions of O. P0 is in M. The model is a statistical model for P0 augmented with possible additional nontestable causal assumptions. TARGET PARAMETER The parameter Ψ(P0) is a particular feature of P0, where Ψ maps the probability distribution P 0 into the target parameter of interest. ESTIMATION SUPER LEARNER The first step in our estimation procedure is an initial estimate of the relevant part Q0 of P0 using the machine learning algorithm super learner. TARGETED MAXIMUM LIKELIHOOD ESTIMATION With an initial estimate of the relevant part of the data-generating distribution obtained using super learning, the second stage of TMLE updates this initial fit in a step targeted toward making an optimal bias variance tradeoff for the parameter of interest, now denoted Ψ(Q0), instead of the overall probability distribution. INFERENCE INFERENCE Standard errors are calculated for the estimator of the target parameter using the influence curve or resampling-based methods to assess the uncertainty in the estimator. INTERPRETATION The target parameter can be interpreted as a purely statistical parameter or as a causal parameter under possible additional nontestable assumptions in our model. END CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

56 Targeted Learning Book (targetedlearningbook.com) van der Laan & Rose, Targeted Learning: Causal Inference for Observational and Experimental Data. New York: Springer, CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

57 Additional References Pearl, Causality. New York: Cambridge University Press, 2nd edition, Rose, Big data and the future. Significance, 9(4): 47-48, [The Big Data quotes were originally published in this article.] Rose, Starmans, van der Laan. Targeted learning for causality and statistical analysis in medical research. In Qian Meng, Zhongguo Zheng, eds. Statistics: Discovering Your Future Power. Beijing: China Statistics Press, van der Laan, Polley, Hubbard. Super Learner. SAGMB, 6(1):Article 25, van der Laan, Rubin. Targeted maximum likelihood learning. Int J Biostat, 2(1):Article 11, CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

58 Acknowledgments Johns Hopkins: Michael Rosenblum UC Berkeley: Mark van der Laan Rose Rosenblum Rose van der Laan Funding: National Science Foundation, DMS (PI: S. Rose) CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

Targeted Learning. Sherri Rose. April 24, Associate Professor Department of Health Care Policy Harvard Medical School

Targeted Learning. Sherri Rose. April 24, Associate Professor Department of Health Care Policy Harvard Medical School Targeted Learning Sherri Rose Associate Professor Department of Health Care Policy Harvard Medical School Slides: drsherrirosecom/short-courses Code: githubcom/sherrirose/cncshortcourse April 24, 2017

More information

Targeted Maximum Likelihood Estimation in Safety Analysis

Targeted Maximum Likelihood Estimation in Safety Analysis Targeted Maximum Likelihood Estimation in Safety Analysis Sam Lendle 1 Bruce Fireman 2 Mark van der Laan 1 1 UC Berkeley 2 Kaiser Permanente ISPE Advanced Topics Session, Barcelona, August 2012 1 / 35

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2008 Paper 241 A Note on Risk Prediction for Case-Control Studies Sherri Rose Mark J. van der Laan Division

More information

Randomly Significant

Randomly Significant 1/66 Randomly Significant Why most science reporting is misleading Peter Hoff Statistics, University of Washington 2/66 Breakthroughs in Science Studies recently in the news: Female hurricanes are deadlier

More information

Targeted Learning for High-Dimensional Variable Importance

Targeted Learning for High-Dimensional Variable Importance Targeted Learning for High-Dimensional Variable Importance Alan Hubbard, Nima Hejazi, Wilson Cai, Anna Decker Division of Biostatistics University of California, Berkeley July 27, 2016 for Centre de Recherches

More information

Fair Inference Through Semiparametric-Efficient Estimation Over Constraint-Specific Paths

Fair Inference Through Semiparametric-Efficient Estimation Over Constraint-Specific Paths Fair Inference Through Semiparametric-Efficient Estimation Over Constraint-Specific Paths for New Developments in Nonparametric and Semiparametric Statistics, Joint Statistical Meetings; Vancouver, BC,

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2014 Paper 327 Entering the Era of Data Science: Targeted Learning and the Integration of Statistics

More information

Targeted Group Sequential Adaptive Designs

Targeted Group Sequential Adaptive Designs Targeted Group Sequential Adaptive Designs Mark van der Laan Department of Biostatistics, University of California, Berkeley School of Public Health Liver Forum, May 10, 2017 Targeted Group Sequential

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2015 Paper 334 Targeted Estimation and Inference for the Sample Average Treatment Effect Laura B. Balzer

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2011 Paper 290 Targeted Minimum Loss Based Estimation of an Intervention Specific Mean Outcome Mark

More information

SIMPLE EXAMPLES OF ESTIMATING CAUSAL EFFECTS USING TARGETED MAXIMUM LIKELIHOOD ESTIMATION

SIMPLE EXAMPLES OF ESTIMATING CAUSAL EFFECTS USING TARGETED MAXIMUM LIKELIHOOD ESTIMATION Johns Hopkins University, Dept. of Biostatistics Working Papers 3-3-2011 SIMPLE EXAMPLES OF ESTIMATING CAUSAL EFFECTS USING TARGETED MAXIMUM LIKELIHOOD ESTIMATION Michael Rosenblum Johns Hopkins Bloomberg

More information

Construction and statistical analysis of adaptive group sequential designs for randomized clinical trials

Construction and statistical analysis of adaptive group sequential designs for randomized clinical trials Construction and statistical analysis of adaptive group sequential designs for randomized clinical trials Antoine Chambaz (MAP5, Université Paris Descartes) joint work with Mark van der Laan Atelier INSERM

More information

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH The First Step: SAMPLE SIZE DETERMINATION THE ULTIMATE GOAL The most important, ultimate step of any of clinical research is to do draw inferences;

More information

Comparative effectiveness of dynamic treatment regimes

Comparative effectiveness of dynamic treatment regimes Comparative effectiveness of dynamic treatment regimes An application of the parametric g- formula Miguel Hernán Departments of Epidemiology and Biostatistics Harvard School of Public Health www.hsph.harvard.edu/causal

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2011 Paper 282 Super Learner Based Conditional Density Estimation with Application to Marginal Structural

More information

Targeted Maximum Likelihood Estimation for Adaptive Designs: Adaptive Randomization in Community Randomized Trial

Targeted Maximum Likelihood Estimation for Adaptive Designs: Adaptive Randomization in Community Randomized Trial Targeted Maximum Likelihood Estimation for Adaptive Designs: Adaptive Randomization in Community Randomized Trial Mark J. van der Laan 1 University of California, Berkeley School of Public Health laan@berkeley.edu

More information

Statistical Inference for Data Adaptive Target Parameters

Statistical Inference for Data Adaptive Target Parameters Statistical Inference for Data Adaptive Target Parameters Mark van der Laan, Alan Hubbard Division of Biostatistics, UC Berkeley December 13, 2013 Mark van der Laan, Alan Hubbard ( Division of Biostatistics,

More information

Causal Inference with Big Data Sets

Causal Inference with Big Data Sets Causal Inference with Big Data Sets Marcelo Coca Perraillon University of Colorado AMC November 2016 1 / 1 Outlone Outline Big data Causal inference in economics and statistics Regression discontinuity

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

Combining multiple observational data sources to estimate causal eects

Combining multiple observational data sources to estimate causal eects Department of Statistics, North Carolina State University Combining multiple observational data sources to estimate causal eects Shu Yang* syang24@ncsuedu Joint work with Peng Ding UC Berkeley May 23,

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2010 Paper 260 Collaborative Targeted Maximum Likelihood For Time To Event Data Ori M. Stitelman Mark

More information

Causal Inference. Prediction and causation are very different. Typical questions are:

Causal Inference. Prediction and causation are very different. Typical questions are: Causal Inference Prediction and causation are very different. Typical questions are: Prediction: Predict Y after observing X = x Causation: Predict Y after setting X = x. Causation involves predicting

More information

Statistical Models for Causal Analysis

Statistical Models for Causal Analysis Statistical Models for Causal Analysis Teppei Yamamoto Keio University Introduction to Causal Inference Spring 2016 Three Modes of Statistical Inference 1. Descriptive Inference: summarizing and exploring

More information

Mediation for the 21st Century

Mediation for the 21st Century Mediation for the 21st Century Ross Boylan ross@biostat.ucsf.edu Center for Aids Prevention Studies and Division of Biostatistics University of California, San Francisco Mediation for the 21st Century

More information

Lecture 01: Introduction

Lecture 01: Introduction Lecture 01: Introduction Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina Lecture 01: Introduction

More information

Core Courses for Students Who Enrolled Prior to Fall 2018

Core Courses for Students Who Enrolled Prior to Fall 2018 Biostatistics and Applied Data Analysis Students must take one of the following two sequences: Sequence 1 Biostatistics and Data Analysis I (PHP 2507) This course, the first in a year long, two-course

More information

Compare Predicted Counts between Groups of Zero Truncated Poisson Regression Model based on Recycled Predictions Method

Compare Predicted Counts between Groups of Zero Truncated Poisson Regression Model based on Recycled Predictions Method Compare Predicted Counts between Groups of Zero Truncated Poisson Regression Model based on Recycled Predictions Method Yan Wang 1, Michael Ong 2, Honghu Liu 1,2,3 1 Department of Biostatistics, UCLA School

More information

Causal Inference for Case-Control Studies. Sherri Rose. A dissertation submitted in partial satisfaction of the. requirements for the degree of

Causal Inference for Case-Control Studies. Sherri Rose. A dissertation submitted in partial satisfaction of the. requirements for the degree of Causal Inference for Case-Control Studies By Sherri Rose A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Biostatistics in the Graduate Division

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2010 Paper 259 Targeted Maximum Likelihood Based Causal Inference Mark J. van der Laan University of

More information

CONTENTS OF DAY 2. II. Why Random Sampling is Important 10 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

CONTENTS OF DAY 2. II. Why Random Sampling is Important 10 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE 1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 4 Problems with small populations 9 II. Why Random Sampling is Important 10 A myth,

More information

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto. Introduction to Dalla Lana School of Public Health University of Toronto olli.saarela@utoronto.ca September 18, 2014 38-1 : a review 38-2 Evidence Ideal: to advance the knowledge-base of clinical medicine,

More information

Probability and Probability Distributions. Dr. Mohammed Alahmed

Probability and Probability Distributions. Dr. Mohammed Alahmed Probability and Probability Distributions 1 Probability and Probability Distributions Usually we want to do more with data than just describing them! We might want to test certain specific inferences about

More information

Counterfactual Reasoning in Algorithmic Fairness

Counterfactual Reasoning in Algorithmic Fairness Counterfactual Reasoning in Algorithmic Fairness Ricardo Silva University College London and The Alan Turing Institute Joint work with Matt Kusner (Warwick/Turing), Chris Russell (Sussex/Turing), and Joshua

More information

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017 Lecture 7: Interaction Analysis Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 39 Lecture Outline Beyond main SNP effects Introduction to Concept of Statistical Interaction

More information

Deductive Derivation and Computerization of Semiparametric Efficient Estimation

Deductive Derivation and Computerization of Semiparametric Efficient Estimation Deductive Derivation and Computerization of Semiparametric Efficient Estimation Constantine Frangakis, Tianchen Qian, Zhenke Wu, and Ivan Diaz Department of Biostatistics Johns Hopkins Bloomberg School

More information

Adaptive Trial Designs

Adaptive Trial Designs Adaptive Trial Designs Wenjing Zheng, Ph.D. Methods Core Seminar Center for AIDS Prevention Studies University of California, San Francisco Nov. 17 th, 2015 Trial Design! Ethical:!eg.! Safety!! Efficacy!

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2011 Paper 288 Targeted Maximum Likelihood Estimation of Natural Direct Effect Wenjing Zheng Mark J.

More information

Causal Inference Basics

Causal Inference Basics Causal Inference Basics Sam Lendle October 09, 2013 Observed data, question, counterfactuals Observed data: n i.i.d copies of baseline covariates W, treatment A {0, 1}, and outcome Y. O i = (W i, A i,

More information

Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Estimation of Optimal Treatment Regimes Via Machine Learning. Marie Davidian

Estimation of Optimal Treatment Regimes Via Machine Learning. Marie Davidian Estimation of Optimal Treatment Regimes Via Machine Learning Marie Davidian Department of Statistics North Carolina State University Triangle Machine Learning Day April 3, 2018 1/28 Optimal DTRs Via ML

More information

Semi-Nonparametric Inferences for Massive Data

Semi-Nonparametric Inferences for Massive Data Semi-Nonparametric Inferences for Massive Data Guang Cheng 1 Department of Statistics Purdue University Statistics Seminar at NCSU October, 2015 1 Acknowledge NSF, Simons Foundation and ONR. A Joint Work

More information

Targeted Maximum Likelihood Estimation for Dynamic Treatment Regimes in Sequential Randomized Controlled Trials

Targeted Maximum Likelihood Estimation for Dynamic Treatment Regimes in Sequential Randomized Controlled Trials From the SelectedWorks of Paul H. Chaffee June 22, 2012 Targeted Maximum Likelihood Estimation for Dynamic Treatment Regimes in Sequential Randomized Controlled Trials Paul Chaffee Mark J. van der Laan

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2004 Paper 155 Estimation of Direct and Indirect Causal Effects in Longitudinal Studies Mark J. van

More information

Propensity Score Weighting with Multilevel Data

Propensity Score Weighting with Multilevel Data Propensity Score Weighting with Multilevel Data Fan Li Department of Statistical Science Duke University October 25, 2012 Joint work with Alan Zaslavsky and Mary Beth Landrum Introduction In comparative

More information

Weighting. Homework 2. Regression. Regression. Decisions Matching: Weighting (0) W i. (1) -å l i. )Y i. (1-W i 3/5/2014. (1) = Y i.

Weighting. Homework 2. Regression. Regression. Decisions Matching: Weighting (0) W i. (1) -å l i. )Y i. (1-W i 3/5/2014. (1) = Y i. Weighting Unconfounded Homework 2 Describe imbalance direction matters STA 320 Design and Analysis of Causal Studies Dr. Kari Lock Morgan and Dr. Fan Li Department of Statistical Science Duke University

More information

The International Journal of Biostatistics

The International Journal of Biostatistics The International Journal of Biostatistics Volume 2, Issue 1 2006 Article 2 Statistical Inference for Variable Importance Mark J. van der Laan, Division of Biostatistics, School of Public Health, University

More information

Decision-making, inference, and learning theory. ECE 830 & CS 761, Spring 2016

Decision-making, inference, and learning theory. ECE 830 & CS 761, Spring 2016 Decision-making, inference, and learning theory ECE 830 & CS 761, Spring 2016 1 / 22 What do we have here? Given measurements or observations of some physical process, we ask the simple question what do

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2014 Paper 330 Online Targeted Learning Mark J. van der Laan Samuel D. Lendle Division of Biostatistics,

More information

Data splitting. INSERM Workshop: Evaluation of predictive models: goodness-of-fit and predictive power #+TITLE:

Data splitting. INSERM Workshop: Evaluation of predictive models: goodness-of-fit and predictive power #+TITLE: #+TITLE: Data splitting INSERM Workshop: Evaluation of predictive models: goodness-of-fit and predictive power #+AUTHOR: Thomas Alexander Gerds #+INSTITUTE: Department of Biostatistics, University of Copenhagen

More information

Probability. Introduction to Biostatistics

Probability. Introduction to Biostatistics Introduction to Biostatistics Probability Second Semester 2014/2015 Text Book: Basic Concepts and Methodology for the Health Sciences By Wayne W. Daniel, 10 th edition Dr. Sireen Alkhaldi, BDS, MPH, DrPH

More information

Probability. We will now begin to explore issues of uncertainty and randomness and how they affect our view of nature.

Probability. We will now begin to explore issues of uncertainty and randomness and how they affect our view of nature. Probability We will now begin to explore issues of uncertainty and randomness and how they affect our view of nature. We will explore in lab the differences between accuracy and precision, and the role

More information

Reading for Lecture 6 Release v10

Reading for Lecture 6 Release v10 Reading for Lecture 6 Release v10 Christopher Lee October 11, 2011 Contents 1 The Basics ii 1.1 What is a Hypothesis Test?........................................ ii Example..................................................

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

Chapter 11. Correlation and Regression

Chapter 11. Correlation and Regression Chapter 11. Correlation and Regression The word correlation is used in everyday life to denote some form of association. We might say that we have noticed a correlation between foggy days and attacks of

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2010 Paper 269 Diagnosing and Responding to Violations in the Positivity Assumption Maya L. Petersen

More information

What Causality Is (stats for mathematicians)

What Causality Is (stats for mathematicians) What Causality Is (stats for mathematicians) Andrew Critch UC Berkeley August 31, 2011 Introduction Foreword: The value of examples With any hard question, it helps to start with simple, concrete versions

More information

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen Harvard University Harvard University Biostatistics Working Paper Series Year 2014 Paper 175 A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome Eric Tchetgen Tchetgen

More information

Ignoring the matching variables in cohort studies - when is it valid, and why?

Ignoring the matching variables in cohort studies - when is it valid, and why? Ignoring the matching variables in cohort studies - when is it valid, and why? Arvid Sjölander Abstract In observational studies of the effect of an exposure on an outcome, the exposure-outcome association

More information

Causal Inference. Miguel A. Hernán, James M. Robins. May 19, 2017

Causal Inference. Miguel A. Hernán, James M. Robins. May 19, 2017 Causal Inference Miguel A. Hernán, James M. Robins May 19, 2017 ii Causal Inference Part III Causal inference from complex longitudinal data Chapter 19 TIME-VARYING TREATMENTS So far this book has dealt

More information

where Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc.

where Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc. Notes on regression analysis 1. Basics in regression analysis key concepts (actual implementation is more complicated) A. Collect data B. Plot data on graph, draw a line through the middle of the scatter

More information

COMP61011! Probabilistic Classifiers! Part 1, Bayes Theorem!

COMP61011! Probabilistic Classifiers! Part 1, Bayes Theorem! COMP61011 Probabilistic Classifiers Part 1, Bayes Theorem Reverend Thomas Bayes, 1702-1761 p ( T W ) W T ) T ) W ) Bayes Theorem forms the backbone of the past 20 years of ML research into probabilistic

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

PEARL VS RUBIN (GELMAN)

PEARL VS RUBIN (GELMAN) PEARL VS RUBIN (GELMAN) AN EPIC battle between the Rubin Causal Model school (Gelman et al) AND the Structural Causal Model school (Pearl et al) a cursory overview Dokyun Lee WHO ARE THEY? Judea Pearl

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2010 Paper 267 Optimizing Randomized Trial Designs to Distinguish which Subpopulations Benefit from

More information

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Panel Data?

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Panel Data? When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Panel Data? Kosuke Imai Department of Politics Center for Statistics and Machine Learning Princeton University Joint

More information

Selective Inference for Effect Modification

Selective Inference for Effect Modification Inference for Modification (Joint work with Dylan Small and Ashkan Ertefaie) Department of Statistics, University of Pennsylvania May 24, ACIC 2017 Manuscript and slides are available at http://www-stat.wharton.upenn.edu/~qyzhao/.

More information

Technical Track Session I: Causal Inference

Technical Track Session I: Causal Inference Impact Evaluation Technical Track Session I: Causal Inference Human Development Human Network Development Network Middle East and North Africa Region World Bank Institute Spanish Impact Evaluation Fund

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 01: Introduction and Overview

Introduction to Empirical Processes and Semiparametric Inference Lecture 01: Introduction and Overview Introduction to Empirical Processes and Semiparametric Inference Lecture 01: Introduction and Overview Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations

More information

BIOS 2041: Introduction to Statistical Methods

BIOS 2041: Introduction to Statistical Methods BIOS 2041: Introduction to Statistical Methods Abdus S Wahed* *Some of the materials in this chapter has been adapted from Dr. John Wilson s lecture notes for the same course. Chapter 0 2 Chapter 1 Introduction

More information

On the Use of the Bross Formula for Prioritizing Covariates in the High-Dimensional Propensity Score Algorithm

On the Use of the Bross Formula for Prioritizing Covariates in the High-Dimensional Propensity Score Algorithm On the Use of the Bross Formula for Prioritizing Covariates in the High-Dimensional Propensity Score Algorithm Richard Wyss 1, Bruce Fireman 2, Jeremy A. Rassen 3, Sebastian Schneeweiss 1 Author Affiliations:

More information

The Lady Tasting Tea. How to deal with multiple testing. Need to explore many models. More Predictive Modeling

The Lady Tasting Tea. How to deal with multiple testing. Need to explore many models. More Predictive Modeling The Lady Tasting Tea More Predictive Modeling R. A. Fisher & the Lady B. Muriel Bristol claimed she prefers tea added to milk rather than milk added to tea Fisher was skeptical that she could distinguish

More information

Causality II: How does causal inference fit into public health and what it is the role of statistics?

Causality II: How does causal inference fit into public health and what it is the role of statistics? Causality II: How does causal inference fit into public health and what it is the role of statistics? Statistics for Psychosocial Research II November 13, 2006 1 Outline Potential Outcomes / Counterfactual

More information

An Introduction to Mplus and Path Analysis

An Introduction to Mplus and Path Analysis An Introduction to Mplus and Path Analysis PSYC 943: Fundamentals of Multivariate Modeling Lecture 10: October 30, 2013 PSYC 943: Lecture 10 Today s Lecture Path analysis starting with multivariate regression

More information

A Sampling of IMPACT Research:

A Sampling of IMPACT Research: A Sampling of IMPACT Research: Methods for Analysis with Dropout and Identifying Optimal Treatment Regimes Marie Davidian Department of Statistics North Carolina State University http://www.stat.ncsu.edu/

More information

Topic 3: Hypothesis Testing

Topic 3: Hypothesis Testing CS 8850: Advanced Machine Learning Fall 07 Topic 3: Hypothesis Testing Instructor: Daniel L. Pimentel-Alarcón c Copyright 07 3. Introduction One of the simplest inference problems is that of deciding between

More information

Welcome! Webinar Biostatistics: sample size & power. Thursday, April 26, 12:30 1:30 pm (NDT)

Welcome! Webinar Biostatistics: sample size & power. Thursday, April 26, 12:30 1:30 pm (NDT) . Welcome! Webinar Biostatistics: sample size & power Thursday, April 26, 12:30 1:30 pm (NDT) Get started now: Please check if your speakers are working and mute your audio. Please use the chat box to

More information

Estimation of direct causal effects.

Estimation of direct causal effects. University of California, Berkeley From the SelectedWorks of Maya Petersen May, 2006 Estimation of direct causal effects. Maya L Petersen, University of California, Berkeley Sandra E Sinisi Mark J van

More information

Comparing Adaptive Interventions Using Data Arising from a SMART: With Application to Autism, ADHD, and Mood Disorders

Comparing Adaptive Interventions Using Data Arising from a SMART: With Application to Autism, ADHD, and Mood Disorders Comparing Adaptive Interventions Using Data Arising from a SMART: With Application to Autism, ADHD, and Mood Disorders Daniel Almirall, Xi Lu, Connie Kasari, Inbal N-Shani, Univ. of Michigan, Univ. of

More information

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Previous lecture P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Interaction Outline: Definition of interaction Additive versus multiplicative

More information

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources Yi-Hau Chen Institute of Statistical Science, Academia Sinica Joint with Nilanjan

More information

A Decision Theoretic Approach to Causality

A Decision Theoretic Approach to Causality A Decision Theoretic Approach to Causality Vanessa Didelez School of Mathematics University of Bristol (based on joint work with Philip Dawid) Bordeaux, June 2011 Based on: Dawid & Didelez (2010). Identifying

More information

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3 University of California, Irvine 2017-2018 1 Statistics (STATS) Courses STATS 5. Seminar in Data Science. 1 Unit. An introduction to the field of Data Science; intended for entering freshman and transfers.

More information

Application of Time-to-Event Methods in the Assessment of Safety in Clinical Trials

Application of Time-to-Event Methods in the Assessment of Safety in Clinical Trials Application of Time-to-Event Methods in the Assessment of Safety in Clinical Trials Progress, Updates, Problems William Jen Hoe Koh May 9, 2013 Overview Marginal vs Conditional What is TMLE? Key Estimation

More information

CHL 5225H Advanced Statistical Methods for Clinical Trials: Multiplicity

CHL 5225H Advanced Statistical Methods for Clinical Trials: Multiplicity CHL 5225H Advanced Statistical Methods for Clinical Trials: Multiplicity Prof. Kevin E. Thorpe Dept. of Public Health Sciences University of Toronto Objectives 1. Be able to distinguish among the various

More information

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 CS 70 Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 Introduction One of the key properties of coin flips is independence: if you flip a fair coin ten times and get ten

More information

An Introduction to Path Analysis

An Introduction to Path Analysis An Introduction to Path Analysis PRE 905: Multivariate Analysis Lecture 10: April 15, 2014 PRE 905: Lecture 10 Path Analysis Today s Lecture Path analysis starting with multivariate regression then arriving

More information

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall 1 Structural Nested Mean Models for Assessing Time-Varying Effect Moderation Daniel Almirall Center for Health Services Research, Durham VAMC & Dept. of Biostatistics, Duke University Medical Joint work

More information

Personalized Treatment Selection Based on Randomized Clinical Trials. Tianxi Cai Department of Biostatistics Harvard School of Public Health

Personalized Treatment Selection Based on Randomized Clinical Trials. Tianxi Cai Department of Biostatistics Harvard School of Public Health Personalized Treatment Selection Based on Randomized Clinical Trials Tianxi Cai Department of Biostatistics Harvard School of Public Health Outline Motivation A systematic approach to separating subpopulations

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2015 Paper 341 The Statistics of Sensitivity Analyses Alexander R. Luedtke Ivan Diaz Mark J. van der

More information

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal Overview In observational and experimental studies, the goal may be to estimate the effect

More information

Individualized Treatment Effects with Censored Data via Nonparametric Accelerated Failure Time Models

Individualized Treatment Effects with Censored Data via Nonparametric Accelerated Failure Time Models Individualized Treatment Effects with Censored Data via Nonparametric Accelerated Failure Time Models Nicholas C. Henderson Thomas A. Louis Gary Rosner Ravi Varadhan Johns Hopkins University July 31, 2018

More information

Estimating direct effects in cohort and case-control studies

Estimating direct effects in cohort and case-control studies Estimating direct effects in cohort and case-control studies, Ghent University Direct effects Introduction Motivation The problem of standard approaches Controlled direct effect models In many research

More information

The Future of Healthcare? W H A T D O E S T H E F U T U R E H O L D? The Empowered Consumer

The Future of Healthcare? W H A T D O E S T H E F U T U R E H O L D? The Empowered Consumer : : The Future of Healthcare? W H A T D O E S T H E F U T U R E H O L D? The Empowered Consumer Dr. Anne Anne W. Snowdon, W. Snowdon, RN, BScN, Chair MSc, PhD World Health Innovation Network, Odette School

More information

CAUSAL INFERENCE IN THE EMPIRICAL SCIENCES. Judea Pearl University of California Los Angeles (www.cs.ucla.edu/~judea)

CAUSAL INFERENCE IN THE EMPIRICAL SCIENCES. Judea Pearl University of California Los Angeles (www.cs.ucla.edu/~judea) CAUSAL INFERENCE IN THE EMPIRICAL SCIENCES Judea Pearl University of California Los Angeles (www.cs.ucla.edu/~judea) OUTLINE Inference: Statistical vs. Causal distinctions and mental barriers Formal semantics

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

A new approach to hierarchical data analysis: Targeted maximum likelihood estimation for the causal effect of a cluster-level exposure

A new approach to hierarchical data analysis: Targeted maximum likelihood estimation for the causal effect of a cluster-level exposure A new approach to hierarchical data analysis: Targeted maximum likelihood estimation for the causal effect of a cluster-level exposure arxiv:1706.02675v2 [stat.me] 2 Apr 2018 Laura B. Balzer, Wenjing Zheng,

More information

Econometrics I. Professor William Greene Stern School of Business Department of Economics 1-1/40. Part 1: Introduction

Econometrics I. Professor William Greene Stern School of Business Department of Economics 1-1/40. Part 1: Introduction Econometrics I Professor William Greene Stern School of Business Department of Economics 1-1/40 http://people.stern.nyu.edu/wgreene/econometrics/econometrics.htm 1-2/40 Overview: This is an intermediate

More information

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION SunLab Enlighten the World FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION Ioakeim (Kimis) Perros and Jimeng Sun perros@gatech.edu, jsun@cc.gatech.edu COMPUTATIONAL

More information