Big Data, Causal Modeling, and Estimation

Size: px

Start display at page:

Download "Big Data, Causal Modeling, and Estimation"

Reynold Elliott
6 years ago
Views:

1 Big Data, Causal Modeling, and Estimation The Center for Interdisciplinary Studies in Security and Privacy Summer Workshop Sherri Rose NSF Mathematical Sciences Postdoctoral Research Fellow Department of Biostatistics Johns Hopkins Bloomberg School of Public Health drsherrirose.com targetedlearningbook.com August 30, 2012 CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

2 General Research Areas Robust estimation Causal inference High-dimensional longitudinal data methods for complex observational data Sequential decision theory (e.g., dynamic regimes) Ensemble machine learning in prediction and causal inference Most of my applications have been in the areas of medicine, public health, and biology, but these methods are very general and can been used in many disparate fields. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

3 Albert Einstein To raise new questions, new possibilities, to regard old problems from a new angle, requires creative imagination and marks real advance in science. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

4 Motivation Essay ublished research findings are sometimes refuted by subsequent Pevidence, with ensuing confusion and disappointment. Refutation and controversy is seen across the range of factors that influence this problem and some corollaries thereof. Modeling the Framework for False Positive Findings Several methodologists have pointed out [9 11] that the high rate of nonreplication (lack of confirmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research findings solely on the basis of a single study assessed by formal statistical significance, typically for a p-value less than Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles should be interpreted based only on p-values. Research findings are defined here as any relationship reaching formal statistical significance, e.g., effective interventions, informative predictors, risk factors, or associations. Negative research is also very useful. Negative is actually a misnomer, and the misinterpretation is widespread. However, here we will target relationships that investigators claim exist, rather than null findings. As has been shown previously, the Open access, freely available online Why Most Published Research Findings Are False John P. A. Ioannidis Summary There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research. It can be proven that most claimed research findings are false. is characteristic of the field and can vary a lot depending on whether the field targets highly likely relationships or searches for only one or a few true relationships among thousands and millions of hypotheses that may be postulated. Let us also consider, for computational simplicity, circumscribed fields where either there is only one true relationship (among many that can be hypothesized) or the power is similar to find any of the several existing true relationships. The pre-study probability of a relationship being true is R (R + 1). The probability of a study finding a true relationship reflects the power 1 β (one minus the Type II error rate). The probability of claiming a relationship when none truly exists reflects the Type I error rate, α. Assuming that c relationships are being probed in the field, the expected values of the 2 2 table are given in Table 1. After a research finding has been claimed based on achieving formal statistical significance, the post-study probability that it is true is the positive predictive value, PPV. The PPV is also the complementary probability of what Wacholder et al. have called the false positive report probability [10]. According to the 2 2 table, one gets PPV = (1 β)r (R βr + α). A research finding is thus Citation: Ioannidis JPA (2005) Why most published research findings are false. PLoS Med 2(8): e124. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

5 Motivation Debate over Hormone Replacement Therapy (HRT) Professional groups gave HRT their stamp of approval 15 years ago. Studies indicated HRT protective against osteoporosis and heart disease. In 1998, a study demonstrated increased risk of heart attack among women with heart disease taking HRT. In 2002 a study showed increased risk for breast cancer, heart disease, and stroke, among other ailments, for women on HRT. Why were there inconsistencies in the study results? CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

6 Motivation Debate over mammography Mammography gained widespread acceptance as effective tool for breast cancer screening in the 1980s. The Health Insurance Plan trial and Swedish Two-County trial demonstrated mammography saved lives. In 2009, surprise over new recommendations from the U.S. Preventive Services Task Force. Among women without a family history, mammography now recommended for women aged 50 to 74. Previous guidelines started at age 40. Why was there a seemingly sudden paradigm shift? CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

7 [Big Data] CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

8 What Role Does Big Data Play in Biostatistics? Many of the data-related problems biostatisticians face in the modern era involve Big Data. Examples: imaging data post-market safety analysis environmental health medicine genomics... CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

9 Imaging Data Understanding the unique complexities of imaging data is no small task! Ani Eloyan, PhD, Johns Hopkins University Brain imaging data mostly consist of collections of three-dimensional arrays collected over time resulting in a four-dimensional array for each subject. The first major issue in analyzing these data is the simple fact that our brains are very different in size, shape and so on. In many cases the transformation of the matrices into a common space a form in which they can be compared to each other is still an open problem which is hindering the analysis of the data. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

10 Imaging Data: Competitions Dr. Eloyan was part of the Johns Hopkins team that won a recent prediction contest examining attention deficit hyperactivity disorder (ADHD), the 2011 ADHD-200 Global Competition. They used neuroimaging data and other information to categorize subjects into neurotypical, ADHD primary inattentive type, or ADHD combined type diagnoses. Eloyan et al. Automated Diagnoses of Attention Deficit Hyperactive Disorder using Magnetic Resonance Imaging. Frontiers in Systems Neuroscience, in press. Preprint: CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

11 Competitions Continued! Public competitions involving the analysis of large databases are making continued mainstream appearances, following the $1 million Netflix Prize where teams developed algorithms to improve upon the content providers existing recommendation system for movies. Next up: the $3 million Heritage Health Prize Competition where the goal is to predict future hospitalizations using existing high-dimensional patient data. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

12 Medical Databases Continued: Safety Analysis A behemoth example of a massive clinical database is the US Food and Drug Administration s Sentinel Initiative, which aims to monitor drugs and medical devices for safety over time. The end result of this program will be a national electronic system, and the new system already has access to 100 million people and their medical records. Consider the volume of medical data that one person can accumulate over a few years: repeated measurements of blood pressure, lung function, antibody concentrations, scans, etc. Multiply that by 100 million and you get an idea of the size of the database. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

13 Medical Databases Continued: Safety Analysis The sheer scale of this project and its longitudinal nature provide significant challenges. One complexity is accurately defining the data. For example: One must acknowledge that subjects drop out and are not followed for the entire time period, and this drop out is often not random and due to a specific issue such as drug side effects. Traditional assumptions of parametric modeling are not likely to be supported by what is known about how the data was generated. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

14 Safety Analysis Mark van der Laan, PhD, UC Berkeley We need to use the state-of-the-art in estimation without relying on restrictive assumptions; we need methods that aim to learn from these large data sets as much as the data allow. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

15 Electronic Health Records + Electronic medical records are only part of big data. They are increasingly being combined with other big data sets. Example: Environmental health issues such as air quality. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

16 Air Quality Studying air quality bring in an additional component: geography. Different regions have different particulate matter in the air. Cory Zigler, PhD, Harvard There are satellites measuring markers of ambient air quality at increasingly fine spatial and temporal resolutions. But all the data in the world won t change some of the salient issues such as the fact that people who live near one another share many things in common in addition to the air they breathe. Teasing out the health effects of air pollution from other factors requires thoughtful statistical reasoning throughout the entire process: you must define the right question, choose the right spatial and temporal resolution of the data, ultimately apply the right analytical methods and interpret them correctly. This must be a combined effort from people with a wide array of quantitative skills. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

17 Big Data and Medicine Alessio Fasano, PhD, University of Maryland School of Medicine Imagine that you have in your hands the ability to unveil the secrets of human biology, to establish how the human host interacts and communicates with the parallel civilization of bacteria living in symbiosis with us, to understand the yin and yang between tolerance and immune response, and the ability to turn on and off autoimmune diseases at will. Imagine, in other words, that you have the power to decipher the secrets of complex diseases, so that innovative preventive and therapeutic interventions can be developed. All this is theoretically possible with celiac disease, the only autoimmune disease for which the environmental trigger is known. [Continued next slide...] CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

18 Big Data and Medicine Alessio Fasano, PhD, University of Maryland School of Medicine However, these goals are achievable only if robust statistical methodologies are applied to elaborate the enormous amount of data that we have recently acquired, thanks to advances in our knowledge about celiac disease pathogenesis. Trying to make sense of the complexity of celiac disease without fundamentals in statistics is like trying to decipher Egyptian hieroglyphics without having the key to interpret them. Dr. Fasano is leading innovative new projects studying the introduction of gluten in infants and their microbial environment, among other projects. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

19 Genomics Steven Salzberg, PhD, Johns Hopkins University Next-generation sequencing technology can now generate more data in a single day than the entire Human Genome Project generated in 12 years. It has transformed biomedical science. Simply moving this data around presents major challenges to many scientists and institutions: their networks just arent fast enough. Analyzing the data is a much bigger problem. With such large data sets, it is all too easy to find rare statistical anomalies and to confuse them with real phenomena. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

20 Genes and Privacy Just this week, a new article came out in the New York Times: Genes Now Tell Doctors Secrets They Can t Utter by Gina Kolata. A quick synopsis is that subjects in studies of disease risk submitting samples typically sign a waver that they wish to remain anonymous. Serious ethical issues arise when researchers, who are not clinicians, discover important findings in a subject s genes with substantial implications for the subject and/or biological family members. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

21 A Few Specific Applied Projects... The causal effect of leisure-time physical activity on mortality in the elderly. New prediction functions for mortality in elderly populations. Finding quantitative trait loci genes. When to initiate combined antiretroviral therapy in HIV-infected persons in the United States. Success of an in vitro fertilization program in a longitudinal study population.... CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

22 [Causal Modeling] CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

23 Causal Modeling MAIN TAKE HOME MESSAGE: Causal assumptions allow us to interpret the parameter of interest as a causal effect. These additional assumptions are untestable; we cannot use the data to verify their accuracy. The causal modeling assumptions are separate from the chosen estimation procedure. A so-called causal estimation method is simply a statistical estimation method when causal assumptions are not made. The interpretation of the parameter will differ; it now has a statistical interpretation, but not a causal one. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

24 Let s Step Back... When we ask scientific questions, we frequently collect data in an attempt to answer these questions. In many areas of research, we are often interested in causal effects. That is to say, we prefer not to merely conclude that there is an association or correlation between two variables. Instead, we want to know that X causes Y. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

25 Defining the Question of Interest The first step is accurately defining the question of interest. This includes a clear description of the data, model, and parameter. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

26 Defining the Question of Interest DATA: Our study is an experiment where we draw a random variable from our population n times. The data we observe are realizations of these n random variables, and the random variables have an underlying probability distribution. Formally: The data consists of n i.i.d. copies of random variable O P 0, where P 0 is the true underlying probability distribution for O. In this talk we ll explore a simple case, where O is defined as: O = (W, A, Y ) P 0. W is a vector of baseline (first time point) variables, A is some intervention (often a treatment or exposure in biostatistics), and Y is an outcome. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

27 Defining the Question of Interest STATISTICAL MODEL: A statistical model in general represents the set of possible probability distributions of the data. Our statistical model should represent our knowledge about the data. We may wish to assume a nonparametric statistical model. Then we are saying that we know the data are comprised of observations on n independent and identically distributed random variables, which is a real assumption, but we make no other assumptions. A parametric statistical model would assume that the probability distribution underlying the data is known (up to a certain number of parameters). Our statistical model makes no such assumption, as, in practice, it is widely known that nonsaturated parametric statistical models are wrong. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

28 Defining the Question of Interest CAUSAL ASSUMPTIONS: Now, we ve made only those assumptions in our nonparametric model that are supported by the data. But there is nothing about the statistical model that allows us to interpret our parameter as causal...yet. We can make additional causal assumptions, and these assumptions combined with our statistical model are referred to simply as the model for the observed data. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

29 More on Causal Assumptions We can assume a structural causal model (SCM) (Pearl 2009), comprised of endogenous variables X = (X j : j) and exogenous variables U = (U Xj : j). The SCM describes that each X j is a deterministic function of other endogenous variables and an exogenous error U j. The errors U are never observed. For each X j we characterize its parents from among X with Pa(X j ). For example, in our simple study, X = (W, A, Y ), and Pa(A) = W. We know this due to the time ordering of the variables. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

30 More on Causal Assumptions Thus we can now write: X j = f Xj (Pa(X j ), U Xj ), j = 1..., J, and the functional form of f Xj is often unspecified. An SCM can be fully parametric, but we do not do that here as our background knowledge does not support the assumptions involved. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

31 More on Causal Assumptions: Our Example We could specify the following SCM: W = f W (U W ), A = f A (W, U A ), Y = f Y (W, A, U Y ), Recall that we assume for the full data: 1 for each X j, X j = f j (Pa(X j ), U Xj ) depends on the other endogenous variables only through the parents Pa(X j ), 2 the exogenous variables have a particular joint distribution P U. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

32 Causal Graph U W U A U W U A W A U Y W A U Y (a) Y (b) Y U W U A U W U A W A U Y W A U Y (c) Y (d) Y Figure: Causal graphs with various assumptions about the distribution of P U CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

33 A Note on Causal Assumptions We could alternatively use the Neyman Rubin Causal Model and assume (1) randomization (A Y a W ) and (2) stable unit treatment value assumption (SUTVA; no interference between subjects and consistency assumption). CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

34 Defining the Question of Interest PARAMETER: One possible target parameter, the risk difference: ψ RD = Ψ(P) = E[E(Y A = 1, W ) E(Y A = 0, W )] = E(Y 1 ) E(Y 0 ) = P(Y 1 = 1) P(Y 0 = 1) CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

35 [Estimation] CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

36 The Need for Targeted Learning in Semiparametric Models MLE/machine learning are not targeted for effect parameters. For that, we need a subsequent targeted bias-reduction step: Targeted MLE Targeted Learning Avoid reliance on human art and unrealistic (parametric) models Define interesting parameters Target the fit of data-generating distribution to the parameter of interest Incorporate machine learning Statistical inference CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

37 Targeted Maximum Likelihood Learning Two-step procedure that incorporates estimates of the probability of the outcome given intervention and covariates as well as an estimate of the probability of intervention given covariates. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

38 Targeted Maximum Likelihood Learning Super Learner (van der Laan, Polley, and Hubbard 2007) Allows researchers to use multiple algorithms to outperform a single algorithm in semiparametric statistical models. It is related to stacking algorithms. TMLE (van der Laan and Rubin 2006) With an initial estimate of the relevant part of the data-generating distribution obtained using super learning, the second stage of TMLE updates this initial fit in a step targeted toward making an optimal bias-variance tradeoff for the parameter of interest. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

39 TMLE TMLE: Double Robust Removes asymptotic residual bias of initial estimator for the target parameter, if it uses a consistent estimator of intervention mechanism. If initial estimator was consistent for the target parameter, the additional fitting of the data in the targeting step may remove finite sample bias, and preserves consistency property of the initial estimator. TMLE: Efficiency If the initial estimator and the intervention estimator are both consistent, then it is also asymptotically efficient according to semiparametric statistical model efficiency theory. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

40 TMLE TMLE: In Practice Allows the incorporation of machine learning methods for the estimation of outcome regression and intervention mechanism so that we do not make assumptions about the probability distribution P 0 we do not believe. Thus, every effort is made to achieve minimal bias and the asymptotic semiparametric efficiency bound for the variance. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

41 TMLE Algorithm Observed data random variables Target parameter map O1,...,On Ψ() INPUTS Initial estimator of the probability distribution of the data P 0 n Targeted estimator of the probability distribution of the data True probability distribution P0 P n STATISTICAL MODEL Set of possible probability distributions of the data Initial estimator Ψ(P 0 n) Ψ(P0) Ψ(P n) Targeted estimator True value (estimand) of target parameter VALUES OF TARGET PARAMETER Values mapped to the real line with better estimates closer to the truth CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

42 Landscape: Other Estimators Maximum-Likelihood-Based Estimators Maximum-likelihood-based substitution estimators will be of the type ψ n = Ψ(Q n ) = 1 n n { Q n (1, W i ) Q n (0, W i )}, i=1 where this estimate is obtained by plugging in Q n = ( Q n, Q W,n ) into the mapping Ψ. Qn (A = a, W i ) = E n (Y A = a, W i ). Estimating-Equation-Based Methods An estimating function is a function of the data O and the parameter of interest. If D(ψ)(O) is an estimating function, then we can define a corresponding estimating equation: n 0 = D(ψ)(O i ), i=1 and solution ψ n satisfying n i=1 D(ψ n)(o i ) = 0. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

43 Effect Estimation vs. Prediction Both effect and prediction research questions are inherently estimation questions, but they are distinct in their goals. Effect: Interested in estimating the effect of exposure on outcome adjusted for covariates (includes causal effects). Prediction: Interested in generating a function to input covariates and predict a value for the outcome. Effect parameters where no causal assumptions are made may be referred to as variable importance measures (VIMs), especially when one is creating a ranked list of effect measures. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

44 The Prediction Estimation Problem A loss function assigns a measure of performance to a candidate function (e.g., Q) when applied to an observation O. We define our parameter of interest, Q 0 = E 0 (Y A, W ), as the minimizer of the expected squared error loss: Q 0 = arg min QE 0 L(O, Q), where L(O, Q) = (Y Q(A, W )) 2. E 0 L(O, Q), which we want to be small, evaluates the candidate Q, and it is minimized at the optimal choice of Q 0. We refer to expected loss as the risk. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

45 Revisiting Super Learner Super Learner Suppose a researcher is interested in using several different parametric statistical models to estimate E 0 (Y A, W ). We can use these algorithms to build a library of algorithms consisting of all weighted averages of the algorithms. One of these weighted averages might perform better than one of the algorithms alone. It is this principle that allows us to map a collection of algorithms into a library of weighted averages of these algorithms. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

46 Super Learner CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

47 A few elevator pitches... Risk Score Prediction in Elderly Populations CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

48 Background Risk scores are calculated to identify those patients at the highest level of risk for disease or death. In some cases, interventions are implemented for patients at high risk. Prediction has been used most notably to generate tables for risk of heart disease and breast cancer. Standard practice for risk score prediction relies heavily on regression in parametric statistical models, assuming a functional form that is not known. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

49 Background In high-dimensional data, researchers often have dozens, hundreds, or even thousands of potential covariates to include in their parametric statistical model. Not only does this provide an impossible challenge to correctly specify the parametric statistical model for the conditional mean, but the complexity of the parametric statistical model may also increase to the point that there are more unknown parameters than observations. A fully saturated parametric statistical model will often result in a gross overfit of the data. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

50 Background Recent medical and epidemiologic studies for prediction have employed newer machine learning methods. Researchers are then left with questions such as, When should I use random forest instead of standard regression techniques? Example of Opposite Findings for the Better Algorithm: Austin et al. Logistic regression had superior performance compared with regression trees for predicting in-hospital mortality in patients hospitalized with heart failure. J Clin Epidemiol. 2010; 63(10): Peng et al. Random forest can predict 30-day mortality of spontaneous intracerebral hemorrhage with remarkable discrimination. Eur J Neurol. 2010;17(7): CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

51 Elevator Pitch: Risk Score Prediction Kaiser Permanente Database Nested case-control sample (n=27,012) from a Kaiser Permanente database of persons over the age of 65 in Northern California. Outcome was death. Covariates were 184 medical flags covering a variety of diseases, treatments, and conditions as well as gender and age. Generally weak signal with R 2 = Rose, Fireman, van der Laan. Nested case-control risk score prediction. In: van der Laan, Rose. Targeted Learning: Causal Inference for Observational and Experimental Data. New York, NY: Springer, 2011: CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

52 Super Learner 1. Input data and the collection of 16 algorithms. 2. Split data into 10 blocks. 3. Fit each of the 16 algorithms on the training set (nonshaded blocks). 4. Predict the probabilities of death (Z) using the validation set (shaded block) for each algorithm, based on the corresponding training set fit. Collection of 16 Algorithms Data algorithm a algorithm b algorithm p algorithm a algorithm b algorithm p algorithm a algorithm b algorithm p 1 Z 1,a Z 1,b... 2 Z 2,a Z 2,b Z 10,a Z 10,b... CV MSE a... CV MSE b Family of weighted combinations... Z 1,p Z 2,p Z 10,p CV MSE p 5. Calculate estimated MSE within each validation set for each algorithm using Z. Average the risks across validation sets resulting in one estimated crossvalidated MSE for each algorithm. 6. Propose a family of weighted combinations of the 16 algorithms indexed by a weight vector α. Super learner function P n (Y=1 Z)=expit(α a,n Z a +α b,n Z b +...+α p,n Z p ) 8. Fit each of the 16 algorithms on the complete data set. Combine these fits with the weights obtained in the previous step to generate the super learner predictor function. 7. Use the probabilities (Z) to predict the outcome Y and estimate the vector α, thereby determining the combination that minimizes the crossvalidated risk over the family of weighted combinations. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

53 Elevator Pitch: Risk Score Prediction Sonoma Data Set Cohort study of n = 2, 066 residents of Sonoma, CA aged 54 and over. Outcome was death. Covariates were gender, age, self-rated health, leisure-time physical activity, smoking status, cardiac event history, and chronic health condition status. Almost two-fold improvement (R 2 = 0.200) with less than 10% of the subjects and less than 10% the number of covariates. Rose. Mortality risk score prediction in an elderly population using machine learning. Am J Epid, in press. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

54 Prediction Discussion Previous literature indicates that perception of health in elderly adults may be as important as less subjective measures when assessing later outcomes (Idler & Benyamini 1997, Blazer 2008). Likewise, benefits of physical activity in older populations have also been shown (Denaei et al. 2009). Even when the result is a negligible improvement relative to the best algorithms in the collection, the super learner provides a tool to run many algorithms and return a prediction function with the best (or equal) cross-validated MSE, avoiding the need to commit to a single algorithm. CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

55 Summary: The Road Map DEFINING THE RESEARCH QUESTION BEGIN DATA The data are n i.i.d. observations of random variable O. O has probability distribution P 0. MODEL The statistical model M is a set of possible probability distributions of O. P0 is in M. The model is a statistical model for P0 augmented with possible additional nontestable causal assumptions. TARGET PARAMETER The parameter Ψ(P0) is a particular feature of P0, where Ψ maps the probability distribution P 0 into the target parameter of interest. ESTIMATION SUPER LEARNER The first step in our estimation procedure is an initial estimate of the relevant part Q0 of P0 using the machine learning algorithm super learner. TARGETED MAXIMUM LIKELIHOOD ESTIMATION With an initial estimate of the relevant part of the data-generating distribution obtained using super learning, the second stage of TMLE updates this initial fit in a step targeted toward making an optimal bias variance tradeoff for the parameter of interest, now denoted Ψ(Q0), instead of the overall probability distribution. INFERENCE INFERENCE Standard errors are calculated for the estimator of the target parameter using the influence curve or resampling-based methods to assess the uncertainty in the estimator. INTERPRETATION The target parameter can be interpreted as a purely statistical parameter or as a causal parameter under possible additional nontestable assumptions in our model. END CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

for Observational and Experimental Data.

56 Targeted Learning Book (targetedlearningbook.com) van der Laan & Rose, Targeted Learning: Causal Inference for Observational and Experimental Data. New York: Springer, CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

57 Additional References Pearl, Causality. New York: Cambridge University Press, 2nd edition, Rose, Big data and the future. Significance, 9(4): 47-48, [The Big Data quotes were originally published in this article.] Rose, Starmans, van der Laan. Targeted learning for causality and statistical analysis in medical research. In Qian Meng, Zhongguo Zheng, eds. Statistics: Discovering Your Future Power. Beijing: China Statistics Press, van der Laan, Polley, Hubbard. Super Learner. SAGMB, 6(1):Article 25, van der Laan, Rubin. Targeted maximum likelihood learning. Int J Biostat, 2(1):Article 11, CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

58 Acknowledgments Johns Hopkins: Michael Rosenblum UC Berkeley: Mark van der Laan Rose Rosenblum Rose van der Laan Funding: National Science Foundation, DMS (PI: S. Rose) CRISSP (NYU-POLY) Big Data/Causal Modeling/Estimation August 30, / 58

Targeted Learning. Sherri Rose. April 24, Associate Professor Department of Health Care Policy Harvard Medical School

Targeted Learning. Sherri Rose. April 24, Associate Professor Department of Health Care Policy Harvard Medical School Targeted Learning Sherri Rose Associate Professor Department of Health Care Policy Harvard Medical School Slides: drsherrirosecom/short-courses Code: githubcom/sherrirose/cncshortcourse April 24, 2017