Stat 587: Key points and formulae Week 15

Similar documents
Machine Learning, Fall 2009: Midterm

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Statistical Consulting Topics Classification and Regression Trees (CART)

Generalized Linear Models for Non-Normal Data

Chapter 1. Modeling Basics

Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

Experimental Design and Data Analysis for Biologists

Glossary for the Triola Statistics Series

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Poisson regression: Further topics

Stat 502X Exam 2 Spring 2014

Holdout and Cross-Validation Methods Overfitting Avoidance

Bayesian Learning (II)

Statistical. Psychology

Classification and Regression Trees

Consider Table 1 (Note connection to start-stop process).

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Generalized Models: Part 1

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Lecture 12: Effect modification, and confounding in logistic regression

CPSC 340: Machine Learning and Data Mining

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Chapter 1 Statistical Inference

Linear classifiers: Logistic regression

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

LECTURE 15: SIMPLE LINEAR REGRESSION I

Introduction to Generalized Models

Proteomics and Variable Selection

Decision Trees Part 1. Rao Vemuri University of California, Davis

VBM683 Machine Learning

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

New Statistical Methods That Improve on MLE and GLM Including for Reserve Modeling GARY G VENTER

Logistic Regression: Regression with a Binary Dependent Variable

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Generalized Linear Models for Count, Skewed, and If and How Much Outcomes

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Linear Regression. Chapter 3

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

2/26/2017. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Contents. Part I: Fundamentals of Bayesian Inference 1

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

PENALIZING YOUR MODELS

Business Statistics. Lecture 9: Simple Regression

Binomial Model. Lecture 10: Introduction to Logistic Regression. Logistic Regression. Binomial Distribution. n independent trials

Machine Learning Linear Classification. Prof. Matteo Matteucci

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

8 Nominal and Ordinal Logistic Regression

CS 5014: Research Methods in Computer Science. Bernoulli Distribution. Binomial Distribution. Poisson Distribution. Clifford A. Shaffer.

Variance Reduction and Ensemble Methods

Lecture 10: Introduction to Logistic Regression

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Regression techniques provide statistical analysis of relationships. Research designs may be classified as experimental or observational; regression

BMI 541/699 Lecture 22

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Time Series Analysis. Smoothing Time Series. 2) assessment of/accounting for seasonality. 3) assessment of/exploiting "serial correlation"

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

ECE 5424: Introduction to Machine Learning

Chi-Squared Tests. Semester 1. Chi-Squared Tests

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

Feature Engineering, Model Evaluations

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Exam details. Final Review Session. Things to Review

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Regression in R. Seth Margolis GradQuant May 31,

Linear Regression Models P8111

CMU-Q Lecture 24:

Lecture Outline. Biost 518 Applied Biostatistics II. Choice of Model for Analysis. Choice of Model. Choice of Model. Lecture 10: Multiple Regression:

12 Statistical Justifications; the Bias-Variance Decomposition

MALLOY PSYCH 3000 Hypothesis Testing PAGE 1. HYPOTHESIS TESTING Psychology 3000 Tom Malloy

Part 8: GLMs and Hierarchical LMs and GLMs

CSC2515 Assignment #2

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

7. Assumes that there is little or no multicollinearity (however, SPSS will not assess this in the [binary] Logistic Regression procedure).

COMS 4771 Regression. Nakul Verma

Stat 5101 Lecture Notes

Statistics 572 Semester Review

Machine Learning, Fall 2012 Homework 2

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

UVA CS 4501: Machine Learning

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Lecture 3: Introduction to Complexity Regularization

Final Exam, Machine Learning, Spring 2009

An Introduction to Multilevel Models. PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 25: December 7, 2012

CS 361: Probability & Statistics

Chapter 4: Generalized Linear Models-II

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

Announcements. Proposals graded

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Some comments on Partitioning

Chapter 26: Comparing Counts (Chi Square)

Pump failure data. Pump Failures Time

Statistical Machine Learning

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Neural networks (not in book)

Transcription:

Odds ratios to compare two proportions: Difference, p 1 p 2, has issues when applied to many populations Vit. C: P[cold Placebo] = 0.82, P[cold Vit. C] = 0.74, Estimated diff. is 8% What if a year or place when colds infrequent, e.g. 6% on placebo. Vit.C = 2%??? Odds ratios quantify relationship between two proportions that is applicable across a wide range of baseline proportions Odds = π/(1 π). related to betting: horse is a 2:1 favorite. range from 0 (Prob = 0) to (Prob = 1) Odds = 1 Prob = 0.5 Statistical analysis commonly uses log odds range from (Prob = 0) to (Prob = 1) log Odds = 0 Prob = 0.5 Odds ratio: comparison between two groups Odds 1 /Odds 2 = π 1(1 π 2 ) (1 π 1 )π2 commonly use log odds ratio: = 0 when Odds 1 = Odds 2 or π 1 = π 2 Inference for log odds ratio estimate = log O 12 O 21 O 11 O 22 This is log odds of cold in placebo - log odds of cold in VitC odds of cold in placebo = O 12 /O 11 odds of cold in Vit C = O 22 /O 21 Easy to misinterpret as odds of not cold in placebo vs in vitc or odds of cold in Vit C vs in placebo Sign difference (+ or -1). If it matters, I check against proportions 1 se O 11 + 1 O 12 + 1 O 21 + 1 O 22 ci for log odds ratio: estimate ±z 1 α/2 se 95% ci: z 0.975 = 1.96 exponentiate to get ci for odds ratio Sampling models: How to obtain the data in above table; three common ways Prospective Binomial sample: e.g., Vit C data Two (or more) groups, observe events, estimate P[event group] Retrospective Binomial sample: e.g., case-control study. Especially useful when event is rare Sample C 1 events and C 2, observe group for each can estimate P[group event] but not P[event group] (without more info) this is not very useful Can estimate odds ratio - this is the only useful statistic Multinomial sample: e.g. genetic linkage study observe 4 groups defined by row and column labels Q concerns independence of the row and column classifications The three sampling models differ by what is fixed by the design Prospective Binomial: Number in each group Retrospective Binomial: Number of events and non-events 1

Multinomial: Total number of subjects There are still more sampling models, but they are much less frequently used Theory when sample size large, use Chi-square test for all three Different small sample methods Material covered on the final ends here The semester in 20 minutes: A lot of the semester has been details Understanding how a statistical method works Understanding how to interpret specific results Here are the key concepts Almost always, you have to choose how to analyze your data My conceptual model: Question Data Answers Everything starts with your Question Study design should obtain data that can answer that question Choice of statistical method should provide justifiable answers to that question Appropriate methods and conclusions answer the question and honor the study design An analysis is not right or wrong, really 3 categories wrong: doesn t answer the question, violates critical assumption (e.g., independence) ignores study design (e.g., ignoring pairing, causal claims from obs. data) ok, perhaps could be improved errors not normally distributed (perhaps unequal variance) but hard to fix or reasonable Most analyses based on a model for the data. That model should: honor the design (e.g., include blocks) provide quantities that answer the question e.g., study with quantitative treatments - regression or pairwise comparisons My opinion: prefer estimates and confidence intervals especially for a-priori contrasts are more informative than a test p-value only tells you not zero except when question is general: do all trts have same mean? Fastest way to get into ethical trouble: explore data for interesting patterns (fine, so long as clearly exploratory) but then treat as a-priori hypotheses or questions 2

Fun things that could be useful to know about: Paired Yes/No data Logistic Regression: regression with yes/no response Poisson Regression: regression with count (0, 1, 2, 3 ) response Smoothing splines CART models and random forests For all - few details. My goal is for you to know these exist Paired yes/no data: Vit C study: what if conducted differently? Find households with 2 adults. Within each household, one adult gets Vit C, other gets Placebo Data are paired (yes/no response doesn t change that aspect of the design) My experience is that this pairing often gets forgotten Chi-square test is wrong (obs. are not independent) se of log odds ratio is wrong There are methods that explicitly account for pairing analog of the paired t-test for continuous responses one simple one is McNemar s test Regression with yes/no response: Simple situation: yes/no response, 1 X variable Want to model P[yes] as a function of the X variable Y i is 0 or 1, π i is P[yes] for i th obs, X i is X value for i th obs P[yes] is an average, so could try π i = β 0 + β 1 X i + ε i Issue: (besides unequal variance, non-normal errors) Predicted values can be < 0 or > 1: not good! Logistic regression: model log odds as function of X i log π i 1 π i = β 0 + β 1 X i Observe Y i which is 1 with probability π i If draw π i vs X i, curve is sigmoid same shape as logistic population growth curve in ecology β 1 is increase in log odds when X increases by 1 exp β 1 is odds ratio comparing X + 1 to X Extends in obvious way to multiple X variables How to estimate the coefficients? Can not use least squares (for continuous responses) Use maximum likelihood Likelihood: Generalization of least squares to any statistical distribution When Y i = 0 or 1, Y i has a Bernoulli distribution Likelihood expresses how well a particular set of parameters fits the data Maximum likelihood which β 0, β 1 fits best 3

Provides standard errors for estimates which give tests and confidence intervals Implemented as a Generalized Linear Model (glm) which is a genmod in SAS because glm used for general linear model Regression with count responses Two types: Fixed maximum: Binomial distribution unlimited maximum: Poisson distribution Example of fixed maximum: Modification of Vit C study Response is whether or not you had a cold in Nov, Dec, Jan, Feb or Mar Response is 0, 1, 2, 3, 4 or 5, (5 if had a cold in at each month) Define π i as probability have a cold in a month Y i is number of months for person i, has Binomial(5, π i ) distribution # possible events can be same or different for each individual model log odds (log π i /(1 π i )) as a function of the X variable Example of unlimited maximum: another modification of the Vit C study Response is number of colds you had during the winter season No fixed limit, has values 0, 1, 2,, possibly large # Define λ i = average (or predicted) number of events for person i λ i can not be negative So model log λ i = β 0 + β i Y i is number of colds, has Poisson(λ i ) distribution Fit either model by maximum likelihood Overdispersion in models for count data: Both Binomial and Poisson distributions: Var Y i depends on mean Y i i.e., π i or λ i Sometimes the data are more variable than they should be This is known as overdispersion Account for it by using a more complicated distribution for the data Fixed maximum: Beta binomial distribution instead of Binomial Unlimited maximum: Negative binomial distribution instead of Poisson My experience is that most ag/bio count data is overdispersed Analysis of these data must account for overdispersion Only exception is bird eggs/clutch, which are less variable than expected Smoothing splines: Goal: model relationship between Y and X without specifying the details We ve seen linear: E Y i = β 0 + β 1 X i and quadratic E Y i = β 0 + β 1 X i + β 2 Xi 2 models If curve needs to bend in a different way, could use higher order polynomial models Adding more terms allows curve to wiggle more. But only in the ways allowed by some specific polynomial Smoothing splines let the data tell the model where/how much to bend Model is Y i = f(x i ) + ε i 4

where f(x i ) is an arbitrary function estimated from the data Simple model is a series of line segments joined together bends where data bents, straight where data straight Continuous, but not smooth (1st and 2nd derivatives not continuous) More useful: join cubic polynomials that are continuous with continuous 1st and 2nd derivatives Looks smooth, most common choice for splines Practical issue: how wiggly is the fitted curve? A model selection issue: want to fit the data, but not overfit A curve that wiggles a lot probably overfits the data Common solution: borrow a model selection idea combine Fit + penalty for wiggliness (= complexity) What can this be used for? Learning: understand relationship between Y and X Interpolation: predict Y for X s inside the data Splines do not extrapolate well (no data to estimate the curve) Evaluate a specific model Overlay predicted values from the model and from a spline CART models: Classification and Regression Trees Really useful for multiple X variables with complicated interaction effects Predictor is a dichotomous key like classifying species Example: predicting SAT score Like splines: estimate f(x) from data but prediction is the average of a group of observations Example from SAT data: if ltakers 3.205, predict 877.4; if not, predict 1002 Those values are the means of the groups with ltakers 3.205 and ltakers < 3.205 Algorithm: Consider all variables, and all possible split points Find the variable and split point that best separates two groups details of best depend on nature of the data (continuous, count, yes/no) Split the data, consider each subgroup separately Find the best split for subgroup 1, result is two sub-sub-groups And for subgroup 2, giving 2 more sub-sub-groups Keep splitting until: no effective split or groups too small to split further (user specified limit) Practical experience is that a tree tends to overfit the data prune the tree by removing lowest branches decision often made using cross-validation Make prediction by working down the tree at each split, decide which way the observation goes when get to end, prediction is the average for that leaf CART models: require quite a bit of data (> 100 observations, >250 is better) 5

are really effective with contingent relationships SAT: rank only matters when ltakers < 3.205 not as useful when a simple model (e.g., linear) is sufficient Random Forest: Extension of a CART model Idea is to create many trees by resampling the original data 500 trees is common, often more. Prediction: have covariate values Use covariates to make a prediction from each model, so 500 predictions report their average as the prediction for that observation An example of ensemble prediction: using many models Seems wierd, but works extremely well Random Forests are the best out of the box method to make predictions my experience and opinion, shared by lots of others out of the box means they work on lots of different types of problems without requiring a lot of problem-specific tweaking. 6