Stat 587: Key points and formulae Week 15

Size: px

Start display at page:

Download "Stat 587: Key points and formulae Week 15"

Doris Morton
5 years ago
Views:

1 Odds ratios to compare two proportions: Difference, p 1 p 2, has issues when applied to many populations Vit. C: P[cold Placebo] = 0.82, P[cold Vit. C] = 0.74, Estimated diff. is 8% What if a year or place when colds infrequent, e.g. 6% on placebo. Vit.C = 2%??? Odds ratios quantify relationship between two proportions that is applicable across a wide range of baseline proportions Odds = π/(1 π). related to betting: horse is a 2:1 favorite. range from 0 (Prob = 0) to (Prob = 1) Odds = 1 Prob = 0.5 Statistical analysis commonly uses log odds range from (Prob = 0) to (Prob = 1) log Odds = 0 Prob = 0.5 Odds ratio: comparison between two groups Odds 1 /Odds 2 = π 1(1 π 2 ) (1 π 1 )π2 commonly use log odds ratio: = 0 when Odds 1 = Odds 2 or π 1 = π 2 Inference for log odds ratio estimate = log O 12 O 21 O 11 O 22 This is log odds of cold in placebo - log odds of cold in VitC odds of cold in placebo = O 12 /O 11 odds of cold in Vit C = O 22 /O 21 Easy to misinterpret as odds of not cold in placebo vs in vitc or odds of cold in Vit C vs in placebo Sign difference (+ or -1). If it matters, I check against proportions 1 se O O O O 22 ci for log odds ratio: estimate ±z 1 α/2 se 95% ci: z = 1.96 exponentiate to get ci for odds ratio Sampling models: How to obtain the data in above table; three common ways Prospective Binomial sample: e.g., Vit C data Two (or more) groups, observe events, estimate P[event group] Retrospective Binomial sample: e.g., case-control study. Especially useful when event is rare Sample C 1 events and C 2, observe group for each can estimate P[group event] but not P[event group] (without more info) this is not very useful Can estimate odds ratio - this is the only useful statistic Multinomial sample: e.g. genetic linkage study observe 4 groups defined by row and column labels Q concerns independence of the row and column classifications The three sampling models differ by what is fixed by the design Prospective Binomial: Number in each group Retrospective Binomial: Number of events and non-events 1

2 Multinomial: Total number of subjects There are still more sampling models, but they are much less frequently used Theory when sample size large, use Chi-square test for all three Different small sample methods Material covered on the final ends here The semester in 20 minutes: A lot of the semester has been details Understanding how a statistical method works Understanding how to interpret specific results Here are the key concepts Almost always, you have to choose how to analyze your data My conceptual model: Question Data Answers Everything starts with your Question Study design should obtain data that can answer that question Choice of statistical method should provide justifiable answers to that question Appropriate methods and conclusions answer the question and honor the study design An analysis is not right or wrong, really 3 categories wrong: doesn t answer the question, violates critical assumption (e.g., independence) ignores study design (e.g., ignoring pairing, causal claims from obs. data) ok, perhaps could be improved errors not normally distributed (perhaps unequal variance) but hard to fix or reasonable Most analyses based on a model for the data. That model should: honor the design (e.g., include blocks) provide quantities that answer the question e.g., study with quantitative treatments - regression or pairwise comparisons My opinion: prefer estimates and confidence intervals especially for a-priori contrasts are more informative than a test p-value only tells you not zero except when question is general: do all trts have same mean? Fastest way to get into ethical trouble: explore data for interesting patterns (fine, so long as clearly exploratory) but then treat as a-priori hypotheses or questions 2

3 Fun things that could be useful to know about: Paired Yes/No data Logistic Regression: regression with yes/no response Poisson Regression: regression with count (0, 1, 2, 3 ) response Smoothing splines CART models and random forests For all - few details. My goal is for you to know these exist Paired yes/no data: Vit C study: what if conducted differently? Find households with 2 adults. Within each household, one adult gets Vit C, other gets Placebo Data are paired (yes/no response doesn t change that aspect of the design) My experience is that this pairing often gets forgotten Chi-square test is wrong (obs. are not independent) se of log odds ratio is wrong There are methods that explicitly account for pairing analog of the paired t-test for continuous responses one simple one is McNemar s test Regression with yes/no response: Simple situation: yes/no response, 1 X variable Want to model P[yes] as a function of the X variable Y i is 0 or 1, π i is P[yes] for i th obs, X i is X value for i th obs P[yes] is an average, so could try π i = β 0 + β 1 X i + ε i Issue: (besides unequal variance, non-normal errors) Predicted values can be < 0 or > 1: not good! Logistic regression: model log odds as function of X i log π i 1 π i = β 0 + β 1 X i Observe Y i which is 1 with probability π i If draw π i vs X i, curve is sigmoid same shape as logistic population growth curve in ecology β 1 is increase in log odds when X increases by 1 exp β 1 is odds ratio comparing X + 1 to X Extends in obvious way to multiple X variables How to estimate the coefficients? Can not use least squares (for continuous responses) Use maximum likelihood Likelihood: Generalization of least squares to any statistical distribution When Y i = 0 or 1, Y i has a Bernoulli distribution Likelihood expresses how well a particular set of parameters fits the data Maximum likelihood which β 0, β 1 fits best 3

4 Provides standard errors for estimates which give tests and confidence intervals Implemented as a Generalized Linear Model (glm) which is a genmod in SAS because glm used for general linear model Regression with count responses Two types: Fixed maximum: Binomial distribution unlimited maximum: Poisson distribution Example of fixed maximum: Modification of Vit C study Response is whether or not you had a cold in Nov, Dec, Jan, Feb or Mar Response is 0, 1, 2, 3, 4 or 5, (5 if had a cold in at each month) Define π i as probability have a cold in a month Y i is number of months for person i, has Binomial(5, π i ) distribution # possible events can be same or different for each individual model log odds (log π i /(1 π i )) as a function of the X variable Example of unlimited maximum: another modification of the Vit C study Response is number of colds you had during the winter season No fixed limit, has values 0, 1, 2,, possibly large # Define λ i = average (or predicted) number of events for person i λ i can not be negative So model log λ i = β 0 + β i Y i is number of colds, has Poisson(λ i ) distribution Fit either model by maximum likelihood Overdispersion in models for count data: Both Binomial and Poisson distributions: Var Y i depends on mean Y i i.e., π i or λ i Sometimes the data are more variable than they should be This is known as overdispersion Account for it by using a more complicated distribution for the data Fixed maximum: Beta binomial distribution instead of Binomial Unlimited maximum: Negative binomial distribution instead of Poisson My experience is that most ag/bio count data is overdispersed Analysis of these data must account for overdispersion Only exception is bird eggs/clutch, which are less variable than expected Smoothing splines: Goal: model relationship between Y and X without specifying the details We ve seen linear: E Y i = β 0 + β 1 X i and quadratic E Y i = β 0 + β 1 X i + β 2 Xi 2 models If curve needs to bend in a different way, could use higher order polynomial models Adding more terms allows curve to wiggle more. But only in the ways allowed by some specific polynomial Smoothing splines let the data tell the model where/how much to bend Model is Y i = f(x i ) + ε i 4

5 where f(x i ) is an arbitrary function estimated from the data Simple model is a series of line segments joined together bends where data bents, straight where data straight Continuous, but not smooth (1st and 2nd derivatives not continuous) More useful: join cubic polynomials that are continuous with continuous 1st and 2nd derivatives Looks smooth, most common choice for splines Practical issue: how wiggly is the fitted curve? A model selection issue: want to fit the data, but not overfit A curve that wiggles a lot probably overfits the data Common solution: borrow a model selection idea combine Fit + penalty for wiggliness (= complexity) What can this be used for? Learning: understand relationship between Y and X Interpolation: predict Y for X s inside the data Splines do not extrapolate well (no data to estimate the curve) Evaluate a specific model Overlay predicted values from the model and from a spline CART models: Classification and Regression Trees Really useful for multiple X variables with complicated interaction effects Predictor is a dichotomous key like classifying species Example: predicting SAT score Like splines: estimate f(x) from data but prediction is the average of a group of observations Example from SAT data: if ltakers 3.205, predict 877.4; if not, predict 1002 Those values are the means of the groups with ltakers and ltakers < Algorithm: Consider all variables, and all possible split points Find the variable and split point that best separates two groups details of best depend on nature of the data (continuous, count, yes/no) Split the data, consider each subgroup separately Find the best split for subgroup 1, result is two sub-sub-groups And for subgroup 2, giving 2 more sub-sub-groups Keep splitting until: no effective split or groups too small to split further (user specified limit) Practical experience is that a tree tends to overfit the data prune the tree by removing lowest branches decision often made using cross-validation Make prediction by working down the tree at each split, decide which way the observation goes when get to end, prediction is the average for that leaf CART models: require quite a bit of data (> 100 observations, >250 is better) 5

6 are really effective with contingent relationships SAT: rank only matters when ltakers < not as useful when a simple model (e.g., linear) is sufficient Random Forest: Extension of a CART model Idea is to create many trees by resampling the original data 500 trees is common, often more. Prediction: have covariate values Use covariates to make a prediction from each model, so 500 predictions report their average as the prediction for that observation An example of ensemble prediction: using many models Seems wierd, but works extremely well Random Forests are the best out of the box method to make predictions my experience and opinion, shared by lots of others out of the box means they work on lots of different types of problems without requiring a lot of problem-specific tweaking. 6

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all