ASSOCIATION: CONTINGENCY, CORRELATION, AND REGRESSION Chapter 3 Learning Objectives 3.1 The Association between Two Categorical Variables 1. Identify variable type: Response or Explanatory 2. Define Association 3. Contingency tables 4. Calculate proportions and conditional proportions jenniferechols.files.wordpress.com Response and Explanatory Variables Association Response variable (dependent, y) outcome variable Explanatory variable (independent, x) defines groups Response/Explanatory 1. Blood alcohol level/ # of beers consumed 2. Grade on test/amount of study time 3. Yield of corn/amount of rainfall Association When a value for one variable is more likely with certain values of the other variable Data analysis with two variables 1. Tell whether there is an association and 2. Describe that association tickets.worldcafelive.com 1
Contingency Table Contingency Table Displays two categorical variables The rows list the categories of one variable; the columns list the other Entries in the table are frequencies What is the response variable? What is the explanatory variable? www1.pictures.fp.zimbio.com Proportions & Conditional Proportions Proportions & Conditional Proportions What proportion of organic foods contain pesticides? Conventionally grown? All? i.treehugger.com Proportions & Conditional Proportions Proportions & Conditional Proportions Side by side bar charts show conditional proportions and allow for easy comparison If no association, then proportions would be the same Since there is association, then proportions are different www.vitalchoice.com 2
Learning Objectives: 3.2 The Association between Two Quantitative Variables 1. Constructing scatterplots 2. Interpreting a scatterplot 3. Correlation 4. Calculating correlation onlinestatbook.com Internet Usage & GDP Data Set Scatterplot INTERNET GDP INTERNET GDP Algeria 0.65 6.09 Japan 38.42 25.13 Argentina 10.08 11.32 Malaysia 27.31 8.75 Australia 37.14 25.37 Mexico 3.62 8.43 Austria 38.7 26.73 Netherlands 49.05 27.19 Belgium 31.04 25.52 New Zealand 46.12 19.16 Brazil 4.66 7.36 Nigeria 0.1 0.85 Canada 46.66 27.13 Norway 46.38 29.62 Chile 20.14 9.19 Pakistan 0.34 1.89 China 2.57 4.02 Philippines 2.56 3.84 Denmark 42.95 29 Russia 2.93 7.1 Egypt 0.93 3.52 Saudi Arabia 1.34 13.33 Finland 43.03 24.43 South Africa 6.49 11.29 France 26.38 23.99 Spain 18.27 20.15 Germany 37.36 25.35 Sweden 51.63 24.18 Greece 13.21 17.44 Switzerland 30.7 28.1 India 0.68 2.84 Turkey 6.04 5.89 Iran 1.56 6 United Kingdom 32.96 24.16 Ireland 23.31 32.41 United States 50.15 34.32 Israel 27.66 19.79 Vietnam 1.24 2.07 Yemen 0.09 0.79 www.knitwareblog.com Graph of two quantitative variables: Horizontal Axis: Explanatory, x Vertical Axis: Response, y INTERNET GDP INTERNET GDP Algeria 0.65 6.09 Japan 38.42 25.13 Argentina 10.08 11.32 Malaysia 27.31 8.75 Australia 37.14 25.37 Mexico 3.62 8.43 Austria 38.7 26.73 Netherlands 49.05 27.19 Belgium 31.04 25.52 New Zealand 46.12 19.16 Brazil 4.66 7.36 Nigeria 0.1 0.85 Canada 46.66 27.13 Norway 46.38 29.62 Chile 20.14 9.19 Pakistan 0.34 1.89 China 2.57 4.02 Philippines 2.56 3.84 Denmark 42.95 29 Russia 2.93 7.1 Egypt 0.93 3.52 Saudi Arabia 1.34 13.33 Finland 43.03 24.43 South Africa 6.49 11.29 France 26.38 23.99 Spain 18.27 20.15 Germany 37.36 25.35 Sweden 51.63 24.18 Greece 13.21 17.44 Switzerland 30.7 28.1 India 0.68 2.84 Turkey 6.04 5.89 Iran 1.56 6 United Kingdom 32.96 24.16 Ireland 23.31 32.41 United States 50.15 34.32 Israel 27.66 19.79 Vietnam 1.24 2.07 Yemen 0.09 0.79 Interpreting Scatterplots Used-car Dealership The overall pattern includes trend, direction, and strength of the relationship Trend: linear, curved, clusters, no pattern Direction: positive, negative, no direction Strength: how closely the points fit the trend Also look for outliers from the overall trend www.pritchettcartoons.com What association would we expect between the age of the car and mileage? a) Positive b) Negative c) No association 3
Linear Correlation, r Measures the strength and direction of the linear association between x and y Correlation coefficient: Measuring Strength & Direction of a Linear Relationship Positive r => positive association Negative r => negative association r close to +1 or -1 indicates strong linear association r close to 0 indicates weak association Learning Objectives 3.3 Can We Predict the Outcome of a Variable? www.cabnr.unr.edu 1. Define regression line 2. Predict with regression equation 3. Interpret slope and y-intercept 4. Identify least-squares regression line 5. Calculate least-squares regression line 6. Compare explanatory and response variables 7. Calculate and interpret r 2 What Is Regression? Regression Line Regress withdraw act of reasoning backward Regression Curve curve with best possible fit for data; describes how y changes with x graphics8.nytimes.com Predicts y, given x: y ˆ = a + bx The y-intercept and slope are a and b Only an estimate actual data vary Describes relationship between x and estimated means of y farm4.static.flickr.com 4
Residuals Least Squares Method Prediction errors: vertical distance between data point and regression line Large residual indicates unusual observation Each residual is: y yˆ Sum of residuals is always zero www.chem.utoronto.ca Goal: Minimize distance from data to regression line msenux.redwoods.edu Residual sum of squares: ( residuals) = ( y yˆ ) 2 2 Least squares regression line minimizes vertical distance between points and their predictions Regression Analysis Anthropologists Predict Height Using Remains? Identify response and explanatory variables Response variable is y Explanatory variable is x Regression Equation: yˆ = 61.4 + 2. 4x ŷ is predicted height and x is the length of a femur, thighbone (cm) Predict height for femur length of 50 cm www.geektoysgamesandgadgets.com Bones Interpreting the y-intercept and slope Slope Values: Positive, Negative, Zero y-intercept: y-value when x = 0 Helps plot line Slope: change in y for 1 unit increase in x 1 cm increase in femur length means 2.4 cm increase in predicted height yˆ = 61.4 + 2. 4x 5
Slope and Correlation Squared Correlation, r 2 Slope, b: Doesn t tell strength Has units Inverts if x and y are swapped Correlation, r: Describes strength No units Same if x and y are swapped Proportional reduction in error, r 2 Variation in y-values explained by relationship of y to x A correlation, r, of.9 means r 2 2 =.9 =.81 => 81% of variation in y is explained by x 81% Learning Objectives: 3.4 What Are Some Cautions in Analyzing Associations? 1. Extrapolation 2. Outliers and Influential Observations 3. Correlations does not imply causation 4. Lurking variables and confounding 5. Simpson s Paradox www.bio.uu.nl Extrapolation Outliers and Influential Points Extrapolation: Predicting y for x-values outside range of data Riskier the farther from the range of x No guarantee trend holds Regression outlier lies far away from rest of data Influential if both: 1. Low or high, compared to rest of data 2. Regression outlier Neil Weiss, Elementary Statistics, 7 th Edition www2.selu.edu 6
Correlation Does Not Imply Causation Chicago Fires of Last Year Strong correlation between x and y means Strong linear association between the variables Does not mean x causes y www.teachbabymusic.com x = # firefighters at fire y = cost of damages 1. Correlation is +, -, 0? 2. Do more firefighters cause damages to be worse? 3. What else might cause association? a. Distance from station b. Intensity of fire c. Size of fire pixdaus.com Lurking Variables & Confounding Simpson s Paradox 1. Ice cream sales & drowning => temperature 2. Reading level & shoe size => age Simpson s Paradox: Association between two variables reverses after third is included image3.examiner.com Confounding two explanatory variables both associated with response variable and each other Lurking variables not measured in study but may confound Homer (not really the right) Simpson www.jewsinalabama.com Simpson s Paradox Example Simpson s Paradox Example Break out Data by Age streetpulse.files.wordpress.com Probability of Death of Smoker = 139/582 = 24% Probability of Death of Nonsmoker = 230/732 = 31% blogs.smh.com.au Greta Garbo 7
Simpson s Paradox Example Image Sources www.straitstimes.com Associations look quite different after adjusting for third variable Statistics: The Art and Science of Learning from Data, 2 nd Edition, Agresti and Franklin http://jenniferechols.files.wordpress.com/2007/08/two-friends-hugging.jpg http://www.hessdesignworks.com/illustrations/corn.jpg http://tickets.worldcafelive.com/uplimage/beercircle.jpg http://i.treehugger.com/images/2007/10/24/pesticide-jj-001.jpg http://onlinestatbook.com/chapter12/graphics/reg_error.gif http://www.knitwareblog.com/wp-content/uploads/2008/06/firefox-3-download-map.jpg http://www.pritchettcartoons.com/jeremy/used_cars.gif http://scienceaid.co.uk/psychology/approaches/images/correlation.jpg http://nrtwq.usgs.gov/images/methods/sscvsturb.png http://www.agdesktop.com/wallpapers%5ctelefilm%5cbones%5ctemperance_bones_brennan-seeley%20booth-001.jpg http://farm3.static.flickr.com/2706/4441460977_61dfcc3e6e.jpg http://msenux.redwoods.edu/math/r/graphics/regression1.gif http://graphics8.nytimes.com/images/2009/04/27/world/27withdraw.xlarge12.jpg http://farm4.static.flickr.com/3311/3577858126_093b727095.jpg http://www.chem.utoronto.ca/coursenotes/analsci/stats/images/linreggraph.gif http://thumb11.shutterstock.com.edgesuite.net/display_pic_with_logo/66811/66811,1179456692,1/stock-photo-an-upwardgraph-on-a-green-chalkboard-3322824.jpg http://www.bio.uu.nl/~biostat/outlier.gif Neil Weiss, Elementary Statistics, 7th Edition http://www2.selu.edu/academics/faculty/dgurney/math241/stattopics/scatanal_files/image004.gif http://pixdaus.com/pics/1221327421sxkwkqy.jpg http://www.teachbabymusic.com/img/piano_r3_c1.jpg http://image3.examiner.com/images/blog/exid30987/images/resized_child_eating_ice_cream.jpg http://blogs.smh.com.au/girlsguide/garbo313.jpg http://www.straitstimes.com/sti/stimedia/image/20100317/smoker-reuters.jpg http://www1.pictures.fp.zimbio.com/marcia+cross+running+errands+brentwood+65f2awtuzjal.jpg 8