Big Data Analysis with Apache Spark UC#BERKELEY

Similar documents
Lecture 30. DATA 8 Summer Regression Inference

Unit 6 - Simple linear regression

Unit 6 - Introduction to linear regression

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

Business Statistics. Lecture 10: Course Review

Ordinary Least Squares Regression Explained: Vartanian

AIM HIGH SCHOOL. Curriculum Map W. 12 Mile Road Farmington Hills, MI (248)

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Chapter 6. September 17, Please pick up a calculator and take out paper and something to write with. Association and Correlation.

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation

Important note: Transcripts are not substitutes for textbook assignments. 1

Physics 509: Bootstrap and Robust Parameter Estimation

Business Statistics. Lecture 10: Correlation and Linear Regression

AMS 7 Correlation and Regression Lecture 8

Business Statistics. Lecture 9: Simple Regression

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Chapter 16. Simple Linear Regression and Correlation

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

Examine characteristics of a sample and make inferences about the population

Assumptions, Diagnostics, and Inferences for the Simple Linear Regression Model with Normal Residuals

where Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc.

LECTURE 5. Introduction to Econometrics. Hypothesis testing

appstats8.notebook October 11, 2016

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics

t-test for b Copyright 2000 Tom Malloy. All rights reserved. Regression

Chapter 7. Scatterplots, Association, and Correlation. Copyright 2010 Pearson Education, Inc.

appstats27.notebook April 06, 2017

Background to Statistics

Bivariate Regression Analysis. The most useful means of discerning causality and significance of variables

Chapter 5 Least Squares Regression

Business Statistics: Lecture 8: Introduction to Estimation & Hypothesis Testing

Chapter 7 Summary Scatterplots, Association, and Correlation

Correlation and Regression

Statistics 100 Exam 2 March 8, 2017

Chapter 27 Summary Inferences for Regression

Introduction to Basic Statistics Version 2

Do students sleep the recommended 8 hours a night on average?

Chapter 10. Correlation and Regression. McGraw-Hill, Bluman, 7th ed., Chapter 10 1

Correlation and regression

Inferences for Regression

Biostatistics 4: Trends and Differences

Histograms, Central Tendency, and Variability

Final Exam - Solutions

You have 3 hours to complete the exam. Some questions are harder than others, so don t spend too long on any one question.

Statistical Distribution Assumptions of General Linear Models

CORRELATION AND REGRESSION

Review of Statistics 101

Simple linear regression

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Chapter 12 Summarizing Bivariate Data Linear Regression and Correlation

Correlation. January 11, 2018

Chapter 16. Simple Linear Regression and dcorrelation

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

Review. Midterm Exam. Midterm Review. May 6th, 2015 AMS-UCSC. Spring Session 1 (Midterm Review) AMS-5 May 6th, / 24

Chapter 4 Describing the Relation between Two Variables

Ordinary Least Squares Regression Explained: Vartanian

Lecture 11: Simple Linear Regression

Chapter 10 Correlation and Regression

Topic 10 - Linear Regression

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Math 147 Lecture Notes: Lecture 12

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Inference in Normal Regression Model. Dr. Frank Wood

Chapter 12 - Part I: Correlation Analysis

1 Correlation and Inference from Regression

Psychology 282 Lecture #4 Outline Inferences in SLR

Introduction and Single Predictor Regression. Correlation

Basics of Experimental Design. Review of Statistics. Basic Study. Experimental Design. When an Experiment is Not Possible. Studying Relations

Multiple Linear Regression

Relationships between variables. Visualizing Bivariate Distributions: Scatter Plots

determine whether or not this relationship is.

Section 3: Simple Linear Regression

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

Probability and Statistics

Lecture 10: F -Tests, ANOVA and R 2

EM375 STATISTICS AND MEASUREMENT UNCERTAINTY CORRELATION OF EXPERIMENTAL DATA

Correlation and Regression

Correlation A relationship between two variables As one goes up, the other changes in a predictable way (either mostly goes up or mostly goes down)

Chapter 7. Scatterplots, Association, and Correlation

Correlation. What Is Correlation? Why Correlations Are Used

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Final Exam - Solutions

Review of Multiple Regression

Warm-up Using the given data Create a scatterplot Find the regression line

Correlation and Simple Linear Regression

Correlation and Linear Regression

Chapter 11. Correlation and Regression

Black White Total Observed Expected χ 2 = (f observed f expected ) 2 f expected (83 126) 2 ( )2 126

R 2 and F -Tests and ANOVA

Statistics Primer. A Brief Overview of Basic Statistical and Probability Principles. Essential Statistics for Data Analysts Using Excel

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

ECON 497: Lecture 4 Page 1 of 1

Example: Forced Expiratory Volume (FEV) Program L13. Example: Forced Expiratory Volume (FEV) Example: Forced Expiratory Volume (FEV)

HUDM4122 Probability and Statistical Inference. February 2, 2015

Mathematics for Economics MA course

Intro to Linear Regression

Transcription:

Big Data Analysis with Apache Spark UC#BERKELEY

This Lecture: Relation between Variables An association A trend» Positive association or Negative association A pattern» Could be any discernible shape» Could be Linear or Non-linear Visualize, then quantify, but be cautious!

Autonomy(Corp Rhine Paradox* Joseph Rhine was a parapsychologist in the 1950 s» Experiment: subjects guess whether 10 hidden cards were red or blue He found that about 1 person in 1,000 had Extra Sensory Perception!» They could correctly guess the color of all 10 cards *Example from Jeff Ullman/Anand Rajaraman

Autonomy(Corp Rhine Paradox Called back psychic subjects and had them repeat test» They all failed Concluded that act of telling psychics that they have psychic abilities causes them to lose it (!) Q: What s wrong with his conclusion?

Rhine s Error Q: What s wrong with his conclusion? 2 10 = 1,024 combinations of red and blue of length 10 0.98 probability at least 1subject in 1,000 will guess correctly

The Correlation Coefficient Pearson product-moment correlation coefficient» Measures linear association between X and Y» Based on standard units (Standard Deviation) Ranges from -1 1» = 1: scatter is perfect straight line sloping up» = -1: scatter is perfect straight line sloping down» = 0: No linear association; uncorrelated

Total positive correlation + Correlation https://commons.wikimedia.org/wiki/user:kiatdd

Total negative correlation - Correlation https://commons.wikimedia.org/wiki/user:kiatdd

0 Correlation No correlation https://commons.wikimedia.org/wiki/user:kiatdd

-1 < < 1correlation -1 < < 1Correlation https://commons.wikimedia.org/wiki/user:kiatdd

Definition of Correlation Coefficient :» Average of product of (x in standard units) and (y in standard units)» Measures how clustered the scatter is around a straight line Further Properties of» is a pure number with no units» is not affected by changing units of measurement» is not affected by switching the x and y axes Remember: Correlation is not causation

Graph of Averages A visualization of x and y pairs» Group each x with a representative x value (e.g., rounding)» Average the corresponding y values for each group One point per representative x value and average y value If the association between x and y is linear, then points in the graph of averages tend to fall on the regression line

Regression to the Mean A statement about x and y pairs» Measured in standard units» Describing the deviation of x from 0 (the average of x s)» And the deviation of y from 0 (the average of y s) On average, y deviates from 0 less than x deviates from 0

Regression to the Mean A statement about x and y pairs» Measured in standard units» Describing the deviation of x from 0 (the average of x s)» And the deviation of y from 0 (the average of y s) On average, y deviates from 0 less than x deviates from 0 Regression Line Correlation Not true for all points a statement about averages

Slope & Intercept In original units, the regression line has this equation:

Slope & Intercept In original units, the regression line has this equation: y in standard units x in standard units

Slope & Intercept In original units, the regression line has this equation: y in standard units x in standard units Lines can be expressed by slope & intercept

Regression Line Original Units SD x * SD y

Regression Line Original Units (Average x, Average y) SD x * SD y

Regression Line Standard Units Original Units (0, 0) 1 (Average x, Average y) SD x * SD y

The Regression Model A model is a set of assumptions about the data Regression model justifies using regression line as a predictor For each point» Sample an x» Find its y on regression line (signal)» Add a sample of deviation (noise)

Predicting with Regression Justified by assuming that x and y are linearly related Only reasonable within the range of observed data Varies by sample with more variability at the extremes Predictions are average values, not perfect guesses

Errors: Evaluating Prediction Accuracy Error: the difference between estimated and actual values» Errors can be positive or negative» Depend on data and the line chosen Common metric:» Root Mean Squared Error (or Root Mean Squared Deviation) Root Mean Squared Error (RMSE) = RMSE

Regression by Minimizing Errors The regression line is the one that minimizes MSE of a collection of paired values Minimizing any of these quantities yields equivalent results:» Root Mean Squared Error» Mean Squared Error» Total Squared Error

Regression by Minimizing Errors The regression line is the one that minimizes MSE of a collection of paired values The slope and intercept are unique for regression Numerical minimization is approximate but effective Lots of machine learning involves minimizing error

Multiple Linear Regression Simple regression: one input one output Multiple regression: many inputs one output GPA = a days * days + a contributions * contributions + b Find a s and b by minimizing RMSE

Statistics Terminology Inference: Making conclusions from random samples Population: The entire set that is the subject of interest Parameter: A quantity computed for the entire population Sample: A subset of the population» Random Sample: we know chance any subset of population will enter the sample, in advance Statistic: A quantity computed for a particular sample

Estimating a Parameter 1. Describe the population and a parameter of interest

Estimating a Parameter 1. Describe the population and a parameter of interest 2. Acquire a random sample

Estimating a Parameter 1. Describe the population and a parameter of interest 2. Acquire a random sample

Estimating a Parameter 1. Describe the population and a parameter of interest 2. Acquire a random sample 3. Compute statistics

Estimating a Parameter 1. Describe the population and a parameter of interest 2. Acquire a random sample 3. Compute statistics 4. Pick an estimate

Estimating a Parameter 1. Describe the population and a parameter of interest 2. Acquire a random sample 3. Compute statistics 4. Pick an estimate Draw conclusions

Empirical Distributions, Statistics & Parameters A reasonable way to estimate a parameter (e.g., population average, max, median, ) is to compute the corresponding statistic for a sample

Empirical Distributions, Statistics & Parameters A reasonable way to estimate a parameter (e.g., population average, max, median, ) is to compute the corresponding statistic for a sample Different samples will lead to different estimates Population (fixed) Sample (random) Statistic (random)

Empirical Distributions, Statistics & Parameters A reasonable way to estimate a parameter (e.g., population average, max, median, ) is to compute the corresponding statistic for a sample Different samples will lead to different estimates Population (fixed) Sample (random) Statistic (random) Goal: Infer the variability of a statistic, using only a sample

Anatomy of a sample: Sample Variability» A sample contains not just a statistic, but a whole data set! The same sample can be used for multiple purposes:» Compute a statistic that is an estimate of a parameter» Approximate the shape of the population distribution

Confidence Intervals: A Margin of Error Estimation is a process with a random outcome Population (fixed) Sample (random) Statistic (random) Instead of picking a single estimate of the parameter, we can pick a whole interval: lower bound to upper bound

Confidence Intervals: A Margin of Error Estimation is a process with a random outcome Population (fixed) Sample (random) Statistic (random) Instead of picking a single estimate of the parameter, we can pick a whole interval: lower bound to upper bound A 95% Confidence Interval is an interval constructed so that it will contain the parameter for 95% of samples

Confidence Intervals: A Margin of Error Estimation is a process with a random outcome Population (fixed) Sample (random) Statistic (random) A 95% Confidence Interval is an interval constructed so that it will contain the parameter for 95% of samples For a particular sample, the interval either contains the parameter or it doesn t: the process works 95% of the time

Resampling Inferential idea: When we wish we could sample again from the population, we instead sample from the sample

Intervals If an interval around the parameter contains the estimate, then a (reflected) interval of the same width around the estimate contains the parameter (and vice versa) Parameter Estimate

Resampled Confidence Interval Inferential idea: The variability of the sampled distribution is a useful proxy for the variability of the original distribution Collect a random sample Compute your estimate (e.g., sample average) Resample K samples from the sample, with replacement» Compute the same statistic for each resampled sample» Take percentiles of the deviations from the estimate

Verifying Intervals When all you have is a sample, it is impossible to verify empirically whether the interval you compute is correct If you have the whole population, then you can check how often intervals are correct

Simple Linear Regression Simple: One explanatory or predictor variable x Can be used to estimate response variable y based on x Strength of linear relation between x and y is measured by correlation

Regression Line Estimate of y = slope! + intercept slope = (SD of y) / (SD of x) intercept = (average of y) - slope (average of x) Best among all straight lines for estimating y based on x Best : minimizes RMSE of estimation

Tyche, the Goddess of Chance

A Model : What Tyche does Distance drawn at random from normal distribution with mean 0 Another distance drawn independently from the same normal distribution

Plan for Prediction If our model is good:» Regression line is close to Tyche s true line» Given a new value of x, predict y by finding the point on the regression line at that x» Bootstrap the scatter plot» Get a new prediction using the regression line that goes through the resampled plot» Repeat the two steps above many times» Get an interval of predictions of y for the given x

Predictions at Different Values of x Since y is correlated with x, the predicted values of y depend on the value of x The width of the prediction interval also depends on x» Typically, inter vals are wider for values of x that are further away from the mean of x

Rain on the Prediction Parade We observed a positive slope and used it to make our predictions But what if the scatter plot got its positive slope just by chance? What if the true line is actually FLAT?

Confidence Interval for True Slope Steps:» Bootstrap the scatter plot» Find the slope of the regression line through bootstrapped plot» Repeat» Draw the empirical histogram of all the generated slopes» Get the middle 95% interval That s an approximate 95% confidence interval for the slope of the true line

Inference for the True Slope Null hypothesis: The slope of the true line is 0 Alternative hypothesis: No, it s not Method:» Construct a bootstrap confidence interval for the true slope» If the interval doesn t contain 0, reject the null hypothesis» If the interval does contain 0, there isn t enough evidence to reject the null hypothesis

Confidence Intervals for Testing Null hypothesis: A parameter is equal to a specified value Alternative hypothesis: No, it s not Method:» Construct a confidence interval for the parameter» If the specified value isn t in the interval, reject the null hypothesis» If the interval does contain 0, there isn t enough evidence to reject the null hypothesis