CS 361: Probability & Statistics

Similar documents
PS2.1 & 2.2: Linear Correlations PS2: Bivariate Statistics

P8130: Biostatistical Methods I

Stat 101 Exam 1 Important Formulas and Concepts 1

Chapter 2: Tools for Exploring Univariate Data

Where Is Newton Taking Us? And How Fast?

TOPIC: Descriptive Statistics Single Variable

Alex s Guide to Word Problems and Linear Equations Following Glencoe Algebra 1

MATH 1150 Chapter 2 Notation and Terminology

Announcements. Lecture 1 - Data and Data Summaries. Data. Numerical Data. all variables. continuous discrete. Homework 1 - Out 1/15, due 1/22

Determining the Spread of a Distribution

Elementary Statistics

Determining the Spread of a Distribution

Chapter 1:Descriptive statistics

For instance, we want to know whether freshmen with parents of BA degree are predicted to get higher GPA than those with parents without BA degree.

Chapter 1. Looking at Data

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

appstats8.notebook October 11, 2016

AP Final Review II Exploring Data (20% 30%)

Chapter 1 Review of Equations and Inequalities

Descriptive Univariate Statistics and Bivariate Correlation

Solving Equations by Adding and Subtracting

Correlation. January 11, 2018

1. Descriptive stats methods for organizing and summarizing information

In this investigation you will use the statistics skills that you learned the to display and analyze a cup of peanut M&Ms.

Chapter 1 Describing Data

CS 124 Math Review Section January 29, 2018

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

Topic 3: Introduction to Statistics. Algebra 1. Collecting Data. Table of Contents. Categorical or Quantitative? What is the Study of Statistics?!

( )( b + c) = ab + ac, but it can also be ( )( a) = ba + ca. Let s use the distributive property on a couple of

1 Measures of the Center of a Distribution

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Math101, Sections 2 and 3, Spring 2008 Review Sheet for Exam #2:

#29: Logarithm review May 16, 2009

Descriptive Data Summarization

Approximate Linear Relationships

Numbers and Operations Review

Describing Data: Two Variables

Chapter 7: Statistics Describing Data. Chapter 7: Statistics Describing Data 1 / 27

Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

20 Hypothesis Testing, Part I

Describing Distributions

BNG 495 Capstone Design. Descriptive Statistics

4.3 Rational Inequalities and Applications

Describing distributions with numbers

Performance of fourth-grade students on an agility test

Chapter 5. Understanding and Comparing. Distributions

Visualizing Data: Basic Plot Types

Lecture Notes 2: Variables and graphics

Rising 7 th Grade Summer Assignment

are the objects described by a set of data. They may be people, animals or things.

What is the association of between the two variables? In what direction does the association go?

M & M Project. Think! Crunch those numbers! Answer!

Math 223 Lecture Notes 3/15/04 From The Basic Practice of Statistics, bymoore

Chapter 1: Exploring Data

Business Statistics. Lecture 10: Course Review

1. Exploratory Data Analysis

CS 5630/6630 Scientific Visualization. Elementary Plotting Techniques II

Math 140 Introductory Statistics

Math 140 Introductory Statistics

2 Analogies between addition and multiplication

I started to think that maybe I could just distribute the log so that I get:

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

2.1 Definition. Let n be a positive integer. An n-dimensional vector is an ordered list of n real numbers.

STT 315 This lecture is based on Chapter 2 of the textbook.

Nicole Dalzell. July 2, 2014

5.2 Infinite Series Brian E. Veitch

Vocabulary: Samples and Populations

1. Create a scatterplot of this data. 2. Find the correlation coefficient.

Keystone Exams: Algebra

Analytical Graphing. lets start with the best graph ever made

Section 3.3 Maximum and Minimum Values

CS 147: Computer Systems Performance Analysis

Revised: 2/19/09 Unit 1 Pre-Algebra Concepts and Operations Review

Describing Distributions With Numbers

How are molecular formulas different from empirical formulas? Can they ever be the same for a particular substance?

1 Review of the dot product

Simple Regression Model. January 24, 2011

irst we need to know that there are many ways to indicate multiplication; for example the product of 5 and 7 can be written in a variety of ways:

Unit 2. Describing Data: Numerical

2.0 Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

1.3: Describing Quantitative Data with Numbers

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation

MA 1125 Lecture 15 - The Standard Normal Distribution. Friday, October 6, Objectives: Introduce the standard normal distribution and table.

Chapter 1: Why is my evil lecturer forcing me to learn statistics?

Bag RED ORANGE GREEN YELLOW PURPLE Candies per Bag

2011 Pearson Education, Inc

STAT 200 Chapter 1 Looking at Data - Distributions

Lecture 3B: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

Chapter 2. Mathematical Reasoning. 2.1 Mathematical Models

Statistics, continued

Example 2. Given the data below, complete the chart:

Describing Distributions With Numbers Chapter 12

Finite Mathematics : A Business Approach

Unit C: Usage of Graphics in Agricultural Economics. Lesson 3: Understanding the Relationship of Data, Graphics, and Statistics

Transcription:

January 24, 2018 CS 361: Probability & Statistics Relationships in data

Standard coordinates If we have two quantities of interest in a dataset, we might like to plot their histograms and compare the two quantities, even if they have different units or measure entirely different things If we wanted to compare a histogram of internship earnings and GPA to see if there s something similar about the histograms, how to proceed?

Standard coordinates A transformation of the dataset Create a new dataset mean from each data item with standardized location subtract the And standardized scale divide each data item by the standard deviation is dimensionless, has 0 mean, and unit standard deviation

Standard coordinates Recipe for computing standard coordinates from your dataset

Standard normal data A wide variety of data, when standardized, will have a particular look and even fit a particular mathematical curve.

Boxplots Another type of visualization

How to make a box plot Height of the box is from q1 to q3, width is whatever makes it look nice Identify the median Use a rule for outliers: bigger than q3 + 1.5(iqr) or smaller than q1-1.5(iqr) for example Whiskers extend to the largest data item which isn t an outlier and smallest data item which isn t an outlier Outlier data points indicated

Chapter 2: Relationships in data

Relationships Most of what we have covered so far has been about visualizing or describing a single dimension of a dataset We may be interested in how two or more dimensions of a dataset relate to one another We might expect there to be a relationship among: temperature and latitude, height and weight, hours spent studying and GPA, etc.

Visualizing relationships

Plotting 2d categorical data We could try and collapse two categorical variables with m and n different values into a single variable with mn values and use a bar chart What could go wrong? mn could be too large for the chart to be readable

Pie charts An alternative is a pie chart Small differences may be hard to judge, though

Stacked bar charts

Heat Maps

3D bar charts

3D bar charts occlusion

Time series and spatial data

Time series data Sometimes there is something in the data that indicates an ordering in time for the data (month, day, year, etc) Plot the points Connect with lines Trends: upwards, downwards, periodic? Ask yourself what s happening before and after the cutoffs chosen

Time series

Relationships in time series We can visually inspect a relationship between two variables in a dataset with these plots Why might the number of pelts be periodic? The price?

Plotting spatial data

Spatial data and Cholera

Spatial data

Scatterplots

Visualizing two variables: scatterplots 2D plot with one numerical variable on each axis

Scatterplots Choose 2 of the d variables in your dataset that you re interested in investigating for some relationship Call one of the variables x and one y Creating a new dataset Then plot a mark on a graph for each data item at the (x,y) coordinate given by the 2 variables you ve chosen to look at It doesn t really matter which is x and y (what if we flipped them?)

Including more information It s possible to use point size, point color, or different types of points (x s and o s for instance) to indicate the values of other variables in the plot Any difference between sex 1 and 2? Relationship between temp and HR?

Scale matters The same dataset, outliers removed, two different axis scales

Normalization The top figure shows the time series data, the bottom two are the same data on a scatter plot In the bottom left, it s hard to see the law of supply and demand. Normalization reveals that this is a scaling artifact

Scatterplots summary A good first choice when dealing with 2 numerical dimensions of data Scale matters, so it s a good idea to use standard coordinates May want to remove outliers or unusual data

Correlation

Correlation Broadly, if x changes, what does y do? If a small x and small y (respectively large x and large y) tend to occur together we say there is positive correlation between x and y

Correlation If small values of x tend to occur with large values of y and large values of x tend to occur with small values of y we say that x and y are negatively correlated

Correlation When there is no tendency for x and y to be either large or small together, we say there is zero correlation Our data will be more of a blob

Examples Lines of code in a codebase and number of bugs? Body temperature and height? GPA and hours spent playing video games? Earnings and happiness?

Correlation coefficient Suppose we have N data items that are each 2-vectors Normalize the data The correlation coefficient of x and y is the mean of the product, i.e.

Properties of correlation Correlation coefficient is symmetric Not changed by translating data Scaling may change the sign

Properties of correlation How do these two conceptualizations line up?

Properties of correlation The largest possible correlation is 1 and happens when The smallest possible correlation is -1 and happens when

Proving correlation bounds Proposition: First note that the correlation can be written as a dot product of two vectors Let and We have or Either But or which is Recall

Proof Since is our standardized dataset, we have Similar reasoning applies for y So we know that And since We ve shown that

Using correlation to predict One useful task is to take what we know about the data we have and make predictions about data we don t yet have or measurements we have that are incomplete Example: we might like to go into the fur pelt business and have a bunch of historical data on supply and prices. We know the price today and would like to guess as to the total supply That is we have a bunch of pairs (x,y) for prices and supply. But our state of knowledge today might be (x_0,???) Correlation will be useful for this task

Prediction We want a predictor that we can apply to any x We want it to behave well on our existing data We can choose the predictor by considering the error the predictor will have

Prediction Since it s possible to convert to and from standard coordinates and we know standard coordinates have nice properties like 0 mean and 1 standard deviation, we will first convert We will write to indicate our predicted value of for the point