Data Analysis and Statistical Methods Statistics 651

Similar documents
Data Analysis and Statistical Methods Statistics 651

Chapter 2: Tools for Exploring Univariate Data

MATH 1150 Chapter 2 Notation and Terminology

A is one of the categories into which qualitative data can be classified.

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651

Topic 3: Introduction to Statistics. Algebra 1. Collecting Data. Table of Contents. Categorical or Quantitative? What is the Study of Statistics?!

Lecture Notes 2: Variables and graphics

MATH 2560 C F03 Elementary Statistics I Lecture 1: Displaying Distributions with Graphs. Outline.

Chapter 1. Looking at Data

Elementary Statistics

Sampling, Frequency Distributions, and Graphs (12.1)

STAT 200 Chapter 1 Looking at Data - Distributions

MATH 10 INTRODUCTORY STATISTICS

Comparing Measures of Central Tendency *

Probability Distributions

Statistics, continued

Descriptive Univariate Statistics and Bivariate Correlation

STT 315 This lecture is based on Chapter 2 of the textbook.

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Week 1: Intro to R and EDA

Stat 101 Exam 1 Important Formulas and Concepts 1

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

Chapter 1: Exploring Data

MEASURES OF LOCATION AND SPREAD

TOPIC: Descriptive Statistics Single Variable

Data Analysis and Statistical Methods Statistics 651

In this investigation you will use the statistics skills that you learned the to display and analyze a cup of peanut M&Ms.

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Chapter 7: Statistics Describing Data. Chapter 7: Statistics Describing Data 1 / 27

COMPLEMENTARY EXERCISES WITH DESCRIPTIVE STATISTICS

Descriptive Statistics Solutions COR1-GB.1305 Statistics and Data Analysis

Let's Do It! What Type of Variable?

CIVL 7012/8012. Collection and Analysis of Information

8/4/2009. Describing Data with Graphs

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

Chapter 5. Understanding and Comparing. Distributions

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

SESSION 5 Descriptive Statistics

Chapter 2 Class Notes Sample & Population Descriptions Classifying variables

a table or a graph or an equation.

Data Analysis and Statistical Methods Statistics 651

Measurement & Lab Equipment

CHAPTER 2: Describing Distributions with Numbers

BNG 495 Capstone Design. Descriptive Statistics

Descriptive statistics

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

Lecture 1 : Basic Statistical Measures

Introduction to Statistics for Traffic Crash Reconstruction

ECLT 5810 Data Preprocessing. Prof. Wai Lam

2.1 Measures of Location (P.9-11)

Data Analysis and Statistical Methods Statistics 651

Units. Exploratory Data Analysis. Variables. Student Data

P8130: Biostatistical Methods I

Lecture 3: Chapter 3

University of California, Berkeley, Statistics 131A: Statistical Inference for the Social and Life Sciences. Michael Lugo, Spring 2012

Statistics lecture 3. Bell-Shaped Curves and Other Shapes

Introduction to Basic Statistics Version 2

Variables, distributions, and samples (cont.) Phil 12: Logic and Decision Making Fall 2010 UC San Diego 10/18/2010

Types of Information. Topic 2 - Descriptive Statistics. Examples. Sample and Sample Size. Background Reading. Variables classified as STAT 511

The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

Statistical Concepts. Constructing a Trend Plot

Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

Clinical Research Module: Biostatistics

Psych Jan. 5, 2005

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career.

AP Final Review II Exploring Data (20% 30%)

Nicole Dalzell. July 2, 2014

Chinle USD CURRICULUM GUIDE SUBJECT: MATH GRADE: 8th TIMELINE: 3 rd quarter

Statistic: a that can be from a sample without making use of any unknown. In practice we will use to establish unknown parameters.

ACMS Statistics for Life Sciences. Chapter 13: Sampling Distributions

Chapter2 Description of samples and populations. 2.1 Introduction.

Graphing. LI To practice reading and creating graphs

An Introduction to Probability and Statistics

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

THE SAMPLING DISTRIBUTION OF THE MEAN

Exam: practice test 1 MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Lecture 1: Description of Data. Readings: Sections 1.2,

Preliminary Statistics course. Lecture 1: Descriptive Statistics

For instance, we want to know whether freshmen with parents of BA degree are predicted to get higher GPA than those with parents without BA degree.

1. For which of these would you use a histogram to show the data? (a) The number of letters for different areas in a postman s bag.

University of Jordan Fall 2009/2010 Department of Mathematics

20 Hypothesis Testing, Part I

Descriptive Statistics and Probability Test Review Test on May 4/5

CS 361: Probability & Statistics

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

Data Analysis and Statistical Methods Statistics 651

Conditional Probability Solutions STAT-UB.0103 Statistics for Business Control and Regression Models

Analytical Graphing. lets start with the best graph ever made

Announcements. Lecture 1 - Data and Data Summaries. Data. Numerical Data. all variables. continuous discrete. Homework 1 - Out 1/15, due 1/22

DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS QM 120. Spring 2008

Inference for Proportions, Variance and Standard Deviation

Example 2. Given the data below, complete the chart:

Survey on Population Mean

AIM HIGH SCHOOL. Curriculum Map W. 12 Mile Road Farmington Hills, MI (248)

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

download instant at

Let's Do It! What Type of Variable?

Transcription:

Data Analysis and Statistical Methods Statistics 651 http://www.stat.tamu.edu/~suhasini/teaching/ Suhasini Subba Rao Review In the previous lecture we looked at the statistics of M&Ms. This example illustrates several important concepts: What a population and sample are. What an estimator is? - The population parameter (such as mean - average of all those in a population) is based on the population. An estimator is an estimate of the population parameter and is based on the sample. Roughly, what a probability is. How sample size (the number in a sample) can influence the quality of the estimator. 1 A representative sample When infering something about a population based on a sample we need to ensure that the sample is somehow representative of the data. For example, if we want to infer something about the mean height of students at A&M (the population is all students at A&M) based on a sample containing only women, it is likely that our sample will be biased. All the female students is a subpopulation of the population of all students. Our sample is better for making inference on the subpopulation of female students, rather than the entire population. Is this class a representative sample of students at A&M? Designing an experiment in a good way is extremely important, but something we shall not cover in this course. Different types of variables Usually it is not the population that we are interested in, but certain measurements (variables) in that population. For example, if the population is the human population, if you are nutritionist then you may be interested in heights or weights of the individuals. On the other hand, if you are a demographer you may be interested in the age/gender/ethnic group of the individuals. What are variables? Variables are what we measure in the population (and sample). For example, in a bag of M&Ms we may be interested in the majority colour, number of M&Ms, weight of bag, type of M&M (chocolate or peanut) etc. 2 3

bag no. majority colour number of M&Ms weight of bag type 1 blue 18 2.2 ounces chocolate 2 brown 19 2.3 ounces chocolate 3 red 12 2.1 ounces peanut Types of Variables From the above we can see that variables come in several different types: variables such as number of M&Ms in a bag or ordinal data (such as satisfactory ratings which are rated from 1-6). In statistics we treat different types of variables in different ways. During the course we will consider different methods for treating different types of variables. Numerical: eg. weight (2.2 ounces) Binary: eg. Type (chocolate/peanut 0/1) Categorical: eg. Majority colour (blue/brown/red/green) Numerical variables can be further partitioned in terms of continuous numerical variables (such as weight of an M&M bag) or discrete (count) 4 5 Examples of variables What type of variables are the following: The gender of a randomly chosen person (we can use M/F or 0/1)? The make of bicycle of a randomly chosen person? The number of bicycles owned by a randomly chosen person? The height of person? Whether a random selected person responds to a drug? The prediction of Paul the octopus (win or lose). Statistical Analysis comes in three stages (1) Data description. When starting a data analysis first use a graphical method to represent the data (Chapter 3, Ott and Longnecker). Ie. histograms, pie charts, line graphs, line and whisker plots etc. (2) Summary statistics, average (mean), median, variance, quantiles etc. This describes the data set (which can be large) in a few numbers, it also gives us an idea about the spread of the data. (3) Quantative techniques (this will be the main focus of the course, Chapter 3-11, Ott and Longnecker). We can evaluate an average, but what does this average tell us about of the true population average (usually called population mean)? How close is the sample average to the population average? We will be finding out a few weeks from now. 6 7

0 100 200 300 400 500 600 time The start of any statistical analysis: Data description There are several ways to represent data. For example the Antarctic peninsula data observed monthly between 1951-2005 can be plotted against time. This is usually called a time series. It seems to be seasonal. Is there a slow increase? Can we explain any changes using external factors? We shall be answering some of these questions later in the course. min temp 40 30 20 10 0 Figure 1: Plot of time against minimum monthly temperatures What can you see from it? But the main point is: A good plot can tell more than a thousand words! There are interesting plotting tools, such as time series plots, pie charts etc (see Ott and Longnecker, Chapter 3). Always start any statistical analysis with some plots and summary statistics (the sample mean etc). An important plotting tool is the histogram which we now define. 8 9 Data description: Histograms An important graphical tool in Statistics is the histogram. Plotting a Histogram The Recipe (through an Example) The histogram is a plot device for checking the frequency of observations in a certain interval. Some definitions is the number/or percentage of data lying in an interval. Range is an interval where the smallest value of all of the observations is the start of the interval and the largest value is the end of the interval. Eg. if 22,23,39, 37,31,24,24, 26,27,41 are the observations, then the smallest value is 22 and the largest is 41. The range is the interval [22, 41] Data - weight of 10 M&M bags 22, 23, 39, 37, 31, 24, 24, 26, 27, 41. Range of weights: [22 41] Divide the interval which contains the range of weights into subintervals (usually of the same length). The interval [20 44] clearly contains the interval [22 41]: Subintervals: [20 24] [25 29] [30 34] [35 39] [40 44] Length of subinterval is the bin width, in our example the bin width is 4. 10 11

20 25 30 35 40 45 data.age Compute percentage of observations in each interval interval [20-24] [25-29] [30-34] [35-39] [40-44] count 4 2 1 2 1 percent 40% 20% 10% 20% 10% The Histogram (either using the count or the percentage): 0 1 2 3 4 Histogram of data.age The general recipe for making a relative frequency plot Choose an interval which contains the range of observations. In the previous example the interval [20,45] contained the range [22,41]. Divide the interval into sub-intervals (the bins). Calculate the number of observations in each subinterval (this is called the frequency). Calculate the relative frequency. That is relative frequency = number of observations in a subinterval (frequency) number of observations in total We observe that the relative frequency is like a probability or the chance of drawing from inside that interval. 12 13 Plot the relative frequency against the subintervals. What can we see from a histogram? We often plot the relative frequency against the subintervals rather than the frequency against the subintervals. This is because the relative frequency does not depend on the sample size just the relative sample size. In other words, if we plotted the relative frequency plots of the data sets data 1 22,23,40, 37,31,25,25,26, 27,41 data 2 22,23,40, 37,31,25,25,26, 27,41, 22,23,40, 37,31,25,25,26, 27,41 we get identical plots (since the second data set is just a double of the first). A histogram allows us to see in what interval a variable may most frequently arise. The spread of the data, where the data is mainly concentrated. Warning: The histogram heavily depends on the bin width. In practice, it is useful to plot several histograms with different bin widths, and compare the plots. (How to choose the bin width is a difficult statistical question, we shall not concern ourselves with it in this course). We illustrate these two features below. 14 15

Features: Different bin widths, different histograms Histogram of population Using a histogram to populations 0 5 10 15 20 25 30 2 1 0 1 2 population Histogram of population A histogram is a very useful tool for comparing samples and seeing whether they come from the sample population or from different populations. We will learn more quantative methods of comparison later in the course. What we do now is just a RULE OF THUMB. 0 2 4 6 8 10 12 2 1 0 1 2 population Here we have plotted the histogram of the same sample using two different bin widths. Example We could expect the temperatures in January in the Antarctic to be more than those in May in the Antarctic (recall that in the Antarctic, January is summer and May is winter). Below are plots from a sample of temperatures taken in January and a sample taken from May. What do you think? But it is clear that the two plots are very different. 16 17 Comparing temperatures in the Antarctic Histogram of jan.faraday 0 5 10 15 20 0 2 4 6 8 10 12 14 40 30 20 10 0 jan.faraday Histogram of may.faraday We see that the two sample histograms seems to have different centers. How to quantify this difference? There are several ways to do this. One way is to consider a numerical value which describes a feature in the data, and to compare the numerical values from each sample. From the point of view of statistical inference, it is much easier to compare numerical values than graphs. 40 30 20 10 0 may.faraday The top plot are the summer temperatures and the lower plot are the winter temperatures in the Antarctic between 1951-2005. What do you notice? One way to measure where they are centered is to consider their sample means. Later we shall consider methods which compare the sample means of two population. 18 19

The population and the distribution plots The histogram is a very important way of visually studying the distribution of data. We can use it to find: where values arise the most often. what the spread of the data is etc. The height of the bars in a histogram is very important as it gives the (sample) frequency of the variable. Usually the distribution of categorical data (such as colour/gender/subjects), is represented with a histogram (for both samples and populations). The story is a little different for numerical, continuous variables. Usually the population distribution of numerical variables are not represented with a histograms, but a closely related cousin called the density function. The density plot As we do not observe the population it is very hard to make a histogram of it (there are certain technical reasons why it cannot be done). However often we assume apriori that the population distribution has some characteristics. These characteristics are best represented using a density plot and not a histogram. The density plot is a little different to the histogram, in the sense that now the area under the graph represent frequency. The histogram and density plot are related, and using calculus based arguments one can go from one to another. 20 21 Look at the handwritten handout called density function.pdf. We see that the density function can have several different shapes. The density function will form an important component of this course. We will be returning to them a little later in the course. Example 1 Data on the age of time of job turnover and on the reason for the job turnover are displayed here for 250 jobs in a large corporation. Reason for turnover 29 30 39 40 49 50 Total Resigned 30 6 4 20 60 Transfered 12 45 4 5 66 Retired/fired 8 9 52 55 124 Total 50 60 60 80 250 For each reason, plot a relative frequency histogram for the ages. Compare the three histograms. 22 23