Part III: Unstructured Data. Lecture timetable. Analysis of data. Data Retrieval: III.1 Unstructured data and data retrieval

Similar documents
Part III: Unstructured Data

Introduction to Statistics

Do students sleep the recommended 8 hours a night on average?

Introduction to Basic Statistics Version 2

Chapter 3 Statistics for Describing, Exploring, and Comparing Data. Section 3-1: Overview. 3-2 Measures of Center. Definition. Key Concept.

Lecture Notes 2: Variables and graphics

Using SPSS for One Way Analysis of Variance

Experimental Design, Data, and Data Summary

MATH 1150 Chapter 2 Notation and Terminology

BNG 495 Capstone Design. Descriptive Statistics

Glossary for the Triola Statistics Series

Variables, distributions, and samples (cont.) Phil 12: Logic and Decision Making Fall 2010 UC San Diego 10/18/2010

Draft Proof - Do not copy, post, or distribute

Introduction to Statistical Data Analysis Lecture 1: Working with Data Sets

Introduction to Statistics for Traffic Crash Reconstruction

Answer keys for Assignment 10: Measurement of study variables (The correct answer is underlined in bold text)

3.1 Measure of Center

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

A is one of the categories into which qualitative data can be classified.

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Introduction to Statistics

Lecture Lecture 5

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Mathematics Grade 7. Solve problems involving scale drawings.

Numerical Methods Lecture 7 - Statistics, Probability and Reliability

Concepts in Statistics

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

Stochastic calculus for summable processes 1

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

STATISTICS ANCILLARY SYLLABUS. (W.E.F. the session ) Semester Paper Code Marks Credits Topic

Chapter Four. Numerical Descriptive Techniques. Range, Standard Deviation, Variance, Coefficient of Variation

Review of Statistics 101

MARLBORO CENTRAL SCHOOL DISTRICT CURRICULUM MAP. Unit 1: Integers & Rational Numbers

Lecture 2. Descriptive Statistics: Measures of Center

Chapter 2: Tools for Exploring Univariate Data

All the men living in Turkey can be a population. The average height of these men can be a population parameter

CHOOSING THE RIGHT SAMPLING TECHNIQUE FOR YOUR RESEARCH. Awanis Ku Ishak, PhD SBM

Data Collection: What Is Sampling?

Chapter 3. Data Description

Math 90 Lecture Notes Chapter 1

Rational Numbers and Exponents

SESSION 5 Descriptive Statistics

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Mapping Common Core State Standard Clusters and. Ohio Grade Level Indicator. Grade 7 Mathematics

IENG581 Design and Analysis of Experiments INTRODUCTION

Units. Exploratory Data Analysis. Variables. Student Data

Unit 4 Probability. Dr Mahmoud Alhussami

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Histograms, Central Tendency, and Variability

104 Business Research Methods - MCQs

P8130: Biostatistical Methods I

Vehicle Freq Rel. Freq Frequency distribution. Statistics

MIDTERM EXAMINATION (Spring 2011) STA301- Statistics and Probability

Introduction to statistics

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

Social Studies 201 October 11, 2006 Standard Deviation and Variance See text, section 5.9, pp Note: The examples in these notes may be

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Grade 7 Honors Yearlong Mathematics Map

Institute of Actuaries of India

SOL Study Book Fifth Grade Scientific Investigation, Reasoning, and Logic

Subject CS1 Actuarial Statistics 1 Core Principles

Statistics 301: Probability and Statistics Introduction to Statistics Module

Learning Objectives for Stat 225

Sampling, Frequency Distributions, and Graphs (12.1)

Sampling Populations limited in the scope enumerate

Inferential Statistics

Marquette University MATH 1700 Class 5 Copyright 2017 by D.B. Rowe

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Probability and Statistics

Probability Experiments, Trials, Outcomes, Sample Spaces Example 1 Example 2

West Windsor-Plainsboro Regional School District Algebra Grade 8

MATH 10 INTRODUCTORY STATISTICS

Overview. INFOWO Statistics lecture S1: Descriptive statistics. Detailed Overview of the Statistics track. Definition

UNIVERSITY OF TORONTO MISSISSAUGA. SOC222 Measuring Society In-Class Test. November 11, 2011 Duration 11:15a.m. 13 :00p.m.

Final Exam STAT On a Pareto chart, the frequency should be represented on the A) X-axis B) regression C) Y-axis D) none of the above

Grade 7 Math Spring 2017 Item Release

Chapter 1. Looking at Data

AP Final Review II Exploring Data (20% 30%)

It s Just a Lunar Phase

The science of learning from data.

Chapter 2 Class Notes Sample & Population Descriptions Classifying variables

Mathematics Table of Contents

Descriptive Statistics Solutions COR1-GB.1305 Statistics and Data Analysis

Seismic Analysis of Structures Prof. T.K. Datta Department of Civil Engineering Indian Institute of Technology, Delhi. Lecture 03 Seismology (Contd.

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data

CHAPTER 8 INTRODUCTION TO STATISTICAL ANALYSIS

Background to Statistics

Elementary Statistics

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Curriculum Scope & Sequence Subject/Grade Level: MATHEMATICS/GRADE 7 Course: MATH 7

Lecture 1: Description of Data. Readings: Sections 1.2,

Pacing 7th Grade Advanced Math 1st Nine Weeks Introduction to Proportional Relationships. Proportional Relationships and Percentages

a table or a graph or an equation.

7 th Grade Remediation Guide

Descriptive Univariate Statistics and Bivariate Correlation

Purposes of Data Analysis. Variables and Samples. Parameters and Statistics. Part 1: Probability Distributions

APS Eighth Grade Math District Benchmark Assessment NM Math Standards Alignment

Describing Data: Numerical Measures

Section 5.4. Ken Ueda

Transcription:

Inf1-DA 2010 20 III: 28 / 89 Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval Statistical Analysis of Data: III.2 Data scales and summary statistics III.3 Hypothesis testing and correlation III.4 χ 2 and collocations Inf1-DA 2010 20 III: 29 / 89 Lecture timetable Tuesday 8 March Lecture Tutorial 7, Information Retrieval Friday March No Lecture Assignment due to ITO 4pm Tuesday 15 March Lecture Tutorial 8, Statistical Analysis Friday 18 March No Lecture Tuesday 22 March Lecture Tutorial 9, assignment feedback Inf1-DA 2010 20 III: 30 / 89 Analysis of data There are many reasons to analyse data. Two common goals of analysis: Discover implicit structure in the data. E.g., find patterns in empirical data (such as experimental data). Confirm or refute a hypothesis about the data. E.g., confirm or refute an experimental hypothesis. Statistics provides a powerful and ubiquitous toolkit for performing such analyses.

Inf1-DA 2010 20 III: 31 / 89 Data scales The type of analysis performed (obviously) depends on: The reason for wishing to carry out the analysis. The type of data to hand: for example, the data may be quantitative (i.e., numerical), or it may be qualitative (i.e., descriptive). One important aspect of the kind of data is the form of data scale it belongs to: Categorical (also called nominal) and Ordinal scales (for qualitative data). Interval and ratio scales (for quantitative data). This affects the ways in which we can manipulate data. Inf1-DA 2010 20 III: 32 / 89 Categorical scales Data belongs to a categorical scale if each datum (i.e., data item ) is classified as belonging to one of a fixed number categories. Example: The British Government might classify Visa applications according to the nationality of the applicant. This classification is a categorical scale: the categories are the different possible nationalities. Example: Insurance companies classify some insurance applications (e.g., home, possessions, car) according to the postcode of the applicant (since different postcodes have different risk assessments). Categorical scales are sometimes called nominal scales, especially in cases in which the value of a datum is a name. Inf1-DA 2010 20 III: 33 / 89 Ordinal scales Data belongs to an ordinal scale if it has an associated ordering but arithmetic transformations on the data are not meaningful. Example: The Beaufort wind force scale classifies wind speeds on a scale from 0 (calm) to 12 (hurricane). This has an obvious associated ordering, but it does not make sense to perform arithmetic operations on this scale. E.g., it does not make much sense to say that scale 6 (strong breeze) is the average of calm and hurricane force. Example: In many institutions, exam results are recorded as grades (e.g., A,B,..., G) rather than as marks. Again the ordering is clear, but one does not perform arithmetic operations on the scale.

Inf1-DA 2010 20 III: 34 / 89 Interval scales An interval scale is a numerical scale (usually with real number values) in which we are interested in relative value rather than absolute value. Example: Points in time are given relative to an arbitrarily chosen zero point. We can make sense of comparisons such as: moment x is 2009 years later than moment y. But it does not make sense to say: moment x is twice as large as moment z. Mathematically, interval scales support the operations of subtraction (returning a real number for this) and weighted average. Interval scales do not support the operations of addition and multiplication. Inf1-DA 2010 20 III: 35 / 89 Ratio scales A ratio scale is a numerical scale (again usually with real number values) in which there is a notion of absolute value. Example: Most physical quantities such as mass, energy and length are measured on ratio scales. So is temperature if measured in kelvins (i.e. relative to absolute zero). Like interval scales, ratio scales support the operations of subtraction and weighted average. They also support the operations of addition and of multiplication by a real number. Question for physics students: Is time a ratio scale if one uses the Big Bang as its zero point? Inf1-DA 2010 20 III: 36 / 89 Visualising data It is often helpful to visualise data by drawing a chart or plotting a graph of the data. Visualisations can help us guess properties of the data, whose existence we can then explore mathematically using statistical tools. For a collection of data of a categorical or ordinal scale, a natural visual representation is a histogram (or bar chart), which, for each category, displays the number of occurrences of the category in the data. For a collection of data from an interval or ratio scale, one plots a graph with the data scale as the x-axis and the frequency as the y-axis. It is very common for such a graph to take a bell-shaped appearance.

Inf1-DA 2010 20 III: 37 / 89 Normal distribution In a normal distribution, the data is clustered symmetrically around a central value (zero in the graph below), and takes the bell-shaped appearance below. Inf1-DA 2010 20 III: 38 / 89 Normal distribution (continued) There are two crucial values associated with the normal distribution. The mean, µ, is the central value around which the data is clustered. In the example, we have µ = 0. The standard deviation, σ, is the distance from the mean to the point at which the curve changes from being convex to being concave. In the example, we have σ = 1. The larger the standard deviation, the larger the spread of data. The general equation for a normal distribution is y = c e (x µ) 2 2σ 2 (You do not need to remember this formula.) Inf1-DA 2010 20 III: 39 / 89 Statistic(s) A statistic is a (usually numerical) value that captures some property of data. For example, the mean of a normal distribution is a statistic that captures the value around which the data is clustered. Similarly, the standard deviation of a normal distribution is a statistic that captures the degree of spread of the data around its mean. The notion of mean and standard deviation generalise to data that is not normally distributed. There are also other, mode and median, which are alternatives to the mean for capturing the focal point of data.

Inf1-DA 2010 20 III: 40 / 89 Mode Summary statistics summarise a property of a data set in a single value. Given data values x 1, x 2,..., x N, the mode (or modes) is the value (or values) x that occurs most often in x 1, x 2,..., x N. Example: Given data: 6, 2, 3, 6, 1, 5, 1, 7, 2, 5, 6, the mode is 6, which is the only value to occur three times. The mode makes sense for all types of data scale. However, it is not particularly informative for real-number-valued quantitative data, where it is unlikely for the same data value to occur more than once. (This is an instance of a more general phenomenon. In many circumstances, it is neither useful nor meaningful to compare real-number values for equality.) Inf1-DA 2010 20 III: 41 / 89 Median Given data values x 1, x 2,..., x N, written in non-decreasing order, the median is the middle value x ( N+1 assuming N is odd. If N is even, then 2 ) any data value between x ( N 2 ) and x ( N 2 +1) inclusive is a possible median. Example: Given data: 6, 2, 3, 6, 1, 5, 1, 7, 2, 5, 6, we write this in non-decreasing order: The middle value is the sixth value 5. 1, 1, 2, 2, 3, 5, 5, 6, 6, 6, 7 The median makes sense for ordinal data and for interval and ratio data. It does not make sense for categorical data, because categorical data has no associated order. Inf1-DA 2010 20 III: 42 / 89 Mean Given data values x 1, x 2,..., x N, the mean µ is the value: µ = N i=1 x i N Example: Given data: 6, 2, 3, 6, 1, 5, 1, 7, 2, 5, 6, the mean is 6 + 2 + 3 + 6 + 1 + 5 + 1 + 7 + 2 + 5 + 6 = 4. Although the formula for the mean involves a sum, the mean makes sense for both interval and ratio scales. The reason it makes sense for data on an interval scale is that interval scales support weighted averages, and a mean is simply an equally-weighted average (all weights are set as 1 N ). The mean does not make sense for categorical and ordinal data.

Inf1-DA 2010 20 III: 43 / 89 Variance and standard deviation Given data values x 1, x 2,..., x N, with mean µ, the variance, written Var or σ 2, is the value: Var = N i=1 (x i µ) 2 N The standard deviation, written σ, is defined by: σ = Var = N i=1 (x i µ) 2 N Like the mean, the standard deviation makes sense for both interval and ratio data. (The values that are squared are real numbers, so, even with interval data, there is no issue about performing the multiplication.) Inf1-DA 2010 20 III: 44 / 89 Variance and standard deviation (example) Given data: 6, 2, 3, 6, 1, 5, 1, 7, 2, 5, 6, we have µ = 4. Var = 22 + 2 2 + 1 2 + 2 2 + 3 2 + 1 2 + 3 2 + 3 2 + 2 2 + 1 2 + 2 2 = 4 + 4 + 1 + 4 + 9 + 1 + 9 + 9 + 4 + 1 + 4 = 50 σ = = 4.55 (to 2 decimal places) 50 = 2.13 (to 2 decimal places) Inf1-DA 2010 20 III: 45 / 89 Populations and samples The discussion of statistics so far has been all about computing various statistics for a given set of data. Very often, however, one is interested in knowing the value of the statistic for a whole population from which our data is just a sample. Examples: Experiments in social sciences where one wants to discover some general property of a section of the population (e.g., teenagers). Surveys (e.g., marketing surveys, opinion polls, etc.). In software design, understanding requirements of users, based on questioning a sample of potential users. In such cases it is totally impracticable to obtain exhaustive data about the population as a whole. So we are forced to obtain data about a sample.

Inf1-DA 2010 20 III: 46 / 89 Sampling There are important guidelines to follow in choosing a sample from a population. The sample should be chosen randomly from the population. The sample should be as large as is practically possible (given constraints on gathering data, storing data and calculating with data). These two guidelines are designed to improve the likelihood that the sample is representative of the population. In particular, they minimise the chance of accidentally building a bias into the sample. Given a sample, one calculates statistical properties of the sample, and uses these to infer likely statistical properties of the whole population. Important topics in statistics (beyond the scope of D&A) are maximising and quantifying the reliability of such techniques. Inf1-DA 2010 20 III: 47 / 89 Estimating statistics for a population given a sample Typically one has a (hopefully representative) sample x 1,..., x n from a population of size N where n << N (i.e., n is much smaller that N). We use the sample x 1,..., x n to estimate statistical values for the whole population. Sometimes the calculation is the expected one, sometimes it isn t. The best estimate m of the mean µ of the population is: n i=1 m = x i n As expected, this is just the mean of the sample. Inf1-DA 2010 20 III: 48 / 89 Estimating variance and standard deviation of population To estimate the variance of the population, calculate n i=1 (x i m) 2 n 1 The best estimate s of the standard deviation σ of the population, is: s = n i=1 (x i m) 2 n 1 N.B. These values are not simply the variance and standard deviation of the sample. In both cases, the expected denominator of n has been replaced by n 1. This gives a better estimate in general when n << N.

Inf1-DA 2010 20 III: 49 / 89 Caution The use of samples to estimate statistics of populations is so common that the formula on the previous slide is very often the one needed when calculating standard deviations. Its usage is so widespread that sometimes it is wrongly given as the definition of standard deviation. The existence of two different formulas for calculating the standard deviation in different circumstances can lead to confusion. So one needs to take care. Sometimes calculators make both formulas available via two buttons: σ n for the formula with denominator n; and σ n 1 for the formula with denominator n 1. Inf1-DA 2010 20 III: 50 / 89 Further reading There are many, many, many books on statistics. Two very gentle books, intended mainly for social science students, are: P. Hinton Statistics Explained Routledge, London, 1995 These are good for the formula-shy reader. First Steps in Statistics D. B. Wright SAGE publications, 2002 Two entertaining books (the first a classic, the second rather recent), full of examples of how statistics are often misused in practice, are: D. Huff How to Lie with Statistics Victor Gollancz, 1954 M. Blastland and A. Dilnot The Tiger That Isn t Profile Books, 2008