Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 03

Similar documents
P8130: Biostatistical Methods I

Chapter 4. Displaying and Summarizing. Quantitative Data

Midrange: mean of highest and lowest scores. easy to compute, rough estimate, rarely used

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Σ x i. Sigma Notation

ECLT 5810 Data Preprocessing. Prof. Wai Lam

CS 147: Computer Systems Performance Analysis

Descriptive Data Summarization

200 participants [EUR] ( =60) 200 = 30% i.e. nearly a third of the phone bills are greater than 75 EUR

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

1. Exploratory Data Analysis

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

STAT 200 Chapter 1 Looking at Data - Distributions

Histograms allow a visual interpretation

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

Histograms, Central Tendency, and Variability

Week 1: Intro to R and EDA

Stat 101 Exam 1 Important Formulas and Concepts 1

Preliminary Statistics course. Lecture 1: Descriptive Statistics

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

MATH 10 INTRODUCTORY STATISTICS

IAM 530 ELEMENTS OF PROBABILITY AND STATISTICS LECTURE 3-RANDOM VARIABLES

AP Final Review II Exploring Data (20% 30%)

Chapter 4.notebook. August 30, 2017

Summary statistics. G.S. Questa, L. Trapani. MSc Induction - Summary statistics 1

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Descriptive Statistics-I. Dr Mahmoud Alhussami

3.1 Measure of Center

Module 1. Identify parts of an expression using vocabulary such as term, equation, inequality

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Unit 2. Describing Data: Numerical

Introduction to Basic Statistics Version 2

Chapter 2 Descriptive Statistics

Class 11 Maths Chapter 15. Statistics

SUMMARIZING MEASURED DATA. Gaia Maselli

Chapter 2: Tools for Exploring Univariate Data

Index I-1. in one variable, solution set of, 474 solving by factoring, 473 cubic function definition, 394 graphs of, 394 x-intercepts on, 474

Sociology 6Z03 Review I

Stat 20: Intro to Probability and Statistics

MIDTERM EXAMINATION (Spring 2011) STA301- Statistics and Probability

Statistical Methods. by Robert W. Lindeman WPI, Dept. of Computer Science

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS

Chapter Four. Numerical Descriptive Techniques. Range, Standard Deviation, Variance, Coefficient of Variation

2.0 Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

TOPIC: Descriptive Statistics Single Variable

Descriptive Statistics

ANÁLISE DOS DADOS. Daniela Barreiro Claro

Mathematics Standards for High School Algebra I

Descriptive Univariate Statistics and Bivariate Correlation

Lecture Notes 2: Variables and graphics

Lecture 2 and Lecture 3

Elementary Statistics

Keystone Exams: Algebra

3rd Quartile. 1st Quartile) Minimum

Chapter I: Introduction & Foundations

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

1 Measures of the Center of a Distribution

Exploring, summarizing and presenting data. Berghold, IMI, MUG

Summarizing and Displaying Measurement Data/Understanding and Comparing Distributions

Summarizing Measured Data

Descriptive Statistics

Units. Exploratory Data Analysis. Variables. Student Data

3.1 Measures of Central Tendency: Mode, Median and Mean. Average a single number that is used to describe the entire sample or population

Introduction to Statistics

Chapter2 Description of samples and populations. 2.1 Introduction.

Chapter 3. Data Description

FLORIDA STANDARDS TO BOOK CORRELATION

Observations Homework Checkpoint quizzes Chapter assessments (Possibly Projects) Blocks of Algebra

Quantitative Tools for Research

Example 2. Given the data below, complete the chart:

Chapter 1. Looking at Data

Stat Lecture Slides Exploring Numerical Data. Yibi Huang Department of Statistics University of Chicago

MATH 117 Statistical Methods for Management I Chapter Three

Types of Information. Topic 2 - Descriptive Statistics. Examples. Sample and Sample Size. Background Reading. Variables classified as STAT 511

Statistics and parameters

Lecture 6: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

are the objects described by a set of data. They may be people, animals or things.

Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

1.0 Continuous Distributions. 5.0 Shapes of Distributions. 6.0 The Normal Curve. 7.0 Discrete Distributions. 8.0 Tolerances. 11.

Algebra I Number and Quantity The Real Number System (N-RN)

BNG 495 Capstone Design. Descriptive Statistics

Chapter 2 Class Notes Sample & Population Descriptions Classifying variables

Curriculum Summary 8 th Grade Algebra I

3 GRAPHICAL DISPLAYS OF DATA

Algebra 1 Mathematics: to Hoover City Schools

Foundations of Algebra/Algebra/Math I Curriculum Map

Lecture 3B: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

Dublin City Schools Mathematics Graded Course of Study Algebra I Philosophy

Statistics in medicine

COMMON CORE STATE STANDARDS TO BOOK CORRELATION

Biostatistics for biomedical profession. BIMM34 Karin Källen & Linda Hartman November-December 2015

Probabilities and Statistics Probabilities and Statistics Probabilities and Statistics

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

Describing distributions with numbers

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

AP Statistics Cumulative AP Exam Study Guide

Transcription:

Meelis Kull meelis.kull@ut.ee Autumn 2017 1

Demo: Data science mini-project CRISP-DM: cross-industrial standard process for data mining Data understanding: Types of data Data understanding: First look at attributes Types of attributes First look at a nominal attribute First look at a ordinal attribute First look at a numeric attribute 2

Demo: Data science mini-project CRISP-DM: cross-industrial standard process for data mining Data understanding: Types of data Data understanding: First look at attributes Types of attributes First look at a nominal attribute First look at a ordinal attribute First look at a numeric attribute 3

4

All the same applies as for nominal attributes Need to make sure that the order is retained in histograms Additional ways to look at the attribute: Calculate the min, max, median of the attribute 5

All the same applies as for nominal attributes Need to make sure that the order is retained in histograms Additional ways to look at the attribute: Calculate the min, max, median of the attribute More generally, calculate quantiles 6

There are 3 quartiles: lower quartile, median, upper quartile They divide a sorted data set into 4 equal parts Percentiles divide into 100 equal parts Lower quartile is the 25 th percentile Median is the 50 th percentile Deciles divide into 10 equal parts Quantiles generalise to any location over sorted data: Lower quartile = 25 th percentile = quantile at 0.25 4 th decile = quantile at 0.4 7

A. 1 st and 2 nd decile B. 2 nd and 3 rd decile C. 3 rd and 4 th decile D. 4 th and 5 th decile E. 5 th and 6 th decile F. 6 th and 7 th decile G. 7 th and 8 th decile H. 8 th and 9 th decile 8

A. Nominal attributes B. Ordinal attributes C. Both D. Neither E. Not sure 9

Same aspects as in nominal attributes Additionally: Is the order specified in the documentation? (meta-data) If all possible values specified in the meta-data: Check if all values present in the data If not, make sure that the empty bars in histograms are in the right location corresponding to the order 10

Demo: Data science mini-project CRISP-DM: cross-industrial standard process for data mining Data understanding: Types of data Data understanding: First look at attributes Types of attributes First look at a nominal attribute First look at a ordinal attribute First look at a numeric attribute 11

12

Is it really numeric? E.g. if contains only values 0.0 and 1.0 then perhaps nominal is more appropriate? Is it discrete (only integers)? E.g. if contains integers 0 to 100 then can have a first look as if it were an ordinal attribute Calculate quantiles as for ordinal attributes Calculate the mean (arithmetic average): 13

Is it really numeric? Is it discrete (only integers)? Calculate quantiles as for ordinal attributes Calculate the mean (arithmetic average) Plot a histogram (with default binning) 14

Check if some value occurs many times With real numbers often no number occurs more than once If a value is frequent then: Possibly can denote missingness (e.g. value 0), should be treated as N/A (not available, missing) Perhaps can denote an extremal value (more extremal values represented as this value) Check for rounding E.g., if all numbers rounded to 2 nd digit and one has more digits then it might be a typing error 15

A. Nominal attributes B. Ordinal attributes C. Numeric attributes D. All of the above E. None of the above F. Not sure 16

Demo: Data science mini-project CRISP-DM: cross-industrial standard process for data mining Data understanding: Types of data Data understanding: First look at attributes Types of attributes First look at a nominal attribute First look at a ordinal attribute First look at a numeric attribute 17

Data understanding: distribution of attributes Types of histograms How to describe probability distributions? Some standard probability distributions More ways to visualise distributions Visualising relations of attributes Are the attributes related? 18

Data understanding: distribution of attributes Types of histograms How to describe probability distributions? Some standard probability distributions More ways to visualise distributions Visualising relations of attributes Are the attributes related? 19

20

We have now looked at each attribute A key question we can now answer: How do the items in this dataset look like? Now we have at least 3 answers to this: 21

How do the items in this dataset look like? Let us just pick up one as an example: Age = 22 Workclass = Private Education = 11 th Occupation = Other-service Capital.gain = 0... 22

How do the items in this dataset look like? Let us just provide the ranges of attributes Age: {17,18,19,,90} Workclass: {Federal-gov, Local-gov, Private, } Education: {1 st -4 th,5 th -6 th,, Doctorate) Occupation: {Adm-clerical, Exec-managerial, } Capital.gain: [0,99999]... 23

How do the items in this dataset look like? Let us just provide the histograms Age: <histogram> Workclass: <histogram> Education: <histogram> Occupation: <histogram> Capital.gain: <histogram>... 24

Data understanding: distribution of attributes Types of histograms How to describe probability distributions? Some standard probability distributions More ways to visualise distributions Visualising relations of attributes Are the attributes related? 25

26

Histograms on discrete data Nominal Ordinal Numeric with few different values E.g. small number of different integers Histograms on continuous data Numeric with many different values 27

Frequency histogram Frequency = count of items with each value ggplot(data) + geom_histogram(aes(x=age),stat="count") 28

Frequency histogram Frequency = count of items with each value ggplot(data) + geom_histogram(aes(x=workclass),stat="count") 29

Relative frequency histogram Relative frequency = proportion (0..1) or percentage (0..100%) of items with each value Heights of bars sum up to 1 ggplot(data) + geom_bar(aes(x=workclass,..count../sum(..count..)),stat="count") 30

Depends on the goal Frequency histogram Gives actual counts in the data Relative frequency histogram Gives proportions in the data Interpretable as probability distribution of a randomly chosen item 31

Continuous attribute: Usually value is different in each item Need to introduce bins (a.k.a. intervals, ranges) Histogram not informative without bins: ggplot(data) + geom_histogram(aes(x=factor(salaries)),stat="count") 32

Frequency histogram of binned data Frequency = count of items in each bin ggplot(data) + geom_histogram(aes(x=salaries),stat="bin",boundary=0,binwidth=10000) 33

Relative frequency histogram of binned data Relative frequency = proportion of items in bins Heights of bars add up to 1 ggplot(data) + geom_histogram(aes(x=salaries,..count../sum(..count..)),stat="bin",boundary=0,binwidth=10000) 34

Density histogram of binned data Density = Y-axis such that areas of bars in the histogram add up to 1 Density scale is invariant to the sizes of bins ggplot(data) + geom_histogram(aes(x=salaries,..density..),stat="bin",boundary=0,binwidth=10000) 35

Density histogram of binned data Density = Y-axis such that areas of bars in the histogram add up to 1 Density scale is invariant to the sizes of bins + geom_histogram(aes(x=salaries,..density..),stat="bin",boundary=0,binwidth=1000,alpha=0,colour="red") 36

Density histogram of binned data Density = Y-axis such that areas of bars in the histogram add up to 1 Density scale is invariant to the sizes of bins + geom_histogram(aes(x=salaries),stat="density",colour="blue") 37

Represented by the probability density function (pdf) Area under the curve is equal to 1 Areas represent probabilities Area = P(a<X<b) = probability that X is between a and b 38

Data understanding: distribution of attributes Types of histograms How to describe probability distributions? Some standard probability distributions More ways to visualise distributions Visualising relations of attributes Are the attributes related? 39

40

Statistic measure calculated from all values Mode of the distribution: The most probable value (or values) The most frequent value if discrete Value with highest density if continuous Median and other quantiles We have defined earlier Mean of the distribution: Average value of the attribute Arithmetic average if discrete Expected value if continuous Centre of mass of the distribution 41

Consider attribute with values: 1,2,2,2,3,3,4,7 Mode: 2 because occurs three times Median 2.5 because 2 & 3 are in the middle, (2+3)/2=2.5 Mean: 3.0 because (1+2+2+2+3+3+4+7)/8 = 24/8 = 3.0 42

A. 0 B. 0.5 C. 1 D. 1.5 E. None of the above 43

A. 0 B. 0.5 C. 1 D. 1.5 E. None of the above 44

A. 0 B. 0.5 C. 1 D. 1.5 E. None of the above 45

Variance of a distribution is: average squared deviation from the mean Standard deviation is square root of variance Quadratic average deviation from the mean Example: Values: 1,2,2,2,3,3,4,7 Mean: 3.0 Variance: ((1-3)^2+(2-3)^2+ +(7-3)^2) / 8 = 3.0 Standard deviation: sqrt(3.0)=1.732 46

Symmetric Skewed / right-skewed / left-skewed Heavy-tailed Bimodal Multi-modal Capped / right-capped / left-capped 47

Simple definitions, not fully correct: Unimodal 1 mode Bimodal 2 modes Multimodal multiple modes Actually: Bimodal usually means that the density (pdf) has two local maxima: Multimodal means pdf has multiple maxima (2 or more) 48

Symmetric distribution: Symmetric around a vertical axis of symmetry Left and right side are mirror-images of each other Left- (or right-)skewed distribution: Non-symmetric and mean below (above) mode Usually used only for unimodal distributions sym negati positiv m vely ely et ske ske ri wed we c d 49

Data understanding: distribution of attributes Types of histograms How to describe probability distributions? Some standard probability distributions More ways to visualise distributions Visualising relations of attributes Are the attributes related? 50

51

Statisticians have names for many different families of probability distributions Why need to know some of them? In practice these are used to communicate the distribution without having to visualise it We will talk about: Uniform distributions Normal distributions Power law distributions 52

All options are equally probable 53

All values in [a,b] are equally probable: The pdf is constant between a and b, and 0 elsewhere 54

Represent data dispersion, spread Represent central tendency 55

N(mean,variance) The most common non-uniform continuous distribution Why? Sum of many independent and identically distributed (i.i.d.) random variables is approximately normally distributed Standard normal distribution: N(0,1) 56

Heavy-tailed (or long-tailed) distribution Very high or low values are likely More likely than in case of normal distribution Technical definition: Tails are not exponentially bounded 57

Alternative names: scale-free / scale-independent Distributions on positive real values The probability of M times bigger value is K times smaller (power law; M, K - parameters) Linearly descending pdf when drawn in loglog-scale (both x and y logarithmic) Examples: Number of connections in many real-world graphs Frequencies of words in languages Sizes of craters on the moon 58

A. Uniform B. Unimodal symmetric C. Unimodal right-skewed D. Unimodal left-skewed E. Bimodal symmetric F. Bimodal asymmetric G. Multi-modal symmetric H. Multi-modal asymmetric I. Normally distributed J. Power-law distributed 59

A. Uniform B. Unimodal symmetric C. Unimodal right-skewed D. Unimodal left-skewed E. Bimodal symmetric F. Bimodal asymmetric G. Multi-modal symmetric H. Multi-modal asymmetric I. Normally distributed J. Power-law distributed 60

A. Uniform B. Unimodal symmetric C. Unimodal right-skewed D. Unimodal left-skewed E. Bimodal symmetric F. Bimodal asymmetric G. Multi-modal symmetric H. Multi-modal asymmetric I. Normally distributed J. Power-law distributed 61

10-2 10 0 10 1 10 2 10-3 10-4 A. Uniform B. Unimodal symmetric C. Unimodal right-skewed D. Unimodal left-skewed E. Bimodal symmetric F. Bimodal asymmetric G. Multi-modal symmetric H. Multi-modal asymmetric I. Normally distributed J. Power-law distributed 62

Data understanding: distribution of attributes Types of histograms How to describe probability distributions? Some standard probability distributions More ways to visualise distributions Visualising relations of attributes Are the attributes related? 63

64

Compactly visualise many distributions Density plots rotated by 90º and mirrored Distribution of salaries for each education level ggplot(data) + geom_violin(aes(x=education,y=salaries)) 65

Compact visualise many distributions marked median, upper and lower quartile, and outliers (R ggplot outlier = more than 1.5x interquartile range from quartile) ggplot(data) + geom_boxplot(aes(x=education,y=salaries)) 66

Violin plots and box plots can also be shown together: ggplot(data) + geom_violin(aes(x=education_factor,y=salaries)) + geom_boxplot(aes(x=education_factor,y=salaries),alpha=0.3) 67

Often too crowded, but sometimes provide extra insights compared to box plots and violin plots Over-crowded with too many points ggplot(data) + geom_point(aes(x=education_factor,y=salaries)) 68

Often too crowded, but sometimes provide extra insights compared to box plots and violin plots More useful, here only 100 points ggplot(data) + geom_point(aes(x=education_factor,y=salaries)) 69

Data understanding: distribution of attributes Types of histograms How to describe probability distributions? Some standard probability distributions More ways to visualise distributions Visualising relations of attributes Are the attributes related? 70

71

2 categorical attributes: Cross-table (workclass & education) table(data$education,data$workclass) 72

2 categorical attributes: Cross-table (workclass & education) Heatmap (workclass & education) ggplot(data %>% group_by(education,workclass) %>% summarise(count=length(education))) + geom_tile(aes(y=education,x=workclass,fill=count)) + scale_fill_gradient(low='white',high='black',trans="log",breaks=c(1,10,100,1000)) + theme_bw() 73

2 categorical attributes: Cross-table (workclass & education) Heatmap (workclass & education) Cross-table and heatmap combined + geom_text(aes(x=workclass,y=education,label=count),color="red") 74

Scatter plot, box plot, violin plot 75

Scatter plot ggplot(data) + geom_point(aes(x=age,y=salaries)) 76

Scatter plot ggplot(data) + geom_point(aes(x=capital.gain,y=salaries)) 77

Scatter plot Discretise one (or both) of the attributes Discretise = make into categorical For instance, introduce bins library(arules) data$capital.gain.discretised = discretize(data$capital.gain,method="fixed",categories=c(0,5,10,20,50,100)*1000) ggplot(data) + geom_boxplot(aes(x=capital.gain.discretised,y=salaries)) 78

Make all pairwise visualisations and organise in a cross-table Perform dimensionality reduction Introduced later in the course Projects the data into a 2-dimensional space Then visualise Use colours, shapes, etc. 79

Four attributes visualised together ggplot(data) + geom_point(aes(x=capital.gain,y=salaries,color=workclass,shape=gender)) + scale_colour_brewer(type="qual") 80

Four attributes visualised together Often hard to read when many attributes visualised this way ggplot(data) + geom_point(aes(x=capital.gain,y=salaries,color=workclass,shape=gender)) + scale_colour_brewer(type="qual") 81

Data understanding: distribution of attributes Types of histograms How to describe probability distributions? Some standard probability distributions More ways to visualise distributions Visualising relations of attributes Are the attributes related? 82

83

There are many statistical tests about this, we will cover some in the next lecture Visual inspection can reveal some relations For example, university graduates have higher median salaries than others 84

There are many statistical tests about this, we will cover some in the next lecture Visual inspection can reveal some relations For example, university graduates have higher median salaries than others Correlation is a statistic on 2 numeric attributes, quantifying linear relationships 85

Mean (actually sample mean as explained in next lecture) x 1 n n i 1 x i Correlation (Pearson correlation coefficient) 86

Ranges from -1 to +1: R=-1: perfectly anti-correlated R=+1: perfectly correlated R=0: absolutely uncorrelated Positively correlated Negatively correlated 87

Uncorrelated Uncorrelated Uncorrelated 88

89

Several sets of (x, y) points, with the Pearson correlation coefficient of x and y for each set. Note that the correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero. 90

91

A. All equal B. All different C. Some equal, some different 92

93

94

95

96

97

405 - Feb 2017 1200 1000 800 600 400 200 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 CO2 Noise Pressure 98

1200 405 - Feb 2017 Changed the scale of noise 80 1000 70 60 800 50 600 40 400 30 20 200 10 0 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 CO2 Pressure Noise 99

80 70 CO2 vs Noise in 405 y = 0.0465x + 19.693 R² = 0.65625 60 Noise 50 40 30 20 10 0 CO2 0 200 400 600 800 1000 1200 100

80 70 Not the same as correlation! CO2 vs Noise in 405 y = 0.0465x + 19.693 R² = 0.65625 60 Noise 50 40 30 20 10 0 CO2 0 200 400 600 800 1000 1200 101

r Pearson correlation coefficient R 2 coefficient of determination Proportion of variance in the target attribute that is predictable from the source attribute Basically, measures how well the data follow the regression line In simple linear regression R 2 is equal to the square of r 102

Source: https://www.valuewalk.com/2016/08/usd-vs-gbp-vs-eur-a-leveling-of-the-playing-field/ 103

Source: https://www.valuewalk.com/2016/08/usd-vs-gbp-vs-eur-a-leveling-of-the-playing-field/ 104

????????? Source: http://www.tylervigen.com/spurious-correlations 105

Source: http://www.tylervigen.com/spurious-correlations 106

????????? 107

108

Relation Correlation Causation 109

Relation Correlation Causation Spurious correlations! Multiple testing problem (next lecture) 110

Data understanding: distribution of attributes Types of histograms How to describe probability distributions? Some standard probability distributions More ways to visualise distributions Visualising relations of attributes Are the attributes related? 111

112

The most merciful thing in the world... is the inability of the human mind to correlate all its contents. H. P. Lovecraft The correlation of quality of life and cost of energy is huge Sam Altman Visualization is daydreaming with a purpose. Bo Bennett Source: https://www.brainyquote.com 113