Descriptive Data Summarization
|
|
- Beverly Conley
- 6 years ago
- Views:
Transcription
1 Descriptive Data Summarization Descriptive data summarization gives the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning and data integration. For data preprocessing tasks, we want to learn about data characteristics regarding both central tendency and dispersion of the data. Measures of central tendency include mean, median, mode, and midrange, Measures of data dispersion include quartiles, interquartile range (IQR), and variance. These descriptive statistics are of great help in understanding the distribution of the data. Data Mining 1
2 Measuring Central Tendency: Mean The most common and most effective numerical measure of the center of a set of data is the arithmetic mean Arithmetic Mean: x 1 n Sometimes, each value x i in a set may be associated with a weight w i, The weights reflect the significance, importance, or occurrence frequency attached to their respective values. Weighted Arithmetic Mean: n i 1 x x i n i 1 n i 1 w x i w i i Data Mining 2
3 Measuring Central Tendency: Mean Although the mean is the single most useful quantity for describing a data set, it is not always the best way of measuring the center of the data. A major problem with the mean is its sensitivity to extreme (outlier) values. Even a small number of extreme values can corrupt the mean. To offset the effect caused by a small number of extreme values, we can instead use the trimmed mean, Trimmed mean can be obtained after chopping off values at the high and low extremes. Data Mining 3
4 Measuring Central Tendency: Median Another measure of the center of data is the median. Suppose that a given data set of N distinct values is sorted in numerical order. If N is odd, the median is the middle value of the ordered set; If N is even, the median is the average of the middle two values. For grouped data, the median can be estimated median L 1 n / 2 ( ( freq freq) l ) width median L 1 is the lower boundary of the median interval, N is the number of values in the entire data set, ( freq) l is the sum of the frequencies of all of the intervals that are lower than the median interval, freqi median is the frequency of the median interval, and width is the width of the median interval. Median interval Data Mining 4
5 Measuring Central Tendency: Mode Another measure of central tendency is the mode. The mode for a set of data is the value that occurs most frequently in the set. It is possible for the greatest frequency to correspond to several different values, which results in more than one mode. Data sets with one, two, or three modes: called unimodal, bimodal, and trimodal. At the other extreme, if each data value occurs only once, then there is no mode. For unimodal frequency curves that are moderately skewed (asymmetrical), we have the following empirical relation: mean mode 3 ( mean median ) Data Mining 5
6 Symmetric vs. Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data Data Mining 6
7 Measuring the Dispersion of Data The degree to which numerical data tend to spread is called the dispersion, or variance of the data. The most common measures of data dispersion are range, five-number summary (based on quartiles), interquartile range, standard deviation. Boxplots can be plotted based on the five-number summary and are a useful tool for identifying outliers. Data Mining 7
8 Range, Quartiles, Outliers Range: the difference between the largest and smallest values. Quartiles: Q1 (25th percentile), Q3 (75th percentile) Inter-quartile Range: IQR = Q 3 Q 1 Five number summary: Minumum, Q 1, Median, Q 3, Maximum Outliers: A common rule of thumb for identifying suspected outliers is to single out values falling at least 1.5xIQR above the third quartile or below the first quartile Data Mining 8
9 Boxplot Analysis Boxplots are a popular way of visualizing a distribution and aboxplot incorporates the fivenumber summary: The ends of the box are at the quartiles, so that the box length is the interquartile range, IQR. The median is marked by a line within the box. Two lines outside the box extend to the smallest and largest observations. Outliers: points beyond a specified outlier threshold, plotted individually Data Mining 9
10 Variance and Standard Deviation Variance of N observations: 2 1 N N i 1 ( x i ) 2 where is the mean value of the observations Standard Deviation σ is the square root of variance or σ 2 The basic properties of the standard deviation are σ measures spread about the mean and should be used only when the mean is chosen as the measure of center. σ =0 only when there is no spread, when all observations have the same value. Otherwise σ > 0. Data Mining 10
11 Properties of Normal Distribution Curve The normal (distribution) curve (μ: mean, σ: standard deviation) From μ σ to μ+σ: contains about 68% of the measurements From μ 2σ to μ+2σ: contains about 95% of it From μ 3σ to μ+3σ: contains about 99.7% of it Data Mining 11
12 Graphic Displays of Basic Statistical Descriptions Boxplot: graphic display of five-number summary Histogram: x-axis are values, y-axis represents frequencies Quantile plot: each value x i is paired with f i indicating that approximately 100 f i % of data are x i Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane Data Mining 12
13 Histogram Analysis Histogram: Graph display of tabulated frequencies, shown as bars It shows what proportion of cases fall into each of several categories Differs from a bar chart in that it is the area of the bar that denotes the value, not the height as in bar charts, a crucial distinction when the categories are not of uniform width The categories are usually specified as non-overlapping intervals of some variable. The categories (bars) must be adjacent Data Mining 13
14 In an equal-width histogram, each bucket represents an equalwidth range of numerical attribute Histogram Analysis Data Mining 14
15 Histograms Often Tell More than Boxplots The two histograms shown in the left may have the same boxplot representation The same values for: min, Q1, median, Q3, max But they have rather different data distributions Data Mining 15
16 Quantile Plot Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information For a data x i data sorted in increasing order, f i indicates that approximately 100 f i % of the data are below or equal to the value x i quantiles Data Mining 16
17 Quantile-Quantile (Q-Q) Plot Graphs the quantiles of one univariate distribution against the corresponding quantiles of another It is a powerful visualization tool in that it allows the user to view whether there is a shift in going from one distribution to another. Suppose that we have two sets of observations for the variable unit price, taken from two different branch locations. Let x 1... x N be the data from the first branch, and y 1... y M be the data from the second, where each data set is sorted in increasing order. If M = N (i.e., the number of points in each set is the same), then we simply plot y i against x i, where y i and x i are both quantiles of their respective data sets. If M < N (i.e., the second branch has fewer observations than the first), there can be only M points on the q-q plot Data Mining 17
18 Quantile-Quantile (Q-Q) Plot A quantile-quantile plot for unit price data of items sold at two different branches Each point corresponds to the same quantile for each data set and shows the unit price of items sold at branch 1 versus branch 2 for that quantile. For example, here the lowest point in the left corner corresponds to the 0.03 quantile. A straight line that represents the case of when, for each given quantile, the unit price at each branch is the same. The darker points correspond to the data for Q1, the median, and Q3, respectively.) The unit price of items sold at branch 1 was slightly less than that at branch 2. Data Mining 18
19 Scatter plot A scatter plot is one of effective graphical methods for determining if there appears to be a relationship, pattern, or trend between two numerical attributes. To construct a scatter plot, each pair of values is treated as a pair of coordinates in an algebraic sense and plotted as points in the plane. Data Mining 19
20 Scatter Plot: Positively and Negatively Correlated Data Negatively Correlated Positively Correlated The left half fragment is positively correlated The righthalf fragment is negatively correlated Data Mining 20
21 Scatter Plot: Uncorrelated Data Uncorrelated scatter plot examples Data Mining 21
22 Loess Curve A loess curve is another graphic aid that adds a smooth curve to a scatter plot to provide better perception of the pattern of dependence. The word loess is short for local regression. Data Mining 22
23 Similarity and Dissimilarity Similarity The similarity between two objects is a numerical measure of the degree to which the two objects are alike. Similarities are higher for pairs of objects that are more alike. Similarities are usually non-negative and are often between 0 (no similarity) and 1 (complete similarity). Dissimilarity The dissimilarity between two objects is a numerical measure of the degree to which the two objects are different. Dissimilarities are lower for more similar pairs of objects. The term distance is used as a synonym for dissimilarity, although the distance is often used to refer to a special class of dissimilarities. Dissimilarities sometimes fall in the interval [0,1], but it is also common for them to range from 0 to. Proximity refers to a similarity or dissimilarity Data Mining 23
24 Similarity/Dissimilarity for Simple Attributes The proximity of objects with a number of attributes is typically defined by combining the proximities of individual attributes. Consider objects described by one nominal attribute. What would it mean for two such objects to be similar? p and q are the attribute values for two data objects Data Mining 24
25 Euclidean Distance Dissimilarities between Data Objects Euclidean Distance dist n k 1 ( p k q k 2 ) where n is the number of dimensions (attributes) and p k and q k are, respectively, the k th attributes (components) or data objects p and q. Normally attributes are numeric. Standardization is necessary, if scales differ. Data Mining 25
26 Euclidean Distance p1 p3 p4 p point x y p1 0 2 p2 2 0 p3 3 1 p4 5 1 p1 p2 p3 p4 p p p p Distance Matrix Data Mining 26
27 Minkowski Distance Minkowski Distance is a generalization of Euclidean Distance where n dist ( k 1 p k q k r is a parameter, n is the number of dimensions (attributes) and p k and q k are k th attributes of data objects p and q. r ) 1 r Note that Minkowski Distance is Euclidean Distance when r=2 Data Mining 27
28 Minkowski Distance: Examples r = 1. City block (Manhattan, taxicab, L 1 norm) distance. A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors r = 2. Euclidean distance r. supremum (L max norm, L norm) distance. This is the maximum difference between any component of the vectors Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions. Data Mining 28
29 point x y p1 0 2 p2 2 0 p3 3 1 p4 5 1 Minkowski Distance Manhattan (L 1 ) L1 p1 p2 p3 p4 p p p p Euclidean (L 2 ) L2 p1 p2 p3 p4 p p p p Supremum (L ) L p1 p2 p3 p4 p p p p Distance Matrix Data Mining 29
30 (Metric) Properties of Distances Distances, such as the Euclidean distance, have some properties. If distance d(x, y) between x and y, hold following properties. Measures that satisfy all three properties are known as metrics. Some dissimilarities do not satisfy one or more of the metric properties. Examples: set difference, time difference Data Mining 30
31 Common Properties of a Similarity Similarities, also have some well known properties. 1. s(p, q) = 1 (or maximum similarity) only if p = q. (0 s 1) 2. s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q. Data Mining 31
32 Similarity Between Binary Vectors: Simple Matching and Jaccard Coefficients Similarity measures between objects that contain only binary attributes are called similarity coefficients Common situation is that objects, p and q, have only binary attributes Compute similarities using the following quantities M 01 = the number of attributes where p was 0 and q was 1 M 10 = the number of attributes where p was 1 and q was 0 M 00 = the number of attributes where p was 0 and q was 0 M 11 = the number of attributes where p was 1 and q was 1 Simple Matching and Jaccard Coefficients Simple Matching Coefficient counts both presences and absences equally SMC = number of matches / number of attributes = (M 11 + M 00 ) / (M 01 + M 10 + M 11 + M 00 ) Jaccard Coefficient is frequently for asymmetric binary attributes J = number of 11 matches / number of not-both-zero attributes values = (M 11 ) / (M 01 + M 10 + M 11 ) Data Mining 32
33 SMC versus Jaccard Coefficient: Example p = q = M 01 = 2 (the number of attributes where p was 0 and q was 1) M 10 = 1 (the number of attributes where p was 1 and q was 0) M 00 = 7 (the number of attributes where p was 0 and q was 0) M 11 = 0 (the number of attributes where p was 1 and q was 1) SMC = (M 11 + M 00 )/(M 01 + M 10 + M 11 + M 00 ) = (0+7) / ( ) = 0.7 J = (M 11 ) / (M 01 + M 10 + M 11 ) = 0 / ( ) = 0 Data Mining 33
34 Cosine Similarity Cosine similarity is a common measure for document similarity. If d 1 and d 2 are two document vectors, then cos(d 1,d 2 ) = (d 1 d 2 ) / d 1 d 2 where indicates vector dot product d is the length of vector d. A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document. Data Mining 34
35 cos(d 1, d 2 ) = (d 1 d 2 ) / d 1 d 2, where indicates vector dot product, d : the length of vector d Cosine Similarity : Example Find the similarity between documents 1 and 2. d 1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d 2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) d 1 d 2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 d 1 = (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0) 0.5 =(42) 0.5 = d 2 = (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1) 0.5 =(17) 0.5 = 4.12 cos(d 1, d 2 ) = 0.94 Data Mining 35
36 Cosine Similarity Cosine similarity really is a measure of the (cosine of the) angle between x and y. If the cosine similarity is 1, the angle between x and y is 0 o, and x and y are same If the cosine similarity is 0, then the angle between x and y is 90 o, and they do not share any terms Cosine similarity can be written as Dividing x and y by their lengths normalizes them to have a length of 1. This means that cosine similarity does not take the magnitude of the two data objects into account when computing similarity. Euclidean distance might be a better choice when magnitude is important. Data Mining 36
37 Extended Jaccard Coefficient (Tanimoto Coefficient) Extended Jaccard Coefficient can be used for document data and that reduces to the Jaccard coefficient in the case of binary attributes. Extended Jaccard Coefficient is also known as Tanimoto coefficient. Data Mining 37
38 Correlation Correlation measures the linear relationship between objects Pearson's correlation coefficient between two data objects, x and y: where Data Mining 38
39 Correlation: Perfect Correlation Correlation is always in the range -1 to 1. A correlation of 1 (-1) means that x and y have a perfect positive (negative) linear relationship A perfect negative linear relationship (correlation: -1) x = (-3, 6, 0, 3, -6) s xy = -7.5 s x = s y = y = ( 1, -2, 0,-1, 2 ) corr(x,y) = -1 A perfect positive linear relationship (correlation: +1) x = ( 3, 6, 0, 3, 6 ) s xy = 2.1 s x = s y = y = ( 1, 2, 0, 1, 2) corr(x,y) = +1 Data Mining 39
40 Visually Evaluating Correlation scatter plots showing the similarity from 1 to 1 Data Mining 40
41 Issues in Proximity Calculation Important issues related to proximity measures: (1) How to handle the case in which attributes have different scales and/or are correlated, (2) How to calculate proximity between objects that are composed of different types of attributes, e.g., quantitative and qualitative, (3) How to handle proximity calculation when attributes have different weights; i.e., when not all attributes contribute equally to the proximity of objects. Data Mining 41
42 Standardization and Correlation for Distance Measures An important issue with distance measures is how to handle the situation when attributes do not have the same range of values. This situation is often described by saying that "the variables have different scales." Example: Euclidean distance is used to measure the distance between people based on two attributes: age and income. Unless these two attributes are standardized, the distance between two people will be dominated by income. We have to both attributes have same range (Ex: 0 1) Related issue is how to compute distance when there is correlation between some of the attributes, A generalization of Euclidean distance, the Mahalanobis distance, is useful when attributes are correlated, have different ranges of values and the distribution of the data is approximately Gaussian Data Mining 42
43 Z-score: Standardizing Numeric Data X: raw score to be standardized, μ: mean of the population, σ: standard deviation the distance between the raw score and the population mean in units of the standard deviation negative when the raw score is below the mean, + when above An alternative way: Calculate the mean absolute deviation where s m z x f f 1( n x 1 n (x m x standardized measure (z-score): m... x m 1f f 2 f f nf f 1 f x 2 f... x nf Using mean absolute deviation is more robust than using standard deviation ). z if x if m s f f ) Data Mining 43
44 Mahalanobis Distance 1 mahalanobi s( p, q) ( p q) ( p q) T is the covariance matrix of the input data X 1 n ( X j, k ij j ik k ) n 1 i 1 X )( X where -1 is the inverse of the covariance matrix of the data. Note that the covariance matrix is the matrix whose ij th entry is the covariance of the i th and j th attributes X For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6. Data Mining 44
45 Mahalanobis Distance C Covariance Matrix: B A A: (0.5, 0.5) B: (0, 1) C: (1.5, 1.5) Mahalanobis(A,B) = 5 Mahalanobis(A,C) = 4 Data Mining 45
46 General Approach for Combining Similarities Sometimes attributes are of many different types, but an overall similarity is needed. Following Algorithm is effective for computing an overall similarity between two objects, x and y, with different types of attributes. Data Mining 46
47 Using Weights to Combine Similarities We may not want to treat all attributes the same. Use weights w k which are between 0 and 1 and sum to 1. Modified Minkowski distance Data Mining 47
48 Selecting the Right Proximity Measure For many types of dense, continuous data, metric distance measures such as Euclidean distance are often used. The cosine, Jaccard, and extended Jaccard measures are appropriate for sparse, asymmetric data, most objects have only a few of the characteristics described by the attributes and thus, are highly similar in terms of the characteristics they do not have. In some cases, transformation or normalization of the data is important for obtaining a proper similarity measure since such transformations are not always present in proximity measures. The proper choice of a proximity measure can be a time-consuming task that requires careful consideration of both domain knowledge and the purpose for which the measure is being used. A number of different similarity measures may need to be evaluated to see which ones produce results that make the most sense. Data Mining 48
Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining
Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are. Is higher
More informationANÁLISE DOS DADOS. Daniela Barreiro Claro
ANÁLISE DOS DADOS Daniela Barreiro Claro Outline Data types Graphical Analysis Proimity measures Prof. Daniela Barreiro Claro Types of Data Sets Record Ordered Relational records Video data: sequence of
More informationECLT 5810 Data Preprocessing. Prof. Wai Lam
ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate
More informationGetting To Know Your Data
Getting To Know Your Data Road Map 1. Data Objects and Attribute Types 2. Descriptive Data Summarization 3. Measuring Data Similarity and Dissimilarity Data Objects and Attribute Types Types of data sets
More informationChapter I: Introduction & Foundations
Chapter I: Introduction & Foundations } 1.1 Introduction 1.1.1 Definitions & Motivations 1.1.2 Data to be Mined 1.1.3 Knowledge to be discovered 1.1.4 Techniques Utilized 1.1.5 Applications Adapted 1.1.6
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 1
Clustering Part 1 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville What is Cluster Analysis? Finding groups of objects such that the objects
More informationMSCBD 5002/IT5210: Knowledge Discovery and Data Minig
MSCBD 5002/IT5210: Knowledge Discovery and Data Minig Instructor: Lei Chen Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei and
More informationP8130: Biostatistical Methods I
P8130: Biostatistical Methods I Lecture 2: Descriptive Statistics Cody Chiuzan, PhD Department of Biostatistics Mailman School of Public Health (MSPH) Lecture 1: Recap Intro to Biostatistics Types of Data
More informationproximity similarity dissimilarity distance Proximity Measures:
Similarity Measures Similarity and dissimilarity are important because they are used by a number of data mining techniques, such as clustering nearest neighbor classification and anomaly detection. The
More informationChapter 3. Data Description
Chapter 3. Data Description Graphical Methods Pie chart It is used to display the percentage of the total number of measurements falling into each of the categories of the variable by partition a circle.
More informationData Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining
Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 10 What is Data? Collection of data objects
More informationSimilarity and Dissimilarity
1//015 Similarity and Dissimilarity COMP 465 Data Mining Similarity of Data Data Preprocessing Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed.
More informationSTP 420 INTRODUCTION TO APPLIED STATISTICS NOTES
INTRODUCTION TO APPLIED STATISTICS NOTES PART - DATA CHAPTER LOOKING AT DATA - DISTRIBUTIONS Individuals objects described by a set of data (people, animals, things) - all the data for one individual make
More informationStatistical Concepts. Constructing a Trend Plot
Module 1: Review of Basic Statistical Concepts 1.2 Plotting Data, Measures of Central Tendency and Dispersion, and Correlation Constructing a Trend Plot A trend plot graphs the data against a variable
More informationUnit 2. Describing Data: Numerical
Unit 2 Describing Data: Numerical Describing Data Numerically Describing Data Numerically Central Tendency Arithmetic Mean Median Mode Variation Range Interquartile Range Variance Standard Deviation Coefficient
More informationChapter 2: Tools for Exploring Univariate Data
Stats 11 (Fall 2004) Lecture Note Introduction to Statistical Methods for Business and Economics Instructor: Hongquan Xu Chapter 2: Tools for Exploring Univariate Data Section 2.1: Introduction What is
More informationLecture 2 and Lecture 3
Lecture 2 and Lecture 3 1 Lecture 2 and Lecture 3 We can describe distributions using 3 characteristics: shape, center and spread. These characteristics have been discussed since the foundation of statistics.
More information2011 Pearson Education, Inc
Statistics for Business and Economics Chapter 2 Methods for Describing Sets of Data Summary of Central Tendency Measures Measure Formula Description Mean x i / n Balance Point Median ( n +1) Middle Value
More informationData Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining
Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Types of data sets Record Tables Document Data Transaction Data Graph World Wide Web Molecular Structures
More informationData Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Chapter 2 1 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser University 2011 Han, Kamber, and Pei. All rights reserved.
More informationChapter 1. Looking at Data
Chapter 1 Looking at Data Types of variables Looking at Data Be sure that each variable really does measure what you want it to. A poor choice of variables can lead to misleading conclusions!! For example,
More information1. Exploratory Data Analysis
1. Exploratory Data Analysis 1.1 Methods of Displaying Data A visual display aids understanding and can highlight features which may be worth exploring more formally. Displays should have impact and be
More informationData Exploration Slides by: Shree Jaswal
Data Exploration Slides by: Shree Jaswal Topics to be covered Types of Attributes; Statistical Description of Data; Data Visualization; Measuring similarity and dissimilarity. Chp2 Slides by: Shree Jaswal
More informationDistances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining
Distances and similarities Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Similarities Start with X which we assume is centered and standardized. The PCA loadings were
More informationSimilarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1]
Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1] Dissimilarity (e.g., distance) Numerical
More informationQUANTITATIVE DATA. UNIVARIATE DATA data for one variable
QUANTITATIVE DATA Recall that quantitative (numeric) data values are numbers where data take numerical values for which it is sensible to find averages, such as height, hourly pay, and pulse rates. UNIVARIATE
More informationLecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1
Lecture Slides Elementary Statistics Tenth Edition and the Triola Statistics Series by Mario F. Triola Slide 1 Chapter 3 Statistics for Describing, Exploring, and Comparing Data 3-1 Overview 3-2 Measures
More information1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.
1-1 Chapter 1 Sampling and Descriptive Statistics 1-2 Why Statistics? Deal with uncertainty in repeated scientific measurements Draw conclusions from data Design valid experiments and draw reliable conclusions
More informationSTAT 200 Chapter 1 Looking at Data - Distributions
STAT 200 Chapter 1 Looking at Data - Distributions What is Statistics? Statistics is a science that involves the design of studies, data collection, summarizing and analyzing the data, interpreting the
More informationDistances & Similarities
Introduction to Data Mining Distances & Similarities CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 1 / 22 Outline
More informationChapter 4. Displaying and Summarizing. Quantitative Data
STAT 141 Introduction to Statistics Chapter 4 Displaying and Summarizing Quantitative Data Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 1 / 31 4.1 Histograms 1 We divide the range
More informationGlossary. The ISI glossary of statistical terms provides definitions in a number of different languages:
Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the
More informationData Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining
Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar 10 What is Data? Collection of data objects and their attributes Attributes An attribute is a property
More information1 Measures of the Center of a Distribution
1 Measures of the Center of a Distribution Qualitative descriptions of the shape of a distribution are important and useful. But we will often desire the precision of numerical summaries as well. Two aspects
More informationData Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining
Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 10 What is Data? Collection of data objects
More informationCISC 4631 Data Mining
CISC 4631 Data Mining Lecture 02: Data Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) 1 10 What is Data? Collection of data objects and their attributes An attribute
More informationInstrumentation (cont.) Statistics vs. Parameters. Descriptive Statistics. Types of Numerical Data
Norm-Referenced vs. Criterion- Referenced Instruments Instrumentation (cont.) October 1, 2007 Note: Measurement Plan Due Next Week All derived scores give meaning to individual scores by comparing them
More informationHistograms allow a visual interpretation
Chapter 4: Displaying and Summarizing i Quantitative Data s allow a visual interpretation of quantitative (numerical) data by indicating the number of data points that lie within a range of values, called
More informationTastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?
Tastitsticsss? What s that? Statistics describes random mass phanomenons. Principles of Biostatistics and Informatics nd Lecture: Descriptive Statistics 3 th September Dániel VERES Data Collecting (Sampling)
More informationDescribing distributions with numbers
Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods measure one of two data characteristics: The central
More informationStatistics for Managers using Microsoft Excel 6 th Edition
Statistics for Managers using Microsoft Excel 6 th Edition Chapter 3 Numerical Descriptive Measures 3-1 Learning Objectives In this chapter, you learn: To describe the properties of central tendency, variation,
More informationClustering Lecture 1: Basics. Jing Gao SUNY Buffalo
Clustering Lecture 1: Basics Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics Clustering
More informationStatistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1
Week 2 Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Part I Other datatypes, preprocessing 2 / 1 Other datatypes Document data You might start with a collection of
More informationPart I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes
Week 2 Based in part on slides from textbook, slides of Susan Holmes Part I Other datatypes, preprocessing October 3, 2012 1 / 1 2 / 1 Other datatypes Other datatypes Document data You might start with
More informationAssignment 3: Chapter 2 & 3 (2.6, 3.8)
Neha Aggarwal Comp 578 Data Mining Fall 8 9-12-8 Assignment 3: Chapter 2 & 3 (2.6, 3.8) 2.6 Q.18 This exercise compares and contrasts some similarity and distance measures. (a) For binary data, the L1
More informationReview for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data
Review for Exam #1 1 Chapter 1 Population the complete collection of elements (scores, people, measurements, etc.) to be studied Sample a subcollection of elements drawn from a population 11 The Nature
More informationChapter 1: Exploring Data
Chapter 1: Exploring Data Section 1.3 with Numbers The Practice of Statistics, 4 th edition - For AP* STARNES, YATES, MOORE Chapter 1 Exploring Data Introduction: Data Analysis: Making Sense of Data 1.1
More information2.0 Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table
2.0 Lesson Plan Answer Questions 1 Summary Statistics Histograms The Normal Distribution Using the Standard Normal Table 2. Summary Statistics Given a collection of data, one needs to find representations
More informationChapter 2: Descriptive Analysis and Presentation of Single- Variable Data
Chapter 2: Descriptive Analysis and Presentation of Single- Variable Data Mean 26.86667 Standard Error 2.816392 Median 25 Mode 20 Standard Deviation 10.90784 Sample Variance 118.981 Kurtosis -0.61717 Skewness
More informationElementary Statistics
Elementary Statistics Q: What is data? Q: What does the data look like? Q: What conclusions can we draw from the data? Q: Where is the middle of the data? Q: Why is the spread of the data important? Q:
More informationSUMMARIZING MEASURED DATA. Gaia Maselli
SUMMARIZING MEASURED DATA Gaia Maselli maselli@di.uniroma1.it Computer Network Performance 2 Overview Basic concepts Summarizing measured data Summarizing data by a single number Summarizing variability
More informationDescribing distributions with numbers
Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods measure one of two data characteristics: The central
More informationLecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #
Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series by Mario F. Triola Chapter 3 Statistics for Describing, Exploring, and Comparing Data 3-1 Review and Preview 3-2 Measures
More informationClass 11 Maths Chapter 15. Statistics
1 P a g e Class 11 Maths Chapter 15. Statistics Statistics is the Science of collection, organization, presentation, analysis and interpretation of the numerical data. Useful Terms 1. Limit of the Class
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 2: Vector Data: Prediction Instructor: Yizhou Sun yzsun@cs.ucla.edu October 8, 2018 TA Office Hour Time Change Junheng Hao: Tuesday 1-3pm Yunsheng Bai: Thursday 1-3pm
More informationChapter 1 - Lecture 3 Measures of Location
Chapter 1 - Lecture 3 of Location August 31st, 2009 Chapter 1 - Lecture 3 of Location General Types of measures Median Skewness Chapter 1 - Lecture 3 of Location Outline General Types of measures What
More informationLecture 1: Descriptive Statistics
Lecture 1: Descriptive Statistics MSU-STT-351-Sum 15 (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 1 / 56 Contents 1 Introduction 2 Branches of Statistics Descriptive Statistics
More informationLast Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics
Last Lecture Distinguish Populations from Samples Importance of identifying a population and well chosen sample Knowing different Sampling Techniques Distinguish Parameters from Statistics Knowing different
More informationCHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.
(c) Epstein 2013 Chapter 5: Exploring Data Distributions Page 1 CHAPTER 5: EXPLORING DATA DISTRIBUTIONS 5.1 Creating Histograms Individuals are the objects described by a set of data. These individuals
More informationSection 3. Measures of Variation
Section 3 Measures of Variation Range Range = (maximum value) (minimum value) It is very sensitive to extreme values; therefore not as useful as other measures of variation. Sample Standard Deviation The
More informationData preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data
Elena Baralis and Tania Cerquitelli Politecnico di Torino Data set types Record Tables Document Data Transaction Data Graph World Wide Web Molecular Structures Ordered Spatial Data Temporal Data Sequential
More informationStat 101 Exam 1 Important Formulas and Concepts 1
1 Chapter 1 1.1 Definitions Stat 101 Exam 1 Important Formulas and Concepts 1 1. Data Any collection of numbers, characters, images, or other items that provide information about something. 2. Categorical/Qualitative
More informationSummarizing Measured Data
Summarizing Measured Data 12-1 Overview Basic Probability and Statistics Concepts: CDF, PDF, PMF, Mean, Variance, CoV, Normal Distribution Summarizing Data by a Single Number: Mean, Median, and Mode, Arithmetic,
More informationCIVL 7012/8012. Collection and Analysis of Information
CIVL 7012/8012 Collection and Analysis of Information Uncertainty in Engineering Statistics deals with the collection and analysis of data to solve real-world problems. Uncertainty is inherent in all real
More informationDescribing Distributions
Describing Distributions With Numbers April 18, 2012 Summary Statistics. Measures of Center. Percentiles. Measures of Spread. A Summary Statement. Choosing Numerical Summaries. 1.0 What Are Summary Statistics?
More informationChapter 1:Descriptive statistics
Slide 1.1 Chapter 1:Descriptive statistics Descriptive statistics summarises a mass of information. We may use graphical and/or numerical methods Examples of the former are the bar chart and XY chart,
More informationCHAPTER 2: Describing Distributions with Numbers
CHAPTER 2: Describing Distributions with Numbers The Basic Practice of Statistics 6 th Edition Moore / Notz / Fligner Lecture PowerPoint Slides Chapter 2 Concepts 2 Measuring Center: Mean and Median Measuring
More informationChapter Four. Numerical Descriptive Techniques. Range, Standard Deviation, Variance, Coefficient of Variation
Chapter Four Numerical Descriptive Techniques 4.1 Numerical Descriptive Techniques Measures of Central Location Mean, Median, Mode Measures of Variability Range, Standard Deviation, Variance, Coefficient
More informationCS 147: Computer Systems Performance Analysis
CS 147: Computer Systems Performance Analysis Summarizing Variability and Determining Distributions CS 147: Computer Systems Performance Analysis Summarizing Variability and Determining Distributions 1
More informationDescriptive Statistics-I. Dr Mahmoud Alhussami
Descriptive Statistics-I Dr Mahmoud Alhussami Biostatistics What is the biostatistics? A branch of applied math. that deals with collecting, organizing and interpreting data using well-defined procedures.
More informationChapter 3. Measuring data
Chapter 3 Measuring data 1 Measuring data versus presenting data We present data to help us draw meaning from it But pictures of data are subjective They re also not susceptible to rigorous inference Measuring
More informationDescriptive Univariate Statistics and Bivariate Correlation
ESC 100 Exploring Engineering Descriptive Univariate Statistics and Bivariate Correlation Instructor: Sudhir Khetan, Ph.D. Wednesday/Friday, October 17/19, 2012 The Central Dogma of Statistics used to
More informationWhat is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.
What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty. Statistics is a field of study concerned with the data collection,
More informationADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes
We Make Stats Easy. Chapter 4 Tutorial Length 1 Hour 45 Minutes Tutorials Past Tests Chapter 4 Page 1 Chapter 4 Note The following topics will be covered in this chapter: Measures of central location Measures
More informationDEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS QM 120. Spring 2008
DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS Introduction to Business Statistics QM 120 Chapter 3 Spring 2008 Measures of central tendency for ungrouped data 2 Graphs are very helpful to describe
More informationMeasures of center. The mean The mean of a distribution is the arithmetic average of the observations:
Measures of center The mean The mean of a distribution is the arithmetic average of the observations: x = x 1 + + x n n n = 1 x i n i=1 The median The median is the midpoint of a distribution: the number
More informationMeelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 03
Meelis Kull meelis.kull@ut.ee Autumn 2017 1 Demo: Data science mini-project CRISP-DM: cross-industrial standard process for data mining Data understanding: Types of data Data understanding: First look
More information200 participants [EUR] ( =60) 200 = 30% i.e. nearly a third of the phone bills are greater than 75 EUR
Ana Jerončić 200 participants [EUR] about half (71+37=108) 200 = 54% of the bills are small, i.e. less than 30 EUR (18+28+14=60) 200 = 30% i.e. nearly a third of the phone bills are greater than 75 EUR
More informationare the objects described by a set of data. They may be people, animals or things.
( c ) E p s t e i n, C a r t e r a n d B o l l i n g e r 2016 C h a p t e r 5 : E x p l o r i n g D a t a : D i s t r i b u t i o n s P a g e 1 CHAPTER 5: EXPLORING DATA DISTRIBUTIONS 5.1 Creating Histograms
More informationIntroduction to Basic Statistics Version 2
Introduction to Basic Statistics Version 2 Pat Hammett, Ph.D. University of Michigan 2014 Instructor Comments: This document contains a brief overview of basic statistics and core terminology/concepts
More informationTOPIC: Descriptive Statistics Single Variable
TOPIC: Descriptive Statistics Single Variable I. Numerical data summary measurements A. Measures of Location. Measures of central tendency Mean; Median; Mode. Quantiles - measures of noncentral tendency
More informationMATH 117 Statistical Methods for Management I Chapter Three
Jubail University College MATH 117 Statistical Methods for Management I Chapter Three This chapter covers the following topics: I. Measures of Center Tendency. 1. Mean for Ungrouped Data (Raw Data) 2.
More informationCS249: ADVANCED DATA MINING
CS249: ADVANCED DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu May 2, 2017 Announcements Homework 2 due later today Due May 3 rd (11:59pm) Course project
More informationStatistics I Chapter 2: Univariate data analysis
Statistics I Chapter 2: Univariate data analysis Chapter 2: Univariate data analysis Contents Graphical displays for categorical data (barchart, piechart) Graphical displays for numerical data data (histogram,
More informationChapter2 Description of samples and populations. 2.1 Introduction.
Chapter2 Description of samples and populations. 2.1 Introduction. Statistics=science of analyzing data. Information collected (data) is gathered in terms of variables (characteristics of a subject that
More informationBIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke
BIOL 51A - Biostatistics 1 1 Lecture 1: Intro to Biostatistics Smoking: hazardous? FEV (l) 1 2 3 4 5 No Yes Smoke BIOL 51A - Biostatistics 1 2 Box Plot a.k.a box-and-whisker diagram or candlestick chart
More informationNumerical Measures of Central Tendency
ҧ Numerical Measures of Central Tendency The central tendency of the set of measurements that is, the tendency of the data to cluster, or center, about certain numerical values; usually the Mean, Median
More information3.1 Measure of Center
3.1 Measure of Center Calculate the mean for a given data set Find the median, and describe why the median is sometimes preferable to the mean Find the mode of a data set Describe how skewness affects
More information2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS
Spring 2015: Lembo GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS Descriptive statistics concise and easily understood summary of data set characteristics
More informationCHAPTER 1. Introduction
CHAPTER 1 Introduction Engineers and scientists are constantly exposed to collections of facts, or data. The discipline of statistics provides methods for organizing and summarizing data, and for drawing
More informationIntroduction to Statistics
Introduction to Statistics Data and Statistics Data consists of information coming from observations, counts, measurements, or responses. Statistics is the science of collecting, organizing, analyzing,
More information3rd Quartile. 1st Quartile) Minimum
EXST7034 - Regression Techniques Page 1 Regression diagnostics dependent variable Y3 There are a number of graphic representations which will help with problem detection and which can be used to obtain
More informationPerformance of fourth-grade students on an agility test
Starter Ch. 5 2005 #1a CW Ch. 4: Regression L1 L2 87 88 84 86 83 73 81 67 78 83 65 80 50 78 78? 93? 86? Create a scatterplot Find the equation of the regression line Predict the scores Chapter 5: Understanding
More informationUnit 2: Numerical Descriptive Measures
Unit 2: Numerical Descriptive Measures Summation Notation Measures of Central Tendency Measures of Dispersion Chebyshev's Rule Empirical Rule Measures of Relative Standing Box Plots z scores Jan 28 10:48
More informationMgtOp 215 Chapter 3 Dr. Ahn
MgtOp 215 Chapter 3 Dr. Ahn Measures of central tendency (center, location): measures the middle point of a distribution or data; these include mean and median. Measures of dispersion (variability, spread):
More informationTypes of Information. Topic 2 - Descriptive Statistics. Examples. Sample and Sample Size. Background Reading. Variables classified as STAT 511
Topic 2 - Descriptive Statistics STAT 511 Professor Bruce Craig Types of Information Variables classified as Categorical (qualitative) - variable classifies individual into one of several groups or categories
More informationStatistics I Chapter 2: Univariate data analysis
Statistics I Chapter 2: Univariate data analysis Chapter 2: Univariate data analysis Contents Graphical displays for categorical data (barchart, piechart) Graphical displays for numerical data data (histogram,
More informationLecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data:
Lecture 2 Quantitative variables There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data: Stemplot (stem-and-leaf plot) Histogram Dot plot Stemplots
More informationUnit Two Descriptive Biostatistics. Dr Mahmoud Alhussami
Unit Two Descriptive Biostatistics Dr Mahmoud Alhussami Descriptive Biostatistics The best way to work with data is to summarize and organize them. Numbers that have not been summarized and organized are
More informationLecture 2. Descriptive Statistics: Measures of Center
Lecture 2. Descriptive Statistics: Measures of Center Descriptive Statistics summarize or describe the important characteristics of a known set of data Inferential Statistics use sample data to make inferences
More information