ANÁLISE DOS DADOS. Daniela Barreiro Claro

Similar documents
Chapter I: Introduction & Foundations

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Getting To Know Your Data

MSCBD 5002/IT5210: Knowledge Discovery and Data Minig

Data Mining: Concepts and Techniques

Descriptive Data Summarization

CS249: ADVANCED DATA MINING

Similarity and Dissimilarity

Data Mining 4. Cluster Analysis

CSE5243 INTRO. TO DATA MINING

Data Exploration Slides by: Shree Jaswal

CS145: INTRODUCTION TO DATA MINING

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1]

CISC 4631 Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

P8130: Biostatistical Methods I

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Topic 3: Introduction to Statistics. Algebra 1. Collecting Data. Table of Contents. Categorical or Quantitative? What is the Study of Statistics?!

CS6220: DATA MINING TECHNIQUES

1. Exploratory Data Analysis

Chapter 3. Data Description

CS6220: DATA MINING TECHNIQUES

STAT 200 Chapter 1 Looking at Data - Distributions

Chapter 4. Displaying and Summarizing. Quantitative Data

Chapter 2: Tools for Exploring Univariate Data

3.1 Measures of Central Tendency: Mode, Median and Mean. Average a single number that is used to describe the entire sample or population

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

AP Statistics Cumulative AP Exam Study Guide

Chapter 1. Looking at Data

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 03

Unit 2. Describing Data: Numerical

Sets and Set notation. Algebra 2 Unit 8 Notes

Data Mining and Analysis

College Mathematics

Lecture 1: Descriptive Statistics

Descriptive Univariate Statistics and Bivariate Correlation

Comparing Measures of Central Tendency *

Descriptive statistics Maarten Jansen

CIVL 7012/8012. Collection and Analysis of Information

1 Measures of the Center of a Distribution

Introduction to Statistics

Performance of fourth-grade students on an agility test

Clinical Research Module: Biostatistics

University of Florida CISE department Gator Engineering. Clustering Part 1

Descriptive Statistics

Chapter 1:Descriptive statistics

Announcements. Lecture 1 - Data and Data Summaries. Data. Numerical Data. all variables. continuous discrete. Homework 1 - Out 1/15, due 1/22

AP Final Review II Exploring Data (20% 30%)

Stat 101 Exam 1 Important Formulas and Concepts 1

A is one of the categories into which qualitative data can be classified.

proximity similarity dissimilarity distance Proximity Measures:

CHAPTER 1. Introduction

Histograms allow a visual interpretation

Glossary for the Triola Statistics Series

MATH 117 Statistical Methods for Management I Chapter Three

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

Chapter 1 - Lecture 3 Measures of Location

Types of Information. Topic 2 - Descriptive Statistics. Examples. Sample and Sample Size. Background Reading. Variables classified as STAT 511

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Marquette University MATH 1700 Class 5 Copyright 2017 by D.B. Rowe

Σ x i. Sigma Notation

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

University of Jordan Fall 2009/2010 Department of Mathematics

Chapter 4.notebook. August 30, 2017

Chapter Four. Numerical Descriptive Techniques. Range, Standard Deviation, Variance, Coefficient of Variation

Lecture Notes 2: Variables and graphics

DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS QM 120. Spring 2008

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Biostatistics for biomedical profession. BIMM34 Karin Källen & Linda Hartman November-December 2015

Summarizing and Displaying Measurement Data/Understanding and Comparing Distributions

Lecture 2 and Lecture 3

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

Units. Exploratory Data Analysis. Variables. Student Data

Assignment 3: Chapter 2 & 3 (2.6, 3.8)

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

MgtOp 215 Chapter 3 Dr. Ahn

Chapter2 Description of samples and populations. 2.1 Introduction.

SUMMARIZING MEASURED DATA. Gaia Maselli

Mathematics Grade 7 Transition Alignment Guide (TAG) Tool

Pre Algebra. Curriculum (634 topics)

Measurement and Data

Data Mining: Data. Lecture Notes for Chapter 2 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Statistics for Managers using Microsoft Excel 6 th Edition

Week 1: Intro to R and EDA

Statistics in medicine

REVIEW: Midterm Exam. Spring 2012

2011 Pearson Education, Inc

Transcription:

ANÁLISE DOS DADOS Daniela Barreiro Claro

Outline Data types Graphical Analysis Proimity measures Prof. Daniela Barreiro Claro

Types of Data Sets Record Ordered Relational records Video data: sequence of images Data matri, e.g., numerical matri, crosstabs Document data: tet documents: term-frequency vector Transaction data Graph and network World Wide Web Social or information networks Temporal data: time-series Sequential Data: transaction sequences Genetic sequence data Spatial, image and multimedia: Spatial data: maps Image data: Video data: Molecular Structures

Important Characteristics of Structured Data Dimensionality Curse of dimensionality Sparsity Only presence counts; 0 contributes to sparsity Resolution Patterns depend on the scale Distribution Centrality and dispersion

Data Objects Data sets are made up of data objects. A data object represents an entity. Eamples: sales database: customers, store items, sales medical database: patients, treatments university database: students, professors, courses Also called samples, eamples, instances, data points, objects, tuples. Data objects are described by attributes. Database rows -> data objects; columns ->attributes.

Attributes Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature of a data object. E.g., customer _ID, name, address Types: Nominal Binary Ordinary Numeric: quantitative Interval-scaled Ratio-scaled

Attribute Types Nominal: categories, states, or names of things Binary Hair_color = {auburn, black, blond, brown, grey, red, white} marital status, occupation, ID numbers, zip codes Nominal attribute with only 2 states (0 and 1) Ordinal Symmetric binary: both outcomes equally important e.g., gender Asymmetric binary: outcomes not equally important. e.g., medical test (positive vs. negative) Convention: assign 1 to most important outcome (e.g., HIV positive) Values have a meaningful order (ranking) but magnitude between successive values is not known. Size = {small, medium, large}, grades, army rankings

8 Numeric Attribute Types Numeric(integer or real-valued) Interval Ratio Measured on a scale of equal-sized units Values have order E.g., temperature in C or F, calendar dates No true zero-point Inherent zero-point We can speak of values as being an order of magnitude larger than the unit of measurement (10 K is twice as high as 5 K ). e.g., temperature in Kelvin, length, counts, monetary quantities

9 Discrete vs. Continuous Attributes Discrete Attribute Has only a finite or countably infinite set of values E.g., hair color, profession, or the set of words in a collection of documents Sometimes, represented as integer variables E.g. zip codes, customer_id Note: Binary attributes are a special case of discrete attributes Continuous Attribute Has real numbers as attribute values E.g., temperature, height, or weight Practically, real values can only be measured and represented using a finite number of digits Continuous attributes are typically represented as floating-point variables

10 Basic Statistical Descriptions of Data Motivation To better understand the data: central tendency, variation and spread Data dispersion characteristics median, ma, min, quantiles, outliers, variance, etc. Numerical dimensions correspond to sorted intervals Data dispersion: analyzed with multiple granularities of precision Boplot or quantile analysis on sorted intervals Dispersion analysis on computed measures Folding measures into numerical dimensions Boplot or quantile analysis on the transformed cube

Measuring the Central Tendency Mean (algebraic measure) (sample vs. population): Note: n is sample size and N is population size. n wi i i1 Weighted arithmetic mean: n wi i1 Trimmed mean: chopping etreme values 1 n n i1 i 1 2.. N N Median: Middle value is used for skewed (assymetric) data Mode It separates the higher half of a data from the lower half Estimated by interpolation (for grouped data): median L 1 n / 2 ( ( freq freq) l ) width median Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula: mean mode 3( mean median)

Symmetric vs. Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data positively skewed Symmetric data negatively skewed 12

Measuring the Dispersion of Data Quartiles, outliers and boplots Quantiles are points taken at regular interval of data distribution Quartiles: Q 1 (25 th percentile), Q 3 (75 th percentile) Inter-quartile range: IQR = Q 3 Q 1 Five number summary: min, Q 1, median, Q 3, ma Boplot: ends of the bo are the quartiles; median is marked; add whiskers, and plot outliers individually Outlier: usually, a value higher/lower than 1.5 IQR Variance and standard deviation are measures for data dispersion Indicate how the spread out a data distribution is Variance: (algebraic, scalable computation) 2 1 N n i1 2 i 2 13

Boplot Analysis Five-number summary of a distribution Minimum, Q1, Median, Q3, Maimum Boplot Data is represented with a bo The ends of the bo are at the first and third quartiles, i.e., the height of the bo is IQR The median is marked by a line within the bo Whiskers: two lines outside the bo etended to Minimum and Maimum Outliers: points beyond a specified outlier threshold, plotted individually 14

Graphic Displays of Basic Statistical Descriptions Boplot: graphic display of five-number summary Histogram: -ais are values, y-ais repres. frequencies Quantile plot: each value i is paired with f i indicating that approimately 100 f i % of data are i Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane 15

Histogram Analysis Histogram: Graph display of tabulated frequencies, shown as bars It shows what proportion of cases fall into each of several categories Differs from a bar chart in that it is the area of the bar that denotes the value, not the height as in bar charts, a crucial distinction when the categories are not of uniform width The categories are usually specified as nonoverlapping intervals of some variable. The categories (bars) must be adjacent

Histograms Often Tell More than Boplots The two histograms shown in the left may have the same boplot representation The same values for: min, Q1, median, Q3, ma But they have rather different data distributions

Histogram or Bar Graph? 18

Quantile Plot Quantile Quantileplot 19 Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information For a data i data sorted in increasing order, f i indicates that approimately 100 f i % of the data are below or equal to the value i Plots quantile quantile Eample shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Unit prices of items sold at Branch 1 tend to be lower than those at Branch 2. Quantile - quantile plot Quantile plot

Scatter plot Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plotted as points in the plane positively correlated negative correlated

Eample of Uncorrelated Data 21

Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1] Dissimilarity (e.g., distance) Numerical measure of how different two data objects are Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Proimity refers to a similarity or dissimilarity

Data Matri and Dissimilarity Matri Data matri n data points with p dimensions Two modes 11 i1 n1 1f if nf 1p ip np Dissimilarity matri n data points, but registers only the distance A triangular matri Single mode 0 d(2,1) d(3,1) : d( n,1) 0 d(3,2) : d( n,2) 0 : 0

Proimity Measure for Nominal Attributes Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary attribute) Method 1: Simple matching m: #(number) of matches (the number of attributes for which i and j are in the same state p: total #(number) of attributes describing the objects Method 2: Use a large number of binary attributes d( i, j) p p m creating a new binary attribute for each of the M nominal states 0 d(2,1) d(3,1) : d( n,1) Object ID 1 Code A 2 Code B 3 Code C 4 Code A 0 1 1 0 d(3,2) d( n,2) 0 1 1 0 : Test1(nominal) 0 1 0 : 1 0 0, if the object i and j match, and 1 if the objects differ.

Proimity Measure for Binary Attributes A contingency table for binary data Object j Distance measure for symmetric binary variables: Object i Distance measure for asymmetric binary variables: t is unimportant and is thus ignored Jaccard coefficient (similarity measure for asymmetric binary variables): 1 d( i, j) Note: Jaccard coefficient is the same as coherence :

Dissimilarity between Binary Variables Eample is set to 1 Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N Gender is a symmetric attribute The remaining attributes are asymmetric binary Let the values Y and P be 1, and the value N = 0 0 1 d( jack, mary) 0.33 2 0 1 1 1 d( jack, jim) 0.67 1 1 1 1 2 d( jim, mary) 0.75 1 1 2 Object i Object j

Proimity measure for Numeric Attributes Euclidean Distance Manhattan Distance Minkowski Distance It is a generalization of the Euclidean and Manhattan distances 2 2 2 2 2 1 1 ) ( ) ( ) ( ), ( p jp i j i j i j i d 2 2 2 2 1 1 ), ( jp ip j i j i j i d h p p h h j i j i j i h j i d ), ( 2 2 1 1

Proimity measure for Numeric Attributes Dissimilarity Matri (with Euclidean Distance) 1 2 3 4 1 0 2 3.61 0 3 2.24 5.1 0 4 4.24 1 5.39 0 Dissimilarity Matri (with Manhattan Distance) point attribute1 attribute2 1 1 2 2 3 5 3 2 0 4 4 5 Data Matri 1 2 3 4 1 0 2 5 0 3 3 6 0 4 6 1 7 0

Proimity measure for Numeric Attributes Minkowski distance: A popular distance measure where i = ( i1, i2,, ip ) and j = ( j1, j2,, jp ) are two p-dimensional data objects, and h is the order (the distance so defined is also called L-h norm) Properties d(i, j) > 0 if i j, and d(i, i) = 0 (Positive definiteness) d(i, j) = d(j, i) (Symmetry) d(i, j) d(i, k) + d(k, j) (Triangle Inequality) A distance that satisfies these properties is a metric

Eamples point attribute 1 attribute 2 1 1 2 2 3 5 3 2 0 4 4 5 Dissimilarity Matrices Manhattan (L 1 ) L 1 2 3 4 1 0 2 5 0 3 3 6 0 4 6 1 7 0 Euclidean (L 2 ) L2 1 2 3 4 1 0 2 3.61 0 3 2.24 5.1 0 4 4.24 1 5.39 0 Supremum L 1 2 3 4 1 0 2 3 0 3 2 5 0 4 3 1 5 0

Proimity measure for Ordinal Attributes An ordinal variable can be discrete or continuous Order is important, e.g., rank Can be treated like interval-scaled replace if by their rank r { 1,, M } if f map the range of each variable onto [0, 1] by replacing the rank (r if ) of the i-th object in the f-th attribute by r 1 if z if M 1 f compute the dissimilarity using methods for interval-scaled attributes

Attributes of Mied Type A database may contain all attribute types Nominal, symmetric binary, asymmetric binary, numeric, ordinal One may use a weighted formula to combine their effects f is binary or nominal: d (f) ij = 0 if if = jf, or d (f) ij = 1 otherwise f is numeric: use the normalized distance f is ordinal Compute ranks r if and Treat z if as interval-scaled

Cosine Similarity A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document. Term-frequent vectors are typically very long and sparse (they have many 0 values) Other vector objects: gene features in micro-arrays, Applications: information retrieval, biologic taonomy, gene feature mapping, Cosine measure: If d 1 and d 2 are two vectors (e.g., term-frequency vectors), then cos(d 1, d 2 ) = (d 1 d 2 ) / d 1 d 2, where indicates vector dot product, d : the length of vector d

Eample: Cosine Similarity cos(d 1, d 2 ) = (d 1 d 2 ) / d 1 d 2, where indicates vector dot product, d : the length of vector d E: Find the similarity between documents 1 and 2. d 1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d 2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) d 1 d 2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 d 1 = (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0) 0.5 =(42) 0.5 = 6.481 d 2 = (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1) 0.5 =(17) 0.5 = 4.12 cos(d 1, d 2 ) = 0.94

References T. Dasu and T. Johnson. Eploratory Data Mining and Data Cleaning. John Wiley, 2003 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990. E. R. Tufte. The Visual Display of Quantitative Information, 2 nd ed., Graphics Press, 2001 J. Han et al, Data Mining Concepts and Techniques, 3 rd. Ed. Morgan Kauffman, 2012

Prof. Daniela Barreiro Claro Email: dclaro@ufba.br Our course: www.dcc.ufba.br/~dclaro Course id: MATE04 (2016.1) Semantic Applications and Formalisms Research Group www.formas.ufba.br /formasresearchgroup /formasresearch