Data Exploration Slides by: Shree Jaswal

Size: px
Start display at page:

Download "Data Exploration Slides by: Shree Jaswal"

Transcription

1 Data Exploration Slides by: Shree Jaswal

2 Topics to be covered Types of Attributes; Statistical Description of Data; Data Visualization; Measuring similarity and dissimilarity. Chp2 Slides by: Shree Jaswal 2

3 Which Chapter from which Text Book? Chapter 2: Getting to know your data from Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3nd Edition Chapter 2: Data from P. N. Tan, M. Steinbach, Vipin Kumar, Introduction to Data Mining, Pearson Education Chapter 3: Exploring data from P. N. Tan, M. Steinbach, Vipin Kumar, Introduction to Data Mining, Pearson Education Chp2 Slides by: Shree Jaswal 3

4 Course Outcome achieved TEITC604.2: Able to prepare the data needed for data mining algorithms in terms of attributes and class inputs, training, validating, and testing files. Chp2 Slides by: Shree Jaswal 4

5 10 What is Data? Collection of data objects and their attributes An attribute is a property or characteristic of an object o Examples: eye color of a person, temperature, etc. o Attribute is also known as variable, field, characteristic, or feature A collection of attributes describe an object o Object is also known as record, point, case, sample, entity, or instance Objects Attributes T id R e fu n d M a r ita l T a x a b le S ta tu s In c o m e C h e a t 1 Y e s S in g le 125K No 2 No M a rrie d 100K No 3 No S in g le 70K No 4 Y e s M a rrie d 120K No 5 No D iv o rc e d 95K Y e s 6 No M a rrie d 60K No 7 Y e s D iv o rc e d 220K No 8 No S in g le 85K Y e s 9 No M a rrie d 75K No 10 No S in g le 90K Y e s Chp2 Slides by: Shree Jaswal 5

6 Attribute Values Attribute values are numbers or symbols assigned to an attribute Distinction between attributes and attribute values o Same attribute can be mapped to different attribute values Example: height can be measured in feet or meters o Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different o ID has no limit but age has a maximum and minimum value Chp2 Slides by: Shree Jaswal 6

7 Types of Attributes There are different types of attributes o Nominal Examples: ID numbers, eye color, zip codes o Binary Symmetric and Asymmetric types o Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} o Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. o Ratio Examples: temperature in Kelvin, length, time, counts Chp2 Slides by: Shree Jaswal 7

8 Properties of Attribute Values The type of an attribute depends on which of the following properties it possesses: o Distinctness: o Order: < > = o Addition: + - o Multiplication: * / o Nominal attribute: distinctness o Ordinal attribute: distinctness & order o Interval attribute: distinctness, order & addition o Ratio attribute: all 4 properties Chp2 Slides by: Shree Jaswal 8

9 Attribute Type Description Examples Operations Nominal Ordinal The values of a nominal attribute are zip codes, just different names, i.e., nominal employee ID attributes provide only enough numbers, eye color, information to distinguish one object sex: {male, female} from another. (=, ) The values of an ordinal hardness of minerals, attribute provide enough {good, better, best}, information to order objects. grades, street numbers (<, >) mode, entropy, contingency correlation, 2 test median, percentiles, rank correlation, run tests, sign tests Interval Ratio For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists.(+, - ) For ratio variables, both differences and ratios are meaningful. (*, /) calendar dates, temperature in Celsius or Fahrenheit temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current mean, standard deviation, Pearson's correlation, t and F tests geometric mean, harmonic mean, percent variation Chp2 Slides by: Shree Jaswal 9

10 Attribute Level Transformation Comments Nominal Any permutation of values If all employee ID numbers were reassigned, would it make any difference? Ordinal Interval An order preserving change of values, i.e., new_value = f(old_value) where f is a monotonic function. new_value =a * old_value + b where a and b are constants An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}. Thus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree). Ratio new_value = a * old_value Length can be measured in meters or feet. Chp2 Slides by: Shree Jaswal 10

11 Discrete and Continuous Attributes Discrete Attribute o Has only a finite or countably infinite set of values o Examples: zip codes, counts, or the set of words in a collection of documents o Often represented as integer variables. o Note: binary attributes are a special case of discrete attributes Continuous Attribute o Has real numbers as attribute values o Examples: temperature, height, or weight. o Practically, real values can only be measured and represented using a finite number of digits. o Continuous attributes are typically represented as floatingpoint variables. Chp2 Slides by: Shree Jaswal 11

12 Record o Data Matrix o Document Data o Transaction Data Graph Types of data sets o World Wide Web o Molecular Structures Ordered o Spatial Data o Temporal Data o Sequential Data o Genetic Sequence Data Chp2 Slides by: Shree Jaswal 12

13 Important Characteristics of Structured Data o Dimensionality Curse of Dimensionality o Sparsity Only presence counts o Resolution Patterns depend on the scale Chp2 Slides by: Shree Jaswal 13

14 10 Record Data Data that consists of a collection of records, each of which consists of a fixed set of attributes T id R e fu n d M a r ita l S ta tu s T a x a b le In c o m e C h e a t 1 Y e s S in g le 125K No 2 No M a rrie d 100K No 3 No S in g le 70K No 4 Y e s M a rrie d 120K No 5 No D iv o rc e d 95K Y e s 6 No M a rrie d 60K No 7 Y e s D iv o rc e d 220K No 8 No S in g le 85K Y e s 9 No M a rrie d 75K No 10 No S in g le 90K Y e s Chp2 Slides by: Shree Jaswal 14

15 Transaction Data A special type of record data, where o each record (transaction) involves a set of items. o For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. T I D I tem s 1 B r e a d, C o k e, M ilk 2 B e e r, B r e a d 3 B e e r, C o k e, D ia p e r, M ilk 4 B e e r, B r e a d, D ia p e r, M ilk 5 C o k e, D ia p e r, M ilk Chp2 Slides by: Shree Jaswal 15

16 Data Matrix If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute Projection of x Load Projection of y load Distance Load Thickness Chp2 Slides by: Shree Jaswal 16

17 Document Data Each document becomes a `term' vector, o each term is a component (attribute) of the vector, o the value of each component is the number of times the corresponding term occurs in the document. season timeout lost wi n game score ball pla y coach team Document Document Document Chp2 Slides by: Shree Jaswal 17

18 Graph Data Examples: Generic graph and HTML Links <a href="papers/papers.html#bbbb"> Data Mining </a> <li> <a href="papers/papers.html#aaaa"> Graph Partitioning </a> <li> <a href="papers/papers.html#aaaa"> Parallel Solution of Sparse Linear System of Equations </a> <li> <a href="papers/papers.html#ffff"> N-Body Computation and Dense Linear System Solvers Chp2 Slides by: Shree Jaswal 18

19 Chemical Data Benzene Molecule: C 6 H 6 Chp2 Slides by: Shree Jaswal 19

20 Ordered Data Sequences of transactions Items/Events An element of the sequence Chp2 Slides by: Shree Jaswal 20

21 Ordered Data Genomic sequence data GGTTCCGCCTTCAGCCCCGCGCC CGCAGGGCCCGCCCCGCGCCGTC GAGAAGGGCCCGCCTGGCGGGCG GGGGGAGGCGGGGCCGCCCGAGC CCAACCGAGTCCGACCAGGTGCC CCCTCTGCTCGGCCTAGACCTGA GCTCATTAGGCGGCAGCGGACAG GCCAAGTAGAACACGCGAAGCGC TGGGCTGCCTGCTGCGACCAGGG Chp2 Slides by: Shree Jaswal 21

22 Spatio-Temporal Data Ordered Data Average Monthly Temperature of land and ocean Chp2 Slides by: Shree Jaswal 22

23 Data Quality What kinds of data quality problems? How can we detect problems with the data? What can we do about these problems? Examples of data quality problems: o Noise and outliers o missing values o duplicate data Chp2 Slides by: Shree Jaswal 23

24 Noise Noise refers to modification of original values o Examples: distortion of a person s voice when talking on a poor phone and snow on television screen Two Sine Waves Two Sine Waves + Noise Chp2 Slides by: Shree Jaswal 24

25 Outliers Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set Chp2 Slides by: Shree Jaswal 25

26 Missing Values Reasons for missing values o Information is not collected (e.g., people decline to give their age and weight) o Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values o Eliminate Data Objects o Estimate Missing Values o Ignore the Missing Value During Analysis o Replace with all possible values (weighted by their probabilities) Chp2 Slides by: Shree Jaswal 26

27 Duplicate Data Data set may include data objects that are duplicates, or almost duplicates of one another o Major issue when merging data from heterogeneous sources Examples: o Same person with multiple addresses Data cleaning o Process of dealing with duplicate data issues Chp2 Slides by: Shree Jaswal 27

28 Mining Data Descriptive Characteristics Motivation o To better understand the data: central tendency, variation and spread Data dispersion characteristics o median, max, min, quantiles, outliers, variance, etc. Numerical dimensions correspond to sorted intervals o o Data dispersion: analyzed with multiple granularities of precision Boxplot or quantile analysis on sorted intervals Dispersion analysis on computed measures o o Folding measures into numerical dimensions Boxplot or quantile analysis on the transformed cube Chp2 Slides by: Shree Jaswal 28

29 Measuring the Central Tendency Mean (algebraic measure) (sample vs. population): o o Weighted arithmetic mean: Trimmed mean: chopping extreme values Median: A holistic measure o o Mode o o o i 1 Middle value if odd number of values, or average of the middle x two values otherwise Estimated by interpolation (for grouped data): median Value that occurs most frequently in the data L 1 x n i 1 n w i w x i 1 n N / 2 ( ( freq i n i 1 x i median freq ) l ) width Unimodal, bimodal, trimodal Empirical formula: mean mode 3 ( mean median ) N x Chp2 Slides by: Shree Jaswal 29

30 Symmetric vs. Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data symmetric positively skewed negatively skewed Chp2 Slides by: Shree Jaswal 30

31 Measuring the Dispersion of Data Quartiles, outliers and boxplots o Quartiles: Q 1 (25 th percentile), Q 3 (75 th percentile) o Inter-quartile range: IQR = Q 3 Q 1 o Five number summary: min, Q 1, M, Q 3, max o Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually o Outlier: usually, a value higher/lower than 1.5 x IQR Variance and standard deviation (sample: s, population: σ) o Variance: (algebraic, scalable computation) o Standard deviation s (or σ) is the square root of variance s 2 ( or σ 2) n i n i i i n i i x n x n x x n s ] ) ( 1 [ 1 1 ) ( 1 1 n i i n i i x N x N ) ( 1 Chp2 Slides by: Shree Jaswal 31

32 Boxplot Analysis Five-number summary of a distribution: Boxplot Minimum, Q1, M, Q3, Maximum o Data is represented with a box o The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR o The median is marked by a line within the box o Whiskers: two lines outside the box extend to Minimum and Maximum Chp2 Slides by: Shree Jaswal 32

33 Visualization Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported. Visualization of data is one of the most powerful and appealing techniques for data exploration. o Humans have a well developed ability to analyze large amounts of information that is presented visually o Can detect general patterns and trends o Can detect outliers and unusual patterns Chp2 Slides by: Shree Jaswal 33

34 Data Visualization and Its Methods Why data visualization? o Gain insight into an information space by mapping data onto graphical primitives o Provide qualitative overview of large data sets o Search for patterns, trends, structure, irregularities, relationships among data o Help find interesting regions and suitable parameters for further quantitative analysis o Provide a visual proof of computer representations derived Typical visualization methods: o Geometric techniques o Icon-based techniques o Hierarchical techniques Chp2 Slides by: Shree Jaswal 34

35 Iris Sample Data Set Many of the exploratory data techniques are illustrated with the Iris Plant data set. o Can be obtained from the UCI Machine Learning Repository o From the statistician Douglas Fisher o Three flower types (classes): Setosa Virginica Versicolour o Four (non-class) attributes Sepal width and length Petal width and length Virginica. Robert H. Mohlenbrock. USDA NRCS Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Chp2 Slides by: Shree Jaswal 35 Science Institute.

36 Attribute Information: 1. sepal length in cm 2. sepal width in cm 3. petal length in cm 4. petal width in cm 5. class: -- Iris Setosa -- Iris Versicolour -- Iris Virginica Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Chp2 Slides by: Shree Jaswal 36

37 Example: Sea Surface Temperature The following shows the Sea Surface Temperature (SST) for July 1982 o Tens of thousands of data points are summarized in a single figure Chp2 Slides by: Shree Jaswal 37

38 Representation Is the mapping of information to a visual format Data objects, their attributes, and the relationships among data objects are translated into graphical elements such as points, lines, shapes, and colors. Example: o Objects are often represented as points o Their attribute values can be represented as the position of the points or the characteristics of the points, e.g., color, size, and shape o If position is used, then the relationships of points, i.e., whether they form groups or a point is an outlier, is easily perceived. Chp2 Slides by: Shree Jaswal 38

39 Arrangement Is the placement of visual elements within a display Can make a large difference in how easy it is to understand the data Example: Chp2 Slides by: Shree Jaswal 39

40 Selection Is the elimination or the de-emphasis of certain objects and attributes Selection may involve the choosing a subset of attributes o Dimensionality reduction is often used to reduce the number of dimensions to two or three (in case of few dimensions) o Alternatively, pairs of attributes can be considered Selection may also involve choosing a subset of objects o A region of the screen can only show so many points o Can sample, but want to preserve points in sparse areas Chp2 Slides by: Shree Jaswal 40

41 Visualization Techniques: Histograms Histogram o Usually shows the distribution of values of a single variable o Divide the values into bins and show a bar plot of the number of objects in each bin. o The height of each bar indicates the number of objects o Shape of histogram depends on the number of bins Example: Petal Width (10 and 20 bins, respectively) Chp2 Slides by: Shree Jaswal 41

42 Two-Dimensional Histograms Show the joint distribution of the values of two attributes Example: petal width and petal length Chp2 Slides by: Shree Jaswal 42

43 Visualization Techniques: Box Plots Box Plots o Invented by J. Tukey o Another way of displaying the distribution of data o Following figure shows the basic part of a box plot outlier 10 th percentile 75 th percentile 50 th percentile 25 th percentile 10 th percentile Chp2 Slides by: Shree Jaswal 43

44 Example of Box Plots Box plots can be used to compare attributes Chp2 Slides by: Shree Jaswal 44

45 Visualization of Data Dispersion: 3-D Boxplots Chp2 Slides by: Shree Jaswal 45

46 Histogram Analysis Graph displays of basic statistical class descriptions o Frequency histograms A univariate graphical method Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data Chp2 Slides by: Shree Jaswal 46

47 Histograms Often Tells More than Boxplots The two histograms shown in the left may have the same boxplot representation The same values for: min, Q1, median, Q3, max But they have rather different data distributions Chp2 Slides by: Shree Jaswal 47

48 Quantile Plot Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information o For a data x i data sorted in increasing order, f i indicates that approximately 100 f i % of the data are below or equal to the value x i Chp2 Slides by: Shree Jaswal 48

49 Quantile-Quantile (Q-Q) Plot Graphs the quantiles of one univariate distribution against the corresponding quantiles of another Allows the user to view whether there is a shift in going from one distribution to another Chp2 Slides by: Shree Jaswal 49

50 Scatter plot Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plotted as points in the plane Chp2 Slides by: Shree Jaswal 50

51 Visualization Techniques: Scatter Plots Scatter plots o Attributes values determine the position o Two-dimensional scatter plots most common, but can have three-dimensional scatter plots o Often additional attributes can be displayed by using the size, shape, and color of the markers that represent the objects o It is useful to have arrays of scatter plots can compactly summarize the relationships of several pairs of attributes See example on the next slide Chp2 Slides by: Shree Jaswal 51

52 Scatter Plot Array of Iris Attributes Chp2 Slides by: Shree Jaswal 52

53 Positively and Negatively Correlated Data The left half fragment is positively correlated The right half is negative correlated Chp2 Slides by: Shree Jaswal 53

54 Not Correlated Data Chp2 Slides by: Shree Jaswal 54

55 Visualization Techniques: Contour Plots Contour plots o Useful when a continuous attribute is measured on a spatial grid o They partition the plane into regions of similar values o The contour lines that form the boundaries of these regions connect points with equal values o The most common example is contour maps of elevation o Can also display temperature, rainfall, air pressure, etc. An example for Sea Surface Temperature (SST) is provided on the next slide Chp2 Slides by: Shree Jaswal 55

56 Contour Plot Example: SST Dec, 1998 Celsius Chp2 Slides by: Shree Jaswal 56

57 Visualization Techniques: Matrix Plots Matrix plots o Can plot the data matrix o This can be useful when objects are sorted according to class o Typically, the attributes are normalized to prevent one attribute from dominating the plot o Plots of similarity or distance matrices can also be useful for visualizing the relationships between objects o Examples of matrix plots are presented on the next slide Chp2 Slides by: Shree Jaswal 57

58 Visualization of the Iris Data Matrix standard Chp2 Slides by: Shree Jaswal deviation 58

59 Visualization Techniques: Parallel Coordinates Parallel Coordinates o Used to plot the attribute values of high-dimensional data o Instead of using perpendicular axes, use a set of parallel axes o The attribute values of each object are plotted as a point on each corresponding coordinate axis and the points are connected by a line o Thus, each object is represented as a line o Often, the lines representing a distinct class of objects group together, at least for some attributes o Ordering of attributes is important in seeing such groupings Chp2 Slides by: Shree Jaswal 59

60 Parallel Coordinates Plots for Iris Data Chp2 Slides by: Shree Jaswal 60

61 Other Visualization Techniques Star Plots o Similar approach to parallel coordinates, but axes radiate from a central point o The line connecting the values of an object is a polygon Chernoff Faces o Approach created by Herman Chernoff o This approach associates each attribute with a characteristic of a face o The values of each attribute determine the appearance of the corresponding facial characteristic o Each object becomes a separate face o Relies on human s ability to distinguish faces Chp2 Slides by: Shree Jaswal 61

62 Star Plots for Iris Data Setosa Versicolour Virginica Chp2 Slides by: Shree Jaswal 62

63 Chernoff Faces for Iris Data Setosa Versicolour Virginica Chp2 Slides by: Shree Jaswal 63

64 Stick Figures census data showing age, income, sex, education, etc. Chp2 Slides by: Shree Jaswal 64

65 Hierarchical Techniques Visualization of the data using a hierarchical partitioning into subspaces. Methods o Dimensional Stacking o Worlds-within-Worlds o Treemap o Cone Trees o InfoCube Chp2 Slides by: Shree Jaswal 65

66 Tree-Map Screen-filling method which uses a hierarchical partitioning of the screen into regions depending on the attribute values The x- and y-dimension of the screen are partitioned alternately according to the attribute values (classes) MSR Netscan Image Chp2 Slides by: Shree Jaswal 66

67 Chp2 Slides by: Shree Jaswal 67

68 Chp2 Slides by: Shree Jaswal 68

69 Similarity and Dissimilarity Similarity o Numerical measure of how alike two data objects are o Value is higher when objects are more alike o Often falls in the range [0,1] Dissimilarity (i.e., distance) o Numerical measure of how different are two data objects o Lower when objects are more alike o Minimum dissimilarity is often 0 o Upper limit varies Proximity refers to a similarity or dissimilarity Chp2 Slides by: Shree Jaswal 69

70 Data Matrix and Dissimilarity Matrix Data matrix o n data points with p dimensions o Two modes x x i1... x n x 1f... x if... x nf x 1p... x ip... x np Dissimilarity matrix o n data points, but registers only the distance o A triangular matrix o Single mode 0 d(2,1) d(3,1 ) : d ( n,1 ) d d 0 ( 3,2 ) : ( n,2 ) 0 : Chp2 Slides by: Shree Jaswal 70

71 Nominal Variables A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green Method 1: Simple matching o m: # of matches, p: total # of variables d ( i, j) p m p Method 2: Use a large number of binary variables o creating a new binary variable for each of the M nominal states Chp2 Slides by: Shree Jaswal 71

72 Binary Variables 1 Object j 0 sum A contingency table for binary data Object i 1 0 a c b d a b c d Distance measure for symmetric binary variables: Distance measure for asymmetric binary variables: Jaccard coefficient (similarity measure for asymmetric binary variables): d ( i, j) b c a b c d d ( i, j) sim Jaccard sum a c b d p b c a b c ( i, j) a a b c Chp2 Slides by: Shree Jaswal 72

73 Dissimilarity between Binary Variables Example N a m e G e n d e r F e v e r C o u g h T e s t-1 T e s t-2 T e s t-3 T e s t-4 J a c k M Y N P N N N M a ry F Y N P N P N J im M Y P N N N N o gender is a symmetric attribute o the remaining attributes are asymmetric binary o let the values Y and P be set to 1, and the value N be set to 0 d ( jack, mary )? d ( jack, jim )? d ( jim, mary )? Chp2 Slides by: Shree Jaswal 73

74 d d d ( ( ( jack jack jim,,, mary jim mary ) ) ) Jim and marry are unlikely to have a similar disease whereas Jack and Mary are most likely to have a similar disease Chp2 Slides by: Shree Jaswal 74

75 Euclidean Distance Euclidean Distance dist n ( k 1 p k q k ) 2 Where n is the number of dimensions (attributes) and p k and q k are, respectively, the k th attributes (components) of data objects p and q. Standardization is necessary, if scales differ. Chp2 Slides by: Shree Jaswal 75

76 Euclidean Distance p1 p2 p3 p point x y p1 0 2 p2 2 0 p3 3 1 p4 5 1 p1 p2 p3 p4 p p p p Distance Matrix Chp2 Slides by: Shree Jaswal 76

77 Manhattan Distance It is named so because it is the distance in blocks between any two points in a city A popular distance measure It is defined as: d ( i, j) x i x j x i p p where i = (x i1, x i2,, x ip ) and j = (x j1, x j2,, x jp ) are two p-dimensional data objects x j x i x j Chp2 Slides by: Shree Jaswal 77

78 Minkowski Distance Minkowski distance: A generalization of Euclidean and Manhattan distances d ( i, j) where i = (x i1, x i2,, x ip ) and j = (x j1, x j2,, x jp ) are two p- dimensional data objects, and q is the order Properties q ( x i 1 x j 1 q x i 2 x j 2 q... x i p x j p q ) o o o d(i, j) > 0 if i j, and d(i, i) = 0 (Positive definiteness) d(i, j) = d(j, i) (Symmetry) d(i, j) d(i, k) + d(k, j) (Triangle Inequality) A distance that satisfies these properties is a metric Chp2 Slides by: Shree Jaswal 78

79 Special Cases of Minkowski Distance q = 1: Manhattan (city block, L 1 norm) distance o E.g., the Hamming distance: the number of bits that are different between two binary vectors d ( i, j) x i q= 2: (L 2 norm) Euclidean distance x j x i x j... x i p p x j d ( i, j) ( x i x j 2 x i... x i p p q. supremum (L max norm, L norm) distance. o This is the maximum difference between any component of the vectors Do not confuse q with n, i.e., all these distances are defined for all numbers of dimensions. Also, one can use weighted distance, parametric Pearson product moment correlation, or other dissimilarity measures x j 2 x j 2 ) Chp2 Slides by: Shree Jaswal 79

80 Example: Minkowski Distance point x y p1 0 2 p2 2 0 p3 3 1 p4 5 1 L1 p1 p2 p3 p4 p p p p L2 p1 p2 p3 p4 p p p p L p1 p2 p3 p4 p p p p Distance Matrix Chp2 Slides by: Shree Jaswal 80

81 Common Properties of a Distance Distances, such as the Euclidean distance, have some well known properties. 1. d(p, q) 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness) 2. d(p, q) = d(q, p) for all p and q. (Symmetry) 3. d(p, r) d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality) where d(p, q) is the distance (dissimilarity) between points (data objects), p and q. A distance that satisfies these properties is a metric Chp2 Slides by: Shree Jaswal 81

82 Common Properties of a Similarity Similarities, also have some well known properties. 1. s(p, q) = 1 (or maximum similarity) only if p = q. 2. s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q. Chp2 Slides by: Shree Jaswal 82

83 Ordinal Variables An ordinal variable can be discrete or continuous Order is important, e.g., rank Can be treated like interval-scaled o replace x if by their rank o map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by z if o compute the dissimilarity using methods for interval-scaled variables r M if f 1 r 1 if {1,..., M f } Chp2 Slides by: Shree Jaswal 83

84 Numeric Variables d ( ij f ) max x if hxhf x if min hxhf Chp2 Slides by: Shree Jaswal 84

85 Variables of Mixed Types A database may contain all the six types of variables o symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio One may use a weighted formula to combine their effects p ( f ) ( f ) d f 1 ij ij d ( i, j) p ( f ) o f is binary or nominal: d ij (f) = 0 if x if = x jf, or d ij (f) = 1 otherwise o f is interval-based: use the normalized distance o f is ordinal or ratio-scaled Compute ranks r if and Treat z if as interval-scaled f 1 ij z if r if M 1 f 1 Chp2 Slides by: Shree Jaswal 86

86 Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects. Chp2 Slides by: Shree Jaswal 87

87 Vector Objects: Cosine Similarity Vector objects: keywords in documents, gene features in micro-arrays, Applications: information retrieval, biologic taxonomy,... Cosine similarity measure: If d 1 and d 2 are two vectors, then sim(d 1, d 2 ) = (d 1 d 2 ) / d 1 d 2, where indicates vector dot product, d : the length of vector d A cosine value of 0 indicates that two vectors are at 90 degrees to each other(orthogonal) and have no match. The closer the cosine value to 1, the smaller the angle and greater the match between vectors Chp2 Slides by: Shree Jaswal 88

88 Example: d 1 = d 2 = d 1 d 2 = 3*1+2*0+0*0+5*0+0*0+0*0+0*0+2*1+0*0+0*2 = 5 d 1 = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0) 0.5 =(4 2) 0.5 = d 2 = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 =(6 ) 0.5 = sim( d 1, d 2 ) =.3150 Chp2 Slides by: Shree Jaswal 89

CISC 4631 Data Mining

CISC 4631 Data Mining CISC 4631 Data Mining Lecture 02: Data Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) 1 10 What is Data? Collection of data objects and their attributes An attribute

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Types of data sets Record Tables Document Data Transaction Data Graph World Wide Web Molecular Structures

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 10 What is Data? Collection of data objects

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar 10 What is Data? Collection of data objects and their attributes Attributes An attribute is a property

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 10 What is Data? Collection of data objects

More information

ANÁLISE DOS DADOS. Daniela Barreiro Claro

ANÁLISE DOS DADOS. Daniela Barreiro Claro ANÁLISE DOS DADOS Daniela Barreiro Claro Outline Data types Graphical Analysis Proimity measures Prof. Daniela Barreiro Claro Types of Data Sets Record Ordered Relational records Video data: sequence of

More information

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data Elena Baralis and Tania Cerquitelli Politecnico di Torino Data set types Record Tables Document Data Transaction Data Graph World Wide Web Molecular Structures Ordered Spatial Data Temporal Data Sequential

More information

Chapter I: Introduction & Foundations

Chapter I: Introduction & Foundations Chapter I: Introduction & Foundations } 1.1 Introduction 1.1.1 Definitions & Motivations 1.1.2 Data to be Mined 1.1.3 Knowledge to be discovered 1.1.4 Techniques Utilized 1.1.5 Applications Adapted 1.1.6

More information

Getting To Know Your Data

Getting To Know Your Data Getting To Know Your Data Road Map 1. Data Objects and Attribute Types 2. Descriptive Data Summarization 3. Measuring Data Similarity and Dissimilarity Data Objects and Attribute Types Types of data sets

More information

Descriptive Data Summarization

Descriptive Data Summarization Descriptive Data Summarization Descriptive data summarization gives the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning

More information

Data Mining: Data. Lecture Notes for Chapter 2 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Data Mining: Data. Lecture Notes for Chapter 2 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Data Mining: Data Lecture Notes for Chapter 2 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Look for accompanying R code on the course web site. 1 Topics Attributes/Features Types of Data

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

MSCBD 5002/IT5210: Knowledge Discovery and Data Minig

MSCBD 5002/IT5210: Knowledge Discovery and Data Minig MSCBD 5002/IT5210: Knowledge Discovery and Data Minig Instructor: Lei Chen Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei and

More information

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Data Mining: Concepts and Techniques Chapter 2 1 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser University 2011 Han, Kamber, and Pei. All rights reserved.

More information

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo Clustering Lecture 1: Basics Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics Clustering

More information

Similarity and Dissimilarity

Similarity and Dissimilarity 1//015 Similarity and Dissimilarity COMP 465 Data Mining Similarity of Data Data Preprocessing Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed.

More information

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr Foundation of Data Mining i Topic: Data CMSC 49D/69D CSEE Department, e t, UMBC Some of the slides used in this presentation are prepared by Jiawei Han and Micheline Kamber Data Data types Quality of data

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are. Is higher

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar (modified by Predrag Radivojac, 2017) 10 What is Data? Collection of data objects and their attributes

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar (modified by Predrag Radivojac, 2018) 10 What is Data? Collection of data objects and their attributes

More information

Data Mining 4. Cluster Analysis

Data Mining 4. Cluster Analysis Data Mining 4. Cluster Analysis 4.2 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Data Structures Interval-Valued (Numeric) Variables Binary Variables Categorical Variables Ordinal Variables Variables

More information

University of Florida CISE department Gator Engineering. Clustering Part 1

University of Florida CISE department Gator Engineering. Clustering Part 1 Clustering Part 1 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville What is Cluster Analysis? Finding groups of objects such that the objects

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu May 2, 2017 Announcements Homework 2 due later today Due May 3 rd (11:59pm) Course project

More information

DATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 4 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

Data Mining and Analysis

Data Mining and Analysis 978--5-766- - Data Mining and Analysis: Fundamental Concepts and Algorithms CHAPTER Data Mining and Analysis Data mining is the process of discovering insightful, interesting, and novel patterns, as well

More information

Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1]

Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1] Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1] Dissimilarity (e.g., distance) Numerical

More information

Data Prepara)on. Dino Pedreschi. Anna Monreale. Università di Pisa

Data Prepara)on. Dino Pedreschi. Anna Monreale. Università di Pisa Data Prepara)on Anna Monreale Dino Pedreschi Università di Pisa KDD Process Interpretation and Evaluation Data Consolidation Selection and Preprocessing Warehouse Data Mining Prepared Data p(x)=0.02 Patterns

More information

CSE5243 INTRO. TO DATA MINING

CSE5243 INTRO. TO DATA MINING CSE5243 INTRO. TO DATA MINING Data & Data Preprocessing Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Data & Data Preprocessing What is Data:

More information

proximity similarity dissimilarity distance Proximity Measures:

proximity similarity dissimilarity distance Proximity Measures: Similarity Measures Similarity and dissimilarity are important because they are used by a number of data mining techniques, such as clustering nearest neighbor classification and anomaly detection. The

More information

P8130: Biostatistical Methods I

P8130: Biostatistical Methods I P8130: Biostatistical Methods I Lecture 2: Descriptive Statistics Cody Chiuzan, PhD Department of Biostatistics Mailman School of Public Health (MSPH) Lecture 1: Recap Intro to Biostatistics Types of Data

More information

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty. What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty. Statistics is a field of study concerned with the data collection,

More information

Chapter 3. Data Description

Chapter 3. Data Description Chapter 3. Data Description Graphical Methods Pie chart It is used to display the percentage of the total number of measurements falling into each of the categories of the variable by partition a circle.

More information

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that? Tastitsticsss? What s that? Statistics describes random mass phanomenons. Principles of Biostatistics and Informatics nd Lecture: Descriptive Statistics 3 th September Dániel VERES Data Collecting (Sampling)

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 2: Vector Data: Prediction Instructor: Yizhou Sun yzsun@cs.ucla.edu October 8, 2018 TA Office Hour Time Change Junheng Hao: Tuesday 1-3pm Yunsheng Bai: Thursday 1-3pm

More information

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable QUANTITATIVE DATA Recall that quantitative (numeric) data values are numbers where data take numerical values for which it is sensible to find averages, such as height, hourly pay, and pulse rates. UNIVARIATE

More information

1. Exploratory Data Analysis

1. Exploratory Data Analysis 1. Exploratory Data Analysis 1.1 Methods of Displaying Data A visual display aids understanding and can highlight features which may be worth exploring more formally. Displays should have impact and be

More information

Assignment 3: Chapter 2 & 3 (2.6, 3.8)

Assignment 3: Chapter 2 & 3 (2.6, 3.8) Neha Aggarwal Comp 578 Data Mining Fall 8 9-12-8 Assignment 3: Chapter 2 & 3 (2.6, 3.8) 2.6 Q.18 This exercise compares and contrasts some similarity and distance measures. (a) For binary data, the L1

More information

Math Literacy. Curriculum (457 topics)

Math Literacy. Curriculum (457 topics) Math Literacy This course covers the topics shown below. Students navigate learning paths based on their level of readiness. Institutional users may customize the scope and sequence to meet curricular

More information

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1 Lecture Slides Elementary Statistics Tenth Edition and the Triola Statistics Series by Mario F. Triola Slide 1 Chapter 3 Statistics for Describing, Exploring, and Comparing Data 3-1 Overview 3-2 Measures

More information

Chapter 2: Tools for Exploring Univariate Data

Chapter 2: Tools for Exploring Univariate Data Stats 11 (Fall 2004) Lecture Note Introduction to Statistical Methods for Business and Economics Instructor: Hongquan Xu Chapter 2: Tools for Exploring Univariate Data Section 2.1: Introduction What is

More information

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved. 1-1 Chapter 1 Sampling and Descriptive Statistics 1-2 Why Statistics? Deal with uncertainty in repeated scientific measurements Draw conclusions from data Design valid experiments and draw reliable conclusions

More information

Chapter 1. Looking at Data

Chapter 1. Looking at Data Chapter 1 Looking at Data Types of variables Looking at Data Be sure that each variable really does measure what you want it to. A poor choice of variables can lead to misleading conclusions!! For example,

More information

CIVL 7012/8012. Collection and Analysis of Information

CIVL 7012/8012. Collection and Analysis of Information CIVL 7012/8012 Collection and Analysis of Information Uncertainty in Engineering Statistics deals with the collection and analysis of data to solve real-world problems. Uncertainty is inherent in all real

More information

1 Measures of the Center of a Distribution

1 Measures of the Center of a Distribution 1 Measures of the Center of a Distribution Qualitative descriptions of the shape of a distribution are important and useful. But we will often desire the precision of numerical summaries as well. Two aspects

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1 Week 2 Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Part I Other datatypes, preprocessing 2 / 1 Other datatypes Document data You might start with a collection of

More information

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes Week 2 Based in part on slides from textbook, slides of Susan Holmes Part I Other datatypes, preprocessing October 3, 2012 1 / 1 2 / 1 Other datatypes Other datatypes Document data You might start with

More information

Topic 3: Introduction to Statistics. Algebra 1. Collecting Data. Table of Contents. Categorical or Quantitative? What is the Study of Statistics?!

Topic 3: Introduction to Statistics. Algebra 1. Collecting Data. Table of Contents. Categorical or Quantitative? What is the Study of Statistics?! Topic 3: Introduction to Statistics Collecting Data We collect data through observation, surveys and experiments. We can collect two different types of data: Categorical Quantitative Algebra 1 Table of

More information

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives F78SC2 Notes 2 RJRC Algebra It is useful to use letters to represent numbers. We can use the rules of arithmetic to manipulate the formula and just substitute in the numbers at the end. Example: 100 invested

More information

Algebra 2. Curriculum (524 topics additional topics)

Algebra 2. Curriculum (524 topics additional topics) Algebra 2 This course covers the topics shown below. Students navigate learning paths based on their level of readiness. Institutional users may customize the scope and sequence to meet curricular needs.

More information

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke BIOL 51A - Biostatistics 1 1 Lecture 1: Intro to Biostatistics Smoking: hazardous? FEV (l) 1 2 3 4 5 No Yes Smoke BIOL 51A - Biostatistics 1 2 Box Plot a.k.a box-and-whisker diagram or candlestick chart

More information

Psych Jan. 5, 2005

Psych Jan. 5, 2005 Psych 124 1 Wee 1: Introductory Notes on Variables and Probability Distributions (1/5/05) (Reading: Aron & Aron, Chaps. 1, 14, and this Handout.) All handouts are available outside Mija s office. Lecture

More information

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization. Statistical Tools in Evaluation HPS 41 Dr. Joe G. Schmalfeldt Types of Scores Continuous Scores scores with a potentially infinite number of values. Discrete Scores scores limited to a specific number

More information

Foundations of Algebra/Algebra/Math I Curriculum Map

Foundations of Algebra/Algebra/Math I Curriculum Map *Standards N-Q.1, N-Q.2, N-Q.3 are not listed. These standards represent number sense and should be integrated throughout the units. *For each specific unit, learning targets are coded as F for Foundations

More information

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- # Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series by Mario F. Triola Chapter 3 Statistics for Describing, Exploring, and Comparing Data 3-1 Review and Preview 3-2 Measures

More information

Summarizing and Displaying Measurement Data/Understanding and Comparing Distributions

Summarizing and Displaying Measurement Data/Understanding and Comparing Distributions Summarizing and Displaying Measurement Data/Understanding and Comparing Distributions Histograms, Mean, Median, Five-Number Summary and Boxplots, Standard Deviation Thought Questions 1. If you were to

More information

Introduction to Basic Statistics Version 2

Introduction to Basic Statistics Version 2 Introduction to Basic Statistics Version 2 Pat Hammett, Ph.D. University of Michigan 2014 Instructor Comments: This document contains a brief overview of basic statistics and core terminology/concepts

More information

Anomaly Detection. Jing Gao. SUNY Buffalo

Anomaly Detection. Jing Gao. SUNY Buffalo Anomaly Detection Jing Gao SUNY Buffalo 1 Anomaly Detection Anomalies the set of objects are considerably dissimilar from the remainder of the data occur relatively infrequently when they do occur, their

More information

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining Distances and similarities Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Similarities Start with X which we assume is centered and standardized. The PCA loadings were

More information

Bemidji Area Schools Outcomes in Mathematics Algebra 2 Applications. Based on Minnesota Academic Standards in Mathematics (2007) Page 1 of 7

Bemidji Area Schools Outcomes in Mathematics Algebra 2 Applications. Based on Minnesota Academic Standards in Mathematics (2007) Page 1 of 7 9.2.1.1 Understand the definition of a function. Use functional notation and evaluate a function at a given point in its domain. For example: If f x 1, find f(-4). x2 3 Understand the concept of function,

More information

Ohio Department of Education Academic Content Standards Mathematics Detailed Checklist ~Grade 9~

Ohio Department of Education Academic Content Standards Mathematics Detailed Checklist ~Grade 9~ Ohio Department of Education Academic Content Standards Mathematics Detailed Checklist ~Grade 9~ Number, Number Sense and Operations Standard Students demonstrate number sense, including an understanding

More information

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data cleaning

More information

NEW YORK ALGEBRA TABLE OF CONTENTS

NEW YORK ALGEBRA TABLE OF CONTENTS NEW YORK ALGEBRA TABLE OF CONTENTS CHAPTER 1 NUMBER SENSE & OPERATIONS TOPIC A: Number Theory: Properties of Real Numbers {A.N.1} PART 1: Closure...1 PART 2: Commutative Property...2 PART 3: Associative

More information

Algebra Topic Alignment

Algebra Topic Alignment Preliminary Topics Absolute Value 9N2 Compare, order and determine equivalent forms for rational and irrational numbers. Factoring Numbers 9N4 Demonstrate fluency in computations using real numbers. Fractions

More information

Descriptive Univariate Statistics and Bivariate Correlation

Descriptive Univariate Statistics and Bivariate Correlation ESC 100 Exploring Engineering Descriptive Univariate Statistics and Bivariate Correlation Instructor: Sudhir Khetan, Ph.D. Wednesday/Friday, October 17/19, 2012 The Central Dogma of Statistics used to

More information

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality Measurement and Data Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality Importance of Measurement Aim of mining structured data is to discover relationships that

More information

Middle Grades Mathematics (203)

Middle Grades Mathematics (203) Middle Grades Mathematics (203) NES, the NES logo, Pearson, the Pearson logo, and National Evaluation Series are trademarks, in the U.S. and/or other countries, of Pearson Education, Inc. or its affiliate(s).

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA Department

More information

Pre Algebra. Curriculum (634 topics)

Pre Algebra. Curriculum (634 topics) Pre Algebra This course covers the topics shown below. Students navigate learning paths based on their level of readiness. Institutional users may customize the scope and sequence to meet curricular needs.

More information

Histogram, cumulative frequency, frequency, 676 Horizontal number line, 6 Hypotenuse, 263, 301, 307

Histogram, cumulative frequency, frequency, 676 Horizontal number line, 6 Hypotenuse, 263, 301, 307 INDEX A Abscissa, 76 Absolute value, 6 7, 55 Absolute value function, 382 386 transformations of, reflection, 386 scaling, 386 translation, 385 386 Accuracy, 31 Acute angle, 249 Acute triangle, 263 Addition,

More information

Σ x i. Sigma Notation

Σ x i. Sigma Notation Sigma Notation The mathematical notation that is used most often in the formulation of statistics is the summation notation The uppercase Greek letter Σ (sigma) is used as shorthand, as a way to indicate

More information

Lecture 1: Descriptive Statistics

Lecture 1: Descriptive Statistics Lecture 1: Descriptive Statistics MSU-STT-351-Sum 15 (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 1 / 56 Contents 1 Introduction 2 Branches of Statistics Descriptive Statistics

More information

Introduction to Statistics

Introduction to Statistics Introduction to Statistics Data and Statistics Data consists of information coming from observations, counts, measurements, or responses. Statistics is the science of collecting, organizing, analyzing,

More information

Sets and Set notation. Algebra 2 Unit 8 Notes

Sets and Set notation. Algebra 2 Unit 8 Notes Sets and Set notation Section 11-2 Probability Experimental Probability experimental probability of an event: Theoretical Probability number of time the event occurs P(event) = number of trials Sample

More information

3.1 Measure of Center

3.1 Measure of Center 3.1 Measure of Center Calculate the mean for a given data set Find the median, and describe why the median is sometimes preferable to the mean Find the mode of a data set Describe how skewness affects

More information

College Mathematics

College Mathematics Wisconsin Indianhead Technical College 10804107 College Mathematics Course Outcome Summary Course Information Description Instructional Level Total Credits 3.00 Total Hours 48.00 This course is designed

More information

Keystone Exams: Algebra

Keystone Exams: Algebra KeystoneExams:Algebra TheKeystoneGlossaryincludestermsanddefinitionsassociatedwiththeKeystoneAssessmentAnchorsand Eligible Content. The terms and definitions included in the glossary are intended to assist

More information

Types of Information. Topic 2 - Descriptive Statistics. Examples. Sample and Sample Size. Background Reading. Variables classified as STAT 511

Types of Information. Topic 2 - Descriptive Statistics. Examples. Sample and Sample Size. Background Reading. Variables classified as STAT 511 Topic 2 - Descriptive Statistics STAT 511 Professor Bruce Craig Types of Information Variables classified as Categorical (qualitative) - variable classifies individual into one of several groups or categories

More information

Glossary for the Triola Statistics Series

Glossary for the Triola Statistics Series Glossary for the Triola Statistics Series Absolute deviation The measure of variation equal to the sum of the deviations of each value from the mean, divided by the number of values Acceptance sampling

More information

1. Introduction to Multivariate Analysis

1. Introduction to Multivariate Analysis 1. Introduction to Multivariate Analysis Isabel M. Rodrigues 1 / 44 1.1 Overview of multivariate methods and main objectives. WHY MULTIVARIATE ANALYSIS? Multivariate statistical analysis is concerned with

More information

BNG 495 Capstone Design. Descriptive Statistics

BNG 495 Capstone Design. Descriptive Statistics BNG 495 Capstone Design Descriptive Statistics Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential statistical methods, with a focus

More information

are the objects described by a set of data. They may be people, animals or things.

are the objects described by a set of data. They may be people, animals or things. ( c ) E p s t e i n, C a r t e r a n d B o l l i n g e r 2016 C h a p t e r 5 : E x p l o r i n g D a t a : D i s t r i b u t i o n s P a g e 1 CHAPTER 5: EXPLORING DATA DISTRIBUTIONS 5.1 Creating Histograms

More information

Grade 6 - SBA Claim 1 Example Stems

Grade 6 - SBA Claim 1 Example Stems Grade 6 - SBA Claim 1 Example Stems This document takes publicly available information about the Smarter Balanced Assessment (SBA) in Mathematics, namely the Claim 1 Item Specifications, and combines and

More information

Bemidji Area Schools Outcomes in Mathematics Algebra 2A. Based on Minnesota Academic Standards in Mathematics (2007) Page 1 of 6

Bemidji Area Schools Outcomes in Mathematics Algebra 2A. Based on Minnesota Academic Standards in Mathematics (2007) Page 1 of 6 9.2.1.1 Understand the definition of a function. Use functional notation and evaluate a function at a given point in its domain. For example: If f x 1, find f(-4). x2 3 10, Algebra Understand the concept

More information

CHAPTER 1. Introduction

CHAPTER 1. Introduction CHAPTER 1 Introduction Engineers and scientists are constantly exposed to collections of facts, or data. The discipline of statistics provides methods for organizing and summarizing data, and for drawing

More information

After completing this chapter, you should be able to:

After completing this chapter, you should be able to: Chapter 2 Descriptive Statistics Chapter Goals After completing this chapter, you should be able to: Compute and interpret the mean, median, and mode for a set of data Find the range, variance, standard

More information

Foundations of High School Math

Foundations of High School Math Foundations of High School Math This course covers the topics shown below. Students navigate learning paths based on their level of readiness. Institutional users may customize the scope and sequence to

More information

Elementary Statistics

Elementary Statistics Elementary Statistics Q: What is data? Q: What does the data look like? Q: What conclusions can we draw from the data? Q: Where is the middle of the data? Q: Why is the spread of the data important? Q:

More information

BENCHMARKS GRADE LEVEL INDICATORS STRATEGIES/RESOURCES

BENCHMARKS GRADE LEVEL INDICATORS STRATEGIES/RESOURCES GRADE OHIO ACADEMIC CONTENT STANDARDS MATHEMATICS CURRICULUM GUIDE Ninth Grade Number, Number Sense and Operations Standard Students demonstrate number sense, including an understanding of number systems

More information

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart ST2001 2. Presenting & Summarising Data Descriptive Statistics Frequency Distribution, Histogram & Bar Chart Summary of Previous Lecture u A study often involves taking a sample from a population that

More information

Histograms allow a visual interpretation

Histograms allow a visual interpretation Chapter 4: Displaying and Summarizing i Quantitative Data s allow a visual interpretation of quantitative (numerical) data by indicating the number of data points that lie within a range of values, called

More information

Units. Exploratory Data Analysis. Variables. Student Data

Units. Exploratory Data Analysis. Variables. Student Data Units Exploratory Data Analysis Bret Larget Departments of Botany and of Statistics University of Wisconsin Madison Statistics 371 13th September 2005 A unit is an object that can be measured, such as

More information

Introductory Mathematics

Introductory Mathematics Introductory Mathematics 1998 2003 1.01 Identify subsets of the real number system. 1.02 Estimate and compute with rational Grade 7: 1.02 numbers. 1.03 Compare, order, and convert among Grade 6: 1.03 fractions,

More information

2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS

2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS Spring 2015: Lembo GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS Descriptive statistics concise and easily understood summary of data set characteristics

More information

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things. (c) Epstein 2013 Chapter 5: Exploring Data Distributions Page 1 CHAPTER 5: EXPLORING DATA DISTRIBUTIONS 5.1 Creating Histograms Individuals are the objects described by a set of data. These individuals

More information

Admin. Assignment 1 is out (due next Friday, start early). Tutorials start Monday: Office hours: Sign up for the course page on Piazza.

Admin. Assignment 1 is out (due next Friday, start early). Tutorials start Monday: Office hours: Sign up for the course page on Piazza. Admin Assignment 1 is out (due next Friday, start early). Tutorials start Monday: 11am, 2pm, and 4pm in DMP 201. New tutorial section: 5pm in DMP 101. Make sure you sign up for one. No requirement to attend,

More information

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 03

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 03 Meelis Kull meelis.kull@ut.ee Autumn 2017 1 Demo: Data science mini-project CRISP-DM: cross-industrial standard process for data mining Data understanding: Types of data Data understanding: First look

More information

Module 1. Identify parts of an expression using vocabulary such as term, equation, inequality

Module 1. Identify parts of an expression using vocabulary such as term, equation, inequality Common Core Standards Major Topic Key Skills Chapters Key Vocabulary Essential Questions Module 1 Pre- Requisites Skills: Students need to know how to add, subtract, multiply and divide. Students need

More information

ROSLYN PUBLIC SCHOOLS INTEGRATED ALGEBRA CURRICULUM. Day(s) Topic Textbook Workbook Additional Worksheets

ROSLYN PUBLIC SCHOOLS INTEGRATED ALGEBRA CURRICULUM. Day(s) Topic Textbook Workbook Additional Worksheets I. Review 1 and 2 Review p3-19 1. Screening Test 2. Review sheet II. Real Numbers A.N.1 Identify and apply the properties of real numbers (closure, commutative, associative, distributive, identity, inverse)

More information

Pre Algebra and Introductory Algebra

Pre Algebra and Introductory Algebra Pre Algebra and Introductory Algebra This course covers the topics outlined below and is available for use with integrated, interactive ebooks. You can customize the scope and sequence of this course to

More information

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES INTRODUCTION TO APPLIED STATISTICS NOTES PART - DATA CHAPTER LOOKING AT DATA - DISTRIBUTIONS Individuals objects described by a set of data (people, animals, things) - all the data for one individual make

More information