Data Mining and Analysis

Similar documents
Data Mining and Analysis: Fundamental Concepts and Algorithms

1. Vectors.

An Introduction to Multivariate Methods

STAT 730 Chapter 1 Background

1. Introduction to Multivariate Analysis

Lecture Notes 1: Vector spaces

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

4 Linear Algebra Review

Lecture 7. Econ August 18

Practical Linear Algebra: A Geometry Toolbox

LINEAR ALGEBRA - CHAPTER 1: VECTORS

Getting To Know Your Data

The Transpose of a Vector

New Interpretation of Principal Components Analysis

Matrix Algebra: Vectors

A Introduction to Matrix Algebra and the Multivariate Normal Distribution

Applied Linear Algebra in Geoscience Using MATLAB

Chapter I: Introduction & Foundations

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

ANÁLISE DOS DADOS. Daniela Barreiro Claro

Give a geometric description of the set of points in space whose coordinates satisfy the given pair of equations.

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

PRINCIPAL COMPONENTS ANALYSIS

Linear Algebra Homework and Study Guide

Problem Set 1. Homeworks will graded based on content and clarity. Please show your work clearly for full credit.

7. Variable extraction and dimensionality reduction

which has a check digit of 9. This is consistent with the first nine digits of the ISBN, since

Vector Geometry. Chapter 5

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014

Notes on Linear Algebra and Matrix Theory

4 Linear Algebra Review

2. Matrix Algebra and Random Vectors

Linear Equations in Linear Algebra

Introduction to Matrix Algebra

Data Mining and Analysis: Fundamental Concepts and Algorithms

Correlation Preserving Unsupervised Discretization. Outline

An Introduction to Matrix Algebra

Mathematics for Graphics and Vision

ORTHOGONALITY AND LEAST-SQUARES [CHAP. 6]

2.1 Sets. Definition 1 A set is an unordered collection of objects. Important sets: N, Z, Z +, Q, R.

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality

4.1 Distance and Length

Review of Linear Algebra

Linear Algebra Massoud Malek

Data Mining and Analysis: Fundamental Concepts and Algorithms

Study guide for Exam 1. by William H. Meeks III October 26, 2012

Linear Algebra. Alvin Lin. August December 2017

Chapter 3 Transformations

EGR2013 Tutorial 8. Linear Algebra. Powers of a Matrix and Matrix Polynomial

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Math 416, Spring 2010 More on Algebraic and Geometric Properties January 21, 2010 MORE ON ALGEBRAIC AND GEOMETRIC PROPERTIES

Mathematics Review for Business PhD Students Lecture Notes

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr

Vectors and Geometry

v = v 1 2 +v 2 2. Two successive applications of this idea give the length of the vector v R 3 :

Vectors and Matrices Statistics with Vectors and Matrices

Intro Vectors 2D implicit curves 2D parametric curves. Graphics 2011/2012, 4th quarter. Lecture 2: vectors, curves, and surfaces

Table of Contents. Multivariate methods. Introduction II. Introduction I

Lecture 1 Introduction

(v, w) = arccos( < v, w >

Conceptual Questions for Review

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data

Mathematical foundations - linear algebra

Linear Algebra March 16, 2019

Definitions and Properties of R N

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Review: Linear and Vector Algebra

Elementary maths for GMT

Data Mining and Analysis: Fundamental Concepts and Algorithms

Sample Geometry. Edps/Soc 584, Psych 594. Carolyn J. Anderson

Matrix Operations. Linear Combination Vector Algebra Angle Between Vectors Projections and Reflections Equality of matrices, Augmented Matrix

(v, w) = arccos( < v, w >

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

x 1 x 2. x 1, x 2,..., x n R. x n

b 1 b 2.. b = b m A = [a 1,a 2,...,a n ] where a 1,j a 2,j a j = a m,j Let A R m n and x 1 x 2 x = x n

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Lecture 1: Review of linear algebra

Math Bootcamp An p-dimensional vector is p numbers put together. Written as. x 1 x =. x p

INNER PRODUCT SPACE. Definition 1

Linear Algebra Review. Vectors

15 Singular Value Decomposition

Data Exploration Slides by: Shree Jaswal

CM2202: Scientific Computing and Multimedia Applications Lab Class Week 5. School of Computer Science & Informatics

Inner Product Spaces 6.1 Length and Dot Product in R n

Applied Multivariate and Longitudinal Data Analysis

Math for ML: review. ML and knowledge of other fields

7.5 Operations with Matrices. Copyright Cengage Learning. All rights reserved.

1.1 Single Variable Calculus versus Multivariable Calculus Rectangular Coordinate Systems... 4

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

(v, w) = arccos( < v, w >

Lecture 6: Geometry of OLS Estimation of Linear Regession

Fundamentals of Similarity Search

(, ) : R n R n R. 1. It is bilinear, meaning it s linear in each argument: that is

A Review of Linear Algebra

Intro Vectors 2D implicit curves 2D parametric curves. Graphics 2012/2013, 4th quarter. Lecture 2: vectors, curves, and surfaces

MATRIX ALGEBRA. or x = (x 1,..., x n ) R n. y 1 y 2. x 2. x m. y m. y = cos θ 1 = x 1 L x. sin θ 1 = x 2. cos θ 2 = y 1 L y.

Stat 206: Sampling theory, sample moments, mahalanobis

Basic Concepts in Matrix Algebra

CSL361 Problem set 4: Basic linear algebra

Transcription:

978--5-766- - Data Mining and Analysis: Fundamental Concepts and Algorithms CHAPTER Data Mining and Analysis Data mining is the process of discovering insightful, interesting, and novel patterns, as well as descriptive, understandable, and predictive models from large-scale data We begin this chapter by looking at basic properties of data modeled as a data matrix We emphasize the geometric and algebraic views, as well as the probabilistic interpretation of data We then discuss the main data mining tasks, which span exploratory data analysis, frequent pattern mining, clustering, and classification, laying out the roadmap for the book DATA MATRIX Data can often be represented or abstracted as an n d data matrix, withn rows and d columns, where rows correspond to entities in the dataset, and columns represent attributes or properties of interest Each row in the data matrix records the observed attribute values for a given entity The n d data matrix is given as X X X d x x x x d D = x x x x d x n x n x n x nd where x i denotes the ith row, which is a d-tuple given as x i = (x i, x i,,x id ) and X j denotes the jth column, which is an n-tuple given as X j = (x j, x j,,x nj ) Depending on the application domain, rows may also be referred to as entities, instances, examples, records, transactions, objects, points, feature-vectors, tuples, andso on Likewise, columns may also be called attributes, properties, features, dimensions, variables, fields, and so on The number of instances n is referred to as the size of

978--5-766- - Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis Table Extract from the Iris dataset Sepal Sepal Petal Petal Class length width length width X X X X X 5 x 59 5 Iris-versicolor x 69 9 5 Iris-versicolor x 66 9 6 Iris-versicolor x 6 Iris-setosa x 5 6 Iris-versicolor x 6 7 Iris-setosa x 7 65 58 Iris-virginica x 8 58 7 5 9 Iris-virginica x 9 77 8 67 Iris-virginica x 5 5 5 Iris-setosa the data, whereas the number of attributes d is called the dimensionality of the data The analysis of a single attribute is referred to as univariate analysis, whereas the simultaneous analysis of two attributes is called bivariate analysis and the simultaneous analysis of more than two attributes is called multivariate analysis Example Table shows an extract of the Iris dataset; the complete data forms a 5 5 data matrix Each entity is an Iris flower, and the attributes include sepal length, sepal width, petal length,andpetal width in centimeters, and the type or class of the Iris flower The first row is given as the 5-tuple x = (59,,,5,Iris-versicolor) Not all datasets are in the form of a data matrix For instance, more complex datasets can be in the form of sequences (eg, DNA and protein sequences), text, time-series, images, audio, video, and so on, which may need special techniques for analysis However, in many cases even if the raw data is not a data matrix it can usually be transformed into that form via feature extraction For example, given a database of images, we can create a data matrix in which rows represent images and columns correspond to image features such as color, texture, and so on Sometimes, certain attributes may have special semantics associated with them requiring special treatment For instance, temporal or spatial attributes are often treated differently It is also worth noting that traditional data analysis assumes that each entity or instance is independent However, given the interconnected nature of the world we live in, this assumption may not always hold Instances may be connected to other instances via various kinds of relationships, giving rise to a data graph, where a node represents an entity and an edge represents the relationship between two entities

978--5-766- - Data Mining and Analysis: Fundamental Concepts and Algorithms Attributes ATTRIBUTES Attributes may be classified into two main types depending on their domain, that is, depending on the types of values they take on Numeric Attributes A numeric attribute is one that has a real-valued or integer-valued domain For example, Age with domain(age) = N, wheren denotes the set of natural numbers (non-negative integers), is numeric, and so is petal length in Table, with domain(petal length) = R + (the set of all positive real numbers) Numeric attributes that take on a finite or countably infinite set of values are called discrete, whereas those that can take on any real value are called continuous As a special case of discrete, if an attribute has as its domain the set {,}, it is called a binary attribute Numeric attributes can be classified further into two types: Interval-scaled: For these kinds of attributes only differences (addition or subtraction) make sense For example, attribute temperature measured in Cor F is interval-scaled If it is C on one day and C on the following day, it is meaningful to talk about a temperature drop of C, but it is not meaningful to say that it is twice as cold as the previous day Ratio-scaled: Here one can compute both differences as well as ratios between values For example, for attribute Age, we can say that someone who is years old is twice as old as someone who is years old Categorical Attributes A categorical attribute is one that has a set-valued domain composed of a set of symbols For example, Sex and Education could be categorical attributes with their domains given as domain(sex) ={M,F} domain(education) ={HighSchool,BS,MS,PhD} Categorical attributes may be of two types: Nominal: The attribute values in the domain are unordered, and thus only equality comparisons are meaningful That is, we can check only whether the value of the attribute for two given instances is the same or not For example, Sex is a nominal attribute Also class in Table is a nominal attribute with domain(class) = {iris-setosa, iris-versicolor, iris-virginica} Ordinal: The attribute values are ordered, and thus both equality comparisons (is one value equal to another?) and inequality comparisons (is one value less than or greater than another?) are allowed, though it may not be possible to quantify the difference between values For example, Education is an ordinal attribute because its domain values are ordered by increasing educational qualification

978--5-766- - Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis DATA: ALGEBRAIC AND GEOMETRIC VIEW If the d attributes or dimensions in the data matrix D are all numeric, then each row can be considered as a d-dimensional point: x i = (x i, x i,,x id ) R d or equivalently, each row may be considered as a d-dimensional column vector (all vectors are assumed to be column vectors by default): x i x i x i = = ( ) T x i x i x id R d x id where T is the matrix transpose operator The d-dimensional Cartesian coordinate space is specified via the d unit vectors, called the standard basis vectors, along each of the axes The j th standard basis vector e j is the d-dimensional unit vector whose jth component is and the rest of the components are e j = (,, j,,) T Any other vector in R d canbewrittenaslinear combination of the standard basis vectors For example, each of the points x i can be written as the linear combination x i = x i e + x i e + +x id e d = d x ij e j where the scalar value x ij is the coordinate value along the jth axis or attribute j= Example Consider the Iris data in Table If we project the entire data onto the first two attributes, then each row can be considered as a point or a vector in -dimensional space For example, the projection of the 5-tuple x = (59,,,5,Iris-versicolor) on the first two attributes is shown in Figure a Figure shows the scatterplot of all the n = 5 points in the -dimensional space spanned by the first two attributes Likewise, Figure b shows x as a point and vector in -dimensional space, by projecting the data onto the first three attributes The point (59,, ) can be seen as specifying the coefficients in the linear combination of the standard basis vectors in R : 59 x = 59e + e + e = 59 + + =

978--5-766- - Data Mining and Analysis: Fundamental Concepts and Algorithms Data: Algebraic and Geometric View 5 X X x = (59,) x = (59,,) 5 6 X X 5 6 X (a) (b) Figure Row x as a point and vector in (a) R and (b) R 5 X: sepal width 5 5 5 5 55 6 65 7 75 8 X : sepal length Figure Scatterplot: sepal length versus sepal width The solid circle shows the mean point Each numeric column or attribute can also be treated as a vector in an n-dimensional space R n : x j x j X j = x nj

978--5-766- - Data Mining and Analysis: Fundamental Concepts and Algorithms 6 Data Mining and Analysis If all attributes are numeric, then the data matrix D is in fact an n d matrix, also written as D R n d, given as x x x d x T x x x d D = = x T = X X X d x n x n x nd x T n As we can see, we can consider the entire datasetas an n d matrix, or equivalently as asetofn row vectors x T i R d or as a set of d column vectors X j R n Distance and Angle Treating data instances and attributes as vectors, and the entire dataset as a matrix, enables one to apply both geometric and algebraic methods to aid in the data mining and analysis tasks Let a,b R m be two m-dimensional vectors given as a a a = b b b = a m b m Dot Product The dot product between a and b is defined as the scalar value b a T b = ( ) b a a a m b m = a b + a b + +a m b m m = a i b i Length The Euclidean norm or length of a vector a R m is defined as a = a T a = a + a + +a m = m The unit vector in the direction of a is given as u = a ( ) a = a a a i

978--5-766- - Data Mining and Analysis: Fundamental Concepts and Algorithms Data: Algebraic and Geometric View 7 By definition u has length u =, and it is also called a normalized vector, which can be used in lieu of a in some analysis tasks The Euclidean norm is a special case of a general class of norms, known as L p -norm, defined as ) ( a p = ( a p + a p + + a m p p m ) = a i p p for any p Thus, the Euclidean norm corresponds to the case when p = Distance From the Euclidean norm we can define the Euclidean distance between a and b, as follows δ(a,b) = a b = (a b) T (a b) = m (a i b i ) () Thus, the length of a vector is simply its distance from the zero vector, all of whose elements are, that is, a = a = δ(a,) From the general L p -norm we can define the corresponding L p -distance function, given as follows δ p (a,b) = a b p () If p is unspecified, as in Eq (), it is assumed to be p = bydefault Angle The cosine of the smallest angle between vectors a and b, also called the cosine similarity, is given as cosθ = ( ) at b a T ( ) b a b = a b () Thus, the cosine of the angle between a and b is given as the dot product of the unit vectors a a and b b The Cauchy Schwartz inequality states that for any vectors a and b in R m a T b a b It follows immediately from the Cauchy Schwartz inequality that cosθ

978--5-766- - Data Mining and Analysis: Fundamental Concepts and Algorithms 8 Data Mining and Analysis X (,) a b (5,) b a θ 5 X Figure Distance and angle Unit vectors are shown in gray Because the smallest angle θ [,8 ] and because cosθ [,], the cosine similarity value ranges from +, corresponding to an angle of,to, corresponding to an angle of 8 (or π radians) Orthogonality Two vectors a and b are said to be orthogonal if and only if a T b =,whichinturn implies that cosθ =, that is, the angle between them is 9 or π radians In this case, we say that they have no similarity Example (Distance and Angle) Figure shows the two vectors ( ) ( ) 5 a = and b = Using Eq (), the Euclidean distance between them is given as δ(a,b) = (5 ) + ( ) = 6 + = 7 = The distance can also be computed as the magnitude of the vector: ( ) ( ) ( ) 5 a b = = because a b = + ( ) = 7 = The unit vector in the direction of a is given as u a = a ( ) a = 5 = ( ) 5 5 + = ( ) 86 5

978--5-766- - Data Mining and Analysis: Fundamental Concepts and Algorithms Data: Algebraic and Geometric View 9 The unit vector in the direction of b can be computed similarly: ( ) u b = 97 TheseunitvectorsarealsoshowningrayinFigure By Eq () the cosine of the angle between a and b is given as ( ) T ( ) 5 cosθ = 5 + + = 7 = 7 We can get the angle by computing the inverse of the cosine: θ = cos ( / ) = 5 Let us consider the L p -norm for a with p = ; we get a = ( 5 + ) / = (5) / = 5 The distance between a and b using Eq () for the L p -norm with p = is given as a b = (, ) T = ( + ( ) ) / = (6) / = 98 Mean and Total Variance Mean The mean of the data matrix D is the vector obtained as the average of all the points: n mean(d) = μ = n x i Total Variance The total variance ofthedatamatrixd is the average squared distance of each point from the mean: var(d) = n n δ(x i,μ) = n n x i μ () Simplifying Eq () we obtain var(d) = n ( xi x T i n μ + μ ) = ( n ( n ) ) x i nμ T x i + n μ n n

978--5-766- - Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis = ( n ) x i nμ T μ + n μ n = ( n ) x i μ n The total variance is thus the difference between the average of the squared magnitude of the data points and the squared magnitude of the mean (average of the points) Centered Data Matrix Often we need to center the data matrix by making the mean coincide with the origin of the data space The centered data matrix is obtained by subtracting the mean from all the points: x T μ T x T μt z T x T Z = D μ T μ T x T = = μt z T = (5) μ T x T n μt x T n where z i = x i μ represents the centered point corresponding to x i,and R n is the n-dimensional vector all of whose elements have value The mean of the centered data matrix Z is R d, because we have subtracted the mean μ from all the points x i z T n Orthogonal Projection Often in data mining we need to project a point or vector onto another vector, for example, to obtain a new point after a change of the basis vectors Let a,b R m be two m-dimensional vectors An orthogonal decomposition of the vector b in the direction X b r = b a p = b 5 Figure Orthogonal projection X