Data Mining and Analysis: Fundamental Concepts and Algorithms

Size: px

Start display at page:

Download "Data Mining and Analysis: Fundamental Concepts and Algorithms"

Spencer Harrell
5 years ago
Views:

1 Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 1: Data Mining and Analysis Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 1 / 3

2 Data Matrix Data can often be represented or abstracted as an n d data matrix, with n rows and d columns, given as X 1 X X d x 1 x 11 x 1 x 1d D = x x 1 x x d x n x n1 x n x nd Rows: Also called instances, examples, records, transactions, objects, points, feature-vectors, etc. Given as a d-tuple x i = (x i1, x i,...,x id ) Columns: Also called attributes, properties, features, dimensions, variables, fields, etc. Given as an n-tuple X j = (x 1j, x j,...,x nj ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis / 3

3 Iris Dataset Extract Sepal Sepal Petal Petal Class length width length width X 1 X X 3 X 4 X 5 x Iris-versicolor x Iris-versicolor x Iris-versicolor x Iris-setosa x Iris-versicolor x Iris-setosa x Iris-virginica x Iris-virginica x Iris-virginica x Iris-setosa Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 3 / 3

4 Attributes Attributes may be classified into two main types Numeric Attributes: real-valued or integer-valued domain Interval-scaled: only differences are meaningful e.g., temperature Ratio-scaled: differences and ratios are meaningful e..g,age Categorical Attributes: set-valued domain composed of a set of symbols Nominal: only equality is meaningful e.g., domain(sex) = {M, F} Ordinal: both equality (are two values the same?) and inequality (is one value less than another?) are meaningful e.g., domain(education) = {High School,BS,MS,PhD} Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 4 / 3

5 Data: Algebraic and Geometric View For numeric data matrix D, each row or point is a d-dimensional column vector: x i1 x i x i =. = ( x i1 x i ) T x id R d x id whereas each column or attribute is a n-dimensional column vector: X j = ( x 1j x j x nj ) T R n X3 X x 1 = (5.9, 3.0) T 3 x1 = (5.9, 3.0, 4.) T (a) R X 1 X (b) R 3 X Figure: Projections of x 1 = (5.9, 3.0, 4., 1.5) T in D and 3D Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 5 / 3

6 Scatterplot: D Iris Dataset sepal length versussepal width. Visualizing Iris dataset as points/vectors in D Solid circle shows the mean point X: sepal width X 1 : sepal length Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 6 / 3

7 Numeric Data Matrix If all attributes are numeric, then the data matrix D is an n d matrix, or equivalently a set of n row vectors x T i R d or a set of d column vectors X j R n x 11 x 1 x 1d x T 1 x 1 x x d D = = x T. = X 1 X X d x n1 x n x nd x T n The mean of the data matrix D is the average of all the points: mean(d) = µ = 1 n x i n i=1 The centered data matrix is obtained by subtracting the mean from all the points: x T 1 µ T x T 1 µ T z T 1 x T Z = D 1 µ T =. µ T. = x T µ T. = z T. x T n µ T x T n µ T z T n (1) where z i = x i µ is a centered point, and 1 R n is the vector of ones. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 7 / 3

8 Norm, Distance and Angle Given two points a, b R m, their dot product is defined as the scalar a T b = a 1 b 1 + a b + +a mb m m = a i b i i=1 The Euclidean norm or length of a vector a is defined as a = a T a = m i=1 The unit vector in the direction of a is u = a with a = 1. a a i Distance between a and b is given as a b = m (a i b i ) i=1 Angle between a and b is given as ( ) T ( ) cosθ = at b a b a b = a b X (1, 4) b θ a a b (5, 3) X1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 8 / 3

9 Orthogonal Projection Two vectors a and b are orthogonal iff a T b = 0, i.e., the angle between them is 90. Orthogonal projection of b on a comprises the vector p = b parallel to a, and r = b perpendicular or orthogonal to a, given as b = b + b = p+r where X p = b = ( a T ) b a T a a 4 b 3 r = b a 1 p = b Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 9 / 3 X 1

10 Projection of Centered Iris Data Onto a Line l. l X X 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 10 / 3

11 Data: Probabilistic View The sample space of A is O = [4.3, 7.9], and its range is {0, 1}. Thus, A is discrete. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 11 / 3 A random variable X is a function X: O R, where O is the set of all possible outcomes of the experiment, also called the sample space. A discrete random variable takes on only a finite or countably infinite number of values, whereas a continuous random variable if it can take on any value in R. By default, a numeric attribute X j is considered as the identity random variable given as X(v) = v for all v O. Here O = R. Discrete Variable: Long Sepal Length Define random variable A, denoting long sepal length (7cm or more) as follows: { 0 if v < 7 A(v) = 1 if v 7

12 Probability Mass Function If X is discrete, the probability mass function of X is defined as f(x) = P(X = x) for all x R f must obey the basic rules of probability. That is, f must be non-negative: f(x) 0 and the sum of all probabilities should add to 1: f(x) = 1 x Intuitively, for a discrete variable X, the probability is concentrated or massed at only discrete values in the range of X, and is zero for all other values. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 1 / 3

13 Sepal Length: Bernoulli Distribution Iris Dataset Extract: sepal length (in centimeters) Define random variable A as follows: A(v) = { 0 if v < 7 1 if v 7 We find that only 13 Irises have sepal length of at least 7 cm. Thus, the probability mass function of A can be estimated as: f(1) = P(A = 1) = = = p and f(0) = P(A = 0) = 137 = = 1 p 150 A has a Bernoulli distribution with parameter p [0, 1], which denotes the probability of a success, that is, the probability of picking an Iris with a long sepal length at random from the set of all points. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 13 / 3

14 Sepal Length: Binomial Distribution Define discrete random variable B, denoting the number of Irises with long sepal length in m independent Bernoulli trials with probability of success p. In this case, B takes on the discrete values [0, m], and its probability mass function is given by the Binomial distribution f(k) = P(B = k) = ( m k ) p k (1 p) m k Binomial distribution for long sepal length (p = 0.087) for m = 10 trials P(B=k) k Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 14 / 3

15 Probability Density Function If X is continuous, the probability density function of X is defined as P ( X [a, b] ) = b a f(x) dx f must obey the basic rules of probability. That is, f must be non-negative: f(x) 0 and the sum of all probabilities should add to 1: f(x) dx = 1 Note that P(X = v) = 0 for all v R since there are infinite possible values in the sample space. What it means is that the probability mass is spread so thinly over the range of values that it can be measured only over intervals [a, b] R, rather than at specific points. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 15 / 3

16 Sepal Length: Normal Distribution We model sepal length via the Gaussian or normal density function, given as { } 1 (x µ) f(x) = exp πσ σ where µ = 1 n n i=1 x i is the mean value, and σ = 1 n n i=1 (x i µ) is the variance. Normal distribution for sepal length: µ = 5.84, σ = f(x) 0.5 µ±ǫ x Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 16 / 3

17 Cumulative Distribution Function For random variable X, its cumulative distribution function (CDF) F : R [0, 1], gives the probability of observing a value at most some given value x: F(x) = P(X x) for all < x < When X is discrete, F is given as F(x) = P(X x) = u x f(u) CDF for binomial distribution (p = 0.087, m = 10) F(x) CDF for the normal distribution (µ = 5.84,σ = 0.681) F(x) x When X is continuous, F is given as F(x) = P(X x) = x f(u) du (µ, F(µ)) = (5.84, 0.5) x Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 17 / 3

18 Bivariate Random Variable: Joint Probability Mass Function Iris: joint PMF for long sepal length and Define discrete random variables sepal width { 1 ifv 7 f (0, 0) = P(X 1 = 0, X = 0) = 116/150 = long sepal length:x 1 (v) = 0 otherwise f (0, 1) = P(X 1 = 0, X = 1) = 1/150 = { f (1, 0) = P(X 1 = 1, X = 0) = 10/150 = ifv 3.5 long sepal width:x (v) = f (1, 1) = P(X 0 otherwise 1 = 1, X = 1) = 3/150 = 0.00 The bivariate random variable ( ) X1 X = X has the joint probability mass function f(x) = P(X = x) i.e., f(x 1, x ) = P(X 1 = x 1, X = x ) X f(x) X Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 18 / 3

19 Bivariate Random Variable: Probability Density Function Bivariate Normal: modeling joint distribution for long sepal length (X 1 ) and sepal width (X ) 1 f(x µ,σ) = π exp Σ (x µ)t Σ 1 (x µ) Bivariate Normal µ = (5.843, 3.054) T ( ) Σ = where µ and Σ specify the D mean and covariance matrix: ( ) µ = (µ 1,µ ) T σ Σ = 1 σ 1 σ 1 σ f(x) with mean µ i = 1 n n k=1 x ki and covariance σ ij = 1 n k=1 (x ki µ i )(x kj µ j ). Also, X1 σi = σ ii X Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 19 / 3

20 Random Sample and Statistics Given a random variable X, a random sample of size n from X is defined as a set of n independent and identically distributed (IID) random variables S 1, S,...,S n The S i s have the same probability distribution as X, and are statistically independent. Two random variables X 1 and X are (statistically) independent if, for every W 1 R and W R, we have which also implies that P(X 1 W 1 and X W ) = P(X 1 W 1 ) P(X W ) F(x) = F(x 1, x ) = F 1 (x 1 ) F (x ) f(x) = f(x 1, x ) = f 1 (x 1 ) f (x ) where F i is the cumulative distribution function, and f i is the probability mass or density function for random variable X i. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 0 / 3

21 Multivariate Sample Given dataset D, the n data points x i (with 1 i n) constitute a d-dimensional multivariate random sample drawn from the vector random variable X = (X 1, X,...,X d ). Since the x i are assumed to be independent and identically distributed, their joint distribution is given as f(x 1, x,...,x n ) = n f X (x i ) where f X is the probability mass or density function for X. Assuming that the d attributes X 1, X,...,X d are statistically independent, the joint distribution for the entire dataset is given as: f(x 1, x,...,x n ) = i=1 n f(x i ) = i=1 n i=1 j=1 d f Xj (x ij ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 1 / 3

22 Sample Statistics Let {S i } m i=1 be a random sample of size m drawn from a (multivariate) random variable X. A statistic ˆθ is a function ˆθ: (S 1, S,...,S m) R The statistic is an estimate of the corresponding population parameterθ, where the population refers to the entire universe of entities under study. The statistic is itself a random variable. The sample mean is a statistic, defined as the average ˆµ = 1 n n i=1 x i For sepal length, we have ˆµ = 5.84, which is an estimator for the (unknown) true population mean sepal length. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis / 3

23 Sample Statistics: Variance The sample variance is a statistic ˆσ = 1 n n (x i µ) i=1 For sepal length, we have ˆσ = The total variance is a multivariate statistic var(d) = 1 n n x i µ For the Iris data (with 4 attributes: sepal length and width, petal length and width), we have var(d) = i=1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 3 / 3

Data Mining and Analysis

978--5-766- - Data Mining and Analysis: Fundamental Concepts and Algorithms CHAPTER Data Mining and Analysis Data mining is the process of discovering insightful, interesting, and novel patterns, as well