Noise & Data Reduction

Similar documents
Noise & Data Reduction

CS570 Introduction to Data Mining

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING

Information Management course

7. Variable extraction and dimensionality reduction

Covariance to PCA. CS 510 Lecture #8 February 17, 2014

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Principal Components. Summary. Sample StatFolio: pca.sgp

Introduction to Machine Learning

Covariance to PCA. CS 510 Lecture #14 February 23, 2018

Descriptive Data Summarization

Principal Components Analysis (PCA)

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Basics of Multivariate Modelling and Data Analysis

Maximum variance formulation

EECS 275 Matrix Computation

Principal Component Analysis!! Lecture 11!

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Announcements (repeat) Principal Components Analysis

Machine learning for pervasive systems Classification in high-dimensional spaces

Similarity and Dissimilarity

Statistical signal processing

Numerical Methods I Singular Value Decomposition

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Second-Order Inference for Gaussian Random Curves

Multivariate Statistical Analysis

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

ECLT 5810 Data Preprocessing. Prof. Wai Lam

What is Principal Component Analysis?

Robustness of Principal Components

Unconstrained Ordination

Neuroscience Introduction

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

CS281 Section 4: Factor Analysis and PCA

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

A Peak to the World of Multivariate Statistical Analysis

Chapter 4: Factor Analysis

Descriptive Statistics-I. Dr Mahmoud Alhussami

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Correlation Preserving Unsupervised Discretization. Outline

LAB 2. HYPOTHESIS TESTING IN THE BIOLOGICAL SCIENCES- Part 2

Basic Regressions and Panel Data in Stata

Central limit theorem - go to web applet

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

PCA and admixture models

= observed volume on day l for bin j = base volume in jth bin, and = residual error, assumed independent with mean zero.

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Unsupervised Learning: Dimensionality Reduction

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 147 le-tex

Pattern Classification

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Descriptive Statistics

Issues and Techniques in Pattern Classification

Why is the field of statistics still an active one?

Diploma Part 2. Quantitative Methods. Examiner s Suggested Answers

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Experimental design. Matti Hotokka Department of Physical Chemistry Åbo Akademi University

PCA Review. CS 510 February 25 th, 2013

IV. Matrix Approximation using Least-Squares

Machine Learning 2nd Edition

Principal Component Analysis

B. Weaver (18-Oct-2001) Factor analysis Chapter 7: Factor Analysis

Data Mining Techniques

Factor analysis. George Balabanis

Dimensionality Reduction

Basics of Experimental Design. Review of Statistics. Basic Study. Experimental Design. When an Experiment is Not Possible. Studying Relations

FUNCTIONAL DATA ANALYSIS. Contribution to the. International Handbook (Encyclopedia) of Statistical Sciences. July 28, Hans-Georg Müller 1

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Probabilistic & Unsupervised Learning

CHAPTER 4 PRINCIPAL COMPONENT ANALYSIS-BASED FUSION

Next is material on matrix rank. Please see the handout

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Designing Information Devices and Systems II Fall 2018 Elad Alon and Miki Lustig Homework 9

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag

Midterm 2 - Solutions

Mathematical foundations - linear algebra

Canonical Correlations

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Data Exploration and Unsupervised Learning with Clustering

Statistical Data Analysis

Covariance and Principal Components

Table of Contents. Multivariate methods. Introduction II. Introduction I

Overview of clustering analysis. Yuehua Cui

Dimensionality Reduction Techniques (DRT)

Multivariate Fundamentals: Rotation. Exploratory Factor Analysis

14 Singular Value Decomposition

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Statistical Machine Learning

UCLA STAT 233 Statistical Methods in Biomedical Imaging

Principal component analysis

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

Vector Space Models. wine_spectral.r

Transcription:

Noise & Data Reduction Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum Dimension Reduction 1

Remember: Central Limit Theorem The sampling distribution of the mean of samples of size N approaches a normal (Gaussian) distribution as N approaches infinity. If the samples are drawn from a population with mean µ and standard deviation!, then the mean of the sampling distribution is µ and its standard deviation is! x =! N as N increases. These statements hold irrespective of the shape of the original distribution. Z Test t Test Z = x " µ # / N t = x " µ s/ N standard deviation (population) N 1 " = N # %( x $ x i ) 2 sample standard deviation s = 1 N "1 # when population standard deviation is unknown, samples are small population mean µ, sample mean i=1 N $ i=1 x ( x " x i ) 2 2

p Values Commonly we reject the H0 when the probability of obtaining a sample statistic given the null hypothesis is low, say <.05 The null hypothesis is rejected but might be true We find the probabilities by looking them up in tables, or statistics packages provide them The probability of obtaining a particular sample given the null hypothesis is called the p value By convention, one usually dose not reject the null hypothesis unless p < 0.05 (statistically significant) Example Five cars parked, mean price of the cars is 20.270 and the standard deviation of the sample is 5.811 The mean costs of cars in town is 12.000 (population) H0 hypothesis: parked cars are as expensive as the cars in town t = 20270 "12000 5811/ 5 = 3.18 For N-1 (degrees of freedom) t=3.18 has a value less than 0.025, reject H0! 3

Paired Sample t Test Given a set of paired observations (from two normal populations) A x1 x2 x3 x4 x5 B y1 y2 y3 y4 y5 δ=a-b x1-x2 x2-y2 x3-y3 x4-y4 x5-y5 Calculate the mean and the standard deviation s δ of the the differences δ H0: µ δ =0 (no difference) H0: µ δ =k (difference is a constant) x " t " = x " # µ " ˆ $ " ˆ " # = s # N # 4

Confidence Intervals (σ known) Standard error from the standard deviation " x = " Population N 95 Percent confidence interval for normal distribution is about the mean x ±1.96 " # x Confidence interval when (σ unknown) Standard error from the sample standard deviation ˆ " x = s N 95 Percent confidence interval for t distribution (t 0.025 from a table) is x ± t 0.025 "# ˆ x Previous Example: 5

Overview Data Transformation Reduce Noise Reduce Data Data Transformation Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction New attributes constructed from the given ones 6

Data Transformation: Normalization Min-max normalization: to [new_min A, new_max A ] v v! min maxa! min A ' = ( new _ maxa! new _ mina) + Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to A new _ min 73,600! 12,000 (1.0! 0) + 0 = 0.716 98,000! 12,000 Z-score normalization (μ: mean, σ: standard deviation): v " µ A v' =! A A Ex. Let μ = 54,000, σ = 16,000. Then Normalization by decimal scaling 73,600! 54,000 = 1.225 16,000 How to Handle Noisy Data? (How to Reduce Features?) Binning first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Regression smooth by fitting the data into regression functions Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human (e.g., deal with possible outliers) 7

Data Reduction Strategies A data warehouse may store terabytes of data Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results Data reduction strategies Data cube aggregation Dimensionality reduction remove unimportant attributes Data Compression Numerosity reduction fit data into models Discretization and concept hierarchy generation Simple Discretization Methods: Binning Equal-width (distance) partitioning: Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B A)/N. The most straightforward, but outliers may dominate presentation Skewed data is not handled well. Equal-depth (frequency) partitioning: Divides the range into N intervals, each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky. 8

Binning Methods for Data Smoothing * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries (min and max are identified, bin value replaced by the closesed boundary value): - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 Cluster Analysis 9

Regression y Y1 Y1 y = x + 1 X1 x Feature space Sample r x = " $ $ # $ % $ x 1 x 2.... x d x r (1), x r (2),.., x r (k),.., r { x (n ) } & ' d r x " y r d = (x i " y i ) 2 # i=1 10

Scaling A well-known scaling method consists of performing some scaling operations subtracting the mean and dividing the standard deviation y i = (x i " m i ) s i m i sample mean s i sample standard deviation According to the scaled metric the scaled feature vector is expressed as r y s = shrinking large variance values s i > 1 stretching low variance values s i < 1 Fails to preserve distances when general linear transformation is applied! n # i=1 (x i " m i ) 2 s i 2 11

Covariance Measuring the tendency two features x i and x j varying in the same direction The covariance between features x i and x j is estimated for n patterns c ij = n # k=1 (k ( x ) i " m i ) x (k) j " m j n "1 ( ) " c 11 c 12.. c 1d % $ ' c 21 c 22.. c 2d C = $ ' $........ ' $ ' # c d1 c d 2.. c dd & 12

Correlation Covariances are symmetric c ij =c ji Covariance is related to correlation r ij = n # k=1 (k ( x ) i " m i ) x (k) j " m j ( ) [ ] (n "1)s i s j = c ij s i s j $ "1,1 Principal Component Analysis Intuition: find the axis that shows the greatest variation, and project all points into this axis f2 e2 e1 f1 13

Karhunen-Loève Transformation Covariance matrix C of (a d d matrix) Symmetric and positive definite U T CU = " = diag(# 1,# 2,...,# d ) ("I # C)u = 0 There are d eigenvalues and eigenvectors C r u i = " i r u i is the λ i ith eigenvalue of C and u i the ith column of U, the ith eigenvectors Eigenvectors are always orthogonal U is an orthonormal matrix UU T =U T U=I U defines the K-L transformation The transformed features by the K-L transformation are given by r r y = U T x (linear Transformation) K-L transformation rotates the feature space into alignment with uncorrelated features 14

Example " C = 2 1 % $ ' "I # C = 0 " 2 # 3" +1= 0 # 1 1& λ 1 =2.618 λ 2 =0.382 #"0.618 "1 &# % ( u & 1 % ( = 0 $ "1 1.618' $ ' u 2 u (1) =[1 0.618] u (2) =[-1 1.618] U=[u (1),u (2) ] 15

PCA (Principal Components Analysis) New features y are uncorrelated with the covariance Matrix Each eigenvector u i is associated with some variance associated by λ i Uncorrelated features with higher variance (represented by λ i ) contain more information Idea: Retain only the significant eigenvectors u i Example U=[u (1),u (2) ] λ 1 =2.618 λ 2 =0.382 U*=[u (1) ] r r y = U *T x Dimension Reduction How many eigenvectors (and corresponding eigenvector) to retain Kaiser criterion Discards eigenvectors whose eigenvalues are below 1 16

Problems Principal components are linear transformation of the original features It is difficult to attach any semantic meaning to principal components For new data which is added to the dataset, the PCA has to be recomputed Suppose we have a covariance matrix: " C = 3 1 % $ ' # 1 21& What is the matrix of the K-L transformation? 17

First we have to compute the eiganvalues The system has to become linear depentable (singular) "I # C = 0 The determinant has to become zero " 2 # 24" + 62 = 0 Solving it we get λ 1 =2,94461 λ 2 =21,05538 18

Now, lets compute the two eigenvectors. To do it you have to solve two singular, dependent systems for the first eigenvalue λ 1 =2,94461 (" 1 I # C) u r 1 = 0 Or if you prefer more... r " 1 u 1 = C r Cu r r 1 = " 1 u 1 u 1 Now, lets compute the two eigenvectors. To do it you have to solve two singular, dependent systems And for the second eigenvalue λ 2 =21,05538 (" 2 I # C) u r 2 = 0 Or if you prefer more... r " 2 u 2 = C r Cu r r 2 = " 2 u 2 u 2 19

For λ 1 =2,94461 u 1 =(u 1,u 2 ) " 2,94461 0 % " '( 3 1 %" $ $ ' u % 1 $ ' = 0 # 0 2,94461& # 1 21 &# u 2 & We have to find a nontrivial solution! Trivial solution is u=[0,0]... #"0,05539 "1 &# % ( u & 1 % ( = 0 $ "1 "18,055' $ ' u 2 Because the system is linear dependable, the left column is multiple value of the right column There are infinity many solution!!!! We have only to determine the direction of the eigenvectors u 1 and u 2 But be careful, the normalized vectors have to be orthogonal to each other <u 1,u 2 >=0 20

Let be u 1 =1 then we have to determine u 2 #"0,05539 "1 &# % $ "1 "18,055' ( 1 & % $ u ( = 0 ' 2 #"0,05539& # 1 & % ( = % ( u $ "1 ' $ 18,055 2 ' u 1 =[u 1,u 2 ]=[1,-0,05539] For λ 2 = 21,05538 u 1 =(u 1,u 2 ) " 21,05538 0 % " '( 3 1 %" $ $ ' u % 1 $ ' = 0 # 0 21,05538& # 1 21 &# u 2 & We have to find a nontrivial solution! Trivial solution is u=[0,0]... Déjà vu? # 18,055 "1 &# % ( u & 1 % ( = 0 $ "1 0,05538' $ ' u 2 21

Let be u 1 =1 then we have to determine u 2 # 18,055& # 1 & % ( = % ( u $ "1 ' $ "0,05538 2 ' u 2 =[u 1,u 2 ]=[1, 18,055] u 1 =[u 1,u 2 ]=[1,-0,05539] for λ 1 =2,94461 u 2 =[u 1,u 2 ]=[1,18,055] λ 2 =21,05538 Orthogonal? Yes <u 1,u 2 >=0 Which of the two eigenvectors is more significant? u 2, because λ 1 =2,94461 < λ 2 =21,05538 22

Fourier Analysis It is always possible to analyze complex periodic waveforms into a set of sinusoidal waveforms Any periodic waveform can be approximated by adding together a number of sinusoidal waveforms Fourier analysis tells us what particular set of sinusoids go together to make up a particular complex waveform Spectrum In the Fourier analysis of a complex waveform the amplitude of each sinusoidal component depends on the shape of particular complex wave Amplitude of a wave: maximum or minimum deviation from zero line T duration of a period f = 1 T 23

Noise reduction or Dimension Reduction It is difficult to identify the frequency components by looking at the original signal Converting to the frequency domain If dimension reduction, store only a fraction of frequencies (with high amplitude) If noise reduction (remove high frequencies, fast change, smoothing) (remove low frequencies, slow change, remove global trends) Inverse discrete Fourier transform 24

Automatic Concept Hierarchy Generation Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is placed at the lowest level of the hierarchy Exceptions, e.g., weekday, month, quarter, year country 15 distinct values province_or_ state city street 365 distinct values 3567 distinct values 674,339 distinct values 25

Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum Dimension Reduction Mining Association rules Apriori Algorithm (Chapter 6, Han and Kamber) 26