Descriptive Statistics

Similar documents
Statistical Data Analysis

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Lecture Notes 2: Matrices

Expectation. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Linear regression. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Introduction to Machine Learning

Expectation. DS GA 1002 Probability and Statistics for Data Science. Carlos Fernandez-Granda

The Singular-Value Decomposition

Lecture Notes 1: Vector spaces

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Section 3. Measures of Variation

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Learning representations

Descriptive Data Summarization

SUMMARIZING MEASURED DATA. Gaia Maselli

Unit 2. Describing Data: Numerical

Chapter 1 - Lecture 3 Measures of Location

PCA and admixture models

Convergence of Random Processes

Chapter 2 Class Notes Sample & Population Descriptions Classifying variables

Random Processes. DS GA 1002 Probability and Statistics for Data Science.

PCA, Kernel PCA, ICA

Principal Components Analysis (PCA)

Machine Learning - MT & 14. PCA and MDS

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

Lecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data:

Exploratory data analysis: numerical summaries

Principal Component Analysis

Methods for sparse analysis of high-dimensional data, II

Review (Probability & Linear Algebra)

Data Mining Techniques

Modeling Uncertainty in the Earth Sciences Jef Caers Stanford University

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Data Mining and Exploration

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

Statistics for Managers using Microsoft Excel 6 th Edition

Linear Algebra Methods for Data Mining

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Lecture Slides. Elementary Statistics Eleventh Edition. by Mario F. Triola. and the Triola Statistics Series 3.1-1

Averages How difficult is QM1? What is the average mark? Week 1b, Lecture 2

PCA: Principal Component Analysis

DS-GA 1002 Lecture notes 12 Fall Linear regression

STOR 155 Introductory Statistics. Lecture 4: Displaying Distributions with Numbers (II)

Learning gradients: prescriptive models

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 03

Random projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016

Neuroscience Introduction

14 Singular Value Decomposition

Expectation Maximization

Methods for sparse analysis of high-dimensional data, II

3.1 Measure of Center

Data Preprocessing Tasks

Describing Distributions

Summary statistics. G.S. Questa, L. Trapani. MSc Induction - Summary statistics 1

Descriptive Univariate Statistics and Bivariate Correlation

What is Principal Component Analysis?

Statistics I Chapter 2: Univariate data analysis

Math 14 Lecture Notes Ch Percentile

Learning Objectives for Stat 225

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

PRINCIPAL COMPONENTS ANALYSIS

Statistics I Chapter 2: Univariate data analysis

200 participants [EUR] ( =60) 200 = 30% i.e. nearly a third of the phone bills are greater than 75 EUR

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Describing distributions with numbers

= n 1. n 1. Measures of Variability. Sample Variance. Range. Sample Standard Deviation ( ) 2. Chapter 2 Slides. Maurice Geraghty

CS 147: Computer Systems Performance Analysis

Determining the Spread of a Distribution

MATH4427 Notebook 4 Fall Semester 2017/2018

Matrices and Multivariate Statistics - II

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Chapter 4. Displaying and Summarizing. Quantitative Data

Convergence of Eigenspaces in Kernel Principal Component Analysis

BNG 495 Capstone Design. Descriptive Statistics

Determining the Spread of a Distribution

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Principal Component Analysis

Unsupervised Learning

Chapter 3 Examining Data

PRINCIPAL COMPONENTS ANALYSIS (PCA)

Statistical Machine Learning

DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS QM 120. Spring 2008

Describing distributions with numbers

Module 3. Function of a Random Variable and its distribution

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Getting To Know Your Data

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode.

15 Singular Value Decomposition

1.3: Describing Quantitative Data with Numbers

P8130: Biostatistical Methods I

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

1. Exploratory Data Analysis

After completing this chapter, you should be able to:

Transcription:

Descriptive Statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall17 Carlos Fernandez-Granda

Descriptive statistics Techniques to visualize and summarize data Can often be interpreted within a probabilistic framework Often probabilistic assumptions do not hold, but techniques are still useful We describe them from a deterministic point of view

Histogram Empirical mean and variance Order statistics Empirical covariance Empirical covariance matrix

Histogram Technique to visualize one-dimensional data Bin range of the data, then count the number of instances in each bin The width of the bins can be adjusted to yield higher or lower resolution Approximation to their pmf or pdf if data are iid

Temperature in Oxford 45 40 35 30 25 20 15 10 5 0 January August 0 5 10 15 20 25 30 Degrees (Celsius)

GDP per capita of different countries 90 80 70 60 50 40 30 20 10 0 0 50 100 150 200 Thousands of dollars

Histogram Empirical mean and variance Order statistics Empirical covariance Empirical covariance matrix

Empirical mean Let {x 1, x 2,..., x n } be a set of real-valued data The empirical mean is defined as av (x 1, x 2,..., x n ) := 1 n n i=1 x i Temperature data: 6.73 C in January and 21.3 C in August GDP per capita: $16 500

Empirical mean Let { x 1, x 2,..., x n } be a set of d-dimensional real-valued data The empirical mean is defined as av ( x 1, x 2,..., x n ) := 1 n n x i i=1

Centering Let { x 1, x 2,..., x n } be a set of d-dimensional real-valued data To center the data set we: 1. Compute the empirical mean 2. Subtract it from each vector y i := x i av ( x 1, x 2,..., x n ), 1 i n y 1,..., y n are centered at the origin

Centering Uncentered data Centered data

Empirical variance Let {x 1, x 2,..., x n } be a set of real-valued data The empirical variance is defined as var (x 1, x 2,..., x n ) := 1 n 1 n (x i av (x 1, x 2,..., x n )) 2 The empirical standard deviation is the square root of the empirical variance Temperature data: 1.99 C in January and 1.73 C in August GDP per capita: $25 300 i=1

Histogram Empirical mean and variance Order statistics Empirical covariance Empirical covariance matrix

Temperature dataset In January the temperature in Oxford is around 6.73 C give or take 2 C

GDP dataset Countries typically have a GDP per capita of about $16 500 give or take $25 300

Quantiles and percentiles Let x (1) x (2)... x (n) denote the ordered elements of a dataset {x 1, x 2,..., x n } The q quantile of the data for 0 < q < 1 is x ([q(n+1)]) [q (n + 1)] is the closest integer to q (n + 1) The 100 p quantile is known as the p percentile

Quartiles and median The 0.25 and 0.75 quantiles are the first and third quartiles The 0.5 quantile is the empirical median If n is even, the empirical median is usually set to x (n/2) + x (n/2+1) 2 The difference between the 3rd and 1st quartiles is the interquartile range (IQR)

Quartiles and median Temperature data (January): Sample mean: 6.73 C Median: 6.80 C Interquartile range: 2.9 C Temperature data (August): Sample mean: 21.3 C Median: 21.2 C Interquartile range: 2.1 C

Quartiles and median GDP per capita: Sample mean: $16 500 (71% of the countries have lower GDP per capita!) Median: $6 350 Interquartile range: $18 200 Five-number summary: $130, $1 960, $6 350, $20 100, $188 000

Boxplot of temperature data Degrees (Celsius) 30 25 20 15 10 5 0 5 January April August November

Boxplot of GDP data 60 50 Thousands of dollars 40 30 20 10 0

Histogram Empirical mean and variance Order statistics Empirical covariance Empirical covariance matrix

Multidimensional data Each dimension represents a feature We can visualize two-dimensional data using scatter plots

Scatter plot 20 18 16 April 14 12 10 8 16 18 20 22 24 26 28 August

Scatter plot 20 Minimum temperature 15 10 5 0 5 10 5 0 5 10 15 20 25 30 Maximum temperature

Empirical covariance Data: {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )} The empirical covariance is defined as cov ((x 1, y 1 ),..., (x n, y n )) := 1 n 1 n (x i av (x 1,..., x n )) (y i av (y 1,..., y n )) i=1

Empirical correlation coefficient Data: {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )} The empirical correlation coefficient is defined as ρ ((x 1, y 1 ),..., (x n, y n )) := cov ((x 1, y 1 ),..., (x n, y n )) std (x 1,..., x n ) std (y 1,..., y n ) Cauchy-Schwarz inequality: for any a, b 1 a T b a 2 b 2 1 Consequence: 1 ρ ((x 1, y 1 ),..., (x n, y n )) 1

ρ = 0.269 20 18 16 April 14 12 10 8 16 18 20 22 24 26 28 August

ρ = 0.962 20 Minimum temperature 15 10 5 0 5 10 5 0 5 10 15 20 25 30 Maximum temperature

Histogram Empirical mean and variance Order statistics Empirical covariance Empirical covariance matrix

Empirical covariance matrix Data: { x 1, x 2,..., x n } (d features) The empirical covariance matrix is defined as Σ ( x 1,..., x n ) := 1 n 1 n ( x i av ( x 1,..., x n )) ( x i av ( x 1,..., x n )) T i=1 The (i, j) entry, 1 i, j d, is given by { var (( x1 ) i,..., ( x n i ) if i = j, Σ ( x 1,..., x n ) ij = ) )) cov ((( x 1 ) i, ( x 1 ) j,..., (( x n ) i, ( x n ) j if i j.

Empirical variance in a certain direction Let v be a unit-norm vector aligned with a direction of interest ) var ( v T x 1,..., v T x n

Empirical variance in a certain direction Let v be a unit-norm vector aligned with a direction of interest ) var ( v T x 1,..., v T x n = 1 n 1 n i=1 ( )) 2 v T x i av ( v T x 1,..., v T x n

Empirical variance in a certain direction Let v be a unit-norm vector aligned with a direction of interest ) var ( v T x 1,..., v T x n = 1 n 1 = 1 n 1 n i=1 n i=1 ( )) 2 v T x i av ( v T x 1,..., v T x n ( ) 2 v T ( x i av ( x 1,..., x n ))

Empirical variance in a certain direction Let v be a unit-norm vector aligned with a direction of interest ) var ( v T x 1,..., v T x n = 1 n 1 = 1 n 1 ( = v T n i=1 n i=1 1 n 1 ( )) 2 v T x i av ( v T x 1,..., v T x n ( ) 2 v T ( x i av ( x 1,..., x n )) ) n ( x i av ( x 1,..., x n )) ( x i av ( x 1,..., x n )) T v i=1

Empirical variance in a certain direction Let v be a unit-norm vector aligned with a direction of interest ) var ( v T x 1,..., v T x n = 1 n 1 = 1 n 1 ( = v T n i=1 n i=1 1 n 1 ( )) 2 v T x i av ( v T x 1,..., v T x n ( ) 2 v T ( x i av ( x 1,..., x n )) ) n ( x i av ( x 1,..., x n )) ( x i av ( x 1,..., x n )) T v i=1 = v T Σ ( x 1,..., x n ) v

Eigendecomposition of the covariance matrix Let v be a unit-norm vector aligned with a direction of interest Σ ( x 1,..., x n ) = UΛU T λ 1 0 0 = [ ] u 1 u 2 u n 0 λ 2 0 [ u1 u 2 ] T u n 0 0 λ n

Eigendecomposition of the covariance matrix For any symmetric matrix A R n with normalized eigenvectors u 1, u 2,..., u n and corresponding eigenvalues λ 1 λ 2... λ n λ 1 = max v 2 =1 v T A v u 1 = arg max v 2 =1 v T A v λ k = max v 2 =1, u u 1,..., u k 1 v T A v u k = arg max v T A v v 2 =1, u u 1,..., u k 1

Principal component analysis Compute eigenvectors of empirical covariance matrix to determine directions of maximum variation

Example: 2D data σ 1 n = 0.705 σ 2 n = 0.690 u 1 u 2

Example: 2D data σ 1 n = 0.9832 σ 2 n = 0.3559 u 1 u 2

Example: 2D data σ 1 n = 1.3490 σ 2 n = 0.1438 u 1 u 2

Centering is important! σ 1 n = 5.077 σ 2 n = 0.889 u 1 u 2

Centering is important! σ 1 n = 1.261 σ 2 n = 0.139 u 2 u 1

Dimensionality reduction Projection of data onto a lower-dimensional space Applications: Visualization / computational efficiency / denoising Example: Seeds from 3 varieties of wheat (Kama, Rosa and Canadian) 7 features: area, perimeter, compactness, length of kernel, width of kernel, asymmetry coefficient and length of kernel groove

PCA dimensionality reduction 2.0 1.5 Projection onto second PC 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Projection onto first PC

PCA dimensionality reduction 2.0 1.5 Projection onto dth PC 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Projection onto (d-1)th PC

Whitening Preprocessing procedure Linear transformation to eliminate skew in the data Enhances nonlinear structure After whitening, the data are uncorrelated

Whitening Let x 1,..., x n be a set of d-dimensional centered data with a full-rank covariance matrix. To whiten the data we 1. Compute the eigendecomposition of the empirical covariance matrix Σ ( x 1,..., x n ) = UΛU T 2. For i = 1,..., n set y i := Λ 1 U T x i, λ1 0 0 Λ := 0 λ2 0 0 0 λn

Whitening Σ ( y 1,..., y n )

Whitening Σ ( y 1,..., y n ) := 1 n 1 n i=1 y i y T i

Whitening Σ ( y 1,..., y n ) := 1 n 1 = 1 n 1 n i=1 y i y T i n 1 ( ) Λ U T 1 T x i Λ U T x i i=1

Whitening Σ ( y 1,..., y n ) := 1 n 1 = 1 n 1 n i=1 y i y T i n 1 ( ) Λ U T 1 T x i Λ U T x i i=1 = ( Λ 1 U T 1 n 1 ) n x i x i T U Λ 1 i=1

Whitening Σ ( y 1,..., y n ) := 1 n 1 = 1 n 1 n i=1 y i y T i n 1 ( ) Λ U T 1 T x i Λ U T x i i=1 = ( Λ 1 U T 1 n 1 ) n x i x i T U Λ 1 i=1 = Λ 1 U T Σ ( x 1,..., x n ) U Λ 1

Whitening Σ ( y 1,..., y n ) := 1 n 1 = 1 n 1 n i=1 y i y T i n 1 ( ) Λ U T 1 T x i Λ U T x i i=1 = ( Λ 1 U T 1 n 1 ) n x i x i T U Λ 1 i=1 = Λ 1 U T Σ ( x 1,..., x n ) U Λ 1 = Λ 1 U T U Λ ΛU T U Λ 1

Whitening Σ ( y 1,..., y n ) := 1 n 1 = 1 n 1 n i=1 y i y T i n 1 ( ) Λ U T 1 T x i Λ U T x i i=1 = ( Λ 1 U T 1 n 1 ) n x i x i T U Λ 1 i=1 = Λ 1 U T Σ ( x 1,..., x n ) U Λ 1 = Λ 1 U T U Λ ΛU T U Λ 1 = I

x

U T x

Λ 1 U T x