Pollution Sources Detection via Principal Component Analysis and Rotation

Similar documents
Pollution sources detection via principal component analysis and rotation

PCA and PMF based methodology for air pollution sources identification and apportionment

A Partitioning Method for the Clustering of Categorical Variables

Chapter 4: Factor Analysis

Introduction to Machine Learning

Principal Component Analysis

PCA and admixture models

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 6: Bivariate Correspondence Analysis - part II

Principal Components Analysis (PCA)

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

9.1 Orthogonal factor model.

Bare minimum on matrix algebra. Psychology 588: Covariance structure and factor models

e 2 e 1 (a) (b) (d) (c)

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

PRINCIPAL COMPONENTS ANALYSIS

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

STAT 730 Chapter 9: Factor analysis

STATISTICAL LEARNING SYSTEMS

Parallel Singular Value Decomposition. Jiaxing Tan

2/26/2017. This is similar to canonical correlation in some ways. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

EECS 275 Matrix Computation

3.1. The probabilistic view of the principal component analysis.

Sparse orthogonal factor analysis

Linear Methods in Data Mining

CS281 Section 4: Factor Analysis and PCA

1 A factor can be considered to be an underlying latent variable: (a) on which people differ. (b) that is explained by unknown variables

Principal Component Analysis & Factor Analysis. Psych 818 DeShon

Linear Dimensionality Reduction

The Singular Value Decomposition

Principal Component Analysis-I Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 17

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Independent Component Analysis and Its Application on Accelerator Physics

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

Intermediate Social Statistics

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Principal Component Analysis (PCA) Principal Component Analysis (PCA)

Mathematical foundations - linear algebra

Notes on Implementation of Component Analysis Techniques

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Principal Component Analysis (PCA)

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Principal Component Analysis

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

1 Linearity and Linear Systems

Principal component analysis (PCA) for clustering gene expression data

22m:033 Notes: 7.1 Diagonalization of Symmetric Matrices

Dimension Reduction and Classification Using PCA and Factor. Overview

Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA

arxiv: v4 [stat.co] 8 Dec 2017

Linear Regression Linear Regression with Shrinkage

UNIT 6: The singular value decomposition.

Factor Analysis Continued. Psy 524 Ainsworth

Applied Multivariate Analysis

The Principal Component Analysis

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Multivariate Statistical Analysis

Lecture: Face Recognition and Feature Reduction

Neuroscience Introduction

Linear Algebra in a Nutshell: PCA. can be seen as a dimensionality reduction technique Baroni & Evert. Baroni & Evert

18.S096 Problem Set 7 Fall 2013 Factor Models Due Date: 11/14/2013. [ ] variance: E[X] =, and Cov[X] = Σ = =

Multivariate Statistics (I) 2. Principal Component Analysis (PCA)

CSE 554 Lecture 7: Alignment

Linear Regression Linear Regression with Shrinkage

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Covariance to PCA. CS 510 Lecture #8 February 17, 2014

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Graph Functional Methods for Climate Partitioning

Statistics for Applications. Chapter 9: Principal Component Analysis (PCA) 1/16

I. Multiple Choice Questions (Answer any eight)

Collaborative Filtering: A Machine Learning Perspective

What is Principal Component Analysis?

Applied Linear Algebra in Geoscience Using MATLAB

Example Linear Algebra Competency Test

Machine Learning - MT & 14. PCA and MDS

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Principal Component Analysis

Principal component analysis

Singular Value Decomposition

Space-time data. Simple space-time analyses. PM10 in space. PM10 in time

Lecture 4: Principal Component Analysis and Linear Dimension Reduction

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Learning with Singular Vectors

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

1 Data Arrays and Decompositions

Fundamentals of Matrices

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Methods for sparse analysis of high-dimensional data, II

CSC 411 Lecture 12: Principal Component Analysis

Chapter 3 Transformations

7. Variable extraction and dimensionality reduction

Principal Component Analysis of Sea Surface Temperature via Singular Value Decomposition

Data Mining and Analysis: Fundamental Concepts and Algorithms

Transcription:

Pollution Sources Detection via Principal Component Analysis and Rotation Vanessa Kuentz 1 in collaboration with : Marie Chavent 1 Hervé Guégan 2 Brigitte Patouille 1 Jérôme Saracco 1,3 1 IMB, Université Bordeaux 1, Talence, France (e-mail: vanessa.kuentz@math.u-bordeaux1.fr) 2 ARCANE-CENBG, Gradignan, France 3 GREThA, Université Bordeaux 4, Pessac, France 1/22 Applied Stochastic Models and Data Analysis, Crete, 2007

Outline Factor Analysis 1 Factor Analysis Notations Recall about PCA Factor Analysis model Link between FA and PCA 2 Motivation Criterion Remark on rotation in PCA 3 A three steps process Statistical treatment Some confirmation of the results 2/22 Applied Stochastic Models and Data Analysis, Crete, 2007

Issue Factor Analysis Air pollution = small particles + liquid droplets High concentrations of particles = danger to human health, especially fine particles Development of air quality control strategies identification of pollution sources Receptor modeling, commonly used = Principal Component Analysis Part of scientific project initiated by French Ministry of Ecology 3/22 Applied Stochastic Models and Data Analysis, Crete, 2007

Notations Recall about PCA Factor Analysis model Link between FA and PCA X = (x j i ) n,p numerical data matrix where n objects are described on p < n variables x 1,..., x p X = ( x j i ) n,p standardized data matrix: x j s j ( x j,s j sample mean and standard deviation of x j ) R sample correlation matrix of x 1,..., x p : i = x j i x j R = X M X where M = 1 m I n (with m = n or n 1) We have : R = Z Z with Z = M 1/2 X 4/22 Applied Stochastic Models and Data Analysis, Crete, 2007

Notations Recall about PCA Factor Analysis model Link between FA and PCA Replace p variables by q p uncorrelated components The SVD of Z (of rank r p) is Z = UΛ 1/2 V with : - Λ : diagonal matrix (r, r) of nonnull eigenvalues λ 1,..., λ r of Z Z (in decreasing order) - U : orthonormal matrix (n, r) of eigenvectors of ZZ associated with first r eigenvalues - V : orthonormal matrix (p, r) of eigenvectors of Z Z associated with first r eigenvalues Decomposition of X: X = M 1/2 UΛ 1/2 V 5/22 Applied Stochastic Models and Data Analysis, Crete, 2007

Notations Recall about PCA Factor Analysis model Link between FA and PCA q = r : In PCA, equation X = M 1/2 UΛ 1/2 V is written : X = ΨV with Ψ = M 1/2 UΛ 1/2 We have Ψ = XV (VV = I p ) Columns of Ψ : principal components ψ k, k = 1,..., q q < r : In practice, user retains only first q < r eigenvalues of Λ Approximation of X : ˆ Xq = Ψ q V q (Ψ q, V q : matrices Ψ and V reduced to first q columns) 6/22 Applied Stochastic Models and Data Analysis, Crete, 2007

Notations Recall about PCA Factor Analysis model Link between FA and PCA FA : underlying common factors + dimension reduction Model is written : x = Af + e (p 1) (p q)(q 1) (p 1) with : - x = ( x 1,..., x p ) random vector of R p - A : loading (or pattern) matrix - f = (f 1,..., f q ) random vector of q common factors - e = (e 1,..., e p ) random vector of p unique factors x j : linear combination of common factors + unique factor 7/22 Applied Stochastic Models and Data Analysis, Crete, 2007

Notations Recall about PCA Factor Analysis model Link between FA and PCA FA model : various estimation methods - Principal factor method - Maximum likelihood method - Principal component method -... Commonly used method : PCA - Easy implementation - Reasonable results - Default method in statistical softwares 8/22 Applied Stochastic Models and Data Analysis, Crete, 2007

Notations Recall about PCA Factor Analysis model Link between FA and PCA q = r : In FA (with PCA estimation), equation X = M 1/2 UΛ 1/2 V is written : X = FA with : -F = M 1/2 U : factor scores matrix -A = V Λ 1/2 : loading (or pattern) matrix We have F = XVΛ 1/2 (V V = I q ) Columns of F : common factors f k, k = 1,..., q q < r : In practice, user retains only first q < r eigenvalues of Λ Approximation of X : ˆ X q = F q A q (F q, A q : matrices F and A reduced to first q columns) 9/22 Applied Stochastic Models and Data Analysis, Crete, 2007

Notations Recall about PCA Factor Analysis model Link between FA and PCA We have : PCA : Ψ = XV } F = ΨΛ 1/2 1/2 FA : F = XVΛ factors f k correspond to standardized principal components : f k = ψk λk, k = 1,..., q 10/22 Applied Stochastic Models and Data Analysis, Crete, 2007

Motivation Criterion Remark on rotation in PCA Loading matrix A q : a k j = corr(x j, f k ) identification of groups of correlated elements Problem : Intermediate values difficult association variables/factors Objective : - Column of A q : values close to 0 or 1 - Row of A q : only one value close to 1 Solution : - Non-uniqueness of FA solution : rotation of factors - Orthogonal transformation matrix T (TT = T T = I q ) - Ăq = A q T : correlation of variables to rotated factors f k - How to determine T? 11/22 Applied Stochastic Models and Data Analysis, Crete, 2007

Motivation Criterion Remark on rotation in PCA Several criteria : most used is varimax Maximizes empirical variance of squared elements of each column of Ăq = (ă α j ) p,q : 2 p p q (ăj α ) 4 (ăj α ) 2 j=1 j=1 p p α=1 with ă α j = a j t α and u.cs. t α (t α ) = 1, t l (t k ) = 0, k l. 12/22 Applied Stochastic Models and Data Analysis, Crete, 2007

Motivation Criterion Remark on rotation in PCA Possible rotation in PCA as in FA Must apply T to the good matrices keep property of non correlation of components One must not write : but : X = X = Ψ q {}}{ M 1/2 UΛ 1/2 TT V F q {}}{ M 1/2 UTT Λ 1/2 V rotation of standardized principal components = factors of FA 13/22 Applied Stochastic Models and Data Analysis, Crete, 2007

A three steps process Statistical treatment Some confirmation of the results Part of scientific project of French Ministry of Ecology Improve air quality : identification + quantification of pollution sources Three steps process : - Collecting of fine particles in French urban site (Anglet) : n = 61 samples (Dec. 2005) - Measurements of p = 16 chemical species : PIXE method (Particle Induced X-ray Emission) - Statistical treatment of X = (x j i ) n,p : concentration of jth chemical compound in ith sample 14/22 Applied Stochastic Models and Data Analysis, Crete, 2007

A three steps process Statistical treatment Some confirmation of the results FA model with PCA estimation Choice of number of factors : q = 5 Orthogonal rotation of factors with varimax criterion Detection of groups of correlated chemical compounds Identification of air pollution sources 15/22 Applied Stochastic Models and Data Analysis, Crete, 2007

A three steps process Statistical treatment Some confirmation of the results f 1 f 2 f 3 f 4 f 5 Al2O3 0.67-0.66 0.22-0.19 0.02 SiO2 0.65-0.67 0.25-0.20-0.06 P 0.68-0.64 0.22-0.24 0.01 SO4 0.59 0.45-0.41-0.22 0.18 Cl -0.47-0.28 0.19 0.68 0.34 K 0.89-0.15-0.21 0.00 0.28 Ca 0.64-0.41 0.10 0.40-0.19 Mn 0.38 0.78 0.18 0.09-0.25 Fe2O3 0.79 0.32 0.04 0.27-0.36 Cu 0.80 0.25-0.10 0.23-0.36 Zn 0.35 0.66 0.59-0.07 0.25 Br 0.75-0.18-0.13 0.35 0.36 Pb 0.43 0.66 0.52-0.09 0.30 C-Org 0.60 0.30-0.61-0.02 0.22 Table: Loading matrix before rotation A 5 Lots of intermediate values difficult association variables/factors 16/22 Applied Stochastic Models and Data Analysis, Crete, 2007

A three steps process Statistical treatment Some confirmation of the results f 1 f 2 f 3 f 4 f 5 Al2O3 0.98 0.09-0.04 0.07-0.04 SiO2 0.98 0.01-0.06 0.10-0.07 P 0.97 0.09-0.02 0.07-0.09 SO4-0.03 0.77 0.25 0.18-0.35 Cl -0.15-0.27-0.14-0.18 0.88 K 0.60 0.72 0.11 0.23 0.03 Ca 0.61 0.09-0.11 0.56 0.27 Mn -0.28 0.12 0.60 0.58-0.24 Fe2O3 0.20 0.28 0.29 0.85-0.11 Cu 0.21 0.36 0.16 0.82-0.15 Zn -0.03 0.05 0.98 0.13-0.04 Br 0.49 0.62 0.10 0.28 0.39 Pb 0.00 0.16 0.97 0.13-0.05 C-Org -0.02 0.89 0.02 0.22-0.16 Factor Possible source Factor1 Soil dust Factor2 Combustion Factor3 Industry Factor4 Vehicle Factor5 Sea Table: Pollution sources Table: Rotated loading matrix Ă5 Obvious usefulness of rotation : clear associations ex: Zn + Pb highly correlated to f 3 industrial pollution source 17/22 Applied Stochastic Models and Data Analysis, Crete, 2007

A three steps process Statistical treatment Some confirmation of the results Rotated factors : columns of F 5 = F 5 T = ( f i k ) n,5 Score f i k = relative contribution of kth source to ith sample Confrontation of rotated factors with external parameters : - Meteorological data (wind, temperature,...) - Periodicity day/night - Other chemical coumpounds 18/22 Applied Stochastic Models and Data Analysis, Crete, 2007

A three steps process Statistical treatment Some confirmation of the results Figure: Evolution of vehicle pollution source Stronger contribution during day than night confirmation of cars pollution source 19/22 Applied Stochastic Models and Data Analysis, Crete, 2007

A three steps process Statistical treatment Some confirmation of the results Figure: Evolution of heating pollution and temperatures Middle of period : } Increase in contribution confirmation of heating pollution Decrease in temperature 20/22 Applied Stochastic Models and Data Analysis, Crete, 2007

A three steps process Statistical treatment Some confirmation of the results Figure: Map of sampling site and correlations wind/sea pollution } Strong correlation to N-W wind validation of sea pollution Location toward Atlantic Ocean 21/22 Applied Stochastic Models and Data Analysis, Crete, 2007

Conclusion A three steps process Statistical treatment Some confirmation of the results FA (with PCA estimation) followed by rotation : identification of five sources of particulate emission No prior knowledge : number, composition? more complex sampling site Sources quantification : percentage of total fine dust mass of each source (JDS, Angers, 2007) 22/22 Applied Stochastic Models and Data Analysis, Crete, 2007