MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1: Introduction, Multivariate Location and Scatter

Similar documents
A Peak to the World of Multivariate Statistical Analysis

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 4: Measures of Robustness, Robust Principal Component Analysis

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 8: Canonical Correlation Analysis

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 5: Bivariate Correspondence Analysis

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 6: Bivariate Correspondence Analysis - part II

Invariant coordinate selection for multivariate data analysis - the package ICS

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Introduction. ECN 102: Analysis of Economic Data Winter, J. Parman (UC-Davis) Analysis of Economic Data, Winter 2011 January 4, / 51

Subject CS1 Actuarial Statistics 1 Core Principles

MULTICHANNEL SIGNAL PROCESSING USING SPATIAL RANK COVARIANCE MATRICES

Scatter Matrices and Independent Component Analysis

Announcements Monday, November 13

Exam: 4 hour multiple choice. Agenda. Course Introduction to Statistics. Lecture 1: Introduction to Statistics. Per Bruun Brockhoff

Independent Component (IC) Models: New Extensions of the Multinormal Model

STA 2201/442 Assignment 2

Independent component analysis for functional data

Statistics for Applications. Chapter 9: Principal Component Analysis (PCA) 1/16

Announcements Monday, November 13

Contents. Acknowledgments. xix

TOPIC: Descriptive Statistics Single Variable

Lecture 11. Multivariate Normal theory

Chapter 2 Class Notes Sample & Population Descriptions Classifying variables

On Independent Component Analysis

Columbus State Community College Mathematics Department Public Syllabus

Notes on Random Vectors and Multivariate Normal

Introduction to Statistical Inference Lecture 10: ANOVA, Kruskal-Wallis Test

1. Exploratory Data Analysis

Chapter 5 continued. Chapter 5 sections

MATH 450: Mathematical statistics

Monday May 12, :00 to 1:30 AM

STATISTICS 141 Final Review

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

STATISTICS ANCILLARY SYLLABUS. (W.E.F. the session ) Semester Paper Code Marks Credits Topic

Multivariate Distributions

MATH-3150H-A: Partial Differential Equation 2018WI - Peterborough Campus

Institute of Actuaries of India

HANDBOOK OF APPLICABLE MATHEMATICS

Learning Objectives for Stat 225

The Multivariate Gaussian Distribution

Course Review. Kin 304W Week 14: April 9, 2013

MATH 10 INTRODUCTORY STATISTICS

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

Statistics I Chapter 2: Univariate data analysis

PS2.1 & 2.2: Linear Correlations PS2: Bivariate Statistics

Statistical Depth Function

A is one of the categories into which qualitative data can be classified.

Dr. Babasaheb Ambedkar Marathwada University, Aurangabad. Syllabus at the F.Y. B.Sc. / B.A. In Statistics

STA442/2101: Assignment 5

After completing this chapter, you should be able to:

Descriptive Univariate Statistics and Bivariate Correlation

Course ID May 2017 COURSE OUTLINE. Mathematics 130 Elementary & Intermediate Algebra for Statistics

AMSC/MATH 673, CLASSICAL METHODS IN PDE, FALL Required text: Evans, Partial Differential Equations second edition

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Contents 1. Contents

An introduction to multivariate data

Practical Statistics

Linear Algebra. Instructor: Justin Ryan

Course Outline, Mathematics - Algebra 2-Trigonometry -

MATH 10 INTRODUCTORY STATISTICS

Characteristics of multivariate distributions and the invariant coordinate system

Chapter 5. The multivariate normal distribution. Probability Theory. Linear transformations. The mean vector and the covariance matrix

Department of Mechanical Engineering MECH 221 (with MECH 224 & MATH 255) Engineering Science I

Robust scale estimation with extensions

The Robustness of the Multivariate EWMA Control Chart

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author...

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

BNG 495 Capstone Design. Descriptive Statistics

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Statistics I Chapter 2: Univariate data analysis

BIOS 2083 Linear Models Abdus S. Wahed. Chapter 2 84

Cointegration Lecture I: Introduction

STAT 730 Chapter 1 Background

Chapter 4. Chapter 4 sections

Random Vectors 1. STA442/2101 Fall See last slide for copyright information. 1 / 30

Random vectors X 1 X 2. Recall that a random vector X = is made up of, say, k. X k. random variables.

Algebra 1 Khan Academy Video Correlations By SpringBoard Activity and Learning Target

Transition Passage to Descriptive Statistics 28

Chapter 1 Descriptive Statistics

STAT 200 Chapter 1 Looking at Data - Distributions

Deccan Education Society s FERGUSSON COLLEGE, PUNE (AUTONOMOUS) SYLLABUS UNDER AUTOMONY. SECOND YEAR B.Sc. SEMESTER - III

Probability Lecture III (August, 2006)

Statistics Toolbox 6. Apply statistical algorithms and probability models

M & M Project. Think! Crunch those numbers! Answer!

Instructor: Vikki Maurer Class: M 1-3, W 1-3, F 1-2

STATISTICS ( CODE NO. 08 ) PAPER I PART - I

Columbus State Community College Mathematics Department. CREDITS: 5 CLASS HOURS PER WEEK: 5 PREREQUISITES: MATH 2173 with a C or higher

Physics 2240, Spring 2015

Instrumentation (cont.) Statistics vs. Parameters. Descriptive Statistics. Types of Numerical Data

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE

On Expected Gaussian Random Determinants

Introduction to Engineering Analysis - ENGR1100 Course Description and Syllabus Tuesday / Friday Sections. Spring '13.

CSC411 Fall 2018 Homework 5

HUDM4122 Probability and Statistical Inference. February 2, 2015


EE 302: Probabilistic Methods in Electrical Engineering

Lecture 1: Descriptive Statistics

MAT 211, Spring 2015, Introduction to Linear Algebra.

Welcome to Physics 161 Elements of Physics Fall 2018, Sept 4. Wim Kloet

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Transcription:

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1:, Multivariate Location

Contents

, pauliina.ilmonen(a)aalto.fi Lectures on Mondays 12.15-14.00 (2.1. - 6.2., 20.2. - 27.3.), U147 (U5) Exercises by: Niko Lietzen, niko.lietzen(a)aalto.fi and Sami Helander, sami.helander(a)aalto.fi There are two exercise groups, choose the one that fits better to your schedule Weekly exercises Group 1 on Thursdays 16.15-18.00 (5.1. - 9.2., 23.2. - 30.3.), U344 (Sami) Weekly exercises Group 2 on Fridays 10.15-12.00 (6.1. - 10.2., 24.2. - 31.3.), U257 (Niko) Book: K. V. Mardia, J. T. Kent, J. M. Bibby: Multivariate Analysis, Academic Press, London, 2003 (reprint of 1979)

Self Study Make sure that you know how to calculate the univariate means, medians, variances, and max and min values. Familiarize yourself with the correlation coefficients and common graphical presentations (boxplots, scatter plots, histograms, bar plots, pie charts) of data. Make sure that you know what is a cumulative distribution function, a probability density function, and a probability mass function. Make sure that you know what is the expected value of a random variable. Read about univariate and multivariate normal distributions and elliptical distributions. Make sure that you know what is meant by central symmetric distributions and skew distributions.

How to pass this course? You are expected to Attend the lectures and be active - not compulsory, no points, but highly recommended. Submit your project work on time - THIS IS COMPULSORY - max 6 points. Take the exam - max 24 points. (The next examinations are on Thursday 6.4. at 9-12 and on Wednesday 24.5. at 16.30-19.30.) Submit your homework exercises on time - not compulsory, but highly recommended - max 3 points. Participate to weekly exercises (group 1 OR group 2) - not compulsory, but highly recommended - max 3 points. Max total points = 6 + 24 + 3 + 3 = 36. You need at least 16 points in order to pass the course.

How to get a good grade? Attend the lectures and be active! Work hard on your project work. Be active in exercises! Study for the exam! Grading is based on the total points as follows: 16p -> 1, 20p -> 2, 24p -> 3, 28p -> 4, 32p -> 5.

Project Work Find a multivariate (at least 3-variate) dataset (tilastokeskus, OECD, collect yourself,...), set a research question, and perform multivariate analysis. Write a report (max 10 pages), print it and submit it to Sami, Niko or Pauliina on Wed 5.4.2016 the latest! Goals of the project work: Description of the research questions Description of the dataset Univariate and bivariate statistical analysis to present the variables Application of multivariate statistical methods to answer research questions (justification and output) Conclusions and answers to the question raised at the beginning Critical evaluation of the analysis Remember that No findings is a finding! Note that you will automatically get 0 points from the exam if you will not return your project work on time!

The first step of all statistical analysis is the univariate and bivariate analysis. First calculate the univariate means, medians, variances, max and min values. Then calculate the correlation coefficients. And take a look at your data literally! Make histograms of continuous variables and pie charts of categorical variables. Make boxplots to detect univariate outliers, and make scatter plots to detect bivariate structures. Note that visualization is not always easy when the data contains a large number of individuals, but do not skip plotting your data! It is very important that you get familiar with your data before you conduct any large multivariate analysis.

Multivariate Data Let x denote a p-variate random vector with a cumulative distribution function F x and let X = [x 1...x n ], where x 1,..., x n are i.i.d. observations from the distribution F x.

Location Functionals Definition A p 1 vector-valued functional T (F x ), which is affine equivariant in the sense that T (F Ax+b ) = AT (F x ) + b for all nonsingular p p matrices A and for all p-vectors b, is called a location functional.

Scatter Functionals Definition A p p matrix-valued functional S(F x ) which is positive definite and affine equivariant in the sense that S(F Ax+b ) = AS(F x )A T for all nonsingular p p matrices A and for all p-vectors b, is called a scatter functional.

Location Estimates The corresponding sample statistics are obtained if the functionals are applied to the empirical cumulative distribution F n based on a sample x 1, x 2,..., x n. Notation T (F n ) and S(F n ) or T (X) and S(X) is used for the sample statistics. The location and scatter sample statistics then also satisfy T (AX + b1 T n ) = AT (X) + b and S(AX + b1 T n ) = AS(X)A T for all nonsingular p p matrices A and for all p-vectors b.

Scatter Functionals Scatter matrix functionals are usually standardized such that in the case of standard multivariate normal distribution S(F x ) = I.

Shape Functionals Definition If a positive definite p p matrix-valued functional S(F x ) satisfies that S(F Ax+b ) is proportional to AS(F x )A T for all nonsingular p p matrices A and for all p-vectors b, then S(F x ) is called a shape functional.

Traditional Functionals The first examples of location and scatter functionals are the mean vector and the regular covariance matrix: T 1 (F x ) = E(x) and S 1 (F x ) = Cov(F x ) = E ( (x E(x))(x E(x)) T ).

The Sample Mean Vector and the Sample Covariance Matrix Traditional estimates of the mean vector and the covariance matrix are calculated as follows: T 1 (X) = 1 n n i=1 x i and S 1 (X) = Cov(X) = 1 n 1 n ( (xi T 1 (X))(x i T 1 (X)) T ). i=1

Why do we need other location and scatter measures???

Scatter Functionals There are several other location and scatter functionals, even families of them, having different desirable properties (robustness, efficiency, limiting multivariate normality, fast computations, etc).

Moment Based Functionals Location and scatter functionals can be based on the third and fourth moments as well. A location functional based on third moments is T 2 (F x ) = 1 p E ( (x E(x)) T Cov(F x ) 1 (x E(x))x ) and a scatter matrix functional based on fourth moments is S 2 (F x ) = 1 p + 2 E ( (x E(x))(x E(x)) T Cov(F x ) 1 (x E(x))(x E(x)) T ).

Example 1: Bivariate Normal Distribution In this example we consider bivariate normal distribution N(µ, A), where [ ] 4 2 A = 2 3 and µ = [ 0 10 ].

Example 1: Bivariate Normal Distribution 4 2 0 2 4 6 8 10 12 14 x1 x2 Data T1(X) T2(X)

Example 1: Bivariate Normal Distribution We simulated 100 samples from N(µ, A) and we then calculated the sample mean vector T 1 (X), the location vector based on third moments T 2 (X), the sample covariance matrix S 1 (X) and the scatter matrix based on fourth moments S 2 (X) of each sample. In order to compare T 1 (X), T 2 (X), S 1 (X), and S 2 (X), we calculated the means of the estimates. T[ 1 (X) : ] 0.006703295 T[ 2 (X) : ] 0.01626947 10.001765054 9.99082058 S[ 1 (X) : ] S[ 2 (X) : ] 4.029396 2.034711 3.9197916 2.003406 2.034711 2.968536 2.003406 2.924344 Both location estimates seem to estimate the parameter µ and both scatter estimates seem to estimate the parameter A.

Example 2: Independent Components, Skewed Distributions In this example we consider Gamma(α, β) and χ 2 (k) distributions, where α = 2, and β = 0.5 k = 3.

Example 2: Independent Components, Skewed Distributions 2 4 6 8 10 0 2 4 6 8 10 12 y1 y2 Data T1(Y) T2(Y)

Example 2: Independent Components, Skewed Distribution As in Example 1, we ran the simulation 100 times and calculated the means. T[ 1 (Y ) : ] 4.031022 T[ 2 (Y ) : ] 5.944029 2.964918 4.740199 S[ 1 (Y ) : ] S[ 2 (Y ) : ] 8.16111692 0.04234064 13.4080726 0.1142734 0.04234064 5.76640662 0.1142734 9.8194396 Here the location estimates differ significantly from each other. Also the scatter estimates differ significantly from each other. Note also that the off-diagonal elements of both scatter estimates are small.

Location Functionals Under Symmetry Assumptions We now consider the behavior of scatter and location functionals under some symmetry assumptions.

Theorem Under the assumption of central symmetry, all location functionals are equal to the center of symmetry. Proof:...

Theorem Under the assumption of multivariate elliptical distribution, all scatter functionals are proportional. Proof:...

Note that in general different location functionals do not measure the same population quantities. That is true also for scatter functionals different scatter functional do not necessarily measure the same population quantities!

Next Week Next week we will talk about principal component analysis (PCA).

I K. V. Mardia, J. T. Kent, J. M. Bibby, Multivariate Analysis, Academic Press, London, 2003 (reprint of 1979). H. Oja, Multivariate Nonparametric Methods With R, Springer-Verlag, New York, 2010.

II R. V. Hogg, J. W. McKean, A. T. Craig, to Mathematical Statistics, Pearson Education, Upper Sadle River, 2005. R. A. Horn, C. R. Johnson, Matrix Analysis, Cambridge University Press, New York, 1985. R. A. Horn, C. R. Johnson, Topics in Matrix Analysis, Cambridge University Press, New York, 1991.