Statistical methods and data analysis

Similar documents
* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

HANDBOOK OF APPLICABLE MATHEMATICS

Data Analysis. with Excel. An introduction for Physical scientists. LesKirkup university of Technology, Sydney CAMBRIDGE UNIVERSITY PRESS

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Stat 5101 Lecture Notes

Kumaun University Nainital

Sigmaplot di Systat Software

Applied Probability and Stochastic Processes

Index. Cambridge University Press Data Analysis for Physical Scientists: Featuring Excel Les Kirkup Index More information

Subject CS1 Actuarial Statistics 1 Core Principles

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

THE PRINCIPLES AND PRACTICE OF STATISTICS IN BIOLOGICAL RESEARCH. Robert R. SOKAL and F. James ROHLF. State University of New York at Stony Brook

Experimental Design and Data Analysis for Biologists

Methods of statistical and numerical analysis (integrated course). Part I

Chapter 1 Statistical Inference

Introduction to Statistical Analysis

Index. Geostatistics for Environmental Scientists, 2nd Edition R. Webster and M. A. Oliver 2007 John Wiley & Sons, Ltd. ISBN:

FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE

Psychology 282 Lecture #4 Outline Inferences in SLR

Statistical Methods for Astronomy

Linear Models in Statistics

Multivariate Regression

Department of Mathematics Faculty of Natural Science, Jamia Millia Islamia, New Delhi-25

Practical Statistics for the Analytical Scientist Table of Contents

Exam details. Final Review Session. Things to Review

Deccan Education Society s FERGUSSON COLLEGE, PUNE (AUTONOMOUS) SYLLABUS UNDER AUTOMONY. SECOND YEAR B.Sc. SEMESTER - III

Fundamental Probability and Statistics

Statistical Methods for Astronomy

Syllabus for MATHEMATICS FOR INTERNATIONAL RELATIONS

Contents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

STAT 501 EXAM I NAME Spring 1999

sphericity, 5-29, 5-32 residuals, 7-1 spread and level, 2-17 t test, 1-13 transformations, 2-15 violations, 1-19

TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1

Statistical Methods in HYDROLOGY CHARLES T. HAAN. The Iowa State University Press / Ames

3 Joint Distributions 71

Data modelling Parameter estimation

Irr. Statistical Methods in Experimental Physics. 2nd Edition. Frederick James. World Scientific. CERN, Switzerland

Contents. Acknowledgments. xix

IDL Advanced Math & Stats Module

Statistical techniques for data analysis in Cosmology

LECTURE NOTES FYS 4550/FYS EXPERIMENTAL HIGH ENERGY PHYSICS AUTUMN 2013 PART I A. STRANDLIE GJØVIK UNIVERSITY COLLEGE AND UNIVERSITY OF OSLO

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Statistical Methods in Particle Physics

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

Do not copy, post, or distribute

Prentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12)

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author...

Introduction to the Mathematical and Statistical Foundations of Econometrics Herman J. Bierens Pennsylvania State University

Practical Statistics

Institute of Actuaries of India

Ma 3/103: Lecture 24 Linear Regression I: Estimation

Stat 5102 Final Exam May 14, 2015

My data doesn t look like that..

Statistical Analysis of Engineering Data The Bare Bones Edition. Precision, Bias, Accuracy, Measures of Precision, Propagation of Error

GROUPED DATA E.G. FOR SAMPLE OF RAW DATA (E.G. 4, 12, 7, 5, MEAN G x / n STANDARD DEVIATION MEDIAN AND QUARTILES STANDARD DEVIATION

Generalized, Linear, and Mixed Models

Exam Applied Statistical Regression. Good Luck!

Inferences for Regression

p(z)

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Course in Data Science

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Data Fitting and Uncertainty

Hypothesis testing:power, test statistic CMS:

Asymptotic Statistics-VI. Changliang Zou

Correlation and Simple Linear Regression

Statistics. Lent Term 2015 Prof. Mark Thomson. 2: The Gaussian Limit

Nonparametric Bayesian Methods (Gaussian Processes)

Confidence Intervals, Testing and ANOVA Summary

Econometric Analysis of Cross Section and Panel Data

Math 562 Homework 1 August 29, 2006 Dr. Ron Sahoo

Research Note: A more powerful test statistic for reasoning about interference between units

SPSS Guide For MMI 409

HANDBOOK OF APPLICABLE MATHEMATICS

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing

Correlation and Regression

EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 19

Chapter 14. Linear least squares

Introduction to Econometrics

Statistical Distribution Assumptions of General Linear Models

UNIVERSITY OF TORONTO Faculty of Arts and Science

Measuring relationships among multiple responses

Basic Statistical Analysis

Generalized Linear Models (GLZ)

Outline. Simulation of a Single-Server Queueing System. EEC 686/785 Modeling & Performance Evaluation of Computer Systems.

QUANTITATIVE TECHNIQUES

Statistics and Data Analysis

Econometrics. 4) Statistical inference

Lecture 1: Brief Review on Stochastic Processes

covariance function, 174 probability structure of; Yule-Walker equations, 174 Moving average process, fluctuations, 5-6, 175 probability structure of

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Measurement And Uncertainty

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

Hypothesis Testing for Var-Cov Components

Applied Statistics and Econometrics

Small n, σ known or unknown, underlying nongaussian

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

Goodness of Fit Test and Test of Independence by Entropy

Transcription:

Statistical methods and data analysis Teacher Stefano Siboni Aim The aim of the course is to illustrate the basic mathematical tools for the analysis and modelling of experimental data, particularly concerning the main statistical methods. The idea is not simply to describe and discuss the "cooking recipes" for the statistical treatment of data, but also to introduce and analyse in a critical way the statistical models implemented by the computation procedures, stressing their limitations and the actual interpretation of the results. Period of lectures February-March 2015 Duration 32 hours (4 credits) Conditions for obtaining credits For the first part, participants are required to pass a written examination consisting in the application of some statistical techniques (calculation of confidence intervals, hypothesis testing, calculation of correlation coefficients, linear regressions by chi-square or least squares fitting --- regression straight lines). For the topics of the second part, homework and oral discussion. Programme of the course (tentative) (1) Introduction to experimental measurements Errors of measurement. Systematic and random errors. Absolute and relative errors. Accuracy and precision of a measurement. Postulate of the statistical population; the result of a measurement as an outcome of a random variable. (2) Random variables Discrete and continuous random variables. Cumulative frequency theoretical distribution of a random variable. Frequency theoretical distribution (probability distribution) of a random variable, in the continuous and in the discrete case. Mean, variance, skewness of a probability distribution. Tchebyshev s theorem. Examples of discrete probability distributions: Bernoulli s, binomial, Poisson s. Examples of continuous probability distributions:

uniform, normal or Gaussian, chi-square with n degrees of freedom (X2), Student s with n degrees of freedom (t), Fisher s with n1 and n2 degrees of freedom (F). Central Limit Theorem. Multivariate random variables and related probability distributions; (stochastically) dependent and independent random variables; covariance matrix; correlation matrix; uncorrelated random variables; stochastic independence as a sufficient but not necessary condition to the lack of correlation; multivariate normal random variables, stochastic independence as a condition equivalent to the lack of correlation for multivariate normal random variables. (3) Functions of random variables Functions of random variables and indirect measurements. Linear combination of random variables: calculation of mean and variance, case of uncorrelated random variables; normal random variables, calculation of the joint probability distribution (the linear combination is again a normal random variable). Quadratic forms of standard normal random variables: Craig s theorem on the stochastic independence of positive semidefinite quadratic forms of independent standard normals; theorem for the characterization of quadratic forms of independent standard normals which obey a chi-square probability distribution. Illustrative examples. Error propagation in the indirect measurements: law of propagation of random errors in the indirect measurements (Gauss law) example of application of Gauss law; method of the logarithmic differential, principle of equivalent effects, example of application of the logarithmic differential; error propagation in the solution of a linear set of algebraic equations, condition number, example of application to the solution of a system of 2 equations in 2 variables; probability distribution of a function of random variables by Monte-Carlo methods. (4) Sample theory. Sample estimates of mean and variance Estimate of the parameters of the statistical population (mean and variance). Confidence intervals for the mean and the variance in the case of a normal population. Example of calculation of the CI for mean and variance of a normal population. Remark on the probabilistic meaning of the confidence intervals. Large samples of an arbitrary statistical population: approximate CI for the mean, approximate CI for the variance, example of calculation of the CI for the mean of a non-normal population by using a large sample. (5) Hypothesis testing Null hypothesis, type I errors, type II errors; Tests based on the rejection of the null hypothesis. Example illustrating the general concepts of hypothesis testing.

Summary of the main hypothesis tests and their typologies (parametric and non parametric). X 2 -test to fit a sample to a given probability distribution. Example of X 2 -test applied to a uniform distribution in the unit interval [0,1]. Example of X 2 -test applied to a normal distribution of unknown parameters. Kolmogorov-Smirnov test to fit a sample to a given continuous probability distribution. Example of KS test applied to a uniform distribution in the unit interval [0,1]. Remark on the calculation of the KS test statistics. Example of KS test applied to a standard normal distribution. T-test to check if the mean of a normal population is equal to, smaller than or greater than an assigned value, respectively. Example of application of the t-test on the mean of a normal population, relation between the t-test on the mean and the CI for the mean. X 2 -test to check whether the variance of a normal population is equal to, smaller than or greater than a given value, respectively. Example of application of the X 2 -test on the variance of a normal population, alternative form of the test based on the CI for the variance. Z-test to compare the means of two independent normal populations of known variances. Example of application of the Z-test to compare the means of two independent normal populations of known variances. F-test to compare the variances of two independent normal populations. Example of application of the F-test to compare the variances of two independent normal populations. Unpaired t-test to compare the means of two independent normal populations with unknown variances: case of equal variances; case of unequal variances. Example of application of the unpaired t-test for the comparison of the means of two normal populations (equal unknown variances). Example of application of the unpaired t-test for the comparison of the means of two normal populations (unequal unknown variances). Paired t-test to compare the means of two normal populations of unknown variances. Example of application of the paired t-test to compare the means of two normal populations. A remark about the Excel implementation of the paired t-test: relation with Pearson s linear correlation coefficient. Sign test for the median of a population. Sign test to check if two independent samples belong to the same unknown statistical population. Example of application of the sign test to check if two samples belong to the same unknown population. Detection of outliers non belonging to a statistical normal population by Chauvenet criterion. Example of application of the Chauvenet criterion. (6) Pairs of random variables Pairs of random variables, covariance and linear correlation coefficient (Pearson r). Probability distributions for the linear correlation coefficient r and test to check the hypothesis of stochastic independence based on r: case of independent random variables with convergent moments of sufficiently high order, for sufficiently large samples (z-test); case of normal random variables

- approximation for large samples, Fisher transformation (z-test), - independent normal random variables (t-test). Example of application of the linear correlation coefficient to check the stochastic independence of two normal random variables. Remark: linear correlation coefficient versus regression analysis. END OF THE FIRST PART OF THE COURSE A. F-test for the comparison of means. Introduction to ANOVA. Typical problem addressed by (1-factor) ANOVA and its matematical formulation. Factors and levels: 1-factor ANOVA, 2-factor ANOVA without replication, 2-factor ANOVA with replication, 2-factor ANOVA, etc. General mathematical formulation of ANOVA. A.1 1-factor ANOVA (one-way ANOVA) A.1.1 Basic definitions A.1.2 Statistical model of 1-factor ANOVA Fundamental theorem of 1-factor ANOVA and related remarks. F-test for the dependence of the means on the group. Example of application of 1-factor ANOVA. Remark. 1-factor ANOVA for groups with unequal numbers of data. A.1.3 Confidence intervals for the difference of two means: Student s approach and Scheffé s approach. A.2 2-factor ANOVA without interaction. A.2.1 Basic definitions of 2-factor ANOVA without interaction. A.2.2 Statistical model of 2-factor ANOVA without interaction. Fundamental theorem of 2-factor ANOVA without interaction and related remarks. F-test for the dependence of the means on the index i (first factor). F-test for the dependence of the means on the index j (second factor). Example of application of the 2-factor ANOVA without interaction. A.3 2-factor ANOVA with interaction A.3.1 Basic definitions of 2-factor ANOVA with interaction. A.3.2 Statistical model of 2-factor ANOVA with interaction. Fundamental theorem of 2-factor ANOVA with interaction and related remarks (hints). F-test for the dependence of the means on the index i (first factor). F-test for the dependence of the means on the index j (second factor). F-test for the occurrence of interactions between the factors. Example of application of the 2-factor ANOVA with interaction. R. Data modeling. Introduction to regression analysis General setup of the problem: models, model parameters. Fitting of model parameters to the data of the small sample. Merit function. Calculation of the best values of the model parameters (best-fit). Maximum likelihood method for the definition of the merit function. Some important cases: least-squares method; weighted least-squares method, or chi-square method; robust fitting methods for non-gaussian data, general notion of outlier. Linear regression by the chi-square method.

Linear regression by the least-squares method as a particular case. Determination of the best-fit parameters by the normal equation. The best-fit parameters as normal random variables. Mean and covariance matrix of the best-fit parameters. The normalized sum of squares of residuals around the regression (NSSAR) as a chisquare random variable. Stochastic independence of the estimated parameters. Goodness of fit Q of the regression model. Extreme values of the goodness of fit Q. Case of equal and a priori unknown standard deviations, estimate of the standard deviations of the data by the chi-square method. F-test for the goodness of fit (in the presence of an additive parameter). F-test with repeated observations: case of unequal, known variances; case of equal, possibly unknown variances. Incremental F-test on the introduction of a further fitting parameter. Self-consistency tests of the regression model for homoskedastic systems with unknown variance: coefficient of determination and adjusted coefficient of determination, line fit plot, residuals and residual plot, standardized residuals and normal probability plot. Confidence intervals for the regression parameters. Confidence intervals for predictions. Important case. Regression straight line: regression straight line, basic model; regression straight line, alternative model; confidence intervals for the regression parameters, basic model; confidence intervals for the regression parameters, alternative model; confidence interval for predictions, alternative model, confidence region; confidence interval for predictions, basic model. Example of regression straight line in a homoscedastic model, confidence intervals for the intercept and the slope, confidence region, confidence interval for the prediction at a given value of the abscissa, goodness of fit for an assigned value of the variance. Maple implementation of the linear regression. Excel implementation of the linear regression. Excel implementation of the linear regression for the case of no additive parameter. Linear regression with more variables. Example of multiple linear regression. Topics treated in the lecture notes: Chi-square fitting by SVD. Singular Value Decomposition of a matrix. SVD and covariance matrix of the estimated best-fit parameters. Example of SVD by Maple. N. Nonlinear regression (on having enough time) N.1 Nonlinear models N.2 Maple implementation of nonlinear regression. Example of a nonlinear regression model reducible to a linear one. N.3 Example of a nonlinear regression model in one dependent variable. N.4 Example of a nonlinear regression model with two independent variables. N.5 Robust fitting. N.5.1 Estimate of the best-fit parameters of a model by local M-estimates.

N.5.2 Excel and Maple implementation of robust fitting N.5.3 Example of robust regression straight line N.6 Multiple regression Principal Component Analysis (PCA) (on having enough time) Basic ideas. Principal component model. Principal component model for predictions. Relationship between PCA and SVD. Use of PCA for the grouping of data. PCA for nonb-centred data (General PCA). Implementation of PCA by R. Illustrative example of application of PCA. Further example of PCA. References: [1] E. Lloyd Ed., Handbook of applicable mathematics, Volume VI- Part A: Statistics, John Wiley & Sons, New York, 1984, p. 57 [2] Handbook VI-A (op. cit.), p. 41 [3] C. Capiluppi, D. Postpischl, P. Randi, Introduzione alla elaborazione dei dati sperimentali, CLUEB, Bologna, 1978, pagg. 89-98 (media) e 100-102 (varianza) [4] T.H.Wannacott, R.J.Wannacott, Introduzione alla statistica, Franco Angeli, Milano, 1998, p. 205-228 [5] W.H. Press, B.P. Flannery, S.A. Teukolsky, W.T.Vetterling, Numerical Recipes, Cambridge University Press, Cambridge, 1989, p. 470-471 and Capiluppi (op. cit.), p. 115-117 [6] C. Capiluppi (op.cit.), p. 120-121 [7] C. Capiluppi (op.cit.), p. 121-122 [8] Numerical Recipes (op. cit.), p. 468, Handbook VI-A, p. 264 [9] C. Capiluppi (op.cit.), p.123-124 (equal variances) and Numerical Recipes (op. cit.), p. 466-467 (different variances) [10] http://ishtar.df.unibo.it/stat/avan/misure/criteri/chauvenet.html [11] Numerical Recipes (op. cit.), p. 484-487 and for further reference, in the case of nonzero p, see Handbook VI-A (op. cit.), p. 53-54 General references: John R. Taylor, Introduzione all'analisi degli errori, Zanichelli, Bologna, 1986 Elena S. Ventsel, Teoria delle probabilità, Ed. MIR, 1983 Sheldon M. Ross, Introduction fo probability and statistics for engineers and scientists, 2 nd edt., Academic Press, New York, 2000 Sheldon M. Ross, Probabilità e statistica per l ingegneria e le scienze, APOGEO, Milano, 2003 http://ishtar.df.unibo.it/stat/avan/temp.html (in Italian) http://www.asp.ucar.edu/colloquium/1992/notes/part1/ (in English, very good)