An automatic report for the dataset : affairs

Similar documents
Linear Model Selection and Regularization

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Linear model selection and regularization

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Machine Learning Linear Regression. Prof. Matteo Matteucci

Model Selection. Frank Wood. December 10, 2009

Simple Linear Regression for the Advertising Data

The Model Building Process Part I: Checking Model Assumptions Best Practice

Making rating curves - the Bayesian approach

MS-C1620 Statistical inference

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

Chapter 9 Regression. 9.1 Simple linear regression Linear models Least squares Predictions and residuals.

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION. University of Minnesota

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

Chapter 1 - Lecture 3 Measures of Location

CS Homework 3. October 15, 2009

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Helping the Red Cross predict flooding in Togo

Quantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017

Big Data Analysis with Apache Spark UC#BERKELEY

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression

arxiv: v1 [stat.me] 30 Dec 2017

200 participants [EUR] ( =60) 200 = 30% i.e. nearly a third of the phone bills are greater than 75 EUR

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat).

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

Prediction of Bike Rental using Model Reuse Strategy

Exploratory quantile regression with many covariates: An application to adverse birth outcomes

Regression diagnostics

Regression. Marc H. Mehlman University of New Haven

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 03

Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 78 le-tex

Lecture 18: Simple Linear Regression

EE290H F05. Spanos. Lecture 5: Comparison of Treatments and ANOVA

Math 141. Quantile-Quantile Plots. Albyn Jones 1. jones/courses/ Library 304. Albyn Jones Math 141

AP Statistics Cumulative AP Exam Study Guide

Generalized Linear Models

Chapter 1 Statistical Inference

Brief reminder on statistics

Density Temp vs Ratio. temp

CSC2515 Assignment #2

BNG 495 Capstone Design. Descriptive Statistics

Statistical inference (estimation, hypothesis tests, confidence intervals) Oct 2018

Evaluation. Andrea Passerini Machine Learning. Evaluation

Bootstrap, Jackknife and other resampling methods

You can use numeric categorical predictors. A categorical predictor is one that takes values from a fixed set of possibilities.

3rd Quartile. 1st Quartile) Minimum

Regression, Ridge Regression, Lasso

Descriptive statistics

The subjective worlds of Frequentist and Bayesian statistics

Evaluation requires to define performance measures to be optimized

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

4.12 Sampling Distributions 183

Practical Statistics for the Analytical Scientist Table of Contents

Data Mining. Chapter 5. Credibility: Evaluating What s Been Learned

Foundations of Probability and Statistics

Forecast comparison of principal component regression and principal covariate regression

Exam details. Final Review Session. Things to Review

2-way analysis of variance

MAT Mathematics in Today's World

Exploratory data analysis: numerical summaries

Regression in R. Seth Margolis GradQuant May 31,

Machine Learning for OR & FE

Stat 101 Exam 1 Important Formulas and Concepts 1

3 Results. Part I. 3.1 Base/primary model

Model selection and checking

the logic of parametric tests

ISQS 5349 Spring 2013 Final Exam

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Probe-Level Analysis of Affymetrix GeneChip Microarray Data

MATH 1150 Chapter 2 Notation and Terminology

Statistical concepts in QSAR.

Cross-Validation with Confidence

The Temperature Proxy Controversy

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

Midrange: mean of highest and lowest scores. easy to compute, rough estimate, rarely used

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Low-Level Analysis of High- Density Oligonucleotide Microarray Data

Gaussians Linear Regression Bias-Variance Tradeoff

Machine Learning: Evaluation

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Modelling Wind Farm Data and the Short Term Prediction of Wind Speeds

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

TOTAL JITTER MEASUREMENT THROUGH THE EXTRAPOLATION OF JITTER HISTOGRAMS

Objective Experiments Glossary of Statistical Terms

STK4900/ Lecture 5. Program

10 Model Checking and Regression Diagnostics

The prediction of house price

Answer keys for Assignment 10: Measurement of study variables (The correct answer is underlined in bold text)

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Median Cross-Validation

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Transcription:

An automatic report for the dataset : affairs (A very basic version of) The Automatic Statistician Abstract This is a report analysing the dataset affairs. Three simple strategies for building linear models have been compared using fold cross validation on half of the data. The strategy with the lowest cross validated prediction error has then been used to train a model on the same half of data. This model is then described, displaying the most influential components first. Model criticism techniques have then been applied to attempt to find discrepancies between the model and data. 1 Brief description of data set To confirm that I have interpreted the data correctly a short summary of the data set follows. The target of the regression analysis is the column affairs. There are input columns and 1 rows of data. A summary of these variables is given in table 1. Name Minimum Median Maximum affairs 1 1 3 7.1 7 1 1 3 9 1 1 7 1 Table 1: Summary statistics of data Summary of model construction I have compared a number of different model construction techniques by computing cross-validated root-mean-squared-errors (RMSE). I have also expressed these errors as a proportion of variance explained. These figures are summarised in table. Method Cross validated RMSE Cross validated variance explained (%) Full linear model 3. 7. BIC stepwise 3.31.1 LASSO 3.33 1. Table : Summary of model construction methods and cross validated errors The method, Full linear model, has the lowest cross validated error so I have used this method to train a model on half of the data. In the rest of this report I have described this model and have attempted to falsify it using held out test data. 1

3 Model description In this section I have described the model I have constructed to explain the data. A quick summary is below, followed by quantification of the model with accompanying plots of model fit and residuals. 3.1 Summary The output affairs: decreases linearly with input decreases linearly with input increases linearly with input decreases linearly with input increases linearly with input decreases linearly with input 3. Detailed plots Decrease with The correlation between the data and the input is -. (see figure 1a). Accounting for the rest of the model, this changes slightly to a part correlation of -. (see figure 1b). 1 1 against 1 3 1 against 1 3 1 against 1 3 Figure 1: a) plotted against input. b) s (data minus the rest of the model) and fit of this component. c) (data minus the full model). Decrease with The correlation between the data and the input is -. (see figure a). This correlation does not change when accounting for the rest of the model (see figure b). 1 1 against 1 3 1 against 1 3 1 against 1 3 Figure : a) plotted against input. b) s (data minus the rest of the model) and fit of this component. c) (data minus the full model).

Increase with The correlation between the data and the input is.1 (see figure 3a). Accounting for the rest of the model, this changes slightly to a part correlation of. (see figure 3b). 1 1 against 1 1 1 1 against 1 1 1 1 against 1 1 1 Figure 3: a) plotted against input. b) s (data minus the rest of the model) and fit of this component. c) (data minus the full model). Decrease with The correlation between the data and the input is. (see figure a). Accounting for the rest of the model, this changes moderately to a part correlation of -.1 (see figure b). 1 1 against 1 3 3 1 against 1 3 3 1 against 1 3 3 Figure : a) plotted against input. b) s (data minus the rest of the model) and fit of this component. c) (data minus the full model). Increase with The correlation between the data and the input is.1 (see figure a). This correlation does not change when accounting for the rest of the model (see figure b). 1 1 against 1 3 7 1 against 1 3 7 1 against 1 3 7 Figure : a) plotted against input. b) s (data minus the rest of the model) and fit of this component. c) (data minus the full model). 3

Decrease with The correlation between the data and the input is. (see figure a). Accounting for the rest of the model, this changes slightly to a part correlation of -. (see figure b). 1 1 against 1 1 1 1 1 against 1 1 1 1 1 against 1 1 1 1 Figure : a) plotted against input. b) s (data minus the rest of the model) and fit of this component. c) (data minus the full model). Model criticism In this section I have attempted to falsify the model that I have presented above to understand what aspects of the data it is not capturing well. This has been achieved by comparing the model with data I held out from the model fitting st. In particular, I have searched for correlations and dependencies within the data that are unexpectedly large or small. I have also compared the distribution of the residuals with that assumed by the model (a normal distribution). There are other tests I could perform but I will hopefully notice any particularly obvious failings of the model. Below are a list of the discrepancies that I have found with the most surprising first. Note however that some discrepancies may be due to chance; on aver % of the listed discrepancies will be due to chance. High dependence between residuals and model fit There is an unexpectedly high dependence between the residuals and model fit (see figure 7a). The dependence as measured by the randomised dependency coefficient (RDC) has a substantially larger value of. compared to its median value under the proposed model of.1 (see figure 7b). Test residuals and sample residuals 1 Testing data against predictions Samples Test 1 3 1 1 3 Model fit 1 Histogram of RDC..1..3....7..9 RDC a) b) Figure 7: a) Test set and model sample residuals. b) Histogram of RDC evaluated on random samples from the model (values that would be expected if the data was generated by the model) and value on test data (dashed line). Low negative deviation between quantiles of test residuals and model There is an unexpectedly low negative deviation from equality between the quantiles of the residuals and the assumed noise model (see figure a). The minimum of this deviation occurs at the 99th percentile indicating that the test residuals have unexpectedly light positive tails. The minimum value of this deviation is -.7 which is moderately lower than its median value under the proposed model of -1.9 (see figure c). To demonstrate the discrepancy in simpler terms I have plotted histograms of test set residuals and those expected under the model in figure b.

Data quantiles 1 QQ plot of residuals 1 1 1 Model quantiles Histograms of model and test residual 1 1 Samples Test 1 1 Residual values 1 Histogram of min QQ difference 1 1 1 9 7 3 1 Min QQ difference Figure : a) Test set residuals vs model sample quantile quantile plot. b) Histograms of test set residuals and model residuals. c) Histogram of min QQ difference evaluated on random samples from the model (values that would be expected if the data was generated by the model) and value on test data (dashed line).