FEEG6017 lecture: Akaike's information criterion; model reduction. Brendan Neville

Similar documents
Introduction to Statistical modeling: handout for Math 489/583

Linear Models (continued)

Notes 1 Autumn Sample space, events. S is the number of elements in the set S.)

How the mean changes depends on the other variable. Plots can show what s happening...

Quantitative Methods for Decision Making

6.867 Machine Learning

Introductory Econometrics. Review of statistics (Part II: Inference)

11. Probability Sample Spaces and Probability

6.867 Machine Learning

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

Regression I: Mean Squared Error and Measuring Quality of Fit

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

MS-C1620 Statistical inference

Introduction to Statistics

Linear Model Selection and Regularization

Statistics 1. Edexcel Notes S1. Mathematical Model. A mathematical model is a simplification of a real world problem.

Chapter 1 Statistical Inference

Lecture 6: Model Checking and Selection

Model Selection. Frank Wood. December 10, 2009

hypothesis testing 1

27 Binary Arithmetic: An Application to Programming

Notation: X = random variable; x = particular value; P(X = x) denotes probability that X equals the value x.

Just Enough Likelihood

MASM22/FMSN30: Linear and Logistic Regression, 7.5 hp FMSN40:... with Data Gathering, 9 hp

Day 4: Shrinkage Estimators

Steve Smith Tuition: Maths Notes

Probability theory basics

Chapter 2.5 Random Variables and Probability The Modern View (cont.)

Probability Theory and Simulation Methods

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

The connection of dropout and Bayesian statistics

Statistics 262: Intermediate Biostatistics Model selection

Dynamics in Social Networks and Causality

2.2 Classical Regression in the Time Series Context

What is a random variable

Model comparison. Patrick Breheny. March 28. Introduction Measures of predictive power Model selection

Model comparison and selection

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Discrete Random Variable

Linear Models for Regression CS534

Why should you care?? Intellectual curiosity. Gambling. Mathematically the same as the ESP decision problem we discussed in Week 4.

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

COMP6053 lecture: Sampling and the central limit theorem. Jason Noble,

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

SDS 321: Introduction to Probability and Statistics

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

CPSC 540: Machine Learning

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

ECON 594: Lecture #6

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Mohammed. Research in Pharmacoepidemiology National School of Pharmacy, University of Otago

Linear Models: Comparing Variables. Stony Brook University CSE545, Fall 2017

MATH MW Elementary Probability Course Notes Part I: Models and Counting

COMP6053 lecture: Sampling and the central limit theorem. Markus Brede,

Testing and Model Selection

Information Theory & Decision Trees

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

The Logit Model: Estimation, Testing and Interpretation

Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture I: Linear Correlation and Regression

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

CHI SQUARE ANALYSIS 8/18/2011 HYPOTHESIS TESTS SO FAR PARAMETRIC VS. NON-PARAMETRIC

Lecture Lecture 5

Chapter 14. From Randomness to Probability. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Linear Models for Regression CS534

Lecture 10: Bayes' Theorem, Expected Value and Variance Lecturer: Lale Özkahya

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

6.041SC Probabilistic Systems Analysis and Applied Probability, Fall 2013 Transcript Tutorial:A Random Number of Coin Flips

CS Homework 3. October 15, 2009

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

MITOCW watch?v=vjzv6wjttnc

Train the model with a subset of the data. Test the model on the remaining data (the validation set) What data to choose for training vs. test?

Generalised linear models. Response variable can take a number of different formats

MODULE 2 RANDOM VARIABLE AND ITS DISTRIBUTION LECTURES DISTRIBUTION FUNCTION AND ITS PROPERTIES

Part 3: Parametric Models

Deep Learning for Computer Vision

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Entropy. Probability and Computing. Presentation 22. Probability and Computing Presentation 22 Entropy 1/39

CONTENTS OF DAY 2. II. Why Random Sampling is Important 10 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

36-309/749 Experimental Design for Behavioral and Social Sciences. Dec 1, 2015 Lecture 11: Mixed Models (HLMs)

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Dimension Reduction Methods

Linear Models for Regression CS534

MITOCW MIT6_041F11_lec15_300k.mp4

Machine Learning

CS 361: Probability & Statistics

All models are wrong but some are useful. George Box (1979)

Decision Trees. Tirgul 5

Math 416 Lecture 3. The average or mean or expected value of x 1, x 2, x 3,..., x n is

CS6220: DATA MINING TECHNIQUES

Lecture 3 Probability Basics

What is Probability? Probability. Sample Spaces and Events. Simple Event

Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences h, February 12, 2015

Stochastic Processes

Online Question Asking Algorithms For Measuring Skill

Random Variables. Statistics 110. Summer Copyright c 2006 by Mark E. Irwin

The Inductive Proof Template

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

t-test for b Copyright 2000 Tom Malloy. All rights reserved. Regression

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Transcription:

FEEG6017 lecture: Akaike's information criterion; model reduction Brendan Neville bjn1c13@ecs.soton.ac.uk

Occam's razor William of Occam, 1288-1348. All else being equal, the simplest explanation is the best one.

Occam's razor In statistics, this means a model with fewer parameters is to be preferred to one with more. Of course, this needs to be weighed against the ability of the model to actually predict anything...

Why reduce models? In keeping with Occam's razor, the idea is to trim complicated multi-variable models down to a reasonable size. This is most obvious when we look at multiple regression. Do I need all 12 of these predictors? Would a model with only 6 predictors be almost as accurate and thus preferable?

Why reduce models? The same logic applies to even the simplest statistical tests. A two-sample t-test asks whether the model that says the two samples come from populations with different means could be pared down to the simpler model that says they come from a single population with a common mean.

Over-fitting How greedy can we get? In other words, how many predictors, or degrees of freedom, can a model reasonably have? A useful absolute ceiling to think about is a model with N-1 binary categorical predictor variables, where N is the sample size.

Over-fitting

Over-fitting N-1 predictors would be enough to assign a unique value to each case. That would allow the model to explain all variance in the data, but it's clearly an absurd model. We "explain" the data by just reading it back to ourselves in full. No compression or explanation is achieved.

Under-fitting Thus we want to have fewer predictors in our model than there are cases in the data: usually a lot fewer. How minimal can we get? The minimal model is to simply explain all the variation in our outcome measure by specifying its mean.

Under-fitting

Under-fitting If you had no good model of height, and I asked you the height of the next person you see, your best response is 1.67m (i.e., the UK average). All of the variance in the outcome measure remains unexplained, but at least we can say our one-parameter model is economical!

A compromise We want a model that is as simple as possible, but no simpler. A reasonable amount of explanatory power traded off against model size (number of predictors). How do we measure that?

The old way of doing it People used to do model reduction through a series of F-tests asking whether a model with one extra predictor explained significantly more of the variance in the dependent or outcome variable. This was called "stepwise model reduction", and was done either by pruning from a full model (backwards) or building up from a null model (forwards).

The old way of doing it It wasn't a bad way to do it, but one problem was that the model you ended up with could be different depending on the order in which you examined candidate variables. To get to a better method we have to look briefly at information theory...

Kullback-Leibler divergence Roughly speaking this is a measure of the informational distance between two probability distributions. The K-L distance between a real-world distribution and a model distribution tells us how much information is lost by summarizing the phenomenon with that model. Minimizing the K-L distance is a good plan.

Maximum likelihood estimation A likelihood function gives the probability of observing the data given a certain set of model parameters. L θ x = P(x θ) It's not the same as the probability of a model being true. It's just a measure of how strange the data would be given a particular model. Choose the parameters which maximize the likelihood of the parameters given the data.

Coin tossing example We throw a coin three times: it comes up heads twice and tails once. We have two competing theories about the nature of the coin: o A: it's a fair coin, p(heads) = 0.5 o B: it's a biased coin, p(heads) = 0.8. There are 3 distinct ways to get 2 heads and 1 tail in 3 throws: HHT, HTH, THH.

Coin tossing example Under model A, each of those possibilities has p = 0.125, and the total probability of getting two heads and one tail (i.e., the data) is 0.375. P HHT A = 0.5 0.5 0.5 = 0.125 P 2H and 1T A = 3 0.125 = 0.375 L A 2H and 1T = 0.375

Coin tossing example Under model B, each of those possibilities has p = 0.128, and the total probability of two heads, one tail is 0.384. P HHT B = 0.8 0.8 0.2 = 0.128 P 2H and 1T B = 3 0.125 = 0.384 The likelihood function is maximized by model B in this case. L B 2H and 1T > L A 2H and 1T i.e. the unfair coin parameters are more likely given we saw 2 heads and 1 tails.

Coin tossing example We would therefore prefer model B (narrowly) to model A, because it's the model that renders the observed data "less surprising". Note that in this case models A and B have the same number of parameters, so there's nothing between them on simplicity, only accuracy.

"Akaike's information criterion" Hirotugu Akaike, 1927-2009. In the 1970s he used information theory to build a numerical equivalent of Occam's razor.

Akaike's information criterion The idea is that if we knew the true distribution F, and we had two models G1 and G2, we could figure out which model we preferred by noting which had a lower K-L distance from F. We don't know F in real cases, but we can estimate F-G1 and F-G2 from our data.

Akaike's information criterion That's what AIC is. The model with the lowest AIC value is the preferred one. The formula is remarkably simple: AIC = 2K - 2log(L)... where K is the number of predictors and L is the maximized likelihood value.

Akaike's information criterion The "2K" part of the formula is effectively a penalty for including extra predictors in the model. The "-2 log(l)" part rewards the fit between the model and the data. Likelihood values in real cases will be very small probabilities. So "-2 log(l)" will be a large positive number.

How do I use this in R? AIC is spectacularly easy to use in R. The command is AIC(model1, model2, model3,...) This lists the AIC values for all the named models; simply pick the lowest. drop1(model) is also very useful. It gives the AIC value for the models reached by dropping each predictor in turn from this one.

Additional resources You can have a play about with AIC. Use the Oscars data set from the previous lecture on logistic regression. Logistic regression example, but AIC also works with linear regression and any model where a maximum likelihood estimate exists.

Additional materials The Python code for generating graphs and the fictional data set used here; also the Python code for generating the fictional Oscars data set. The fictional Oscars data set as a text file. An R script for analyzing the fictional data set.

Additional materials If you want to reproduce the R session used in the lecture, load the above data file and R script into your working directory, and then type this command:source("aicscript.txt",echo=tru E)