Minimum Description Length (MDL)

Similar documents
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Introduction to Statistical modeling: handout for Math 489/583

Model comparison and selection

How the mean changes depends on the other variable. Plots can show what s happening...

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

MS&E 226: Small Data

Regression I: Mean Squared Error and Measuring Quality of Fit

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Introduction to Logistic Regression

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

STAT 100C: Linear models

Linear Regression In God we trust, all others bring data. William Edwards Deming

Model comparison. Patrick Breheny. March 28. Introduction Measures of predictive power Model selection

Introduction to Machine Learning and Cross-Validation

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Chapter 7: Model Assessment and Selection

Model Selection. Frank Wood. December 10, 2009

CPSC 540: Machine Learning

Univariate ARIMA Models

Generalized Linear Models

Regression, Ridge Regression, Lasso

Statistics 203: Introduction to Regression and Analysis of Variance Course review

ECE521 week 3: 23/26 January 2017

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

PATTERN RECOGNITION AND MACHINE LEARNING

Cross-validation for detecting and preventing overfitting

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Information Theory and Model Selection

Classification Trees

arxiv: v1 [cs.lg] 25 May 2009

Measures of Fit from AR(p)

Choosing among models

Multiple Linear Regression for the Supervisor Data

Machine Learning, Fall 2009: Midterm

Bayesian Learning (II)

Machine Learning for OR & FE

MS-C1620 Statistical inference

Sparse Linear Models (10/7/13)

Machine Learning. Part 1. Linear Regression. Machine Learning: Regression Case. .. Dennis Sun DATA 401 Data Science Alex Dekhtyar..

Machine Learning Practice Page 2 of 2 10/28/13

Why should you care about the solution strategies?

Lecture 2 Machine Learning Review

Machine Learning, Fall 2012 Homework 2

PDEEC Machine Learning 2016/17

10. Alternative case influence statistics

Bootstrap, Jackknife and other resampling methods

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

STA 414/2104, Spring 2014, Practice Problem Set #1

Linear Models (continued)

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Generative v. Discriminative classifiers Intuition

High-dimensional regression

Pattern Recognition and Machine Learning

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Decision trees COMS 4771

Introduction to Logistic Regression

Estimation. Max Welling. California Institute of Technology Pasadena, CA

Theorems. Least squares regression

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Linear classifiers: Overfitting and regularization

Generalization to Multi-Class and Continuous Responses. STA Data Mining I

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Proteomics and Variable Selection

Simple Linear Regression for the Advertising Data

Streaming Feature Selection using IIC

SGN Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection

Opening Theme: Flexibility vs. Stability

Machine Learning Linear Regression. Prof. Matteo Matteucci

10-701/ Machine Learning, Fall

Machine Learning 3. week

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17

Linear classifiers: Logistic regression

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Lecture 7: Modeling Krak(en)

Model Averaging (Bayesian Learning)

ISyE 691 Data mining and analytics

Week 5: Logistic Regression & Neural Networks

Linear model selection and regularization

SRNDNA Model Fitting in RL Workshop

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008.

MASM22/FMSN30: Linear and Logistic Regression, 7.5 hp FMSN40:... with Data Gathering, 9 hp

Dimension Reduction Methods

Model Selection Procedures

Decision Tree Learning

Tufts COMP 135: Introduction to Machine Learning

Jeffrey D. Ullman Stanford University

High-dimensional regression with unknown variance

Decision Tree Learning Lecture 2

CS6220: DATA MINING TECHNIQUES

Machine Learning Lecture 7

Transcription:

Minimum Description Length (MDL) Lyle Ungar AIC Akaike Information Criterion BIC Bayesian Information Criterion RIC Risk Inflation Criterion

MDL u Sender and receiver both know X u Want to send y using minimum number of bits u Send y in two parts l l Code (the model) Residual (training error = in-sample error ) u Decision tree l l Code = the tree Residual = the misclassifications u Linear regression l l Code = the weights Residual = prediction errors The MDL model is of optimal complexity Trades off bias and variance

Complexity of a decision tree u Need to code the structure of the tree and the values on the leaves Homework!

Complexity of Classification Error u Have two sequences y, ŷ l Code the difference between them l y = (0, 1, 1, 0, 1, 1, 1, 0, 0) l ŷ = (0, 0, 1,1, 1, 1, 1, 0, 0) l y - ŷ = (0, 1, 0,-1, 0, 0, 0, 0, 0) u How to code the differences? u How many bits would the best possible code take?

MDL for linear regression u Need to code the model l For example n y = 3.1 x 1 + 97.2 x 321-17.4 x 5204 l Two part code for model n Which features are in the model n Coefficients for those features u Need to code the residual l Σ i (y i -ŷ i ) 2 u How to code a real number? How to code which features are in the model? How to code the feature values? Class exercise

Complexity of a real number u A real number could require infinite number of bits u So we need to decide what accuracy to code it to l Code sum of square error (SSE) to the accuracy given by the irreducible variance (bits ~ 1/σ 2 ) l Code each w i to its accuracy (bits ~ n 1/2 ) u We know that y and w i are both normally distributed l y ~ N(x. w, σ 2 ) l w j est ~ N(w j, σ 2/ n) u Code a real value by dividing it into regions of equal probability mass l Optimal coding length is given by entropy = p(y) log(p(y))dy

MDL regression penalty Code the residual / data log-likelihood - log(likelihood) = - log(π i p(y i x i )) = - log[π(1/sqrt(2π)σ) exp(- y-ŷ 22 /2σ 2 )] = n ln(sqrt(2π) σ) + Err q / 2σ 2 Code the model For each feature: is it in the model? If it is included, what is its coefficient?

Code the residual Code the residual / data log-likelihood - log(likelihood) = - log(π i p(y i x i )) = n ln(sqrt(2π) σ) + Err q / 2σ 2 But we don t know σ 2. σ 2 = E[(y-ŷ) 2 )] = (1/n) Err Option 1 use estimate from previous iteration σ 2 = Err q-1 pay Err q / 2 Err c-1 Option 2 use estimate from current iteration σ 2 = Err q pay n ln(sqrt(2π) Err q /n)

For each feature: is it in the model? If you expect q features in the model, each will come in with probability (q/p) The total cost is then p (-(q/p) log(q/p) - ((p-q)/p) log ((p-q)/p) If q/p = ½ total cost is p cost/feature = 2 bits If q = 1, total cost is roughly log(p) cost/feature = 1 bit

Code each coefficient Code each coefficient with accuracy proportional to n 1/2 (1/2) log(n) bits/feature

MDL regression penalty Minimize Err q / 2σ 2 + λ w 0 L 0 penalty on coefficients SSE with the w 0 = q features Penalty λ = - log(π) + (1/2) log(n) Code each feature presence using -log(π) bits/feature π = q/p assume q<<p, so log(1-π ) is near 0 if q=1, then log(π) = log(p) Code each coefficient with accuracy proportional to n 1/2 (1/2) log(n) bits/feature n observations q = w 0 actual features p potential features

MDL regression penalty - aside Entropy of features being present or absent: Σ(-πlog(π) (1-π)log(1-π)) = (-πlog(π) (1-π)log(1-π))p = -q log(π) (p-q) log (1-π) If π <<1 then log (1-π) is roughly 0 - q log(π) - is the cost of coding the q features So each feature costs log(π) bits π = q/p probability of a feature being selected n observations q = w 0 actual features p potential features

Regression penalty methods Minimize Err q / 2σ 2 + λ w 0 L 0 penalty on coefficients SSE with the w 0 = q features Method penalty (λ) AIC 1 code coefficient using 1 bit BIC (1/2) log(n) code coefficient using n 1/2 bits RIC log(p) code feature presence/absence prior: one feature will come in How do you estimate σ 2?

Regression penalty methods Minimize Err q / 2σ 2 + λ w 0 Method penalty (λ) A) AIC 1 code coefficient \ using 1 bit B) BIC (1/2) log(n) code coefficient using n 1/2 bits C) RIC log(p) code feature presence/absence

Mallows C p, AIC as MDL Err/ 2σ 2 + q u Mallows C p n doesn t effect l C p = Err q /σ 2 + 2 q - n maximization u AIC l AIC = -2 log(likelihood) + 2 q = -2 log[π(1/sqrt(2π)σ) exp(- Err q /2σ 2 )] + 2q But the best estimate we have is σ 2 = Err q /n AIC ~ 2n log (Err q /n) 1/2-2 log exp(- Err q /(2Err q /n)) + 2q ~ n log(err q /n) + 2 q q = w 0 features in the model

BIC is MDL (as n goes to infinity) Err/ 2σ 2 + (1/2) log(n) q u BIC - 2 log(likelihood) + 2 log(sqrt(n)) q = n log(err q /n) + log(n) q using the exact same derivation as before. q = w 0 features in the model

Why does MDL work? u We want to tune model complexity l How many bits should we use to code the model? u Minimize Test error = training error + penalty = bias + variance l Training error = bias = bits to code residual n Σ i (y i -ŷ i ) 2 / 2σ 2 = - p(y X) log p(y X) Ignoring irreducible uncertainty l Penalty = variance = bits to code the model

What you should know u How to code (in the info-theory sense) l decision trees, regression models, l classification and prediction errors u AIC, BIC and RIC l Assumptions behind them l Why they are useful

You think maybe 10 out of 100,000 features will be significant. Use A) L 2 with CV B) L 1 with CV C) L 0 with AIC D) L 0 with BIC E) L 0 with RIC

You think maybe 500 out of 1,000 features will be significant. Do not use A) L 2 with CV B) L 1 with CV C) L 0 with AIC D) L 0 with BIC E) L 0 with RIC

Regression penalty methods Minimize Err/ 2σ 2 + λ w 0 Err is A) Σ i (y i -ŷ i ) 2 B) (1/n) Σ i (y i -ŷ i ) 2 C) sqrt((1/n) Σ i (y i -ŷ i ) 2 ) D) something else Where does the 2σ 2 come from?

Bonus: p-values u P-value: the probability of getting a false positive If I check 1,000 univariate correlations between x j and some y, and accept those with p < 0.01 I should expect roughly false positives A) 0 B) 1 C) 10 D) 100 E) 1,000 How would you fix this?

Bonus: p-values u Bonferroni p = number of features 1/p = prior probability l require a p-value to be p times smaller n p-value < 0.01(1/p) u Simes: sequential feature selection l Sort features by their p-values l For the first feature to be accepted use Bonferroni n p-value < 0.01(1/p) -- if nothing passes, then stop l If it is accepted the p-value for the second feature is: n p-value < 0.01(2/p) -- if nothing passes, then stop l If it is accepted the p-value for the third feature is n p-value < 0.01(3/p) l...