IEOR165 Discussion Week 5

Similar documents
Lecture 14: Shrinkage

Machine Learning for OR & FE

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Linear Model Selection and Regularization

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Homework 1: Solutions

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Statistics and Econometrics I

Introduction to Machine Learning

Machine Learning for OR & FE

Linear regression methods

Regularization and Variable Selection via the Elastic Net

Lecture 2: Statistical Decision Theory (Part I)

Machine Learning Linear Regression. Prof. Matteo Matteucci

Sparse Linear Models (10/7/13)

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Lecture 6: Linear Regression (continued)

Regression Shrinkage and Selection via the Lasso

Mathematical statistics

Regularized Regression A Bayesian point of view

STAT 462-Computational Data Analysis

Data Mining Stat 588

y Xw 2 2 y Xw λ w 2 2

Generalized Linear Models. Kurt Hornik

Introduction to Machine Learning. Lecture 2

Problem Selected Scores

Association studies and regression

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Penalized Regression

Linear Methods for Regression. Lijun Zhang

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Lecture 20 May 18, Empirical Bayes Interpretation [Efron & Morris 1973]

Regression, Ridge Regression, Lasso

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Lecture 14: Variable Selection - Beyond LASSO

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Probability and Statistics qualifying exam, May 2015

For more information about how to cite these materials visit

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

MLE and GMM. Li Zhao, SJTU. Spring, Li Zhao MLE and GMM 1 / 22

Support Vector Machine I

Bayesian linear regression

MS-C1620 Statistical inference

CS6220: DATA MINING TECHNIQUES

IEOR165 Discussion Week 12

Mathematical statistics

Lecture 6: Linear Regression

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables THE UNIVERSITY OF MANCHESTER. 21 June :45 11:45

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Prediction & Feature Selection in GLM

Consistent high-dimensional Bayesian variable selection via penalized credible regions

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Dimension Reduction Methods

Statistics 135 Fall 2008 Final Exam

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

Applied Machine Learning Annalisa Marsico

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Model Checking and Improvement

Bayesian performance

Masters Comprehensive Examination Department of Statistics, University of Florida

Generalized Elastic Net Regression

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

1 One-way analysis of variance

Spatial Lasso with Application to GIS Model Selection. F. Jay Breidt Colorado State University

Categorical Data Analysis Chapter 3

Link lecture - Lagrange Multipliers

A Modern Look at Classical Multivariate Techniques

ISyE 691 Data mining and analytics

Review. December 4 th, Review

CS6220: DATA MINING TECHNIQUES

Lecture Data Science

Day 4: Shrinkage Estimators

Weakly informative priors

Estimating prediction error in mixed models

Econ 583 Final Exam Fall 2008

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

Weakly informative priors

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Linear Model Selection and Regularization

5.2 Expounding on the Admissibility of Shrinkage Estimators

Linear Regression. Junhui Qian. October 27, 2014

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Econometrics I. Ricardo Mora

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

UNIVERSITETET I OSLO

Ecological inference with distribution regression

Stability and the elastic net

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

PENALIZING YOUR MODELS

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Outline lecture 2 2(30)

Transcription:

IEOR165 Discussion Week 5 Sheng Liu University of California, Berkeley Feb 19, 2016

Outline 1 1st Homework 2 Revisit Maximum A Posterior 3 Regularization IEOR165 Discussion Sheng Liu 2

About 1st Homework For method of moments, understand the difference between µ i ˆµ i θ i ˆθi For uniform distribution on (θ 1, θ 2 ), understand why Solution will be posted online ˆθ 2 > ˆµ 1 > ˆθ 1 IEOR165 Discussion Sheng Liu 3

Maximum A Posterior Revisit Maximum A Posterior (MAP) f(θ X) = f(x θ)g(θ) f(x) Posterior = likelihood prior evidence MLE: maximum likelihood f(x θ) ˆθ ML = arg max f(x θ) = arg max log f(x θ) MAP: maximum posterior f(θ X) ˆθ MAP = arg max f(θ X) = arg max log f(θ X) = arg max {log f(x θ) + log g(θ)} IEOR165 Discussion Sheng Liu 4

MLE vs MAP (Based on Avinash Kak, 2014) Let X 1,..., X n be a random sample. For each i, the value of X i can be either Clinton or Sanders. We want to estimate the probability p that a democrat will vote Clinton in the primary. Given a p, X i will follow a Bernoulli distribution: What is the MLE? f(x i = Clinton p) = p f(x i = Sanders p) = 1 p IEOR165 Discussion Sheng Liu 5

MLE vs MAP Now consider the MAP: What should be the prior? The prior should be within the interval [0, 1] (common knowledge) Different people can have different beliefs about the prior: where should the prior peak? what should be the variance? Here, we take the Beta prior: p Beta(α, β) : g(p) = 1 B(α, β) pα 1 (1 p) β 1 where B(α, β) is the beta function. The mode for the Beta distribution is α 1 α + β 2 And the variance is αβ (α + β) 2 (α + β + 1) Choose the parameters for the prior as α = β = 5. IEOR165 Discussion Sheng Liu 6

MLE vs MAP Derive the MAP. If we have a sample of size n = 100 with 60 of them saying that they will vote for Sanders. Then what s the difference between the estimates of p using MLE and MAP? IEOR165 Discussion Sheng Liu 7

Motivation Why do we want to impose regularization on OLS? Tradeoff between bias and variance: OLS is unbiased but variance may be high n < p, when the observation is not enough, OLS may fail Collinearity: when predictors are correlated, the variance of OLS is significantly high Adding regularization will introduce bias but lower the variance Model interpretability Adding more predictors is not always good, it increases the complexity of the model and thus makes it harder for us to extract useful information Regularization (shrinkage) will make some coefficients approaching zero and select the most influential coefficients (and corresponding predictors) from the model IEOR165 Discussion Sheng Liu 8

Regularization Ridge regression: l 2 -norm regularization ˆβ = arg min Y Xβ 2 2 + λ β 2 2 n p = arg min (y i β 0 β j x ij ) 2 + λ p i=1 j=1 j=1 Lasso: l 1 -norm regularization ˆβ = arg min Y Xβ 2 2 + λ β 1 n p p = arg min (y i β 0 β j x ij ) 2 + λ β j i=1 j=1 j=1 Elastic net: combination of l 1 -norm and l 2 -norm regularization ˆβ = arg min Y Xβ 2 2 + λ β 2 2 + µ β 1 β 2 j IEOR165 Discussion Sheng Liu 9

Credit Data Example (Gareth, et al. 2013) Given a dataset that records Balance Age Cards (Number of credit cards) Education (years of education) Income Limit (credit limit) Rating (credit rating) Gender Student (whether a student or not) marital status ethnicity Let Balance be the response and all other variables be predictors. IEOR165 Discussion Sheng Liu 10

Credit Data Example IEOR165 Discussion Sheng Liu 11

Ridge Regression λ ˆβ R λ 2 IEOR165 Discussion Sheng Liu 12

Lasso λ ˆβ L λ 1 Lasso does variable selection, and gives sparse model. IEOR165 Discussion Sheng Liu 13

Lasso λ ˆβ L λ 1 Lasso does variable selection, and gives sparse model. IEOR165 Discussion Sheng Liu 14

Elastic Net λ µ = 0.3 The path for Limit and Rating are very similar. IEOR165 Discussion Sheng Liu 15

Choose λ Cross-Validation IEOR165 Discussion Sheng Liu 16