Data Mining Stat 588

Similar documents
MS&E 226: Small Data

MS&E 226: Small Data

MATH 829: Introduction to Data Mining and Analysis Logistic regression

Lecture 4: Newton s method and gradient descent

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:

A Modern Look at Classical Multivariate Techniques

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Inversion Base Height. Daggot Pressure Gradient Visibility (miles)

Nonparametric Regression. Badr Missaoui

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Lecture 3: Statistical Decision Theory (Part II)

Logistic Regression and Generalized Linear Models

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

MLISP: Machine Learning in Signal Processing Spring Lecture 10 May 11

Lecture 2 Machine Learning Review

Data Mining Stat 588

Statistical Methods for Data Mining

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Statistical Methods for SVM

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. An Extension of Least Angle Regression Based on the Information Geometry of Dually Flat Spaces

CMSC858P Supervised Learning Methods

Machine Learning Linear Classification. Prof. Matteo Matteucci

MLISP: Machine Learning in Signal Processing Spring Lecture 8-9 May 4-7

12 - Nonparametric Density Estimation

Support Vector Machines

Splines and Friends: Basis Expansion and Regularization

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Lecture 3: Introduction to Complexity Regularization

Local regression I. Patrick Breheny. November 1. Kernel weighted averages Local linear regression

Recap from previous lecture

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Integrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University

Machine Learning for OR & FE

Regression, Ridge Regression, Lasso

Chapter 7: Model Assessment and Selection

CS540 Machine learning Lecture 5

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Generalized Additive Models

Prediction & Feature Selection in GLM

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Biostatistics Advanced Methods in Biostatistics IV

Regression: Lecture 2

Model Selection, Estimation, and Bootstrap Smoothing. Bradley Efron Stanford University

Tufts COMP 135: Introduction to Machine Learning

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17

STA 450/4000 S: January

Theoretical Exercises Statistical Learning, 2009

Lecture 5 Multivariate Linear Regression

Linear Regression Linear Regression with Shrinkage

Lecture 14: Variable Selection - Beyond LASSO

41903: Introduction to Nonparametrics

NON-NESTED MODEL SELECTION VIA EMPIRICAL LIKELIHOOD. Cong Zhao

CMU-Q Lecture 24:

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Applied Machine Learning for Biomedical Engineering. Enrico Grisan

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

An Introduction to Wavelets and some Applications

x j β j + λ (x ij x j )βj + λ β c2 and y aug = λi λi = X X + λi, X λi = y X.

Likelihood Ratio Tests. that Certain Variance Components Are Zero. Ciprian M. Crainiceanu. Department of Statistical Science

Lecture 14: Shrinkage

Machine Learning And Applications: Supervised Learning-SVM

MSG500/MVE190 Linear Models - Lecture 15

OPTIMISATION CHALLENGES IN MODERN STATISTICS. Co-authors: Y. Chen, M. Cule, R. Gramacy, M. Yuan

Sparse linear models

Proteomics and Variable Selection

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

Penalized Splines, Mixed Models, and Recent Large-Sample Results

Lecture 6: Methods for high-dimensional problems

The glmpath Package. June 6, Title L1 Regularization Path for Generalized Linear Models and Cox Proportional Hazards Model

Generalized Boosted Models: A guide to the gbm package

Linear Regression Linear Regression with Shrinkage

Beyond GLM and likelihood

Neural Networks. Haiming Zhou. Division of Statistics Northern Illinois University.

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan

Bias-Variance Tradeoff. David Dalpiaz STAT 430, Fall 2017

PDEEC Machine Learning 2016/17

Sparse Linear Models (10/7/13)

Regularization: Ridge Regression and the LASSO

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Classification. Chapter Introduction. 6.2 The Bayes classifier

Linear Regression Model. Badr Missaoui

Linear Models in Machine Learning

Computing regularization paths for learning multiple kernels

An Introduction to Graphical Lasso

MS-C1620 Statistical inference

Adaptive sparse grids

Gaussian with mean ( µ ) and standard deviation ( σ)

Terminology for Statistical Data

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008.

Generalized Additive Models

Transcription:

Data Mining Stat 588 Lecture 9: Basis Expansions Department of Statistics & Biostatistics Rutgers University Nov 01, 2011

Regression and Classification Linear Regression. E(Y X) = f(x) We want to learn f( ) from the training set (x 1, y 1 ),..., (x N, y N ). Logistic Regression. log P (G = 1 X) P (G = 0 X) = f(x) We want to learn f( ) from the training set (x 1, g 1 ),..., (x N, g N ).

Move Beyong Linearity Have seen models linear in the input features, both for regression and classification. To move beyond linearity, can augment/replace the vector of inputs X with additional variables, which are transformations of X, and then use linear models in this new space of derived input features. h m (X) = X m, m = 1,..., p. h m (X) = X 2 j or h m(x) = X j X k. h m (X) = log(x j ) or X j. h m (X) = I{L m X j < U m }. Model f(x) as a linear basis expansion in X f(x) = M β m h m (X). m=1

Dicionary Methods Have a dictionary D consisting of typically a very large number D of basis functions. Piecewise polynomials and splines. Smoothing splines. Trignometric funcations. Wavelet bases. Need to control the complexity. Restriction methods. Selection methods. Regularization methods. Assume X is one-dimensional for today, unless otherwise specified.

Piecewise Polynomials Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 5 Piecewise Constant Piecewise Linear Continuous Piecewise Linear Piecewise-linear Basis Function ξ 1 ξ 1 ξ 1 ξ 1 ξ 2 ξ 2 ξ 2 ξ 2 (X ξ 1 ) +

Piecewise Cubic Polynomials Discontinuous Continuous Continuous First Derivative Continuous Second Derivative Piecewise Cubic Polynomials ξ 1 ξ 1 ξ 1 ξ 1 ξ 2 ξ 2 ξ 2 ξ 2

Spline rder M = deg +1. Number of knots K. Placement of knots ξ j,..., ξ K. The domain of X is divided into K + 1 continuous intervals: (, ξ 1 ), [ξ 1, ξ 2 ),..., [ξ K 1, ξ K ), [ξ K, ) A spline is an order-m (degree-(m 1)) polynomial on each interval. At each knot ξ j, there is one polynomial on its left hand side and one on its right hand side. These two polynomials have the same value, the same first derivative and the same second derivative, and the same derivatives up to order M 2 at ξ j. A cubic spline has order M = 4. It is called cubic b/c the degree is 3. Cubic splines are the lowest-order spline for which the knot-discontinuity is not visible to the human eye

Truncated-power Basis With specified order, number of knots and their placement, there is a class of splines. n each interval, need M parameters to determine a cubic polynomial. M(K + 1). n each knot, there are M 1 constraints. (M 1)K. There are K + M parameters left. Degree of freedom. In fact, this class is a (K + M)-dimensional linear subspace of the space of all functions over the domain of X. The truncated-power basis is given by h j (X) = X j 1, j = 1,..., M h M+l (X) = (X ξ l ) M 1 +, l = 1,..., K. Every spline in this class can be represented as M K f(x) = β j h j (X) + β M+l h M+l (X). j=1 l=1

Fit the Spline Specify the number of knots, or the number of basis functions or degree of freedom. This can be done empirically or by cross-validation. Set the placement of the knots. e.g. Set knots at appropriate percentiles of the inputs. Set q = M + K, and β = (β 1,..., β q ). For each training point (x i, y i ) or (x i, g i ), evaluate the q basis functions at the input value x i, and obtain h(x i ) = (h 1 (x i ),..., h q (x i )) T. Fit whatever linear models with derived inputs h(x i ).

Linear and Logistic Regressions Linear Regression. E(Y X) = f(x) = q β j h j (X) j=1 Learn f( ) from the training set (x 1, y 1 ),..., (x N, y N ) { N [ ˆβ = arg min yi β T h(x i ) ] } 2 β Logistic Regression. log i=1 P (G = 1 X) P (G = 0 X) = f(x) = q β j h j (X) j=1 Learn f( ) from the training set (x 1, g 1 ),..., (x N, g N ) N ˆβ = arg max {g i log[p(x i )] + (1 g i ) log[1 p(x i )]}, β where i=1 p(x i ) = exp{ q j=1 β jh j (x i )} 1 + exp{ q j=1 β jh j (x i )}.

Boundary Effect 5.2 Piecewise Polynomials and Splines 14 Pointwise Variances 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Global Linear Global Cubic Polynomial Cubic Spline - 2 knots Natural Cubic Spline - 6 knots 0.0 0.2 0.4 0.6 0.8 1.0 X

Natural Cubic Spline A natural cubic spline is linear beyond the boundary knots (ξ 1 and ξ K ). Frees up four degree of freedom from the cubic spline. Will increase bias near the boundaries. The set of all natural cubic splines with fixed knots ξ 1,..., ξ K is a K-dimensional linear subspace. Basis: {N j (X) : 1 j K}. N 1 (X) = 1, N 2 (X) = X, N k+2 (X) = d k (X) d K 1 (X) where d k (X) = (X ξ k) 3 + (X ξ K ) 3 + ξ K ξ k

Example: South African Heart Disease A retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa. There are roughly two controls per case of CHD. These data are taken from a larger dataset, described in Rousseauw et al, 1983, South African Medical Journal. sbp systolic blood pressure tobacco cumulative tobacco (kg) ldl low densiity lipoprotein cholesterol adiposity famhist family history of heart disease (Present, Absent) typea type-a behavior obesity alcohol current alcohol consumption age age at onset chd response, coronary heart disease

-2 0 2 4 0 2 4 6 8 100 120 140 160 180 200 220 0 5 10 15 20 25 30-4 -2 0 2 4 2 4 6 8 10 12 14-4 -2 0 2 4 Absent Present -2 0 2 4 6-6 -4-2 0 2 ˆf(sbp) ˆf(tobacco) sbp tobacco ˆf(ldl) ldl ˆf(famhist) famhist ˆf(obesity) ˆf(age) 15 20 25 30 35 40 45 obesity 20 30 40 50 60 age

Smoothing Splines Among all functions f(x) with two continuous derivatives, find one that minimizes the penalized residual sum of squares RSS(f, λ) = N [y i f(x i )] 2 + λ [f (t)] 2 dt i=1 If λ = 0, f can be any function that interpolates the data. If λ =, f must be linear. Least square fit. Assuming the inputs x 1,..., x N are all different, there is a unique minimizer ˆf, which is a natural cubic spline with N knots at x 1,..., x N. It seems this leads to over-fitting. However, the penalty term shrink the spline coefficients toward the linear fit.

Example: Bone Mineral Density Data Relative spinal bone mineral density measurements on 261 North American adolescents. Each value is the difference in spnbmd taken on two consecutive visits, divided by the average. The age is the average age over the two visits. Variables: idnum: identifies the child, and hence the repeat measurements age: average age of child when measurements were taken gender: male or female spnbmd: Relative Spinal bone mineral density measurement

Degree of Freedom The solution takes the form The criterion reduces to f(x) = N N j (X)θ j. j=1 RSS(θ, λ) = (y Nθ) T (y Nθ) + λθ T Ωθ where N ij = N j (x i ) and Ω jk = N j (t)n k (t)dt. The solution is given by The fitted values are ˆθ = (N T N + λω) 1 N T y. ˆf = N(N T N + λω) 1 N T y =: S λ y. The effective degrees of freedom of a smoothing spline is defined as df λ = trace(s λ ).

Elements of Statistical Learning (2nd Ed.) Eigenvalues and Eigenvectors c Hastie, Tibshirani & Friedman 2009 Chap zone Concentration 0 10 20 30-50 0 50 100 Daggot Pressure Gradient

Example 0-50 0 50 100 Daggot Pressure Gradient Eigenvalues -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 df=5 df=11 5 10 15 20 25 rder -50 0 50 100-50 0 50 100 FIGURE 5.7. (Top:) Smoothing spline fit of ozone

Selection of the Smoothing Parameter Fixing the degree of freedom. Try a couple of different values of df, and select one based on approximate F -tests, residual plots and other more subjective criteria. Bias-Variance Tradeoff. Cross validation, generalized cross validation, C p etc.

6 8 10 12 14 0.9 1.0 1.1 1.2 y 0.0 0.2 0.4 0.6 0.8 1.0-4 -2 0 2 y 0.0 0.2 0.4 0.6 0.8 1.0-4 -2 0 2 y 0.0 0.2 0.4 0.6 0.8 1.0-4 -2 0 2 EPE CV X X X dfλ = 5 dfλ = 9 dfλ = 15 dfλ Cross-Validation EPE(λ) and CV(λ)

Nonparametric Logistic Regression Consider the penalized log-likelihood criterion. { N ˆβ = arg max {g i log[p(x i )] + (1 g i ) log[1 p(x i )]} 1 β 2 i=1 [f (t)] 2 dt }, where p(x i ) = exp{f(x i)} 1 + exp{f(x i )} q and f(t) = β j h j (t). j=1

Wavelet Smoothing Wavelets typically use a complete orthonormal basis to represent functions, but then shrink and select the coefficients toward a sparse representation. They are able to represent both smooth and/or locally bumpy functions in an efficient way. Time and frequency localization. Fit the coefficients for this basis by least squares, and then threshold (discards, filters) the smaller coefficients. Very popular in signal processing and compression.

Haar Wavelets Symmlet-8 Wavelets ψ 6,35 ψ 6,15 ψ 5,15 ψ 5,1 ψ 4,9 ψ 4,4 ψ 3,5 ψ 3,2 ψ 2,3 ψ 2,1 ψ 1,0 0.0 0.2 0.4 0.6 0.8 1.0 Time 0.0 0.2 0.4 0.6 0.8 1.0 Time

5.9 Wavelet Smoothing 177 A NMR (Nuclear Magnetic Resonance) Signal NMR Signal 0 20 40 60 0 200 400 600 800 1000

0 200 400 600 800 1000 Wavelet Transform Wavelet Transform - riginal Signal Wavelet Transform - WaveShrunk Signal Signal Signal W 9 W 9 W 8 W 8 W 7 W 7 W 6 W 6 W 5 W 5 W 4 W 4 V 4 V 4 0 200 400 600 800 1000 0 200 400 600 800 1000