Data Mining Stat 588 Lecture 9: Basis Expansions Department of Statistics & Biostatistics Rutgers University Nov 01, 2011
Regression and Classification Linear Regression. E(Y X) = f(x) We want to learn f( ) from the training set (x 1, y 1 ),..., (x N, y N ). Logistic Regression. log P (G = 1 X) P (G = 0 X) = f(x) We want to learn f( ) from the training set (x 1, g 1 ),..., (x N, g N ).
Move Beyong Linearity Have seen models linear in the input features, both for regression and classification. To move beyond linearity, can augment/replace the vector of inputs X with additional variables, which are transformations of X, and then use linear models in this new space of derived input features. h m (X) = X m, m = 1,..., p. h m (X) = X 2 j or h m(x) = X j X k. h m (X) = log(x j ) or X j. h m (X) = I{L m X j < U m }. Model f(x) as a linear basis expansion in X f(x) = M β m h m (X). m=1
Dicionary Methods Have a dictionary D consisting of typically a very large number D of basis functions. Piecewise polynomials and splines. Smoothing splines. Trignometric funcations. Wavelet bases. Need to control the complexity. Restriction methods. Selection methods. Regularization methods. Assume X is one-dimensional for today, unless otherwise specified.
Piecewise Polynomials Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 5 Piecewise Constant Piecewise Linear Continuous Piecewise Linear Piecewise-linear Basis Function ξ 1 ξ 1 ξ 1 ξ 1 ξ 2 ξ 2 ξ 2 ξ 2 (X ξ 1 ) +
Piecewise Cubic Polynomials Discontinuous Continuous Continuous First Derivative Continuous Second Derivative Piecewise Cubic Polynomials ξ 1 ξ 1 ξ 1 ξ 1 ξ 2 ξ 2 ξ 2 ξ 2
Spline rder M = deg +1. Number of knots K. Placement of knots ξ j,..., ξ K. The domain of X is divided into K + 1 continuous intervals: (, ξ 1 ), [ξ 1, ξ 2 ),..., [ξ K 1, ξ K ), [ξ K, ) A spline is an order-m (degree-(m 1)) polynomial on each interval. At each knot ξ j, there is one polynomial on its left hand side and one on its right hand side. These two polynomials have the same value, the same first derivative and the same second derivative, and the same derivatives up to order M 2 at ξ j. A cubic spline has order M = 4. It is called cubic b/c the degree is 3. Cubic splines are the lowest-order spline for which the knot-discontinuity is not visible to the human eye
Truncated-power Basis With specified order, number of knots and their placement, there is a class of splines. n each interval, need M parameters to determine a cubic polynomial. M(K + 1). n each knot, there are M 1 constraints. (M 1)K. There are K + M parameters left. Degree of freedom. In fact, this class is a (K + M)-dimensional linear subspace of the space of all functions over the domain of X. The truncated-power basis is given by h j (X) = X j 1, j = 1,..., M h M+l (X) = (X ξ l ) M 1 +, l = 1,..., K. Every spline in this class can be represented as M K f(x) = β j h j (X) + β M+l h M+l (X). j=1 l=1
Fit the Spline Specify the number of knots, or the number of basis functions or degree of freedom. This can be done empirically or by cross-validation. Set the placement of the knots. e.g. Set knots at appropriate percentiles of the inputs. Set q = M + K, and β = (β 1,..., β q ). For each training point (x i, y i ) or (x i, g i ), evaluate the q basis functions at the input value x i, and obtain h(x i ) = (h 1 (x i ),..., h q (x i )) T. Fit whatever linear models with derived inputs h(x i ).
Linear and Logistic Regressions Linear Regression. E(Y X) = f(x) = q β j h j (X) j=1 Learn f( ) from the training set (x 1, y 1 ),..., (x N, y N ) { N [ ˆβ = arg min yi β T h(x i ) ] } 2 β Logistic Regression. log i=1 P (G = 1 X) P (G = 0 X) = f(x) = q β j h j (X) j=1 Learn f( ) from the training set (x 1, g 1 ),..., (x N, g N ) N ˆβ = arg max {g i log[p(x i )] + (1 g i ) log[1 p(x i )]}, β where i=1 p(x i ) = exp{ q j=1 β jh j (x i )} 1 + exp{ q j=1 β jh j (x i )}.
Boundary Effect 5.2 Piecewise Polynomials and Splines 14 Pointwise Variances 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Global Linear Global Cubic Polynomial Cubic Spline - 2 knots Natural Cubic Spline - 6 knots 0.0 0.2 0.4 0.6 0.8 1.0 X
Natural Cubic Spline A natural cubic spline is linear beyond the boundary knots (ξ 1 and ξ K ). Frees up four degree of freedom from the cubic spline. Will increase bias near the boundaries. The set of all natural cubic splines with fixed knots ξ 1,..., ξ K is a K-dimensional linear subspace. Basis: {N j (X) : 1 j K}. N 1 (X) = 1, N 2 (X) = X, N k+2 (X) = d k (X) d K 1 (X) where d k (X) = (X ξ k) 3 + (X ξ K ) 3 + ξ K ξ k
Example: South African Heart Disease A retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa. There are roughly two controls per case of CHD. These data are taken from a larger dataset, described in Rousseauw et al, 1983, South African Medical Journal. sbp systolic blood pressure tobacco cumulative tobacco (kg) ldl low densiity lipoprotein cholesterol adiposity famhist family history of heart disease (Present, Absent) typea type-a behavior obesity alcohol current alcohol consumption age age at onset chd response, coronary heart disease
-2 0 2 4 0 2 4 6 8 100 120 140 160 180 200 220 0 5 10 15 20 25 30-4 -2 0 2 4 2 4 6 8 10 12 14-4 -2 0 2 4 Absent Present -2 0 2 4 6-6 -4-2 0 2 ˆf(sbp) ˆf(tobacco) sbp tobacco ˆf(ldl) ldl ˆf(famhist) famhist ˆf(obesity) ˆf(age) 15 20 25 30 35 40 45 obesity 20 30 40 50 60 age
Smoothing Splines Among all functions f(x) with two continuous derivatives, find one that minimizes the penalized residual sum of squares RSS(f, λ) = N [y i f(x i )] 2 + λ [f (t)] 2 dt i=1 If λ = 0, f can be any function that interpolates the data. If λ =, f must be linear. Least square fit. Assuming the inputs x 1,..., x N are all different, there is a unique minimizer ˆf, which is a natural cubic spline with N knots at x 1,..., x N. It seems this leads to over-fitting. However, the penalty term shrink the spline coefficients toward the linear fit.
Example: Bone Mineral Density Data Relative spinal bone mineral density measurements on 261 North American adolescents. Each value is the difference in spnbmd taken on two consecutive visits, divided by the average. The age is the average age over the two visits. Variables: idnum: identifies the child, and hence the repeat measurements age: average age of child when measurements were taken gender: male or female spnbmd: Relative Spinal bone mineral density measurement
Degree of Freedom The solution takes the form The criterion reduces to f(x) = N N j (X)θ j. j=1 RSS(θ, λ) = (y Nθ) T (y Nθ) + λθ T Ωθ where N ij = N j (x i ) and Ω jk = N j (t)n k (t)dt. The solution is given by The fitted values are ˆθ = (N T N + λω) 1 N T y. ˆf = N(N T N + λω) 1 N T y =: S λ y. The effective degrees of freedom of a smoothing spline is defined as df λ = trace(s λ ).
Elements of Statistical Learning (2nd Ed.) Eigenvalues and Eigenvectors c Hastie, Tibshirani & Friedman 2009 Chap zone Concentration 0 10 20 30-50 0 50 100 Daggot Pressure Gradient
Example 0-50 0 50 100 Daggot Pressure Gradient Eigenvalues -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 df=5 df=11 5 10 15 20 25 rder -50 0 50 100-50 0 50 100 FIGURE 5.7. (Top:) Smoothing spline fit of ozone
Selection of the Smoothing Parameter Fixing the degree of freedom. Try a couple of different values of df, and select one based on approximate F -tests, residual plots and other more subjective criteria. Bias-Variance Tradeoff. Cross validation, generalized cross validation, C p etc.
6 8 10 12 14 0.9 1.0 1.1 1.2 y 0.0 0.2 0.4 0.6 0.8 1.0-4 -2 0 2 y 0.0 0.2 0.4 0.6 0.8 1.0-4 -2 0 2 y 0.0 0.2 0.4 0.6 0.8 1.0-4 -2 0 2 EPE CV X X X dfλ = 5 dfλ = 9 dfλ = 15 dfλ Cross-Validation EPE(λ) and CV(λ)
Nonparametric Logistic Regression Consider the penalized log-likelihood criterion. { N ˆβ = arg max {g i log[p(x i )] + (1 g i ) log[1 p(x i )]} 1 β 2 i=1 [f (t)] 2 dt }, where p(x i ) = exp{f(x i)} 1 + exp{f(x i )} q and f(t) = β j h j (t). j=1
Wavelet Smoothing Wavelets typically use a complete orthonormal basis to represent functions, but then shrink and select the coefficients toward a sparse representation. They are able to represent both smooth and/or locally bumpy functions in an efficient way. Time and frequency localization. Fit the coefficients for this basis by least squares, and then threshold (discards, filters) the smaller coefficients. Very popular in signal processing and compression.
Haar Wavelets Symmlet-8 Wavelets ψ 6,35 ψ 6,15 ψ 5,15 ψ 5,1 ψ 4,9 ψ 4,4 ψ 3,5 ψ 3,2 ψ 2,3 ψ 2,1 ψ 1,0 0.0 0.2 0.4 0.6 0.8 1.0 Time 0.0 0.2 0.4 0.6 0.8 1.0 Time
5.9 Wavelet Smoothing 177 A NMR (Nuclear Magnetic Resonance) Signal NMR Signal 0 20 40 60 0 200 400 600 800 1000
0 200 400 600 800 1000 Wavelet Transform Wavelet Transform - riginal Signal Wavelet Transform - WaveShrunk Signal Signal Signal W 9 W 9 W 8 W 8 W 7 W 7 W 6 W 6 W 5 W 5 W 4 W 4 V 4 V 4 0 200 400 600 800 1000 0 200 400 600 800 1000