5.6 Nonparametric Logistic Regression

Size: px

Start display at page:

Download "5.6 Nonparametric Logistic Regression"

Verity Lloyd
6 years ago
Views:

1 5.6 onparametric Logistic Regression Dmitri Dranishnikov University of Florida Statistical Learning

2 onparametric Logistic Regression onparametric? Doesnt mean that there are no parameters. Just means that the number and form of parameters is not assumed or fixed (but depends on the data) Logistic? Why would we want to do this? Primarily to Constrain output to lie in [0,1] Use for classification. For simplicity we consider only 1-D input.

3 onparametric Logistic Regression The Model p x := P Y =1 X =x = e f x 1 e f x = 1 f x Y := Class Indicator Text assumes 2 classes. Generalization to K classes same as in 4.4 Construct Log Likelyhood: log p y f = log p x i y i 1 p xi 1 y i = y i log p x i 1 y i log 1 p x i = y i f x i log 1 e f x i

4 onparametric Logistic Regression The Model Maximize the Penalized Log-Likelyhood L f, = y i f x i log 1 e f x i 1 2 f ' ' t dt As before, optimal f can be shown to be a finite dimensional natural spline. f x = j x j Differentiating, and calculating the Hessian we then apply ewton- Raphson. new = old H 1 L

5 onparametric Logistic Regression Resulting Reupdate Equation is similar to IRLS new = T W 1 T W old W 1 y p f new Questions? = T W 1 T W f old W 1 y p = T W 1 T W z = S, w z Where we have : W i, i := p x i 1 p x i p := p x 1,..., p x T And as before : ' i, j := j x i jk := ' ' j t ' k t dt It can be shown that the update fits a weighted spline to the adjusted-response z (Exercise)

6 Kernel Methods Material from "Pattern Recognition and Machine Learning" (Bishop) Summary So far we have considered regression/classification models in terms of sums of different (sometimes non-linear) functions of the input weighted by parameters. M f x = wi i=1 x Fundamentally, the inputs with which we train/optimize these parameters are discarded when making predictions. We can however, retain all, or a portion of the training data through use of a kernel function, which can be viewed as an inner product in a transformed feature space. "transformed feature space" = The space of all possible feature vectors x

7 Kernel Methods Material from "Pattern Recognition and Machine Learning" (Bishop) Summary Indeed, if our model is a fixed-basis function model we can define... k x i, x j := x i T x j ow, if we can re-write this model and subsequent optimization problem completely in terms of this kernel-function (without any other functions of the input), we can then "choose" a different functional form for the kernel in order to obtain a different non-linear model. This (alternate) kernel representation of the problem in question is known as the dual representation The process of kernel replacement is often called the kernel trick or kernel substitution

8 Kernel Methods Material from "Pattern Recognition and Machine Learning" (Bishop) Summary The model we consider then becomes : f x = j=1 j k x, x j The kernel trick is useful, because we can now deal with feature spaces that are extremely high dimensional (M >> ), or even infinite dimensional by working with kernels instead. We no longer need to compute the phi function. We can still perform regression / classification tasks (implicitly) within the feature space. This can also be viewed as a generalization of the model posed for smoothing splines.

9 5.8 Reproducing Kernel Hilbert Spaces Hilbert Space A Hilbert space is an inner product space (a vector space) that is complete. The Hilbert spaces we will be dealing with are function spaces, with elements consisting of real-valued functions: f : X R We will denote H(X) or as our Hilbert space. Reproducing Kernel Property x X K x H k such that f H k f x = f, K x Then we can define a kernel function : K x, y := K x y ote that this implies: H k It can be shown that symmetric positive definite kernel function defines a unique Reproducing Kernel Hilbert Space (RKHS) And vice-versa (by definition). K x, y = K x, K y

10 Reproducing Kernel Hilbert Spaces We would like to solve the following regularization problem: min f H k [ i=1 L y i, f x i f, f H k ] Here L is a "loss function" (a cost function essentially). The xi and yi are data and outputs respectively. <,> denote the inner product. First we must select a kernel function and derive some properties of f, and our hilbert space H(X).

11 Reproducing Kernel Hilbert Spaces Select a Kernel Function K x, y : [a,b] p [a, b] p R H k There is a unique hilbert space with this K as its reproducing kernel. After selection, we will consider functions of the form f = m m K, y m H k Mercer's Theorem tells us we can write : K x, y = i i=1 x y If p=1, the phis are eigenfunctions of [T K ] x = a b K x, y y dy We can rewrite f : f x = i=1 m m y m x = i=1 ci x

12 Reproducing Kernel Hilbert Spaces ow, consider the problem from before: min f H k [ i=1 L y i, f x i f, f H k ] L is a "loss function" (a cost function essentially). The xi and yi are data and outputs respectively. We can now establish that f, f H k := m n m n K, y m, K, y n = m n m n K y n, y m = i n y n n m y m m = i c i c i = i=1 c i 2

13 Reproducing Kernel Hilbert Spaces The problem becomes... min {c j }[ i=1 L y i, f {c j } x i j=1 c j 2 j ] Solution (Exercise) The solution is finite-dimensional, and has the form f x = i=1 i K x, x i f, f H k = i=1 j=1 j K x i, x j = T K The problem is now easier : min L y, K T K

14 Reproducing Kernel Hilbert Spaces What have we done? Examples We have reduced an infinite dimensional problem in function-space into a finite dimensional one in euclidean space by using the kernel. This is good, but if our dataset is large, the problem is still complicated, it would be nice if solutions were sparse, that is our kernel would depend on only some of the data points. Becomes min {c j }[ i=1 yi f {c j } x i 2 j=1 c j 2 j ] Solution min y K T y K T K = K T K K 1 K T y

15 Reproducing Kernel Hilbert Spaces Examples = K T K K 1 K T y Does this look familiar? Recall that the smoothing spline solution was : = T 1 T y But K is symmetric, and also invertible (since it is positive definite) so... = K 1 K 2 K 1 y = K I 1 y f = K K I 1 y f x = i=1 K x, x i ow its just a matter of choosing the "best" kernel function.

16 Reproducing Kernel Hilbert Spaces Radial Basis Function (RBF) Kernel RBF : Depends only on the difference between two points. Invariant under translation of the input space. "Gaussian Kernel" K x, y = e x y T x y Probably the most commonly used kernel in practice.

17 Reproducing Kernel Hilbert Spaces K = D T = D 1/ 2 D T /2 T = H H T

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed