Cheng Soon Ong & Christian Walder. Canberra February June PDF Free Download

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear 1 Linear 2 Sparse Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 855

Part XI for 390of 855

Original Input versus Feature Space Used direct input x until now. All classification algorithms work also if we first apply a fixed nonlinear transformation of the inputs using a vector of basis functions φ(x). Example: Use two Gaussian basis functions centered at the green crosses in the input space. for 1 1 x2 φ2 0 0.5 1 0 1 0 1 x1 0 0.5 1 φ1 391of 855

Original Input versus Feature Space Linear decision boundaries in the feature space correspond to nonlinear decision boundaries in the input space. Classes which are NOT linearly separable in the input space can become linearly separable in the feature space. BUT: If classes overlap in input space, they will also overlap in feature space. Nonlinear features φ(x) can not remove the overlap; but they may increase it! for 1 1 x2 φ2 0 0.5 1 0 1 0 1 x1 0 0.5 1 φ1 392of 855

Where are we? Basis function models (regression, classification) Flexible basis function models (neural networks) after semester break for 393of 855

Where are we going? Why not use all training data to make predictions for the test inputs? Basic ideas: Continuity : Mostly targets don t change abruptly. Similarity : Each training pair (input, target) tells us something about the possible targets in the neighbourhood of the input. Kernels formalise those ideas. for 394of 855

How are we going there? Kernels for density estimation Nonparametric density estimation Kernels for classification Basis functions and the kernel trick Constructing kernels Warning: The term kernel is also used for all vectors mapping under some matrix A to zero. This is a different concept. Don t get confused! for 395of 855

for 396of 855

Density Suppose we observe data points {x n } N n=1 e.g. just N real numbers Suppose we believe these are drawn independently from some distribution p(x) e.g. p(x) is Gaussian with unknown mean and variance Density estimation problem: Estimate p(x) from data for 397of 855

Nonparametric Density Histogram Partition the space into bins of width i. Count the number n i of samples falling into each bin i. Normalise. p i = n i N i 5 = 0.04 0 0 0.5 1 5 = 0.08 0 0 0.5 1 5 = 0.25 for 0 0 0.5 1 Histogram of 50 data points generated from the distribution shown by the green curve for varying common bin width 398of 855

Nonparametric Density Histogram Advantages: Data can be discarded after calculating the p i. Algorithm can be applied to sequentially arriving data. Disadvantages: Dependency on bin width i. Discontinuities due to the bin edges. Exponential scaling with the dimensionality D of the data. Need M D bins for D dimensions and M bins per dimension. for 399of 855

Nonparametric Density - Refined Draw data from some unknown probability distribution p(x) in a D-dimensional space. Consider a small region R containing x. Probability mass associated with this region P = p(x) dx R Data set of N observations drawn from p(x). Total number K of points found inside of R is distributed according to the binomial distribution Bin(K N, P) = N! K!(N K)! PK (1 P) N K for Expectation of K : E [K/N] = P Variance of K : var[k/n] = P(1 P)/N 400of 855

Nonparametric Density - Refined Expectation of K : E [K/N] = P Variance of K : var[k/n] = P(1 P) For large N, the distribution will be sharply peaked and therefore K NP Assuming also that the region has volume V and the region is small enough for p(x) to be roughly constant, then P p(x)v for Combining two contradictory assumptions Region R is small enough for p(x) to be roughly constant. Region R is large enough to have enough K points falling into it to get a sharp peak for the binomial distribution. p(x) K NV 401of 855

Nonparametric Density - Refined Two ways to exploit p(x) K NV for 1 Fix K and determine the volume V from the data : K-nearest-neighbours density estimation 2 Fix V and determine K from the data : kernel density estimation 402of 855

Nonparametric Nearest Neighbour Fix K and find an appropriate value for V. Consider a small sphere around x and then allow the radius to increase until it contains exactly K data points. Calculate the probability by p(x) K NV 5 K = 1 for 0 0 0.5 1 5 K = 5 0 0 0.5 1 5 K = 30 0 0 0.5 1 Nearest neighbour density model for different K. 403of 855

Nonparametric Parzen Estimator Define region R to be a small hypercube around x Define Parzen window (kernel function) { 1, u i 1/2, i = 1,..., D k(u) = 0, otherwise Total number of data points inside of the hypercube centered at x with lengths h: K = N ( ) x xn k h n=1 for Density estimate for p(x) p(x) K NV = 1 N N ( ) 1 x h D k xn h n=1 Interpret as sum over N cubes centered at each of the x n. 404of 855

Nonparametric Parzen Estimator Remaining problem: Discontinuities because of the hypercube (either in or out). Choose a smoother kernel function (and normalise correctly). Common choice : Gaussian kernel p(x) = 1 N N 1 exp { x x n 2 } (2πh 2 ) D/2 2h 2 n=1 Can choose any other kernel function k(u) obeying for k(u) 0, k(u) du = 1 405of 855

Nonparametric Parzen Estimator Gaussian kernel p(x) = 1 N N 1 exp { x x n 2 } (2πh 2 ) D/2 2h 2 n=1 h controls the trade-off between sensitivity to noise and over-smoothing. 5 h = 0.005 0 0 0.5 1 5 h = 0.07 for 0 0 0.5 1 5 h = 0.2 0 0 0.5 1 Kernel density model with Gaussian kernel for different h. 406of 855

for for 407of 855

The Role of Training Data Parametric methods Learn the model parameter w from the training data t. Discard the training data t. Nonparametric methods Use training data directly for prediction k-nearest neighbours : use k-closest data from the training set for classification Kernel methods Base prediction on linear combination of kernel functions evaluated at the training data. for 408of 855

Features A feature is a measurable property of a phenomenon being observed or any derived property thereof raw features: the original data derived features: mappings of the original features to some other space (possibly high- or infinite dimensional, e.g., basis functions) Feature selection: which features matter for the problem at hand? redundant features problem dependent Feature extraction: can we combine the important features to a smaller set of new features? compact representation versus ability to explain to a human for 409of 855

Very simple example - XOR x 1 x 2 y = x 1 xorx 2-1 -1 1-1 1-1 1-1 -1 1 1 1 1.0 0.5 for -1.0-0.5 0.5 1.0-0.5-1.0 not linearly separable (why?) raw features {( 1, 1), ( 1, 1), (1, 1), (1, 1)} 410of 855

Very simple example - XOR x 1 x 2 x new = x 1 x 2 y = x 1 xorx 2-1 -1 1 1-1 1-1 -1 1-1 -1-1 1 1 1 1 1 0.8 0.6 0.4 0.2 0 0.2 0.4 for 0.6 0.8 1 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 feature extraction: x new = x 1 x 2 data is now separable! All classification algorithms work also if we first apply a fixed nonlinear transformation of the inputs using a vector of basis functions φ(x). 411of 855

Kernel methods in one slide Consider a labelled training set {x i, t i } N i=1 On a new point x, we will predict y(x) = N α i K(x, x i ) t i i=1 where {α i } N i=1 are weights to be determined, and K(, ) is a kernel function The kernel function measures the similarity between any two examples Prediction is a weighted average of the training targets Weights depend on the similarity of x to each training example for 412of 855

- Intuition Suppose we perform linear regression with a feature matrix Φ and target vector t, where φ 0 (x 1 ) φ 1 (x 1 )... φ M 1 (x 1 ) φ 0 (x 2 ) φ 1 (x 2 )... φ M 1 (x 2 ) φ(x 1 ) T Φ =...... = φ(x 2 ) T... φ(x N ) T φ 0 (x N ) φ 1 (x N )... φ M 1 (x N ) Recall that the optimal (regularised) w is for w = (λi + Φ T Φ) 1 Φ T t Thus, the prediction for feature vector of new point x with y(x) = φ(x) T w = φ(x) T (λi + Φ T Φ) 1 Φ T t 413of 855

- Intuition Prediction with optimal (regularised) w y(x) = φ(x) T w = φ(x) T (λi + Φ T Φ) 1 Φ T t Suppose that M is very large. Then, the inverse of an M M matrix above will be expensive to compute. Consider however the following trick: for φ(x) T (λi + Φ T Φ) 1 Φ T t = φ(x) T Φ T (λi + ΦΦ T ) 1 t 414of 855

- Intuition We have thus written the prediction as y(x) = φ(x) T Φ T (λi + ΦΦ T ) 1 t Now, our prediction is determined by an N N rather than M M matrix ΦΦ T is known as the kernel matrix of the training data Intuitively, measures the similarities between the training instances Why? Because the inner product between two points is a measure of similarity: for arg max u, v = u. v 415of 855

Consider a linear regression model with regularised sum-of-squares error J(w) = 1 2 N (w T φ(x n ) t n ) 2 + λ 2 wt w n=1 where λ 0. We could also write this in more compact form as J(w) = 1 2 (t Φw)T (t Φw) + λ 2 wt w with the target vector t = (t 1,..., t N ) T, and the design matrix φ 0 (x 1 ) φ 1 (x 1 )... φ M 1 (x 1 ) φ 0 (x 2 ) φ 1 (x 2 )... φ M 1 (x 2 ) Φ =....... φ 0 (x N ) φ 1 (x N )... φ M 1 (x N ) for 416of 855

Critical points for J(w) J(w) = 1 2 (t Φw)T (t Φw) + λ 2 wt w satisfy w = (Φ T Φ + λi) 1 Φ T t (Φ T Φ + λi)w = Φ T t λw = Φ T (t Φw) w = Φ T a N = φ(x n )a n n=1 for where a = (a 1,..., a N ) T with components a n = 1 { w T } φ(x n ) t n λ 417of 855

Now express J(w) as a function of this new variable a instead of w via the relation w = Φ T a J(a) = 1 2 at ΦΦ T ΦΦ T a a T ΦΦ T t + 1 2 tt t + λ 2 at ΦΦ T a where again t = (t 1,..., t N ) T. Known as the dual representation Define the N N Gram matrix K = ΦΦ T with elements K nm = φ(x n ) T φ(x m ) = k(x n, x m ). for Express J(a) now as J(a) = 1 2 at KKa a T Kt + 1 2 tt t + λ 2 at Ka. 418of 855

Critical Points of J(a) Let s calculate the critical points for J(a) = 1 2 at KKa a T Kt + 1 2 tt t + λ 2 at Ka. Directional derivative DJ(a)(ξ) = ξ T KKa ξ T Kt + λ ξ T Ka should be zero in all possible directions ξ. Therefore K(Ka t + λ a) = 0 and so for a = (K + λ I N ) 1 t. Second directional derivative (using K = ΦΦ T ) D 2 J(a)(ξ, ξ) = ξ T KKξ + λ ξ T Kξ = Kξ 2 + λ Φ T ξ > 0. a = (K + λ I N ) 1 t minimises J(a). 419of 855

Prediction for the Linear Regression Model Inserting the argument a which minimises the error J(a) into the prediction model for the linear regression, we get for the prediction y(x) = w T φ(x) = a T Φφ(x) = (Φφ(x)) T a = k(x) T (K + λ I N ) 1 t where we defined the vector k(x) with elements k n (x) = k(x n, x) = φ(x n ) T φ(x). The prediction y(x) can be expressed entirely in terms of the kernel function k(x, x ) evaluated at the training and test data. Looks familiar? See Bayesian Linear Regression. for 420of 855

The Kernel Function The kernel function is defined over two points, x and x, of the input space k(x, x ) is symmetric. k(x, x ) = φ(x) T φ(x ). It is an inner product of two vectors of basis functions k(x, x ) = φ(x), φ(x ). for For prediction, the kernel function will be evaluated at the training data points. (See next slides.) 421of 855

Dual Representation What have we gained by the dual representation? Need to invert an N N matrix now, where N is the number of data points. Can be large! In the parameter space formulation, we only needed to invert an M M matrix, where M was the number of basis functions. But, a kernel corresponds to an inner product of basis functions. So we can use a large number of basis functions, even infinitely many. We can construct new valid kernels directly from given ones (whatever the corresponding basis functions of the new kernel might be). As a kernel defines a kind of similarity between two points in the input space, we can define kernels over graphs, sets, strings, and text documents. for 422of 855

for 423of 855

Kernels from Basis Functions 1 Choose a set of basis functions {φ 1,..., φ M } 2 Find a new kernel as an inner product between vectors of basis functions evaluated at x and x k(x, x ) = φ(x) T φ(x) = M φ i (x)φ i (x ) i=1 for 424of 855

Kernels from Basis Functions 1 Polynomial basis functions Corresponding kernel k(x, x ) as function of x for x = 0.5 (red cross). 0.5 0 0.5 1 1 0 1 1.0 0.0 0.4 1 0 1 for 425of 855

Kernels from Basis Functions 1 Gaussian basis functions Corresponding kernel k(x, x ) as function of x for x = 0.0 (red cross). 0.75 0.5 0.25 0 1 0 1 2.0 1.0 for 0.0 1 0 1 426of 855

Kernels from Basis Functions 1 Logistic Sigmoid basis functions Corresponding kernel k(x, x ) as function of x for x = 0.0 (red cross). 0.75 0.5 0.25 6.0 3.0 0 1 0 1 0.0 1 0 1 for 427of 855

Kernels by Guessing a Kernel Function 1 Choose a mapping from two points of the input space to a real number, which is symmetric in its arguments, e.g. k(x, z) = (x T z) 2 = k(z, x) 2 Try to write this as an inner product of a vector valued function evaluated at the arguments x and z, e.g. k(x, z) = (x T z) 2 = (x 1 z 1 + x 2 z 2 ) 2 = x 2 1 z 2 1 + 2x 1 z 2 x 2 z 2 + x 2 2 z 2 2 = (x 2 1, 2 x 1 x 2, x 2 2)(z 2 1, 2 z 1 z 2, z 2 2) T = φ(x) T φ(z) for with the feature mapping φ(x) = (x 2 1, 2 x 1 x 2, x 2 2 )T. 428of 855

New Kernels From Theory A necessary and sufficient condition for k(x, x ) to be a valid kernel is that the kernel matrix K, whose elements are k(x n, x m ), should be positive semidefinite for all possible choices of the set {x n }. Previously, we constructed K = ΦΦ T, which is automatically positive semidefinite (why?) If we can explicitly construct the kernel via basis functions, we are good Even if we cannot find the basis functions easily, we may be able to deduce k(x, x ) is a valid kernel for 429of 855

New Kernels From Other Kernels Given valid kernels k 1 (x, x ) and k 2 (x, x ), the following kernels are also valid: k(x, x ) = c k 1 (x, x ) k(x, x ) = f (x) k 1 (x, x ) f (x ) k(x, x ) = q(k 1 (x, x )) k(x, x ) = exp(k 1 (x, x )) k(x, x ) = k 1 (x, x ) + k 2 (x, x ) k(x, x ) = k 1 (x, x ) k 2 (x, x ) k(x, x ) = k 3 (φ(x), φ(x )) k(x, x ) = x T Ax k(x, x ) = k a (x a, x a) + k b (x b, x b) k(x, x ) = k a (x a, x a) k b (x b, x b) c > 0 constant f ( ) any function q( ) polynomial with nonneg. coeff. φ(x) any function to R M k 3 (, ) valid kernel in R M A = A T, A 0 x = (x a, x b ) for 430of 855

New Kernels From Other Kernels Further examples of kernels k(x, x ) = (x T x ) M k(x, x ) = (x T x + c) M k(x, x ) = exp ( x x 2 /2σ 2) k(x, x ) = tanh ( a x T x + b ) Generally, we call k(x, x ) = x T x k(x, x ) = k(x x ) k(x, x ) = k( x x ) only terms of degree M all terms up to degree M Gaussian kernel Sigmoidal kernel linear kernel stationary kernel homogeneous kernel for 431of 855

Kernels over Graphs, Sets, Strings, Texts We only need an appropriate similarity measure k(x, x ) which is a kernel. Example: Given a set A and the set of all subsets of A, called the power set P(A). For two subsets A 1, A 2 P(A), denote the number of elements of the intersection of A 1 and A 2 by A 1 A 2. Then it can be shown that k(a 1, A 2 ) = 2 A1 A2 for corresponds to an inner product in a feature space. Therefore, k(a 1, A 2 ) is a valid kernel function. 432of 855

Kernels from Probabilistic Generative Models Given p(x), we can define a kernel k(x, x ) = p(x) p(x ), which means two inputs x and x are similar if they both have high probabilities. Include a weighting function p(i) and extend the kernel to k(x, x ) = i p(x i) p(x i)p(i). for For a continous variable z k(x, x ) = p(x z) p(x z)p(z)dz. Hidden Markov Model with sequences of length L. 433of 855

Kernels for : Summary Pick a suitable kernel function k(x, x ) e.g. by computing inner product of some basis functions Make predictions by suitably combining k(x, x n ) for each training example x n implicitly, a linear model in some high-dimensional space For linear regression, we go from to y(x) = φ(x) T (λi + Φ T Φ) 1 Φ T t y(x) = k(x) T (K + λ I N ) 1 t for can plug in suitable kernel function to implicitly perform nonlinear transformation 434of 855

Kernels for : Summary Working with a nonlinear kernel, we are implicitly performing a nonlinear transformation of our data for with linear kernel k(x, x ) = x T x 435of 855

Kernels for : Summary Working with a nonlinear kernel, we are implicitly performing a nonlinear transformation of our data for with nonlinear kernel k(x, x ) = (x T x ) 2 436of 855

Cheng Soon Ong & Christian Walder. Canberra February June 2018