Outline lecture 4 2(26)

Similar documents
Outline Lecture 2 2(32)

Outline Lecture 3 2(40)

Outline lecture 2 2(30)

Bayesian Linear Regression. Sargur Srihari

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Outline lecture 6 2(35)

Iterative Reweighted Least Squares

Introduction to Machine Learning

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Neural Network Training

Kernel Methods and Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear Models for Classification

Local Modelling with A Priori Known Bounds Using Direct Weight Optimization

The Laplace Approximation

Machine Learning Lecture 5

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Max Margin-Classifier

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Machine Learning Lecture 12

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen

Linear & nonlinear classifiers

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Nonparametric Bayesian Methods (Gaussian Processes)

Support Vector Machines

Reading Group on Deep Learning Session 1

Machine Learning Lecture 7

(Kernels +) Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

PATTERN RECOGNITION AND MACHINE LEARNING

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

CS798: Selected topics in Machine Learning

Bayesian Machine Learning

CSC 411: Lecture 04: Logistic Regression

Overfitting, Bias / Variance Analysis

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Lecture Notes on Support Vector Machine

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Review: Support vector machines. Machine learning techniques and image analysis

Multiclass Logistic Regression

Lecture 17: Neural Networks and Deep Learning

Linear & nonlinear classifiers

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Machine Learning. 7. Logistic and Linear Regression

Brief Introduction of Machine Learning Techniques for Content Analysis

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machine (continued)

Machine Learning Lecture 10

Cheng Soon Ong & Christian Walder. Canberra February June 2018

From perceptrons to word embeddings. Simon Šuster University of Groningen

Multivariate statistical methods and data mining in particle physics

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

CMU-Q Lecture 24:

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Support Vector Machine (SVM) and Kernel Methods

Reading Group on Deep Learning Session 2

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Kernel methods, kernel SVM and ridge regression

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Support Vector Machines.

An Introduction to Statistical and Probabilistic Linear Models

Naïve Bayes, Maxent and Neural Models

CSC321 Lecture 2: Linear Regression

Relevance Vector Machines

9.2 Support Vector Machines 159

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Discriminative Models

CSC321 Lecture 5 Learning in a Single Neuron

Kernel Methods in Machine Learning

Introduction to Machine Learning

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Bayesian Logistic Regression

1 Machine Learning Concepts (16 points)

Linear and logistic regression

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Introduction to SVM and RVM

Learning Methods for Linear Detectors

Discriminative Models

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

ECE521 week 3: 23/26 January 2017

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

CIS 520: Machine Learning Oct 09, Kernel Methods

Jeff Howbert Introduction to Machine Learning Winter

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Linear Models for Classification

Support Vector Machine (SVM) and Kernel Methods

Neural networks and support vector machines

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Transcription:

Outline lecture 4 2(26), Lecture 4 eural etworks () and Introduction to Kernel Methods Thomas Schön Division of Automatic Control Linköping University Linköping, Sweden. Email: schon@isy.liu.se, Phone: 013-281373, Office: House B, Entrance 27. 1. Summary of lecture 3 2. Generalize the linear model to a nonlinear function expansion 3. Training of neural networks 4. Successful examples of in real life examples System identification Handwritten digit classification 5. Introducing kernel methods 6. Different ways of constructing kernels Summary of lecture 3 (I/III) 3(26) Investigated one linear discriminant (a function that takes an input and assigns it to one of K classes) method in detail (least squares). Modelled each class as y k (x) = w T k x + w k,0 and solved the LS problem, resulting in ŵ = (X T X) 1 X T T. Showed how probabilistic generative models could be built for classification using the strategy, 1. Model p(x C k ) (a.k.a. class-conditional density) 2. Model p(c k ) 3. Use ML to find the parameters in p(x C k ) and p(c k ). 4. Use Bayes rule to find p(c k x) Summary of lecture 3 (II/III) 4(26) The direct method called logistic regression was introduced. Start by stating the model p(c 1 φ) = σ(w T φ) = 1 1 + e wt φ, which results in a log-likelihood function according to L(w) = ln p(t w) = (t n ln(y n ) + (1 t n ) ln(1 y n )), where y n = p(c 1 φ) = σ(w T φ). ote that this is a nonlinear, but concave function of w. Hence, we can easily find the global minimum using ewton s method (resulting in an algorithm known as IRLS).

Summary of lecture 3 (III/III) 5(26) The likelihood function for logistic regression is p(t w) = ( ) σ(w T φ n ) t n 1 σ(w T 1 tn φ n ) Hence, computing the posterior density p(w T) = p(t w)p(w) is p(t) intractable and we considered the Laplace approximation for solving this. The Laplace approximation is a simple (local) approximation that is obtained by fitting a Gaussian centered around the (MAP) mode of the distribution. (Very) brief on evaluating classifiers 6(26) Let us define the following concepts, True positives (tp): Items correctly classified as belonging to the class. True negatives (tn): Items correctly classified as not belonging to the class. False positives (fp): Items incorrectly classified as belonging to the class. In other words, a false alarm. False negatives (fn): Items that are not classified in the correct class, even though they should have. In other words, a missed detection. Precision = tp tp + fp Recall = tp tp + fn Two examples of neural networks in use 7(26) example 1 system identification (I/V) 8(26) 1. System identification 2. Handwritten digit classification These examples will provide a glimpse into a few real life applications of models based in nonlinear function expansions (i.e., neural networks) both for regression and classification problems. We provide references for a more complete treatment. eural networks are one of the standard models used in nonlinear system identification. Problem background: The task here is to identify a dynamical model of a Magnetorheological (MR) fluid damper. The MR fluid (typically some kind of oil) will greatly increases its so called apparent viscosity when the fluid is subjected to a magnetic field. MR fluid dampers are semi-active control devices which are used to reduce vibrations. Input signal: velocity v(t) [cm/s] of the damper Output signal: Damping force f (t) [].

example 1 system identification (II/V) 9(26) example 1 system identification (III/V) 10(26) Have a look at the data 100 50 0 50 100 0 2 4 6 8 10 12 14 16 Time v 20 ze 10 zv cm/s 0 10 f As usual, we try simple things first, that is a linear model. The best linear model turns out to be an output error (OE) model which gives 51% fit on validation data. LinMod2 = oe ( ze, [4 2 1 ] ) ; % OE model y = B / F u + e Try a neural network sigmoidal neural network using 10 hidden units Options = { MaxIter, 5 0, SearchMethod, LM } ; arx1 = n l a r x ( ze, [2 4 1 ], sigmoidnet, Options { : } ) This model already gives a 72% fit on validation data. compare ( zv, arx1 ) ; 20 0 2 4 6 8 10 12 14 16 Time example 1 system identification (IV/V) 11(26) example 1 system identification (V/V) 12(26) Using 12 hidden units and only making use of some of the regressors, Sig = sigmoidnet ( umberofunits, 1 2 ) ; % create SIGMOIDET object arx5 = n l a r x ( ze, [2 3 1 ], Sig, onlinearregressors, [1 3 4 ],... Options { : } ) ; the performance can be increased to a 85% fit on validation data. Of course, this model need further validation, but the improvement from 51% fit for the best linear model is substantial. This example if borrowed from Wang, J., Sano, A., Chen, T. and Huang. B. Identification of Hammerstein systems without explicit parameterization of nonlinearity. International Journal of Control, 82(5):937 952, May 2009. and it is used as one example in illustrating Lennart s toolbox, http://www.mathworks.com/products/sysid/demos.html?file=/products/demos/shipping/ ident/idnlbbdemo_damper.html More about the use of neural networks in system identification can be found in, Sjöberg, J., Zhang, Q., Ljung, L., Benveniste, A., Delyon, B., Glorennec, P-Y., Hjalmarsson, H. and Juditsky, A. onlinear black-box modeling in system identification: a unified overview, Automatica, 31(12):1691 1724, December 1995.

example 2 digit classification (I/IV) 13(26) example 2 digit classification (II/IV) 14(26) You have tried solving this problem using linear methods before. Let us see what can be done if we generalize to nonlinear function expansions (neural networks) instead. Let us now investigate 4 nonlinear models and one linear model solving the same task, et-1: o hidden layer (equivalent to logistic regression). et-2: One hidden layer, 12 hidden units fully connected. et-3: Two hidden layers locally connected. et-4: Two hidden layers, locally connected with weight sharing. et-5: Two hidden layers, locally connected with two levels of weight sharing. Local connectivity (et3-et5) means that each hidden unit is connected only to a small number of units in the layer before. It makes use of the key property that nearby pixels are more strongly correlated than distant pixels. example 2 - digit classification (III/IV) 15(26) example 2 digit classification (IV/IV) 16(26) This example if borrowed from Section 11.7 in HTF, who in turn borrowed it from, etwork Architecture Links Weights Correct et-1: 0 hidden layer 2570 2570 80.0% et-2: 1 hidden layer network 3214 3214 87.0% et-3: 2 hidden layer, locally connected 1226 1226 88.5% et-4: 2 hidden layer, constrained network 2266 1132 94.0% et-5: 2 hidden layer, constrained network 5194 1060 98.4% Copyright IEEE 1989. et-4 and et-5 are referred to as convolutional networks. This example illustrates that knowledge about the problem at hand can be very useful in order to improve the performance! LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P. Gradient-Based Learning Applied to Document Recognition, Proceedings of the IEEE, 86(11):2278 2324, ovember 1998. The best performance as of today is provided by a deep neural network published in last years CVPR (99.77%), Ciresan, D., Meier, U. and Schmidhuber, J. Multi-column Deep eural etworks for Image Classification, In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, Rhode Island, USA, June, 2012. Recall the page mentioned during lecture 1 on this problem, http://yann.lecun.com/exdb/mnist/ There is plenty of software support for neural networks, for example MATLAB s neural network toolbox (for general problems) and the system identification toolbox (for sys. id.).

The deep learning hype 17(26) Even made it into the ew York Times in ovember, 2012, http://www.nytimes.com/2012/11/24/science/ scientists-see-advances-in-deep-learning-a-part-of-artificial-intelligence.html? pagewanted=1&_r=0 Deep learning is a very active topic at IPS. A good entry point (tutorial papers, talks, groups, etc.) into the deep learning hype is provided by Interesting topic for a project! http://deeplearning.net/ Introducing kernel methods (I/III) 18(26) Let us introduce the kernel methods as an equivalent formulation of the linear regression problem. Recall that linear regression models the relationship between a continuous target variable t and a function φ(x) of the input variable x, t n = w T φ(x n ) +ɛ n. }{{} y(x n,w) From lecture 2 we have that the posterior distribution is given by p(w T) = (w m, S ), m = βs Φ T T, S = (αi + βφ T Φ) 1. Introducing kernel methods (II/III) 19(26) Inserting this into y(x, w) = w T φ(x) provides the following expression for the predictive mean where y(x, m ) = m T φ(x) = φ(x)t m = βφ(x) T S Φ T T = βφ(x) T S φ(x n ) t n = }{{} k(x,x n ) k(x, x ) = βφ(x) T S φ(x ) is referred to as the equivalent kernel. k(x, x n )t n, This suggests and alternative approach to regression where we instead of introducing a set of basis functions directly make use of a localized kernel. Introducing kernel methods (III/III) 20(26) Kernel methods constitutes a class of algorithms where the training data (or a subset thereof) is kept also during the prediction phase. Many linear methods can be re-cast into an equivalent dual representation where the predictions are based on linear combinations of kernel functions (one example provided above). A general property of kernels is that they are inner products k(x, z) = ψ(x) T ψ(z) (Linear regression example, ψ(x) = β 1/2 S 1/2 φ(x)) The above development suggests the following idea. In an algorithm where the input data x enters only in the form of scalar products we can replace this scalar product with another choice of kernel! This is referred to as the kernel trick.

Kernel representation of l 2 -reg. LS (I/II) 21(26) Inserting the solution ŵ = Φ T â = Φ T (K + λi) 1 T into y(x, w) provides the following prediction for a new input x ( ( ) ) T T y(x, ŵ) = ŵ T φ(x) = â T Φφ(x) = (K + λi) 1 T Φφ(x) = φ(x) T Φ T (K + λi) 1 T = φ(x) T ( φ(x 1 ) φ(x 2 ) φ(x ) ) (K + λi) 1 T = ( k(x, x 1 ) k(x, x 2 ) k(x, x ) ) (K + λi) 1 T, where we have made use of the definition of a kernel function k(x, z) φ(x) T φ(z) Hence, the solution to the l 2 -regularized least squares problem is expressed in terms of the kernel function k(x, z). Kernel representation of l 2 -reg. LS (II/II) 22(26) ote (again) that the prediction at x is given by a linear combination of the target values from the training set (expensive). Furthermore, we are required to invert an matrix (compared to an M M matrix in the original formulation), where typically M. Relevant question, So what is the point? The fact that it is expressed only using the kernel function k(x, z) implies that we can work entirely using kernels and avoid introducing basis functions φ(x). This in turn allows us to implicitly use basis functions of high, even infinite, dimensions (M ). Constructing kernels 23(26) Techniques for constructing new kernels 24(26) 1. Chose a feature mapping φ(x) and then use this to find the corresponding kernel, k(x, z) = φ(x) T φ(z) = M φ i (x)φ i (z) i=1 2. Chose a kernel function directly. In this case it is important to verify that it is in fact a kernel. (we will see two examples of this) A function k(x, z) is a kernel iff the Gram matrix K is positive semi-definite for all possible inputs. 3. Form new kernels from simpler kernels. 4. Start from probabilistic generative models. Given valid kernels k 1 (x, z) and k 2 (x, z), the following are also valid kernels k(x, z) = ck 1 (x, z), k(x, z) = q(k 1 (x, z)), k(x, z) = k 1 (x, z) + k 2 (x, z), k(x, z) = k 3 (φ(x), φ(z)), k(x, z) = f (x)k 1 (x, z)f (z), k(x, z) = exp(k 1 (x, z)), k(x, z) = k 1 (x, z)k 2 (x, z), k(x, z) = x T Ax, where c > 0 is a constant, f is a function, q is a polynomial with nonnegative coefficients, φ(x) is a function from x to R M, k 3 is a valid kernel in R M and A 0.

Example polynomial kernel 25(26) Let us investigate if the polynomial kernel k(x, z) = (x T z + c) n, c > 0 is a kernel for the special case n = 2 and a 2D input space x = (x 1, x 2 ) T, where k(x, z) = (x T z) 2 = (x 1 z 1 + x 2 z 2 + c) 2 = x 2 1 z2 1 + 2x 1 z 1 x 2 z 2 + x 2 2 z2 2 + 2cx 1 z 1 + 2cx 2 z 2 + c 2 = φ(x) T φ(z), φ(x) = ( x 2 1 2x1 x 2 x 2 2 2cx1 2cx2 c ) T Hence, it contains all possible terms (constant, linear and quadratic) up to order 2. A few concepts to summarize lecture 4 26(26) eural networks: A nonlinear function (as a function expansion) from a set of input variables {x i } to a set of output variables {y k } controlled by a vector w of parameters. Backpropagation: Computing the gradients by making use of the chain rule, combined with clever reuse of information that is needed for more than one gradient. Convolutional neural networks: The hidden units takes their inputs from a small part of the available inputs and all units have the same weights (called weight sharing). Kernel function: A kernel function k(x, z) is defined as an inner product k(x, z) = φ(x) T φ(z), where φ(x) is a fixed mapping. Kernel trick: (a.k.a. kernel substitution) In an algorithm where the input data x enters only in the form of scalar products we can replace this scalar product with another choice of kernel.