Course 10. Kernel methods. Classical and deep neural networks.

Similar documents
Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Deep Feedforward Networks

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Neural Networks and the Back-propagation Algorithm

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks and Deep Learning

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

From perceptrons to word embeddings. Simon Šuster University of Groningen

Deep Feedforward Networks

Artificial Intelligence

Based on the original slides of Hung-yi Lee

Intelligent Systems Discriminative Learning, Neural Networks

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

CSC Neural Networks. Perceptron Learning Rule

Brief Introduction to Machine Learning

Introduction to Neural Networks

A summary of Deep Learning without Poor Local Minima

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Revision: Neural Network

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Artificial Neural Networks

18.6 Regression and Classification with Linear Models

Multilayer Perceptron

Classification with Perceptrons. Reading:

Feedforward Neural Networks. Michael Collins, Columbia University

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Grundlagen der Künstlichen Intelligenz

Course 395: Machine Learning - Lectures

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

PV021: Neural networks. Tomáš Brázdil

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Learning Methods for Linear Detectors

Artificial Neural Networks

Reading Group on Deep Learning Session 1

Machine Learning for Physicists Lecture 1

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Lecture 7 Artificial neural networks: Supervised learning

Machine Learning Lecture 12

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

COMS 4771 Introduction to Machine Learning. Nakul Verma

Nonlinear Classification

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Neural Networks (Part 1) Goals for the lecture

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Advanced Machine Learning

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Neural networks. Chapter 20. Chapter 20 1

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning : Support Vector Machines

Lecture 3 Feedforward Networks and Backpropagation

COMP-4360 Machine Learning Neural Networks

Mining Classification Knowledge

Deep Feedforward Networks

Lecture 3 Feedforward Networks and Backpropagation

Multilayer Neural Networks

Neural Networks DWML, /25

Neural networks. Chapter 19, Sections 1 5 1

Mining Classification Knowledge

Bits of Machine Learning Part 1: Supervised Learning

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Multilayer Neural Networks

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Artificial Neural Networks. Historical description

INTRODUCTION TO ARTIFICIAL INTELLIGENCE

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Deep Neural Networks (1) Hidden layers; Back-propagation

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Introduction to Machine Learning

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

EEE 241: Linear Systems

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Deep learning attracts lots of attention.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Introduction Biologically Motivated Crude Model Backpropagation

Deep Feedforward Networks. Sargur N. Srihari

Optimization and Gradient Descent

Distinguish between different types of scenes. Matching human perception Understanding the environment

ECE521 Lecture 7/8. Logistic Regression

Radial-Basis Function Networks

Lecture 5 Neural models for NLP

Machine Learning

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

4. Multilayer Perceptrons

Master Recherche IAC TC2: Apprentissage Statistique & Optimisation

Radial-Basis Function Networks

1 What a Neural Network Computes

Neural networks. Chapter 20, Section 5 1

CSC242: Intro to AI. Lecture 21

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Deep Neural Networks (1) Hidden layers; Back-propagation

Transcription:

Course 10 Kernel methods. Classical and deep neural networks.

Kernel methods in similarity-based learning Following (Ionescu, 2018)

The Vector Space Model ò The representation of a set of objects as vectors in a common vector space is known as the vector space model. ò dimensions are features (words, similarities between a sample and a training sample, TF-IDF, etc.) ò queries are also vectors From (Verberne, 2018)

Dot product based similarity ò A vector V(d) derived from the object d, with one component for each feature ò the value of a component being a number, or the tf-idf weighting score, or anything else ò How do we quantify the similarity between two objects in this vector space? ò 1 st attempt: compute the magnitude of the vector difference between the vectors of the two objects critics! ò A better solution: compute the cosine similarity:

Properties of the inner (dot) product ò <.,. >: V x V à R with the properties: ò symmetry: <x, y> = <y, x> ò linearity: <ax, y> = a<y, x>; <x + y, z> = <x, z> + <y, z> ò positive definiteness: <x, x> 0; <x, x> = 0 ó x = 0 where x T is the transpose of x.

Computing similarity ò Dot product: ò Euclidean length: ò Unit vectors: and ò Similarity: ò Similarity = cosine of the angle between the two objects

Inner product space (Hilbert space) ò A vector space X over the set of real numbers R is an inner product space, if there exists a real-valued symmetric bilinear (linear in each argument) map <.,. >, that satisfies <x, x> 0, for all x X. ò ò the bilinear map is known as the inner product, dot product or scalar product inner product space is sometimes referred to as a Hilbert space; R m is a Hilbert space ( )

Motivation for using kernels: feature embedding. Nonlinear patterns are converted into linear The function φ embeds the data into a feature space where the nonlinear relations now appear linear. Machine learning methods can easily detect such linear relations.

Kernels in inner product spaces ò A kernel method performs a mapping into an embedding or feature space. An embedding map (or feature map) is a function: where F is an inner product feature space. ò A kernel is a function k that for all x, z X satisfies: ò the choice of the map φ aims to convert the nonlinear relations from X into linear relations in the embedding space F

Kernel methods ò Kernel-based learning algorithms work by embedding the data into a Hilbert space and by searching for linear relations in that space, using a learning algorithm. ò to embed data in a Hilbert space = whatever the input data looks like, embed (transformă/încorporează/scufundă) its representation in a multidimensional space (> 3 dimensions) ò embedding is performed implicitly, i.e. by specifying the inner product between each pair of points rather than by giving their coordinates explicitly ò A kernel = the inner product in a Hilbert space ò another interpretation: the pairwise similarity between samples

Why is the kernel function important? ò Kernel methods (through kernel functions) offers the power to naturally handle input data that is not in the form of numerical vectors, such as strings, images, or even video and audio files. ò The kernel function captures the intuitive notion of similarity between objects in a specific domain and can be any function defined on the respective domain that is symmetric and positively defined. ò in computational biology and computational linguistics (Shawe-Taylor and Cristianini, 2004) ò in image interpretation (Lazebnik et al., 2006; Grauman and Darrell, 2005)

The kernel matrix ò Given a set of vectors {x 1, x n } and a kernel function k employed to evaluate the inner products in a feature space with feature map φ, the kernel matrix is defined as the n n matrix K with entries given by: K ij = <φ (x i ), φ (x i )}> = k(x i, x j ) ò A function k: X X à R which is either continuous or has a finite domain, can be decomposed by a feature map φ into a Hilbert space F applied to both its arguments followed by the evaluation of the inner product in F as follows: k(x, z) = <φ(x), φ(z)>, if and only if it is finitely positive semi-definite.

Example and properties of kernel functions ò Linear kernel: obtained by computing the inner product of two vectors. The map function in this case is φ(x) = x. ò Example: If x = (1, 2, 4, 1) and z = (5, 1, 2, 3) are two vectors in R 4, then k(x, z) = <x, z> = 1. 5 + 2. 1 + 4. 2 + 1. 3 = 18 ò Let k1 and k2 be two kernels over X X, X, a R +, f(.) - a realvalued function on X, and M a symmetric positive semi-definite n n matrix. Then the following functions are also kernels: k(x, z) = k1(x, z) + k2(x, z) k(x, z) = a k1(x, z) k(x, z) = k1(x, y). k2(x, z) k(x, z) = f(x). f(z) k(x, z) = x T M z See (Ionescu, 2018), pag. 44-45, for more examples of kernel functions. M is called positive semi-definite if x T M x 0 for all x in R n.

Kernel classifiers ò In binary classification problems, kernel-based learning algorithms look for a discriminant function, which assigns +1 to examples that belong to one class and -1 to examples that belong to the other class. ò A linear function:

Kernel normalisation ò Given a kernel k(x; z) that corresponds to the feature map φ, the normalized kernel ^k(x; z) corresponds to the feature map given by: ò Normalization helps to improve machine learning performance. ò Examples: ò normalization makes the values of each feature in the data have zeromean and unit-variance ò divide each component by the square root of the product of the two corresponding diagonal components:

O scurtă revedere a rețelelor neurale Following: (Nielsen, 2015).

Neural networks in AI ò Inspired from medicine, where it should refer to neuronal networks ò Artificial neuron - called perceptron ò concept introduced in the 1950s and 1960s (by Frank Rosenblatt, inspired by a prior work of Warren McCulloch and Walter Pitts) ò more modern: the sigmoid neuron w 1 w 2 w 3

Networks with and without feedback ò Neural networks where the output of a layer is used as input to the next layer => feedforward neural networks. ò there are no loops in the network - the information is always fed forward, not backwards ò If the output is fetched to internal nodes => recursive networks

Another way to write the network ecuation = w. x (inner product of two vectors: of weights and thresholds) Noring b = - threshold =>

From (Ionescu, 2018), pag. 47 weights biases

How can a neural network learn? ò A small change in the weight of an arc of the network... ò can cause complete switching of the output, e.g. from 0 to 1 ò if a certain exit would now be corrected in the way we want, it is possible that the rest of the entries be severely affected in a way that is difficult to control ò => We need a way to gradually change weights and polarities so the network gets closer to the desired behavior ò a network where switching is not radical, as the one given by a threshold function...

The sigmoid function ò output = σ(w. x + b), where: perceptron sigmoid neuron

To learn = to minimize an error function (cost, objective, loss) where: ò w = weights matrix ò b = biases vector ò n = dimension of input (length of x) ò x = input ò y(x) = desired output ò a(w, b, x) = computed output of the network

Gradient descent rule ò Assuming a C that depends on a set of input variables v = (v 1,... v m ), a variation ΔC of C, produced by a small variation of the input: Δv = (Δv 1,... Δv m ) T is: ΔC C. Δv, where: is the gradient. ò If we choose: Δv = -η C, with η > 0, then we assure that the ΔC variation will be negative (because we want the error to be smaller and smaller at each step). ò Update rule: algoritmul de gradient descent

How to apply gradient descent? ò Decreasing the gradient can be seen as a way of making small steps in the direction that makes C fall faster. ò The gradient vector has the components and, such that: the rules that make the network learn But this means that the output influences the network. Still this is not a recurrent network (because only weights and biases are influenced, not the inputs of the neurons).

Stochastic descent of gradient ò In practice, the number of input variables and biases can be extremely high, so iterating gradient descent on each of them may be unrealistic. ò At each step in the training we randomly choose a number of inputs X 1,... X m and we estimate ΔC only based on them, by mediating the results <= a process similar to voting polls: ò Summarizing: and at each step another sample.

Multilayer (deep) networks ò Layer 2 perceptrons make decisions by weighing the results of the first layer ò decisions at a more complex and abstract level than those in the first layer ò even more complex decisions can be taken by perceptrons in the third layer ò a network of many layers of perceptrons can engage in a sophisticated decision

Feedforward through a deep neural network ò The loss L(ˆy, y) depends on the output ˆy and the goal y. ò To obtain the loss J, a regularizer Ω(θ), where θ contains all parameters (weights and biases) can be added to the loss. From: (Goodfellow et al., 2016)

Deep learning ò pwe want to find out if a picture contains a human face:

Decompose the problem in simpler sub-problems ò Is there an eye on the top left? Is there an eye on the top right? Is there a nose in the middle? Is there a mouth down in the middle? Is hair above? And so on.

And these... in others even simpler ò But how do you solve the problem automatically by proposing 5-10 intermediate (hidden) layers of neurons that address increasingly general concepts?...

Bibliography ò ò ò ò ò ò ò Goodfellow, Ian, Bengio, Yoshua and Courville, Aaron (2016). Deep Learning, MIT Press (http:// www.deeplearningbook.org) Grauman, Kristen and Darrell, Trevor (2005). The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features. In Proceedings of ICCV, volume 2, pp. 1458{1465. Ionescu, Radu Tudor (2018). Habilitation thesis. Knowledge Transfer between Computer Vision, Text Mining and Computational Biology: New Chapters, University of Bucharest. Lazebnik, Svetlana, Schmid, Cordelia, and Ponce, Jean (2006). Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In Proceedings of CVPR, volume 2, pp. 2169{2178, Washington, IEEE Computer Society. Nielsen, Michael A. (2015). Neural networks and deep learning, Determination Press. http:// neuralnetworksanddeeplearning.com/chap1.html Shawe-Taylor, John and Cristianini, Nello (2004). Kernel Methods for Pattern Analysis, Cambridge University Press. Verberne, Suzan (2018). Word2Vec Tutorial, Leiden Univ., March. https://twitter.com/suzan/status/ 977158060371316736