Bits of Machine Learning Part 1: Supervised Learning

Similar documents
A summary of Deep Learning without Poor Local Minima

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Linear & nonlinear classifiers

Support Vector Machine (continued)

Neural Networks and Deep Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Support Vector Machines for Classification: A Statistical Portrait

ECE521 Lectures 9 Fully Connected Neural Networks

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Support Vector Machines

Advanced Machine Learning

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Lecture 2 Machine Learning Review

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Linear & nonlinear classifiers

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machine

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Support Vector Machine (SVM) and Kernel Methods

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Lecture 6. Regression

Neural networks and support vector machines

Recap from previous lecture

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Overfitting, Bias / Variance Analysis

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Statistical Data Mining and Machine Learning Hilary Term 2016

Lecture Support Vector Machine (SVM) Classifiers

Artificial Neural Networks. MGS Lecture 2

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

CSCI567 Machine Learning (Fall 2018)

This is an author-deposited version published in : Eprints ID : 17710

Max Margin-Classifier

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

1 Machine Learning Concepts (16 points)

Statistical learning theory, Support vector machines, and Bioinformatics

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

CIS 520: Machine Learning Oct 09, Kernel Methods

Lecture 5: Logistic Regression. Neural Networks

Discriminative Models

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Linear Models for Regression CS534

Kernel Learning via Random Fourier Representations

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Lecture 10: A brief introduction to Support Vector Machine

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Artificial Intelligence

Binary Classification / Perceptron

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Tutorial on Machine Learning for Advanced Electronics

CS798: Selected topics in Machine Learning

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Support Vector Machine (SVM) and Kernel Methods

Generalization theory

Multilayer Perceptron

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Pattern Recognition 2018 Support Vector Machines

Introduction to Convolutional Neural Networks (CNNs)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Warm up: risk prediction with logistic regression

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Statistical Machine Learning from Data

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Support Vector Machines

6.036 midterm review. Wednesday, March 18, 15

CS489/698: Intro to ML

Support Vector Machines: Maximum Margin Classifiers

CSC242: Intro to AI. Lecture 21

CS249: ADVANCED DATA MINING

Logistic Regression. COMP 527 Danushka Bollegala

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

Lecture 10: Support Vector Machine and Large Margin Classifier

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Pattern Recognition and Machine Learning

Support Vector Machines and Kernel Methods

Learning with kernels and SVM

Jeff Howbert Introduction to Machine Learning Winter

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Oslo Class 2 Tikhonov regularization and kernels

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

SGD and Deep Learning

Transcription:

Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology)

Outline of the Course 1. Supervised Learning Regression and Classification Neural Networks (Deep Learning) 2. Unsupervised Learning Clustering Reinforcement Learning 1

Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given a labeled set of input-ouput pairs (the training set) D n = {(X i, Y i ), i = 1,..., n} We learn the classification function f = 1 if versicolor, f = 1 if virginica 2

Learning Unsupervised (or Descriptive) learning Find interesting patterns in the data D n = {X i, i = 1,..., n} We learn there are 2 distinct types of iris and how to distinguish them! 3

Examples Supervised learning: digit, flower, picture,... recognition music classification predict prices predict the outcome of chemical reactions source separation: identify the instruments present in a recording... Unsupervised learning: classification (without training set): spam filter, news (google news),... identify structures and causes in the data: e.g. J. Snow cholera deaths vs pollution image segmentation... 4

Outline of the Course 1. Supervised Learning Regression and Classification Neural Networks (Deep Learning) 2. Unsupervised Learning Clustering Reinforcement Learning 5

Supervised Learning: Outline 1. A statistical framework 2. Algorithms using Local Averaging (k-nn, kernels) 3. Linear regression 4. Support Vector Machines 5. The kernel trick 6. Neural Networks 6

1. Supervised learning Training set: D n = {(X i, Y i ), i = 1,..., n} - Input features: X i R d - Output: Y i Y i Y { R finite regression (price, position, etc) classification (type, mode, etc) y is a non-deterministic and complicated function of x i.e., y = f(x) + z where z is unknown (e.g. noise). Goal: learn f. Learning algorithm: 7

A Statistical Approach (X i, Y i ), i = 1,..., n are i.i.d. random variables with joint distribution P Model: look for an estimate of f in a class of functions F, e.g. linear, polynomial,... Learning algorithm: A : D n ˆf n F, ˆf n estimates the true function f Performance of predictions defined through a loss function l - Example 1. Regression, Least Squares (LS): Y = R, l(y, y ) = 1 2 y y 2 - Example 2. Classification in Y = {0, 1}: l(y, y ) = 1 y y 8

Risks The risk of a prediction function g is R(g) := E P [l(g(x), Y )] (often referred to as out-of-sample error) Target function f := arg min g F R(g) Empirical risk: ˆRn (g) := 1 n n i=1 l(g(x i), Y i ) (often referred to as in-sample error) Law of large numbers: lim n ˆRn (g) = R(g) almost surely Central limit theorem: n( ˆRn (g) R(g)) N (0, var P (l(g(x), Y ))) Consistent algorithm: if lim n E[R( ˆf n )] = R(f ) Warning. Consistency good algorithm (F can be poor, f f ) 9

LS regression and binary classification Example 1. Regression, Least Square (LS) Risk R(g) = 1 2 E[ g(x) Y 2 ] Target function f (x) = E[Y X = x] Example 2. Binary classification: Y = { 1, 1} Risk R(g) = P(g(X) Y ) Target function f (x) = arg max k { 1,1} P(Y = k X = x) 10

Model selection - Choosing F Avoid overfitting! A few principles: Occam s razor principle: choose the simplest of two models if they explain the data equally well Hadamard s well posed problems: unique solution, smooth in the parameters Regularization: minimize error( ˆf n )+Ω( ˆf n ) Penalizes the model complexity 11

2. Local averaging algorithms Principle: predict for x based on neighbours of x in the training data n ˆf n (x) = Ψ( W i (x)y i ), i=1 n W i (x) = 1 i=1 Ψ( ) = for regression, Ψ( ) = sgn( ) for binary classification 12

Examples Prediction function: ˆfn (x) = Ψ( n i=1 W i(x)y i ), n i=1 W i(x) = 1 k-n.n.: W i (x) = { 1/k if X i is one of the k-n.n. of x 0 otherwise Nadaraya-Watson algorithm: W i (x) K ( x X i ) h, e.g. K( ) = e 2 13

Performance analysis Stone s theorem: Conditions for W i under which the corresponding algorithm is consistent! Example: LS target function f (x) = E[Y X = x]. Distribution P on (X, Y ). Assume that: 1. X lies in A bounded a.s. under P 2. For all x A, var(y X = x) σ 2 3. f is L-lipschitz Then the k-n.n. estimator satisfies: E(R( ˆf n )) R(f ) σ2 k + cl2 ( ) 2/d k n The k-n.n. algorithm is consistent with P if k goes to infinity sublinearly with n. 14

3. Linear regression: Minimizing the empirical risk F = set of linear functions from R d (X i R d ) to R (Y i R) Function f θ F parametrized by θ R d+1 : (by convention x 0 = 1) f θ (x) = θ 0 + θ 1 x 1 +... + θ d x d = θ x Least square model. Empirical risk of θ: ˆR n (θ) = 1 n 2n i=1 (f θ(x i ) Y i ) 2 Minimal risk achieved for θ = (X X) 1 X y where y = [Y 1... Y n ] and X is a matrix whose i-th line is X i 15

Gradient Descents for Regression Alternative methods to find θ : sequential algorithms. Batch gradient descent. Repeat: n k {0, 1,..., d}, θ k := θ k α (Y i f θ (X i ))X ik Stochastic gradient descent. Repeat: 1. Select a sample i uniformly at random 2. Perform a descent using (X i, Y i ) only i=1 k {0, 1,..., d}, θ k := θ k α(y i f θ (X i ))X ik 16

Linear regression: Regularization For d > n, X X R d d is not invertible, so θ is not uniquely defined. Need to put additional constraints on the model to get a well-posed problem Regularization: add a cost for the magnitude of θ 1 min θ 2n n (f θ (X i ) Y i ) 2 + λω(θ) i=1 Ridge: Ω(θ) = θ 2 2 LASSO: Ω(θ) = θ 1 l p : Ω(θ) = θ p p small (< 1): sparse solutions but hard optimization problems p large ( 1): less sparse solutions but convex optimization problems 17

Linear regression: Regularization Regularization: add a cost for the magnitude of θ 1 min θ 2n n (f θ (X i ) Y i ) 2 + λω(θ) i=1 λ > 0 controls the bias-variance trade-off - High value of λ: the data has a low weight in the objective function (low variance but high bias) - Low value of λ: the data has a high weight in the objective function (low bias but high variance) Solution for Ridge regression: θ = (X X + nλi) 1 X y Prediction: For all x R d, ˆf λ (x) = y (X X + nλi) 1 X x 18

4. Support Vector Machines SVM is a classification algorithm aiming at maximizing the margin of the separating hyperplane The data points (or vectors) realizing the minimum distance to this hyperplane are support vectors (as they carry the hyperplane) 19

Support Vector Machines: Derivation Separating hyperplane H θ,b : θ x + b = 0 θ and b are normalized so that θ X sv + b = 1 where X sv is any support vector, i.e., arg min i d(x i, H θ,b ). With this choice, the margin is simply 2/ θ 2. Max margin problem. { { max θ,b 1/ θ 2 s.t. min i θ X i + b = 1 min θ,b θ 2 2 s.t. Y i (θ X i + b) 1, i A simple quadratic program... 20

Support Vector Machines: Derivation Lagrangian: L(θ, b, α) = 1 2 θ θ i α i(y i (θ X i + b) 1) Dual problem. { max α i α i 1 2 i j Y iy j Xi X jα i α j s.t. α i 0, i, i α iy i = 0 A quadratic program: given the solution α, we have the following: (complementary slackness) α i > 0 iff X i is a support vector (hyperplane): θ = i α i Y ix i and b: Y i (θ X i + b) = 1 if α i > 0 (prediction function): ˆfn (x) = sgn( i α i Y ix i x + b) Remark. The solution and the prediction can be computed by just evaluating inner products X i X j and X i x. 21

Support Vector Machines vs. Local Averaging The predictor ressembles that of a local averaging algorithm: g(x) = sgn( i W i (x)y i + b) with W i (x) = α i X i x But: W i (x) is not local anymore... as the support vectors span the entire data 22

SVM: Extensions What if the data is not linearly separable? Data almost linearly separable Soft-margin SVM (impose a cost of misclassification) Data not linearly separable at all Go to a higher dimensional space where the data becomes separable The Kernel trick 23

4. SVM with the kernel trick Non-linear transformation: φ : R d R e where e d Apply the SVM in R e with the training data (φ(x i ), Y i ) i=1,...,n The optimal prediction can be constructed by only considering scalar products K(X i, X j ) = φ(x i ) φ(x j ), and K(X i, x) = φ(x i ) φ(x) K is called a kernel (by definition if all its matrices are semi-definite positive) Kernel trick: The knowledge of K is enough; we can use spaces with arbitrary large dimension! 24

SVM with the kernel trick SVM with kernel K. 1. Using QP packages, find α solution of: { max α i α i 1 2 i j Y iy j K(X i, X j )α i α j s.t. α i 0, i, i α iy i = 0 2. Prediction function: ˆfK (x) = sgn( i α i Y ik(x i, x) + b) 25

Kernels K is a kernel iff there exists an Hilbert space and φ such that K(x, x ) = φ(x) φ(x ) (Aronszajn s theorem) Examples: - Polynomial kernel: K(x, x ) = (1 + x x ) p (e = 1 + pd) - Gaussian kernel: K(x, x ) = exp( x x 2 /h) (e = ) - Kernel cookbook: http://www.cs.toronto.edu/ duvenaud/cookbook/index.html 26

Outline of the Course 1. Supervised Learning Regression and Classification Neural Networks (Deep Learning) 2. Unsupervised Learning Clustering Reinforcement Learning 27

6. Neural networks Loosely inspired by how the brain works 1. Construct a network of simplified neurones, with the hope of approximating and learning any possible function 1 Mc Culloch-Pitts, 1943 28

The perceptron The first artificial neural network with one layer, and σ(x) = sgn(x) (classification) Input x R d, output in { 1, 1}. Can represent separating hyperplanes. 29

Multilayer perceptrons They can represent any function of R d to { 1, 1}... but the structure depends on the unknown target function f, and is difficult to optimise 30

From perceptrons to neural networks... and the number of layers can rapidly grow with the complexity of the function A key idea to make neural networks practical: soft-thresholding... 31

Soft-thresholding Replace hard-thresholding function σ by smoother functions Theorem (Cybenko 1989) Any continuous function f from [0, 1] d to R can be approximated as a function of the form: N j=1 α jσ(w j x + b j), where σ is any sigmoid function. 32

Soft-thresholding Cybenko s theorem tells us that f can be represented using a single hidden layer network... A non-constructive proof: how many neurones do we need? Might depend on f... 33

Neural networks A feedforward layered network (deep learning = enough layers) 34

Deep Learning and the ILSVR challenge Deep learning outperformed any other techniques in all major machine learning competitions (image classification, speech recognition and natural language processing) The ImageNet Large Scale Visual Recognition Challenge (ILSVRC). 1. Training: 1.2 million images (227 227), labeled one out of 1000 categories 2. Test: 100.000 images (227 227) 3. Error measure: The teams have to predict 5 (out of 1000) classes and an image is considered to be correct if at least one of the predictions is the ground truth. 2 35

ILSVR challenge 1 From Stanford CS231n lecture notes 36

Architectures 37

Architectures 38

Computing with neural networks Layer 0: inputs x = (x (0) 1,..., x(0) ) and x(0) 0 = 1 x (l) i d Layer 1,..., L 1: hidden layer l, d (l) + 1 nodes, state of node i, with x (l) 0 = 1 Layer L: output y = x (L) 1 Signal at k: s (l) k State at k: x (l) k = d (l 1) i=0 w (l) ik x(l 1) i = σ(s (l) k ) Output: the state of y = x (L) 1 39

Training neural networks The output of the network is a function of w = (w (l) ij ) i,j,l: y = f w (x) We wish to optimise over w to find the most accurate estimation of the target function Training data: (X 1, Y 1 ),..., (X n, Y n ) R d { 1, 1} Objective: find w minimising the empirical risk: E(w) := R(f w ) = 1 2n n f w (X l ) Y l 2 l=1 40

Stochastic Gradient Descent E(w) = 1 n 2n l=1 E l(w) where E l (w) := f w (X l ) Y l 2 In each iteration of the SGD algorithm, only one function E l is considered... Parameter. learning rate α > 0 1. Initialization. w := w 0 2. Sample selection. Select l uniformly at random in {1,..., n} 3. GD iteration. w := w α E l (w), go to 2. Is there an efficient way of computing E l (w)? 41

Backpropagation We fix l, and introduce e(w) = E l (w). Let us compute e(w): e w (l) ij = e s (l) j }{{} :=δ (l) j s(l) j w (l) ij }{{} =x (l 1) i The sensitivity of the error w.r.t. the signal at node j can be computed recursively... 42

Backward recursion Output layer. δ (L) 1 := e s (L) 1 From layer l to layer l 1. δ (l 1) i := e = s (l 1) i and e(w) = (σ(s (L) 1 ) Y l ) 2 δ (L) 1 = 2(x (L) 1 Y l )σ (s (L) 1 ) d (l) j=1 e s (l) j }{{} :=δ (l) j s(l) j x (l 1) i }{{} =w (l) ij x(l 1) i s (l 1) i }{{} =σ (s (l 1) i ) Summary. E l w (l) ij d (l) = δ (l) j x (l 1) i, δ (l 1) i = j=1 δ (l) j w (l) ij σ (s (l 1) i ) 43

Backpropagation algorithm Parameter. Learning rate α > 0 Input. (X 1, Y 1 ),..., (X n, Y n ) R d { 1, 1} 1. Initialization. w := w 0 2. Sample selection. Select l uniformly at random in {1,..., n} 3. Gradient of E l. x (0) i := X li for all i = 1,... d Forward propagation: compute the state and signal at each node (x (l) i, s (l) i ) Backward propagation: propagate back Y l to compute δ (l) at each node and the partial derivative E l w (l) ij 4. GD iteration. w := w α E l (w), go to 2. i 44

Example: tensorflow http://playground.tensorflow.org/ 45