Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology)

Outline of the Course 1. Supervised Learning Regression and Classification Neural Networks (Deep Learning) 2. Unsupervised Learning Clustering Reinforcement Learning 1

Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given a labeled set of input-ouput pairs (the training set) D n = {(X i, Y i ), i = 1,..., n} We learn the classification function f = 1 if versicolor, f = 1 if virginica 2

Learning Unsupervised (or Descriptive) learning Find interesting patterns in the data D n = {X i, i = 1,..., n} We learn there are 2 distinct types of iris and how to distinguish them! 3

Examples Supervised learning: digit, flower, picture,... recognition music classification predict prices predict the outcome of chemical reactions source separation: identify the instruments present in a recording... Unsupervised learning: classification (without training set): spam filter, news (google news),... identify structures and causes in the data: e.g. J. Snow cholera deaths vs pollution image segmentation... 4

Outline of the Course 1. Supervised Learning Regression and Classification Neural Networks (Deep Learning) 2. Unsupervised Learning Clustering Reinforcement Learning 5

Supervised Learning: Outline 1. A statistical framework 2. Algorithms using Local Averaging (k-nn, kernels) 3. Linear regression 4. Support Vector Machines 5. The kernel trick 6. Neural Networks 6

1. Supervised learning Training set: D n = {(X i, Y i ), i = 1,..., n} - Input features: X i R d - Output: Y i Y i Y { R finite regression (price, position, etc) classification (type, mode, etc) y is a non-deterministic and complicated function of x i.e., y = f(x) + z where z is unknown (e.g. noise). Goal: learn f. Learning algorithm: 7

A Statistical Approach (X i, Y i ), i = 1,..., n are i.i.d. random variables with joint distribution P Model: look for an estimate of f in a class of functions F, e.g. linear, polynomial,... Learning algorithm: A : D n ˆf n F, ˆf n estimates the true function f Performance of predictions defined through a loss function l - Example 1. Regression, Least Squares (LS): Y = R, l(y, y ) = 1 2 y y 2 - Example 2. Classification in Y = {0, 1}: l(y, y ) = 1 y y 8

Risks The risk of a prediction function g is R(g) := E P [l(g(x), Y )] (often referred to as out-of-sample error) Target function f := arg min g F R(g) Empirical risk: ˆRn (g) := 1 n n i=1 l(g(x i), Y i ) (often referred to as in-sample error) Law of large numbers: lim n ˆRn (g) = R(g) almost surely Central limit theorem: n( ˆRn (g) R(g)) N (0, var P (l(g(x), Y ))) Consistent algorithm: if lim n E[R( ˆf n )] = R(f ) Warning. Consistency good algorithm (F can be poor, f f ) 9

LS regression and binary classification Example 1. Regression, Least Square (LS) Risk R(g) = 1 2 E[ g(x) Y 2 ] Target function f (x) = E[Y X = x] Example 2. Binary classification: Y = { 1, 1} Risk R(g) = P(g(X) Y ) Target function f (x) = arg max k { 1,1} P(Y = k X = x) 10

Model selection - Choosing F Avoid overfitting! A few principles: Occam s razor principle: choose the simplest of two models if they explain the data equally well Hadamard s well posed problems: unique solution, smooth in the parameters Regularization: minimize error( ˆf n )+Ω( ˆf n ) Penalizes the model complexity 11

2. Local averaging algorithms Principle: predict for x based on neighbours of x in the training data n ˆf n (x) = Ψ( W i (x)y i ), i=1 n W i (x) = 1 i=1 Ψ( ) = for regression, Ψ( ) = sgn( ) for binary classification 12

Examples Prediction function: ˆfn (x) = Ψ( n i=1 W i(x)y i ), n i=1 W i(x) = 1 k-n.n.: W i (x) = { 1/k if X i is one of the k-n.n. of x 0 otherwise Nadaraya-Watson algorithm: W i (x) K ( x X i ) h, e.g. K( ) = e 2 13

Performance analysis Stone s theorem: Conditions for W i under which the corresponding algorithm is consistent! Example: LS target function f (x) = E[Y X = x]. Distribution P on (X, Y ). Assume that: 1. X lies in A bounded a.s. under P 2. For all x A, var(y X = x) σ 2 3. f is L-lipschitz Then the k-n.n. estimator satisfies: E(R( ˆf n )) R(f ) σ2 k + cl2 ( ) 2/d k n The k-n.n. algorithm is consistent with P if k goes to infinity sublinearly with n. 14

3. Linear regression: Minimizing the empirical risk F = set of linear functions from R d (X i R d ) to R (Y i R) Function f θ F parametrized by θ R d+1 : (by convention x 0 = 1) f θ (x) = θ 0 + θ 1 x 1 +... + θ d x d = θ x Least square model. Empirical risk of θ: ˆR n (θ) = 1 n 2n i=1 (f θ(x i ) Y i ) 2 Minimal risk achieved for θ = (X X) 1 X y where y = [Y 1... Y n ] and X is a matrix whose i-th line is X i 15

Gradient Descents for Regression Alternative methods to find θ : sequential algorithms. Batch gradient descent. Repeat: n k {0, 1,..., d}, θ k := θ k α (Y i f θ (X i ))X ik Stochastic gradient descent. Repeat: 1. Select a sample i uniformly at random 2. Perform a descent using (X i, Y i ) only i=1 k {0, 1,..., d}, θ k := θ k α(y i f θ (X i ))X ik 16

Linear regression: Regularization For d > n, X X R d d is not invertible, so θ is not uniquely defined. Need to put additional constraints on the model to get a well-posed problem Regularization: add a cost for the magnitude of θ 1 min θ 2n n (f θ (X i ) Y i ) 2 + λω(θ) i=1 Ridge: Ω(θ) = θ 2 2 LASSO: Ω(θ) = θ 1 l p : Ω(θ) = θ p p small (< 1): sparse solutions but hard optimization problems p large ( 1): less sparse solutions but convex optimization problems 17

Linear regression: Regularization Regularization: add a cost for the magnitude of θ 1 min θ 2n n (f θ (X i ) Y i ) 2 + λω(θ) i=1 λ > 0 controls the bias-variance trade-off - High value of λ: the data has a low weight in the objective function (low variance but high bias) - Low value of λ: the data has a high weight in the objective function (low bias but high variance) Solution for Ridge regression: θ = (X X + nλi) 1 X y Prediction: For all x R d, ˆf λ (x) = y (X X + nλi) 1 X x 18

4. Support Vector Machines SVM is a classification algorithm aiming at maximizing the margin of the separating hyperplane The data points (or vectors) realizing the minimum distance to this hyperplane are support vectors (as they carry the hyperplane) 19

Support Vector Machines: Derivation Separating hyperplane H θ,b : θ x + b = 0 θ and b are normalized so that θ X sv + b = 1 where X sv is any support vector, i.e., arg min i d(x i, H θ,b ). With this choice, the margin is simply 2/ θ 2. Max margin problem. { { max θ,b 1/ θ 2 s.t. min i θ X i + b = 1 min θ,b θ 2 2 s.t. Y i (θ X i + b) 1, i A simple quadratic program... 20

Support Vector Machines: Derivation Lagrangian: L(θ, b, α) = 1 2 θ θ i α i(y i (θ X i + b) 1) Dual problem. { max α i α i 1 2 i j Y iy j Xi X jα i α j s.t. α i 0, i, i α iy i = 0 A quadratic program: given the solution α, we have the following: (complementary slackness) α i > 0 iff X i is a support vector (hyperplane): θ = i α i Y ix i and b: Y i (θ X i + b) = 1 if α i > 0 (prediction function): ˆfn (x) = sgn( i α i Y ix i x + b) Remark. The solution and the prediction can be computed by just evaluating inner products X i X j and X i x. 21

Support Vector Machines vs. Local Averaging The predictor ressembles that of a local averaging algorithm: g(x) = sgn( i W i (x)y i + b) with W i (x) = α i X i x But: W i (x) is not local anymore... as the support vectors span the entire data 22

SVM: Extensions What if the data is not linearly separable? Data almost linearly separable Soft-margin SVM (impose a cost of misclassification) Data not linearly separable at all Go to a higher dimensional space where the data becomes separable The Kernel trick 23

4. SVM with the kernel trick Non-linear transformation: φ : R d R e where e d Apply the SVM in R e with the training data (φ(x i ), Y i ) i=1,...,n The optimal prediction can be constructed by only considering scalar products K(X i, X j ) = φ(x i ) φ(x j ), and K(X i, x) = φ(x i ) φ(x) K is called a kernel (by definition if all its matrices are semi-definite positive) Kernel trick: The knowledge of K is enough; we can use spaces with arbitrary large dimension! 24

SVM with the kernel trick SVM with kernel K. 1. Using QP packages, find α solution of: { max α i α i 1 2 i j Y iy j K(X i, X j )α i α j s.t. α i 0, i, i α iy i = 0 2. Prediction function: ˆfK (x) = sgn( i α i Y ik(x i, x) + b) 25

Kernels K is a kernel iff there exists an Hilbert space and φ such that K(x, x ) = φ(x) φ(x ) (Aronszajn s theorem) Examples: - Polynomial kernel: K(x, x ) = (1 + x x ) p (e = 1 + pd) - Gaussian kernel: K(x, x ) = exp( x x 2 /h) (e = ) - Kernel cookbook: http://www.cs.toronto.edu/ duvenaud/cookbook/index.html 26

Outline of the Course 1. Supervised Learning Regression and Classification Neural Networks (Deep Learning) 2. Unsupervised Learning Clustering Reinforcement Learning 27

6. Neural networks Loosely inspired by how the brain works 1. Construct a network of simplified neurones, with the hope of approximating and learning any possible function 1 Mc Culloch-Pitts, 1943 28

The perceptron The first artificial neural network with one layer, and σ(x) = sgn(x) (classification) Input x R d, output in { 1, 1}. Can represent separating hyperplanes. 29

Multilayer perceptrons They can represent any function of R d to { 1, 1}... but the structure depends on the unknown target function f, and is difficult to optimise 30

From perceptrons to neural networks... and the number of layers can rapidly grow with the complexity of the function A key idea to make neural networks practical: soft-thresholding... 31

Soft-thresholding Replace hard-thresholding function σ by smoother functions Theorem (Cybenko 1989) Any continuous function f from [0, 1] d to R can be approximated as a function of the form: N j=1 α jσ(w j x + b j), where σ is any sigmoid function. 32

Soft-thresholding Cybenko s theorem tells us that f can be represented using a single hidden layer network... A non-constructive proof: how many neurones do we need? Might depend on f... 33

Neural networks A feedforward layered network (deep learning = enough layers) 34

Deep Learning and the ILSVR challenge Deep learning outperformed any other techniques in all major machine learning competitions (image classification, speech recognition and natural language processing) The ImageNet Large Scale Visual Recognition Challenge (ILSVRC). 1. Training: 1.2 million images (227 227), labeled one out of 1000 categories 2. Test: 100.000 images (227 227) 3. Error measure: The teams have to predict 5 (out of 1000) classes and an image is considered to be correct if at least one of the predictions is the ground truth. 2 35

ILSVR challenge 1 From Stanford CS231n lecture notes 36

Architectures 37

Architectures 38

Computing with neural networks Layer 0: inputs x = (x (0) 1,..., x(0) ) and x(0) 0 = 1 x (l) i d Layer 1,..., L 1: hidden layer l, d (l) + 1 nodes, state of node i, with x (l) 0 = 1 Layer L: output y = x (L) 1 Signal at k: s (l) k State at k: x (l) k = d (l 1) i=0 w (l) ik x(l 1) i = σ(s (l) k ) Output: the state of y = x (L) 1 39

Training neural networks The output of the network is a function of w = (w (l) ij ) i,j,l: y = f w (x) We wish to optimise over w to find the most accurate estimation of the target function Training data: (X 1, Y 1 ),..., (X n, Y n ) R d { 1, 1} Objective: find w minimising the empirical risk: E(w) := R(f w ) = 1 2n n f w (X l ) Y l 2 l=1 40

Stochastic Gradient Descent E(w) = 1 n 2n l=1 E l(w) where E l (w) := f w (X l ) Y l 2 In each iteration of the SGD algorithm, only one function E l is considered... Parameter. learning rate α > 0 1. Initialization. w := w 0 2. Sample selection. Select l uniformly at random in {1,..., n} 3. GD iteration. w := w α E l (w), go to 2. Is there an efficient way of computing E l (w)? 41

Backpropagation We fix l, and introduce e(w) = E l (w). Let us compute e(w): e w (l) ij = e s (l) j }{{} :=δ (l) j s(l) j w (l) ij }{{} =x (l 1) i The sensitivity of the error w.r.t. the signal at node j can be computed recursively... 42

Backward recursion Output layer. δ (L) 1 := e s (L) 1 From layer l to layer l 1. δ (l 1) i := e = s (l 1) i and e(w) = (σ(s (L) 1 ) Y l ) 2 δ (L) 1 = 2(x (L) 1 Y l )σ (s (L) 1 ) d (l) j=1 e s (l) j }{{} :=δ (l) j s(l) j x (l 1) i }{{} =w (l) ij x(l 1) i s (l 1) i }{{} =σ (s (l 1) i ) Summary. E l w (l) ij d (l) = δ (l) j x (l 1) i, δ (l 1) i = j=1 δ (l) j w (l) ij σ (s (l 1) i ) 43

Backpropagation algorithm Parameter. Learning rate α > 0 Input. (X 1, Y 1 ),..., (X n, Y n ) R d { 1, 1} 1. Initialization. w := w 0 2. Sample selection. Select l uniformly at random in {1,..., n} 3. Gradient of E l. x (0) i := X li for all i = 1,... d Forward propagation: compute the state and signal at each node (x (l) i, s (l) i ) Backward propagation: propagate back Y l to compute δ (l) at each node and the partial derivative E l w (l) ij 4. GD iteration. w := w α E l (w), go to 2. i 44

Example: tensorflow http://playground.tensorflow.org/ 45