Learning From Data: Modelling as an Optimisation Problem

Similar documents
6.036 midterm review. Wednesday, March 18, 15

CSC 411 Lecture 17: Support Vector Machine

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

18.9 SUPPORT VECTOR MACHINES

Ordinary Least Squares Linear Regression

L5 Support Vector Classification

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Convex Optimization and Support Vector Machine

Machine Learning Practice Page 2 of 2 10/28/13

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification and Regression

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Jeff Howbert Introduction to Machine Learning Winter

Recap from previous lecture

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Kernelized Perceptron Support Vector Machines

Machine Learning And Applications: Supervised Learning-SVM

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Statistical Data Mining and Machine Learning Hilary Term 2016

Machine Learning for NLP

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Linear Classifiers. Michael Collins. January 18, 2012

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Introduction to Machine Learning

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Support Vector Machine. Industrial AI Lab.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support Vector Machines Explained

Discriminative Models

Soft-margin SVM can address linearly separable problems with outliers

DATA MINING AND MACHINE LEARNING

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines and Bayes Regression

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Discriminative Models

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Dimension Reduction Methods

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Discriminative Learning and Big Data

CS798: Selected topics in Machine Learning

Logistic Regression. COMP 527 Danushka Bollegala

Support Vector Machine (continued)

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Machine Learning for OR & FE

Support Vector Machines

Linear Models in Machine Learning

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Support Vector Machines

Machine Learning for NLP

Warm up: risk prediction with logistic regression

Support Vector Machines

Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs

An Introduction to Statistical and Probabilistic Linear Models

Chemometrics: Classification of spectra

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Classification and Support Vector Machine

MLCC 2017 Regularization Networks I: Linear Models

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Classification 2: Linear discriminant analysis (continued); logistic regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Day 4: Classification, support vector machines

Proteomics and Variable Selection

Lecture 2 Machine Learning Review

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Gaussian discriminant analysis Naive Bayes

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Learning with L q<1 vs L 1 -norm regularisation with exponentially many irrelevant features

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Tufts COMP 135: Introduction to Machine Learning

ORIE 4741: Learning with Big Messy Data. Regularization

Non-linear Support Vector Machines

COMS 4771 Regression. Nakul Verma

Cheng Soon Ong & Christian Walder. Canberra February June 2018

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Metric-based classifiers. Nuno Vasconcelos UCSD

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Machine Learning. Support Vector Machines. Manfred Huber

Kernel Methods. Barnabás Póczos

4 Bias-Variance for Ridge Regression (24 points)

Machine learning for pervasive systems Classification in high-dimensional spaces

Structured matrix factorizations. Example: Eigenfaces

Part of the slides are adapted from Ziko Kolter

Margin Maximizing Loss Functions

Multiclass Classification-1

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines: Maximum Margin Classifiers

Introduction to Machine Learning

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Transcription:

Learning From Data: Modelling as an Optimisation Problem Iman Shames April 2017 1 / 31

You should be able to... Identify and formulate a regression problem; Appreciate the utility of regularisation; Identify and formulate a binary classification problem; Recognise a generic supervised learning problem as an optimisation problem; Formulate different instances of unsupervised learning problem. Observe that many learning problems are special instances of optimisation theory. 2 / 31

Some Optimisation History If the facts don t fit the theory, change the facts. -A Einestein A family of problems arising in machine learning, viewed through the looking glass of optimisation theory, will be discussed. Supervised learning: fitting a model to given response data Unsupervised learning: build a model for the data without a particular response in d Denote [t 1,..., t m ] R n m a generic matrix of data points t i R n, i = 1,..., m, is the i-th data point, aka example. A particular entry of a data point is known as feature, e.g. temperature, price, blood pressure, signal strength,... 3 / 31

Supervised Learning In supervised learning the goal is build a model of an unknown function: t y(t) We are given a set of observations (examples) that is a number of input-output pairs (t i, y i ), i = 1,..., m y = [y 1,..., y m ] is called the response vector. These examples are used to learn a model of the function t ŷ(t; x) where x is a vector of model parameters. The goal is to use the data and the information in y to form a model. In turn the model can be used to predict a value ŷ for a (yet unseen) new test point x R n. 4 / 31

Supervised Learning In the most general form: ŷ(t; x) = x φ(t) where φ(t) is a given nonlinear function. If φ(t) = (1, t) we retrieve the affine relationship. Example: Demand Prediction for Inventory Management A store needs to predict demand for a specific item from costumers in an area. It assumes that the logarithm of the demand is an arbitrary real number and depends linearly on a number of features: time of year, type of item, number of items sold the day before, etc. The problem is to predict the demand the next day based the observations of features-demand pairs in the past. 5 / 31

Supervised Learning If the response is binary, e.g. y(x) {1, 1} for every t, it is then referred to as a label. In this case a sign-(non)linear model can be used: ŷ(t; x) = sign(x φ(t)) Letting φ(t) = [1, t] yields the sign-linear model. Example: Binary Classification of Credit Card Applicants A credit card company receives thousands of applications. Each application contains information about an applicant: age, marital status, annual salary, outstanding debts, etc. How can we set up a system that classifies the applicants into two categories: approved and not approved. 6 / 31

Supervised Learning The learning (or training) problem is to find the best model coefficient vector x such that ŷ(t i; x) y i, i = 1,..., m. A certain measure of mismatch between y i and ŷ i (t i ; x) needs to be imised. An important point is assign a measure of reliability to the predicted values. This is the subject of statistical learning, see the following: The one that tells you what happens: Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1). Springer, Berlin: Springer series in statistics. The one that asks you to believe: James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 6). New York: springer. 7 / 31

Least Squares Via Polynomials Learning has its roots in this theorem: Theorem (Weierstrass Approximation Theorem): Suppose f is a continuous real-valued function defined on the real interval [a, b]. For every ɛ > 0, there exists a polynomial p such that for all x in [a, b], we have f(x) p(x) < ɛ. Let s assume a basic linear model for the data: y(t) = x 1 + tx 2 + δ = x φ(t) + δ where φ(t) = [1, t], δ is the error term. 8 / 31

Least Squares Via Polynomials To find the coefficient (weight) vector x, we use a least-squares approach to imise the training error: 1 (yi φ i x) 2 x m where φ i = φ(t i ), i = 1,..., m. For y = [y 1,..., y m ], Φ being a matrix with columns φ i : x Φ x y 2 2 This is learning (training) problem. Once this problem is solved, one can use the coefficients to predict values of y for unseen t. 9 / 31

Least Squares Via Polynomials If the linear models don t explain the data well, one can obtain a higher order model: y(t) = x 1 + tx 2 + t 2 x 3 + + t k x k+1 + δ 10 / 31

Least Squares Via Polynomials If the linear models don t explain the data well, one can obtain a higher order model: y(t) = x 1 + tx 2 + t 2 x 3 + + t k x k+1 + δ The fitting problem is still a least-squares problem: x Φ x y 2 2 Each φ i = [1, t i, t 2 i,..., tk i ], i = 1,..., m. Choice of k is magical and requires cross validation: leave a bunch of data points out and evaluate the performance of the predictor. Typically as k increases the accuracy of training gets better, but the cross validation error initially decreases (or remains constant) and then increases (over-fitting). 10 / 31

Least Squares Via Polynomials Example: Polynomial Model For Age Versus Age Data Income and demographic information for males in the central Atlantic region of the United States. The model is assumed to be a degree 4 polynomial: ŷ(t; x) = x 1 + x 2t + x 3t 2 f(x) = 62 i=1 x + x 4t 3 + x 5t 4 φ i x y(t i) 2. f(x) Wage 50 100 150 200 250 300 20 30 40 50 60 70 80 Age Note that m = 62 and t i is known perfectly. 11 / 31

Regularisation Assume that the degree is chosen to be some k. Not all polynomials of degree k are equal; some might vary wildly with large derivatives. Such large variations might result in unrealiability. The size of coeefficients are the other factor (other than degree) that deteres the behaviour of the model. If there isnsome bound for input, then: Theorem: For a polynomial p x (t) = x 1 +x 2 t+ +x k+1 t k, we have t [ 1, 1] : dp x (t) dt k3/2 x 2, and t [ 1, 1] : dp x (t) dt k x 1. 12 / 31

Regularisation Thus, one might want to have some control over the size of the derivatives through imising x 2 or x 1. Two different objective function need to be imised... multi-objective optimisation. Let s have a quick excursion... 13 / 31

Multi-objective or Vector Optimisation It seems that there is doant perception that multi-objective optimsiation is inherently different from other types of optimisaiton. x s.t. F (x) = (f 1 (x),..., f m (x)) x X 14 / 31

Multi-objective or Vector Optimisation It seems that there is wrong doant perception that multi-objective optimsiation is inherently different from other types of optimisaiton. x s.t. F (x) = (f 1 (x),..., f m (x)) x X Well, it is the case! They are not particularly different from any other type of optimisation problem. It is all about defining what solution you are after. A trivial (and impossible in practice) solution is the case where a point x X imises all f i (x), i = 1,..., m. Two fundamental approaches: Scalarisation: x U(f 1(x),..., f m(x)), s.t. x X Pareto Approaches: x is optimal if F (x ) F (x), F (x ) F (x) > 0, x X 14 / 31

Multi-objective Optimisation In scalarisation U(f 1 (x),..., f m (x)) = m i=1 λ if i (x). Choices of λ i matter because of relative magnitude, physical dimensions, etc. Scalarisation has a close relative. For each choices of λ i, i = 1,..., m there exist some γ i, i = 1,..., m and i j such that the solution of the scalarisation is the same as x f j (x) s.t. f i (x) γ i, i {1,..., m} \ {j} x X This latter formulation is easier to interpret from an application point of view. In the end we solve a bunch of optimisation problems. 15 / 31

Regularisation Thus, one might want to have some control over the size of the derivatives through imising x 2 or x 1. The l 2 -norm case: x Φ x y 2 2 + λ x 2 2 This is known as Tikhonov or ridge regularisation and is easy to solve smooth. 16 / 31

Regularisation and Sparsity The one with l 1 -norm: x Φ x y 2 2 + λ x 1 The other is LASSO, solving requires a bit of attention due to nonsmoothness. LASSO with small λ results in a sparse solution. 17 / 31

Binary Classification When predicting a label in { 1, 1}, we can use the following prediction rule: sign(t x + b) Let s find x and w such that average number of errors on a training set is imised. The classification rule sign(t i x + b) is wrong for some t i, if y i (t i x + b) < 0. (i.e. not having the same sign) Define this average error using the following function: { 1 w < 0 E(w) = 0 otherwise 18 / 31

Binary Classification x,b 1 m m max(0, 1 y i (t i x + b)) i=1 It has a smooth reformulation: e,x,b S.t. 1 m m i=1 e i e 0, e i 1 y i (t i x + b), i = 1,..., m This formulation is the building block of Support Vector Machines (SVM). There might be a need to control the complexity/sensitivity of the model regularisation. 19 / 31

Binary Classification An upper bound on the magnitude of the (sub)gradient is x. If all data points are in (Euclidean) sphere of radius R: max t,t : t 2 R, t 2 R x (t t ) 2R x 2 The learning Problem becomes: x,b 1 m m max(0, 1 y i (t i x + b)) + λ x 2 2 i=1 λ 0 is a regularisation parameter to be chosen via cross validation. 20 / 31

Binary Classification An upper bound on the magnitude of the (sub)gradient is x. If all data points are in a box of size R: max t,t : t R, t R x (t t ) 2R x 1 The learning Problem becomes: x,b 1 m m max(0, 1 y i (t i x + b)) + λ x 1 i=1 λ 0 is a regularisation parameter to be chosen via cross validation. the l 1 -norm above encourages sparsity a few elements of t will be involved in classification. This allows identifying key features. 21 / 31

Geometric Interpretations of SVM x,b 1 m m max(0, 1 y i (t i x + b)) i=1 Consider the case where the error in training is zero. It is equivalent to existence of x and b such that y i (x t i + b) 0, i = 1,..., m Data points are linearly sperable by {t x t + b = 0}. }{{} hyperplane 22 / 31

Geometric Interpretations of SVM Consider the case where the data is strictly seperable: y i (x t i + b) β > 0, i = 1,..., m Normalising x and b yields y i (x t i + b) 1, i = 1,..., m Two important hyperplanes {t x t + b ± 1} form a separating slab. 23 / 31

Geometric Interpretations of (Maximum Margin) SVM The choice of sperating hyperplane, (consequnelty the slabs) is not unique. One can maximise the width of the slab, i.e. the distance between the hyperplanes, known as the separation margin. The distance between the two hyperplanes is given by 2/ x. Thus, to maximise the margin: x,b x, s.t. y i (x t i + b) 1, i = 1,..., m The points that lie on the hyperplanes are called support vectors. 24 / 31

SVM with Non-seperable Data We introduce slack variables to capture constraint violations: e 0, y i (x t i + b) 1 e i, i = 1,..., m Ideally, we would like to imise the number of nonzero entries of e: e 0 However, this is a hard (nonconvex) problem. Instead, we can imise e 1. This is the same as the basic SVM problem we saw before! 25 / 31

A Generic Supervised Learning Problem x L(Φ x, y) + λp(x) The loss function is usually assumed to be decomposable as a sum m L(z, y) = l(z i, y i ) i=1 Euclidean squared: L(z, y) = z y 2 2 l 1: L(z, y) = z y 1 l : L(z, y) = z y Hinge: l(z, y) = max(0, 1 yz) Logistic: l(z, y) = log(1 e zy ) The choice of the loss function depends on the task, data, and practical implementation considerations. 26 / 31

A Generic Supervised Learning Problem The penalty function can be l 1 -norm, l 2 -norm, or approximations of indicator functions to capture constraints. x L(Φ x, y), s.t. x X is the same as x L(Φ x, y) + λp(x) where p(x) = { 0 x X x X Sometimes, we add an l 2 -norm penalty to ensure uniqueness of solution strong convexity. 27 / 31

Unsupervised Learning In unsupervised learning, the data points t i R n, i = 1,..., m, do not come with assigned labels or responses. The task is to learn the structure of the data. Principal component analysis (PCA): The idea is to discover the most important directions in a data set along which data vary the most. Big variations along 45 and nearly none along 135. PCA is a way to generalise this intuition. 28 / 31

PCA Let t i R n, i = 1,..., m, be the data points with average t = 1 m m i=1 t i, and let Θ = [ t 1... t m ] be the centred data matrix where t i = t i t. The goal is to find normalised directions z such that the variance of the projections of centered data points along z is maximised. The component of the centred data along z: α i = t i z. The mean-square variation of the data along z 1 m m αi 2 = 1 m i=1 m z t i t i z = 1 m z ΘΘ z i=1 The direction z along which the data has the largest variation is obtained from: max z z ΘΘ z, s.t. z 2 = 1 z is the normalised eigenvector of ΘΘ corresponding to its largest eigenvalue. 29 / 31

Sparse PCA In Sparse PCA a constraint is added to limit the number of the nonzero elements in the decision variable: max z z ΘΘ z, s.t. z 2 = 1, z 0 k For small size problems one can enumerate the solutions and consider all possible combinations for entries of z. Let X be a matrix to be approximated by a rank-1 matrix: p R n,q R m X pq F If the objective can be made small, X pq, we may interpret p as a typical data point and q as a typical feature. However, if X corresponds to positive data points, we cannot draw the same conclusion if p and q are not positive. 30 / 31

Non-negative matrix factorisation (NNMF) Thus, sign constraints need to be enforced: F s.t. p 0, q 0 p R n,q R m The interpretation is that each column of X is proportional to a single vecotr q with weights in p. Hence, each data point follows the same profile q. More generally NNMF: X P Q F s.t. P 0, Q 0 P R n k,q R m k Here the data points follow a linear combination of k profiles given by columns of Q. 31 / 31