Learning From Data: Modelling as an Optimisation Problem

Learning From Data: Modelling as an Optimisation Problem Iman Shames April 2017 1 / 31

You should be able to... Identify and formulate a regression problem; Appreciate the utility of regularisation; Identify and formulate a binary classification problem; Recognise a generic supervised learning problem as an optimisation problem; Formulate different instances of unsupervised learning problem. Observe that many learning problems are special instances of optimisation theory. 2 / 31

Some Optimisation History If the facts don t fit the theory, change the facts. -A Einestein A family of problems arising in machine learning, viewed through the looking glass of optimisation theory, will be discussed. Supervised learning: fitting a model to given response data Unsupervised learning: build a model for the data without a particular response in d Denote [t 1,..., t m ] R n m a generic matrix of data points t i R n, i = 1,..., m, is the i-th data point, aka example. A particular entry of a data point is known as feature, e.g. temperature, price, blood pressure, signal strength,... 3 / 31

Supervised Learning In supervised learning the goal is build a model of an unknown function: t y(t) We are given a set of observations (examples) that is a number of input-output pairs (t i, y i ), i = 1,..., m y = [y 1,..., y m ] is called the response vector. These examples are used to learn a model of the function t ŷ(t; x) where x is a vector of model parameters. The goal is to use the data and the information in y to form a model. In turn the model can be used to predict a value ŷ for a (yet unseen) new test point x R n. 4 / 31

Supervised Learning In the most general form: ŷ(t; x) = x φ(t) where φ(t) is a given nonlinear function. If φ(t) = (1, t) we retrieve the affine relationship. Example: Demand Prediction for Inventory Management A store needs to predict demand for a specific item from costumers in an area. It assumes that the logarithm of the demand is an arbitrary real number and depends linearly on a number of features: time of year, type of item, number of items sold the day before, etc. The problem is to predict the demand the next day based the observations of features-demand pairs in the past. 5 / 31

Supervised Learning If the response is binary, e.g. y(x) {1, 1} for every t, it is then referred to as a label. In this case a sign-(non)linear model can be used: ŷ(t; x) = sign(x φ(t)) Letting φ(t) = [1, t] yields the sign-linear model. Example: Binary Classification of Credit Card Applicants A credit card company receives thousands of applications. Each application contains information about an applicant: age, marital status, annual salary, outstanding debts, etc. How can we set up a system that classifies the applicants into two categories: approved and not approved. 6 / 31

Supervised Learning The learning (or training) problem is to find the best model coefficient vector x such that ŷ(t i; x) y i, i = 1,..., m. A certain measure of mismatch between y i and ŷ i (t i ; x) needs to be imised. An important point is assign a measure of reliability to the predicted values. This is the subject of statistical learning, see the following: The one that tells you what happens: Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1). Springer, Berlin: Springer series in statistics. The one that asks you to believe: James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 6). New York: springer. 7 / 31

Least Squares Via Polynomials Learning has its roots in this theorem: Theorem (Weierstrass Approximation Theorem): Suppose f is a continuous real-valued function defined on the real interval [a, b]. For every ɛ > 0, there exists a polynomial p such that for all x in [a, b], we have f(x) p(x) < ɛ. Let s assume a basic linear model for the data: y(t) = x 1 + tx 2 + δ = x φ(t) + δ where φ(t) = [1, t], δ is the error term. 8 / 31

Least Squares Via Polynomials To find the coefficient (weight) vector x, we use a least-squares approach to imise the training error: 1 (yi φ i x) 2 x m where φ i = φ(t i ), i = 1,..., m. For y = [y 1,..., y m ], Φ being a matrix with columns φ i : x Φ x y 2 2 This is learning (training) problem. Once this problem is solved, one can use the coefficients to predict values of y for unseen t. 9 / 31

Least Squares Via Polynomials If the linear models don t explain the data well, one can obtain a higher order model: y(t) = x 1 + tx 2 + t 2 x 3 + + t k x k+1 + δ 10 / 31

Least Squares Via Polynomials If the linear models don t explain the data well, one can obtain a higher order model: y(t) = x 1 + tx 2 + t 2 x 3 + + t k x k+1 + δ The fitting problem is still a least-squares problem: x Φ x y 2 2 Each φ i = [1, t i, t 2 i,..., tk i ], i = 1,..., m. Choice of k is magical and requires cross validation: leave a bunch of data points out and evaluate the performance of the predictor. Typically as k increases the accuracy of training gets better, but the cross validation error initially decreases (or remains constant) and then increases (over-fitting). 10 / 31

Least Squares Via Polynomials Example: Polynomial Model For Age Versus Age Data Income and demographic information for males in the central Atlantic region of the United States. The model is assumed to be a degree 4 polynomial: ŷ(t; x) = x 1 + x 2t + x 3t 2 f(x) = 62 i=1 x + x 4t 3 + x 5t 4 φ i x y(t i) 2. f(x) Wage 50 100 150 200 250 300 20 30 40 50 60 70 80 Age Note that m = 62 and t i is known perfectly. 11 / 31

Regularisation Assume that the degree is chosen to be some k. Not all polynomials of degree k are equal; some might vary wildly with large derivatives. Such large variations might result in unrealiability. The size of coeefficients are the other factor (other than degree) that deteres the behaviour of the model. If there isnsome bound for input, then: Theorem: For a polynomial p x (t) = x 1 +x 2 t+ +x k+1 t k, we have t [ 1, 1] : dp x (t) dt k3/2 x 2, and t [ 1, 1] : dp x (t) dt k x 1. 12 / 31

Regularisation Thus, one might want to have some control over the size of the derivatives through imising x 2 or x 1. Two different objective function need to be imised... multi-objective optimisation. Let s have a quick excursion... 13 / 31

Multi-objective or Vector Optimisation It seems that there is doant perception that multi-objective optimsiation is inherently different from other types of optimisaiton. x s.t. F (x) = (f 1 (x),..., f m (x)) x X 14 / 31

Multi-objective or Vector Optimisation It seems that there is wrong doant perception that multi-objective optimsiation is inherently different from other types of optimisaiton. x s.t. F (x) = (f 1 (x),..., f m (x)) x X Well, it is the case! They are not particularly different from any other type of optimisation problem. It is all about defining what solution you are after. A trivial (and impossible in practice) solution is the case where a point x X imises all f i (x), i = 1,..., m. Two fundamental approaches: Scalarisation: x U(f 1(x),..., f m(x)), s.t. x X Pareto Approaches: x is optimal if F (x ) F (x), F (x ) F (x) > 0, x X 14 / 31

Multi-objective Optimisation In scalarisation U(f 1 (x),..., f m (x)) = m i=1 λ if i (x). Choices of λ i matter because of relative magnitude, physical dimensions, etc. Scalarisation has a close relative. For each choices of λ i, i = 1,..., m there exist some γ i, i = 1,..., m and i j such that the solution of the scalarisation is the same as x f j (x) s.t. f i (x) γ i, i {1,..., m} \ {j} x X This latter formulation is easier to interpret from an application point of view. In the end we solve a bunch of optimisation problems. 15 / 31

Regularisation Thus, one might want to have some control over the size of the derivatives through imising x 2 or x 1. The l 2 -norm case: x Φ x y 2 2 + λ x 2 2 This is known as Tikhonov or ridge regularisation and is easy to solve smooth. 16 / 31

Regularisation and Sparsity The one with l 1 -norm: x Φ x y 2 2 + λ x 1 The other is LASSO, solving requires a bit of attention due to nonsmoothness. LASSO with small λ results in a sparse solution. 17 / 31

Binary Classification When predicting a label in { 1, 1}, we can use the following prediction rule: sign(t x + b) Let s find x and w such that average number of errors on a training set is imised. The classification rule sign(t i x + b) is wrong for some t i, if y i (t i x + b) < 0. (i.e. not having the same sign) Define this average error using the following function: { 1 w < 0 E(w) = 0 otherwise 18 / 31

Binary Classification x,b 1 m m max(0, 1 y i (t i x + b)) i=1 It has a smooth reformulation: e,x,b S.t. 1 m m i=1 e i e 0, e i 1 y i (t i x + b), i = 1,..., m This formulation is the building block of Support Vector Machines (SVM). There might be a need to control the complexity/sensitivity of the model regularisation. 19 / 31

Binary Classification An upper bound on the magnitude of the (sub)gradient is x. If all data points are in (Euclidean) sphere of radius R: max t,t : t 2 R, t 2 R x (t t ) 2R x 2 The learning Problem becomes: x,b 1 m m max(0, 1 y i (t i x + b)) + λ x 2 2 i=1 λ 0 is a regularisation parameter to be chosen via cross validation. 20 / 31

Binary Classification An upper bound on the magnitude of the (sub)gradient is x. If all data points are in a box of size R: max t,t : t R, t R x (t t ) 2R x 1 The learning Problem becomes: x,b 1 m m max(0, 1 y i (t i x + b)) + λ x 1 i=1 λ 0 is a regularisation parameter to be chosen via cross validation. the l 1 -norm above encourages sparsity a few elements of t will be involved in classification. This allows identifying key features. 21 / 31

Geometric Interpretations of SVM x,b 1 m m max(0, 1 y i (t i x + b)) i=1 Consider the case where the error in training is zero. It is equivalent to existence of x and b such that y i (x t i + b) 0, i = 1,..., m Data points are linearly sperable by {t x t + b = 0}. }{{} hyperplane 22 / 31

Geometric Interpretations of SVM Consider the case where the data is strictly seperable: y i (x t i + b) β > 0, i = 1,..., m Normalising x and b yields y i (x t i + b) 1, i = 1,..., m Two important hyperplanes {t x t + b ± 1} form a separating slab. 23 / 31

Geometric Interpretations of (Maximum Margin) SVM The choice of sperating hyperplane, (consequnelty the slabs) is not unique. One can maximise the width of the slab, i.e. the distance between the hyperplanes, known as the separation margin. The distance between the two hyperplanes is given by 2/ x. Thus, to maximise the margin: x,b x, s.t. y i (x t i + b) 1, i = 1,..., m The points that lie on the hyperplanes are called support vectors. 24 / 31

SVM with Non-seperable Data We introduce slack variables to capture constraint violations: e 0, y i (x t i + b) 1 e i, i = 1,..., m Ideally, we would like to imise the number of nonzero entries of e: e 0 However, this is a hard (nonconvex) problem. Instead, we can imise e 1. This is the same as the basic SVM problem we saw before! 25 / 31

A Generic Supervised Learning Problem x L(Φ x, y) + λp(x) The loss function is usually assumed to be decomposable as a sum m L(z, y) = l(z i, y i ) i=1 Euclidean squared: L(z, y) = z y 2 2 l 1: L(z, y) = z y 1 l : L(z, y) = z y Hinge: l(z, y) = max(0, 1 yz) Logistic: l(z, y) = log(1 e zy ) The choice of the loss function depends on the task, data, and practical implementation considerations. 26 / 31

A Generic Supervised Learning Problem The penalty function can be l 1 -norm, l 2 -norm, or approximations of indicator functions to capture constraints. x L(Φ x, y), s.t. x X is the same as x L(Φ x, y) + λp(x) where p(x) = { 0 x X x X Sometimes, we add an l 2 -norm penalty to ensure uniqueness of solution strong convexity. 27 / 31

Unsupervised Learning In unsupervised learning, the data points t i R n, i = 1,..., m, do not come with assigned labels or responses. The task is to learn the structure of the data. Principal component analysis (PCA): The idea is to discover the most important directions in a data set along which data vary the most. Big variations along 45 and nearly none along 135. PCA is a way to generalise this intuition. 28 / 31

PCA Let t i R n, i = 1,..., m, be the data points with average t = 1 m m i=1 t i, and let Θ = [ t 1... t m ] be the centred data matrix where t i = t i t. The goal is to find normalised directions z such that the variance of the projections of centered data points along z is maximised. The component of the centred data along z: α i = t i z. The mean-square variation of the data along z 1 m m αi 2 = 1 m i=1 m z t i t i z = 1 m z ΘΘ z i=1 The direction z along which the data has the largest variation is obtained from: max z z ΘΘ z, s.t. z 2 = 1 z is the normalised eigenvector of ΘΘ corresponding to its largest eigenvalue. 29 / 31

Sparse PCA In Sparse PCA a constraint is added to limit the number of the nonzero elements in the decision variable: max z z ΘΘ z, s.t. z 2 = 1, z 0 k For small size problems one can enumerate the solutions and consider all possible combinations for entries of z. Let X be a matrix to be approximated by a rank-1 matrix: p R n,q R m X pq F If the objective can be made small, X pq, we may interpret p as a typical data point and q as a typical feature. However, if X corresponds to positive data points, we cannot draw the same conclusion if p and q are not positive. 30 / 31

Non-negative matrix factorisation (NNMF) Thus, sign constraints need to be enforced: F s.t. p 0, q 0 p R n,q R m The interpretation is that each column of X is proportional to a single vecotr q with weights in p. Hence, each data point follows the same profile q. More generally NNMF: X P Q F s.t. P 0, Q 0 P R n k,q R m k Here the data points follow a linear combination of k profiles given by columns of Q. 31 / 31