Statistical Machine Learning Hilary Term 2018

Size: px

Start display at page:

Download "Statistical Machine Learning Hilary Term 2018"

Donald Pearson
5 years ago
Views:

1 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: January 31, 2018

2 Supervised Learning Supervised Learning Supervised Learning Unsupervised learning: Visualize, summarize and compress data. To extract structure and postulate hypotheses about data generating process from unlabelled observations x 1,..., x N. Supervised learning: In addition to the observations of X, we have access to their response variables / labels Y Y: we observe {(x i, y i )} N i=1. Types of supervised learning: Regression: a numerical value is observed and Y = R. Classification: discrete responses, e.g. Y = {+1, 1} or {1,..., K}. The goal is to accurately predict the response Y on new observations of X, i.e., to learn a function f : R p Y, such that f(x) will be close to the true response Y.

3 Supervised Learning Supervised Learning Regression Example: House Price Retrieve historical sales records

Supervised Learning Features used to predict Supervised Learning We will use properties of the house, e.g. squared meters, distance from train station, etc.

4 Supervised Learning Features used to predict Supervised Learning We will use properties of the house, e.g. squared meters, distance from train station, etc. Goal: predict price of another house given these properties.

5 Supervised Learning Supervised Learning Classification Example: Lymphoma We have gene expression measurements X of N = 62 patients for p = 4026 genes. For each patient, Y {0, 1} denotes one of two subtypes of cancer. > str(x) data.frame : 62 obs. of 4026 variables: $ Gene 1 : num $ Gene 2 : num $ Gene 3 : num $ Gene 4 : num $ Gene 5 : num $ Gene 6 : num $ Gene 7 : num $ Gene 8 : num $ Gene 9 : num $ Gene 10 : num $ Gene 11 : num $ Gene 12 : num > str(y) num [1:62] Goal: predict cancer subtype given gene expressions of a new patient.

6 Supervised Learning Regression VS Classification Supervised Learning

7 Loss function Supervised Learning Loss and Risk Suppose we made a prediction Ŷ = f(x) Y based on observation of X. How good is the prediction? We can use a loss function L : Y Y R + to formalize the quality of the prediction. Typical loss functions: Squared loss for regression Absolute loss for regression L(Y, f(x)) = (f(x) Y ) 2. L(Y, f(x)) = f(x) Y. Misclassification loss (or 0-1 loss) for classification { 0 f(x) = Y L(Y, f(x)) = 1 f(x) Y. Many other choices are possible, e.g., weighted misclassification loss. In classification, if estimated probabilities ˆp(k) for each class k Y are returned, log-likelihood loss (or log loss) L(Y, ˆp) = log ˆp(Y ) is often used.

8 Risk Supervised Learning Loss and Risk Risk paired observations {(x i, y i )} N i=1 viewed as i.i.d. realizations of a random variable (X, Y ) on X Y with joint distribution P XY For a given loss function L, the risk R of a learned function f is given by the expected loss R(f) = E PXY [L(Y, f(x))], where the expectation is with respect to the true (unknown) joint distribution of (X, Y ). The risk is unknown, but we can compute the empirical risk: R N (f) = 1 N N L(y i, f(x i )). i=1

9 Supervised Learning Loss and Risk Hypothesis space and Empirical Risk Minimization Hypothesis space H is the space of functions f under consideration. Inductive bias: necessary assumptions on plausible hypotheses Find best function in the space of hypothesis H minimizing the risk: f = argmin E X,Y [L(Y, f(x))] f H Empirical Risk Minimization (ERM): minimize the empirical risk instead, since we typically do not know P X,Y. ˆf = argmin f H 1 N N L(y i, f(x i )) How complex should we allow functions f to be? If hypothesis space H is too large, ERM will overfit. Function { y i if x = x i, ˆf(x) = 0 otherwise will have zero empirical risk, but is useless for generalization, since it has simply memorized the dataset. i=1

10 Linear Regression We will use the framework of linear regression, which should be familiar to you, to illustrate some of the key concepts of supervised learning.

11 Linear regression: predicting the sale price of a house We will use the house price example. (This will be our training data)

12 Correlation between square footage and sale price The size of a house is a good predictor of its price. Note: colors are not important here

13 Roughly linear relationship The size of a house is a good predictor of its price. Sale price price_per_sqft square_footage + fixed_expense

14 Algorithm Linear regression (ordinary least squares) Setup Input: x R D (covariates, predictors, features, etc) Output: y R (responses, targets, outcomes, outputs, etc) Hypotheses: h θ,θ0 : x y, with h θ,θ0 (x) = θ 0 + d θ dx d = θ 0 + θ T x θ = [θ 1 θ 2 θ D ] T : weights, parameters. θ 0 is the intercept (also called bias). Training data: D = {(x n, y n ), n = 1, 2,..., N} We will use the squared loss (differentiable): (sale price - prediction) 2 = (y n h θ (x n )) 2 Could use other loss functions, e.g. absolute loss: sale price - prediction = y n h θ (x n)

15 Algorithm How do we learn parameters? Minimize prediction error on training data Hypothesis: y = h θ (x) = θ 0 + θ 1 x We chose to minimize the squared loss. Empirical risk: R N (θ) = 1 N (y n h θ (x n )) 2 N n=1

16 Algorithm Intuiton behind the squared loss Assume x R h θ (x) RN (θ 1 )

17 Algorithm Intuiton behind the squared loss Assume x R h θ (x) RN (θ 1 )

18 Algorithm Intuiton behind the squared loss Assume x R h θ (x) RN (θ 1 )

19 Algorithm Intuiton behind the squared loss Assume x R h θ (x) RN (θ 1 )

20 Linear regression Algorithm Intuiton behind the squared loss

21 Algorithm Intuiton behind the squared loss h θ (x) R N (θ 0, θ 1 )

22 Algorithm Intuiton behind the squared loss h θ (x) R N (θ 0, θ 1 )

23 Algorithm Intuiton behind the squared loss h θ (x) R N (θ 0, θ 1 )

24 Algorithm Intuiton behind the squared loss h θ (x) R N (θ 0, θ 1 )

25 Univariate solution A simple case: x is just one-dimensional (D=1) Squared loss (dropping the 1/N for simplicity) R N (θ) = n [y n h θ (x n )] 2 = n [y n (θ 0 + θ 1 x n )] 2 Analytical solution For linear regression, the minimization can be done in closed form. Identify stationary points by taking derivative with respect to parameters and setting to zero R N (θ) θ 0 = 0 2 n [y n (θ 0 + θ 1 x n )] = 0 R N (θ) θ 1 = 0 2 n [y n (θ 0 + θ 1 x n )]x n = 0

26 xn y n = θ 0 xn + θ 1 x 2 n Linear regression Univariate solution R N (θ) θ 0 = 0 2 n [y n (θ 0 + θ 1 x n )] = 0 R N (θ) θ 1 = 0 2 n [y n (θ 0 + θ 1 x n )]x n = 0 Simplify these expressions to get Normal Equations yn = Nθ 0 + θ 1 xn We have two equations and two unknowns. Solving we get: (xn x)(y n ȳ) θ 1 = (xi x) 2 and θ 0 = ȳ θ 1 x where x = 1 n n x n and ȳ = 1 n n y n.

27 Probabilistic interpretation Why is minimizing R N sensible? Probabilistic interpretation Noisy observation model Y = θ 0 + θ 1 X + η where η N (0, σ 2 ) is a Gaussian random variable Likelihood of one training sample (x n, y n ) p(y n x n ; θ) = N (θ 0 + θ 1 x n, σ 2 ) = 1 e [yn (θ 0 +θ 1 xn)] 2 2σ 2 2πσ

28 Probabilistic interpretation Probabilistic interpretation (cont d) Log-likelihood of the training data D (assuming i.i.d) LL(θ) = log P (D) N = log log p(y n x n ) p(y n x n ) = n=1 n = { [y n (θ 0 + θ 1 x n )] 2 2σ 2 log } 2πσ n = 1 2σ 2 [y n (θ 0 + θ 1 x n )] 2 N 2 log σ2 N log 2π n { } = σ 2 [y n (θ 0 + θ 1 x n )] 2 + N log σ 2 + const n What is the relationship between minimizing R N and maximizing the log-likelihood?

29 Maximum likelihood estimation Probabilistic interpretation Estimating σ, θ 0 and θ 1 can be done in two steps Maximize over θ 0 and θ 1 max log P (D) min n [y n (θ 0 + θ 1 x n )] 2 That is R N (θ)! Maximize over s = σ 2 (we could estimate σ directly) { } log P (D) = σ 2 [y n (θ 0 + θ 1 x n )] 2 + N log σ 2 + const log P (D) s n { } = s 2 [y n (θ 0 + θ 1 x n )] 2 + N 1 = 0 s n σ 2 = s = 1 [y n (θ 0 + θ 1 x n )] 2 N n

30 Multivariate solution in matrix form Linear regression when x is D-dimensional

31 Multivariate solution in matrix form Linear regression when x is D-dimensional R N (θ) in matrix form R N (θ) = n [y n (θ 0 + d θ d x nd )] 2 = n [y n θ T x n ] 2 where we have redefined some variables (by augmenting) x [1 x 1 x 2... x D ] T, θ [θ 0 θ 1 θ 2... θ D ] T which leads to R N (θ) = n (y n θ T x n )(y n x T nθ) = θ T x n x T nθ 2y n x T nθ + const. n { ( ) ( ) } = θ T x n x T n θ 2 y n x T n θ + const. n n

32 R N (θ) in new notations Linear regression Multivariate solution in matrix form Design matrix and target vector x T 1 x T 2 X =. RN (D+1), y = Compact expression x T N R N (θ) = Xθ y 2 2 = y 1 y 2. y N {θ T X T Xθ 2 ( X T y ) T θ } + const

33 Multivariate solution in matrix form Solution in matrix form Compact expression R N (θ) = Xθ y 2 2 = {θ T X T Xθ 2 ( X T y ) T θ } + const Gradients of Linear and Quadratic Functions x b x = b x x Ax = 2Ax (symmetric A) Normal equation θ R N (θ) X T Xθ X T y = 0 This leads to the linear regression solution 1 θ = ( X T X ) 1 X T y 1 Also see PRML book, Section for a geometric interpretation.

34 Multivariate solution in matrix form Mini-Summary Linear regression is the linear combination of features f : x y, with f(x) = θ 0 + d θ dx d = θ 0 + θ T x If we minimize residual sum of squares as our learning objective, we get a closed-form solution of parameters Probabilistic interpretation: maximum likelihood if assuming residual is Gaussian distributed D-dimensional case leads to compact expressions in matrix form.

35 Nonlinear basis functions Nonlinear basis functions Can we learn non-linear functions? t x 1 We can use a nonlinear mapping φ(x) : x R D z R M where M is the dimensionality of the new feature/input z (or φ(x)). Note that M could be either greater than D or less than or the same.

36 Nonlinear basis functions Nonlinear basis functions Can we learn non-linear functions? We can use a nonlinear mapping φ(x) : x R D z R M For instance, we could use polynomials of increasing order, φ k (x i ) = x k i With the new features, we can apply our learning techniques to minimize our errors on the transformed training data for linear methods, prediction is still based on θ T φ(x)

37 Nonlinear basis functions Regression with nonlinear basis functions Residual sum squares [θ T φ(x n ) y n ] 2 n where θ R M, the same dimensionality as the transformed features φ(x). The linear regression solution can be formulated with the new design matrix φ(x 1 ) T φ(x 2 ) T Φ =. RN M, θ LMS = ( Φ T Φ ) 1 Φ T y φ(x N ) T

38 Nonlinear basis functions Regression with nonlinear basis functions Polynomial basis functions 1 x φ(x) = x 2 f(x) = θ 0 +. x M M θ m x m m=1 Fitting samples from a sine function: underrfitting as f(x) is too simple 1 M = 0 1 M = 1 t t x 1 0 x 1

39 Nonlinear basis functions Adding high-order terms M=3 M=9: overfitting t 1 M = 3 t 1 M = x 1 0 x 1 More complex features lead to better results on the training data, but potentially worse results on new data, e.g., test data!

40 Nonlinear basis functions Overfitting Parameters for higher-order polynomials are very large M = 0 M = 1 M = 3 M = 9 θ θ θ θ θ θ θ θ θ θ

41 Nonlinear basis functions Overfitting can be quite disastrous Fitting the housing price data with M = 3 Note that the price would go to zero (or negative) if you buy bigger ones! This is called poor generalization/overfitting.

Linear Regression (continued)

Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression