Fundamentals of Machine Learning

Similar documents
Fundamentals of Machine Learning (Part I)

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Bias-Variance Decomposition. Mohammad Emtiyaz Khan EPFL Oct 6, 2015

Gradient Descent. Mohammad Emtiyaz Khan EPFL Sep 22, 2015

CSC321 Lecture 2: Linear Regression

Sample questions for Fundamentals of Machine Learning 2018

Bayesian Deep Learning

Linear Models for Regression

Least Mean Squares Regression. Machine Learning Fall 2018

Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 1, 2015

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Machine learning - HT Basis Expansion, Regularization, Validation

Machine Learning Basics: Maximum Likelihood Estimation

Probability and Statistical Decision Theory

Least Mean Squares Regression

ECE521 week 3: 23/26 January 2017

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Lecture 2: Linear regression

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

CSC 411: Lecture 02: Linear Regression

Machine Learning and Data Mining. Linear regression. Kalev Kask

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Lecture 2 Machine Learning Review

We choose parameter values that will minimize the difference between the model outputs & the true function values.

Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error?

Adversarial Examples

CSC 411 Lecture 6: Linear Regression

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

COMP 551 Applied Machine Learning Lecture 2: Linear regression

Overfitting, Bias / Variance Analysis

Linear Models for Regression

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Bits of Machine Learning Part 1: Supervised Learning

Linear Models for Regression CS534

Deep Learning Lab Course 2017 (Deep Learning Practical)

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Lecture : Probabilistic Machine Learning

Classification: The rest of the story

Data Mining und Maschinelles Lernen

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

CSC321 Lecture 9: Generalization

Introduction to Machine Learning

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler

CSC321 Lecture 9: Generalization

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Tufts COMP 135: Introduction to Machine Learning

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Classification: Decision Trees

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Machine Learning for OR & FE

Part of the slides are adapted from Ziko Kolter

CS60010: Deep Learning

IFT Lecture 7 Elements of statistical learning theory

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning. Ensemble Methods. Manfred Huber

CS489/698: Intro to ML

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

An Introduction to Statistical Machine Learning - Theoretical Aspects -

Stochastic Gradient Descent. CS 584: Big Data Analytics

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

MS-C1620 Statistical inference

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Recap from previous lecture

Qualifying Exam in Machine Learning

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Neural Networks and Deep Learning

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Machine Learning Linear Regression. Prof. Matteo Matteucci

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Week 5: Logistic Regression & Neural Networks

Introduction to Gaussian Process

Binary Classification / Perceptron

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

PDEEC Machine Learning 2016/17

Linear Models for Regression. Sargur Srihari

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Transcription:

Fundamentals of Machine Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://icapeople.epfl.ch/mekhan/ emtiyaz@gmail.com Jan 2, 27 Mohammad Emtiyaz Khan 27

Goals Understand (some) fundamentals of Machine learning. Part I : Understand the basic set-up to analyze data under a machine-learning framework.. Before Machine Learning. 2. ML Problem: Regression. 3. Model: Linear Regression. 4. Cost Function: MSE. 5. Algorithm: Gradient Descent. Part II : Understand what can go wrong when learning from data and how to correct it. 6. Challenge: Overfitting. 7. Solutions: Regularization. 8. Bias-Variance Decomposition. 9. Recent Advances. Some figures are taken from Hastie, Tibshirani, and Friedman s book on statistical learning and also from Chris bishop s Machine learning book

Before Machine Learning Acquiring Data Data is the most important component of modern Machine Learning. There are many important steps that can have a huge impact on the performance of a machinelearning system. To name a few: data collection, cleaning, validation, pre-processing, and storage. Picture taken from Doing data science. 2

Defining an ML problem Once we have some data, the next step is to re-define the real-world problem in the context of data, and then to convert it to a machinelearning problem. ML problems can be categorized into 3 main types: supervised, unsupervised, and reinforcement learning. In practice, a successful end-toend system might require a combination of these problems. 3

2 ML Problem: Regression What is regression? Regression is to relate input variables to the output variable, to either predict outputs for new inputs and/or to understand the effect of the input on the output. Dataset for regression In regression, data consists of pairs (y n, x n ), where y n is the n th output and x n is a vector of D inputs. Number of pairs N is the data-size and D is the dimensionality. Examples of regression (a) Height is correlated with weight. Machine Learning for Hackers Taken from 4

(b) How does advertisement in TV, radio, and newspaper affect sales? Taken from the book An Introduction to statistical learning Two goals of regression In prediction, we wish to predict the output for a new input vector, e.g. what is the weight of a person who is 7 cm tall? In interpretation, we wish to understand the effect of inputs on output, e.g. are taller people heavier too? The regression function For both the goals, we need to find a function that approximates the output well enough given inputs. y n f(x n ), for all n 5

Additional Notes Prediction vs Interpretation Some questions to think about: are these prediction tasks or interpretation task?. What is the life-expectancy of a person who has been smoking for years? 2. Does smoking cause cancer? 3. When the number of packs a smoker smokes per day doubles, their predicted life span gets cut in half? 4. A massive scale earthquake will occur in California within next 3 years. 5. More than 3 bird species in north America could reduce their habitat by half or more by 28. 6

3 Model: Linear Regression What is it? Linear regression is a model that assumes a linear relationship between inputs and the output. Why learn about linear regression? Plenty of reasons: simple, easy to understand, most widely used, easily generalized to non-linear models. Most importantly, you can learn almost all fundamental concepts of ML with regression alone. 7

Simple linear regression With only one input dimension, it is simple linear regression. y n f(x n ) := β + β x n Here, β and β are parameters of the model. Multiple linear regression With multiple input dimension, it is multiple linear regression. y n f(x n ) := β + β x n +... + β D x nd = x T nβ () Learning/estimation/fitting Given data, we would like to find β = [β, β,..., β D ]. This is called learning or estimating the parameters or fitting the model. 8

Additional Notes p > n Problem Consider the following simple situation: You have N = and you want to fit y β + β x, i.e. you want to find β and β given one pair (y, x ). Is it possible to find such a line? This problem is related to something called p > n problem. In our notation, this will be called D > N problem, i.e. the number of parameters exceeds number of data examples. Similar issues will arise when we use gradient descent or least-squares to fit a linear model. These problems are all solved by using regularization, which we will learn later. 9

4 Cost Function: MSE Motivation Consider the following models. -parameter model: y n β 2-parameter model: y n β + β x n How can we estimate (or guess) values of β given the data D? What is a cost function? Cost functions (or utilities or energy) are used to learn parameters that explain the data well. They define how costly our mistakes are. Two desirable properties of cost functions When y is real-valued, it is desirable that the cost is symmetric around, since both +ve and -ve errors should be penalized equally. Also, our cost function should penalize large mistakes and verylarge mistakes almost equally.

Mean Square Error (MSE) MSE is one of the most popular cost function. MSE(β) := 2N N [y n f(x n )] 2 n= Does it have both the properties? An exercise for MSE Compute MSE for -param model: L(β ) := 2N N [y n β ] 2 (2) n= Each row contains a y n and column is β. First, compute MSE for for y n = {, 2, 3, 4} and draw MSE as a function of β (by adding the first four rows). Then add y n = 2 to it, and redraw MSE. What do you observe and why? y n \β 2 3 4 5 6 7 2 3 4 MSE 2 MSE

Additional Notes A question for cost functions Is there an automatic way to define loss functions? Nasty cost functions: Visualization See Andrej Karpathy Tumblr post for many cost functions gone wrong for neural network. http://lossfunctions. tumblr.com/. 2

5 Algorithm: Gradient Descent Learning/estimation/fitting Given a cost function L(β), we wish to find β that minimizes the cost: min β L(β), subject to β R D+ This is learning posed as an optimization problem. We will use an algorithm to solve the problem. Grid search Grid search is one of the simplest algorithms where we compute cost over a grid (of say M points) to find the minimum. This is extremely simple and works for any kind of loss when we have very few parameters and the loss is easy to compute. For a large number of parameters, however, grid search has too many for-loops, resulting in exponential computational complexity. Choosing a good range of values is another problem. Are there any other issues? 3

Follow the gradient A gradient (at a point) is the slope of the tangent (at that point). It points to the direction of largest increase of the function. For 2-parameter model, shown below. (I used y T = [2,,.5] and x T = [,, ]). MSE is 6 4 2 8 6 4 2 5 5 5 5 β β 4

Batch gradient descent To minimize the function, take a step in the (opposite) direction of the gradient β (k+) β (k) α L(β(k) ) β where α > is the step-size (or learning rate). Gradient descent for -parameter model to minimize MSE: β (k+) = ( α)β (k) + αȳ Where ȳ = n y n/n. When is this sequence guaranteed to converge? Gradients for MSE L(β) = 2N N (y n x T nβ) 2 (3) n= then the gradient is given by, L β = N N (y n x T nβ) x n (4) n= What is the computational complexity of batch gradient descent? 5

Stochastic gradient descent When N is large, choose a random pair (x i, y i ) in the training set and approximate the gradient: L β N [N(y i x T i β) x i ] (5) Using the above stochastic gradient, take a step: β (k+) = β (k) + α (k) (y i x T i β (k) ) x i What is the computational complexity? For convergence, α k appropriately. One such condition called Robbins-Monroe condition suggests to take α k such that: α (k) =, (α (k) ) 2 < k= k= (6) One way to obtain such sequence is α (k) = /( + k) r where r (.5, ). 6

6 Challenge: Overfitting Motivation Linear model can be easily modified to obtain more powerful non-linear model. We can use basis function expansion to get a non-linear regression model, and then use a sequence of these models to construct a deep model. Consider simple linear regression. Given one-dimensional input x n, we can generate a polynomial basis. φ(x n ) = [, x n, x 2 n, x 3 n,..., x M n ] Then we fit a linear model using the original and the generated features: y n β +β x n +β 2 x 2 n+...+β M x M n.5.5 7

Overfitting and Underfitting Overfitting is fitting the noise in addition to the signal. Underfitting is not fitting the signal well. In reality, it is very difficult to be able to tell the signal from the noise. Which is a better fit? Try a real situation. Below, y-axis is the frequency of an event and x-axis is the magnitude. It is clear that as magnitude increases, frequency decreases. This example is taken from Nate Silver s book. 8

Which model is a better fit? blue or red? Another example: Which model is a better fit? black or red? Data is denoted by circle. 9

Complex models overfit easily Circles are data points, green line is the truth & red line is the model fit. M is the maximum degree in the generated polynomial basis. M = M = t t x x M = 3 M = 9 t t x x If you increase the amount of data, overfitting might reduce. N = 5 N = t t x x 2

Occam s razor One solution is dictated by Occam s razor which states that Simpler models are better in absence of certainty. Sometimes, if you increase the amount of data, you might reduce overfitting. But, when unsure, choose a simple model over a complicated one. Additional Notes Read about overfitting in the paper by Pedro Domingos (section 3 and 5 of A few useful things to know about machine learning ). You can also read Nate Silver s book on The signal and the noise (the earthquake example is taken from this book). 2

7 Solutions: Regularization What is regularization? Through regularization, we can penalize complex models and favor simpler ones: min β L(β) + λ 2N M j= β 2 j The second term is a regularizer (with λ > ). The main point here is that an input variable weighted by a small β j will have less influence on the output. Regularization Parameter The parameter λ can be tuned to reduce overfitting. But, how do you choose λ? The generalization error The generalization error of a learning method is the expected prediction error for unseen data, i.e. mistakes made on the data that we are going to see in the future. This quantifies how well the method generalizes. 22

Simulating the future Ideally, we should choose λ to minimize the mistakes that will be made in the future. Obviously, we do not have the future data, but we can always simulate the future using the data in hand. Splitting the data For this purpose, we split the data into train and validation sets, e.g. 8% as training data and 2% as validation data. We pretend that the validation set is the future data. We fit our model on the training set and compute a prediction-error on the validation set. This gives us an estimate of the generalization error (one instant of the future).. Train RMSE.5 3 2 lambda.8 Test RMSE.7.6 2 lambda 23

Cross-validation Random splitting is not an efficient method. K-fold cross-validation allows us to do this efficiently. We randomly partition the data into K groups. We train on K groups and test on the remaining group. We repeat this until we have tested on all K sets. We then average the results. run run 2 run 3 run 4 Cross-validation returns an estimate of the generalization error. Additional Notes Details on cross-validation are in Chapter 7 in the book by Hastie, Tibshirani, and Friedman (HTF). You can also read about bootstrap in Section 7. in HTF book. This method is related to random splitting and is a very popular method. 24

8 Bias-Variance Decomposition What is bias-variance? One natural question is how does the test error vary wrt λ? When λ is high, the model underfits, while when λ is small, the model overfits. Therefore, a good value is somewhere in between. Bias-variance decomposition explains the shape of this curve. Generalization error Given training data D tr of size N, we would like to estimate the expected error made in future prediction. This error is the generalization error. Below is a definition suppose that we have infinite test data D te, teerr(d tr ) := E Dte [{y f(x)} 2 ] Generalization error is different from the training error which measures how well you fit the data. trerr(d tr ) := N [{y n f(x n )} 2 ] n= 25

Errors vs model complexity As we increase the model complexity, how do these errors vary? The blue line shows training error for a dataset with N = 5, while the red line shows the generalization error for that dataset. Simple model have high train and generalization error since they have a high bias, while complex model have low train but high generalization error because they have high variance. ents of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 29 Cha Prediction Error..2.4.6.8..2 High Bias Low Variance Low Bias High Variance 5 5 2 25 3 35 Model Complexity (df) FIGURE 7.. Behavior of test sample and training 26

Bias-variance decomposition The shape of these curves can be explained using bias-variance decomposition. The following four points can be explained by using the decomposition:. both bias and variance contribute to generalization error. 2. For bias, both model-bias and estimation-bias are important. When we increase model complexity, we increase generalization error due to increased variance. 3. Regularization increases estimation bias while reducing variance. 27

9 Recent Advances Deep Learning & Overfitting Deep learning has shown a new (but old) way to combat overfitting. For many applications, more data and deep architecture combined with stochastic gradient-descent is able to get us to a good minimum which generalizes well. Challenges There are many challenges ahead. Learning from nasty, unreliable data still remains a challenge (e.g. small sample size, redundant data, non-stationary data, sequential learning). On the other hand, living beings - even young ones - are very good in dealing with such data. How do they do it, and how can we design ML methods that can learn like them? 28