Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Similar documents
9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

ECE521 week 3: 23/26 January 2017

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Mathematical Formulation of Our Example

Ways to make neural networks generalize better

Classification: The rest of the story

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

CS 6375 Machine Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

Bayesian Linear Regression [DRAFT - In Progress]

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

MODULE -4 BAYEIAN LEARNING

Linear Models for Regression

Introduction to Gaussian Process

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Week 3: Linear Regression

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

STA414/2104 Statistical Methods for Machine Learning II

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Linear Models for Classification

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Data Informatics. Seon Ho Kim, Ph.D.

ECE521 lecture 4: 19 January Optimization, MLE, regularization

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Parameter Estimation. Industrial AI Lab.

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Lecture 2 Machine Learning Review

Machine Learning Linear Regression. Prof. Matteo Matteucci

Classifier Complexity and Support Vector Classifiers

CSC321 Lecture 18: Learning Probabilistic Models

Fundamentals of Machine Learning (Part I)

Today. Calculus. Linear Regression. Lagrange Multipliers

Probabilistic Graphical Models for Image Analysis - Lecture 1

Least Squares Regression

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Machine Learning Techniques for Computer Vision

PATTERN RECOGNITION AND MACHINE LEARNING

The connection of dropout and Bayesian statistics

Recent Advances in Bayesian Inference Techniques

Machine Learning (CS 567) Lecture 2

Fundamentals of Machine Learning

STA 414/2104: Lecture 8

Parametric Techniques Lecture 3

Machine Learning, Fall 2009: Midterm

Loss Functions, Decision Theory, and Linear Models

Linear Models for Regression

An Introduction to Statistical and Probabilistic Linear Models

4 Bias-Variance for Ridge Regression (24 points)

INTRODUCTION TO PATTERN RECOGNITION

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture Notes 6: Linear Models

y Xw 2 2 y Xw λ w 2 2

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

L11: Pattern recognition principles

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Parametric Techniques

Lecture : Probabilistic Machine Learning

STA 4273H: Sta-s-cal Machine Learning

CPSC 340: Machine Learning and Data Mining

Introduction to Machine Learning Midterm Exam

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CS281 Section 4: Factor Analysis and PCA

Least Squares Regression

Bayesian Machine Learning

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Bayesian Regression (1/31/13)

Bayesian Deep Learning

Stat 5101 Lecture Notes

Machine Learning for Signal Processing Bayes Classification and Regression

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Machine Learning (COMP-652 and ECSE-608)

Introduction to Machine Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Neutron inverse kinetics via Gaussian Processes

Qualifying Exam in Machine Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Overfitting, Bias / Variance Analysis

Advanced Introduction to Machine Learning CMU-10715

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

CSC2515 Assignment #2

GWAS V: Gaussian processes

DS-GA 1002 Lecture notes 12 Fall Linear regression

Learning features by contrasting natural images with noise

STA 414/2104: Machine Learning

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Transcription:

Machine Learning! in just a few minutes Jan Peters Gerhard Neumann 1

Purpose of this Lecture Foundations of machine learning tools for robotics We focus on regression methods and general principles Often needed in robotics 2

Content of this Lecture Math and Statistics Refresher What is Machine Learning? Model-Selection Linear Regression Frequentist Approach Bayesian Approach 3

Statistics Refresher:! Sweet memories from High School... What is a random variable? is a variable whose value x is subject to variations due to chance What is a distribution? Describes the probability that the random variable will be equal to a certain value What is an expectation? 4

Statistics Refresher:! Sweet memories from High School... What is a joint, a conditional and a marginal distribution? What is independence of random variables? What does marginalization mean? And finally what is Bayes Theorem? 5

Math Refresher:! Some more fancy math From now on, matrices are your friends derivatives too Some more matrix calculus Need more? Wikipedia on Matrix Calculus 6

Math Refresher:! Inverse of matrices How can we invert a matrix that is not a square matrix? Left-Pseudo Inverse: works, if J has full column rank Right Pseudo Inverse: works, if J has full row rank 7

Statistics Refresher:! Meet some old friends Gaussian Distribution Covariance matrix captures linear correlation Product: Gaussian stays Gaussian Mean is also the mode 8

Statistics Refresher:! Meet some old friends Joint from Marginal and Conditional Marginal and Conditional Gaussian from Joint 9

Statistics Refresher:! Meet some old friends Bayes Theorem for Gaussians Damped Pseudo Inverse 10

May I introduce you? The good old logarithm It s monoton but not boring, as: Product is easy Division a piece of cake Exponents also 11

Content of this Lecture What is Machine Learning? Model-Selection Linear Regression Frequentist Approach Bayesian Approach 12

Why Machine Learning We are drowning in information and starving for knowledge. John Naisbitt. Era of big data: In 2008 there are about 1 trillion web pages 20 hours of video are uploaded to YouTube every minute Walmart handles more than 1M transactions per hour and has databases containing more than 2.5 petabytes (2.5 10 15 ) of information. No human being can deal with the data avalanche! 13

Why Machine Learning? 14 I keep saying the sexy job in the next ten years will be statisticians and machine learners. People think I m joking, but who would ve guessed that computer engineers would ve been the sexy job of the 1990s? The ability to take data to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it that s going to be a hugely important skill in the next decades. Hal Varian, 2009 Chief Engineer of Google

Types of Machine Learning Machine Learning predictive (supervised) descriptive (unsupervised) Active (reinforcement learning) 15

Prediction Problem (=Supervised Learning) What will be the CO² concentration in the future? 16

Prediction Problem (=Supervised Learning) What will be the CO² concentration in the future? Different prediction models possible Linear 17

Prediction Problem (=Supervised Learning) What will be the CO² concentration in the future? Different prediction models possible Linear Exponential with seasonal! trends 18

Formalization of Predictive Problems In predictive problems, we have the following data-set Two most prominent examples are: 1. Classification: Discrete outputs or labels Most likely class: 2. Regression: Continuous outputs or labels. 19 Expected output:

Examples of Classification Document classification, e.g., Spam Filtering Image classification: Classifying flowers, face detection, face recognition, handwriting recognition,... 20

Examples of Regression 21 Predict tomorrow s stock market price given current market conditions and other possible side information. Predict the amount of prostate specific antigen (PSA) in the body as a function of a number of different clinical measurements. Predict the temperature at any location inside a building using weather data, time, door sensors, etc. Predict the age of a viewer watching a given video on YouTube. Many problems in robotics can be addressed by regression!

Types of Machine Learning Machine Learning predictive (supervised) descriptive (unsupervised) Active (reinforcement learning) 22

Formalization of Descriptive Problems In descriptive problems, we have 23 Three prominent examples are: 1. Clustering: Find groups of data which belong together. 2. Dimensionality Reduction: Find the latent dimension of your data. 3. Density Estimation: Find the probability of your data...

Old Faithful Duration of Eruption Time to Next Eruption 24 This is called Clustering!

Dimensionality Reduction 2D Projection Original Data 1D Projection 25 This is called Dimensionality Reduction!

Dimensionality Reduction Example: Eigenfaces 26 How many faces do you need to characterize these?

27 Example: Eigenfaces

Example: density of glu (plasma glucose concentration) for diabetes patients Estimate relative occurance of a data point 28 This is called Density Estimation!

The bigger picture... 29 When we re learning to see, nobody s telling us what the right answers are we just look. Every so often, your mother says that s a dog, but that s very little information. You d be lucky if you got a few bits of information even one bit per second that way. The brain s visual system has 10 14 neural connections. And you only live for 10 9 seconds. So it s no use learning one bit per second. You need more like 10 5 bits per second. And there s only one place you can get that much information: from the input itself. Geoffrey Hinton, 1996

Types of Machine Learning Machine Learning predictive (supervised) descriptive (unsupervised) Active (reinforcement learning) That will be the main topic of the lecture! 30

How to attack a machine learning problem? 31 Machine learning problems essentially always are about two entities: (i) data model assumptions: Understand your problem generate good features which make the problem easier determine the model class Pre-processing your data (ii) algorithms that can deal with (i): Estimating the parameters of your model. We are gonna do this for regression...

Content of this Lecture What is Machine Learning? Model-Selection Linear Regression Frequentist Approach Bayesian Approach 32

Important Questions 33 How does the data look like? Are you really learning a function? What data types do our outputs have? Outliers: Are there data points in China? What is our model (relationship between inputs and outputs)? Do you have features? What type of noise / What distribution models our outputs? Number of parameters? Is your model sufficiently rich? Is it robust to overfitting?

Important Questions Requirements for the solution accurate efficient to obtain (computation/memory) interpretable 34

Example Problem: a data set 35 Task: Describe the outputs as a function of the inputs (regression)

Model Assumptions: Noise + Features Additive Gaussian Noise: with Equivalent Probabilistic Model Lets keep in simple: linear in Features 36

Important Questions 37 How does the data look like? What data types do our outputs have? Outliers: Are there data points in China? NO Are you really learning a function? YES What is our model? Do you have features? What type of noise / What distribution models our outputs? Number of parameters? Is your model sufficiently rich? Is it robust to overfitting?

Let us fit our model... We need to answer: How many parameters? Is your model sufficiently rich? Is it robust to overfitting? We assume a model class: polynomials of degree n 38

39 Fitting an Easy Model: n=0

40 Add a Feature: n=1

41 More features...

42 More features...

More features: n=200 (zoomed in) 43 overfitting and numerical problems

Prominent example of overfitting... Distinguish between Soviet (left) and US (right) tanks 44 DARPA Neural Network Study (1988-89), AFCEA International Press

Test Error vs Training Error Underfitting About Right Overfitting 45 Does a small training error lead to a good model?? NO! We need to do model selection

Occam s Razor and Model Selection Model Selection: How can we choose number of features/parameters? choose type of features? prevent overfitting? Some insights: Always choose the model that fits the data and has the smallest model complexity called occam s razor 46

Bias-Variance Tradeoff Typically, you can not minimize both! Bias / Structure Error: Error because our model can not do better 47 Variance / Approximation Error: Error because we estimate parameters on a limited data set

! How do choose the model? Goal: Find a good model Split the dataset into: (e.g., good set of features) Training Validation Test Training Set: Fit Parameters Validation Set: Choose model class or single parameters Test Set: Estimate prediction error of trained model Error needs to be estimated on independent set! 48

Model Selection:! K-fold Cross Validation Partition data into K sets, use K-1 as training set and 1 set as validation set For all possible ways of partitioning compute the validation error J computationally expensive!! 49 Choose model with smallest average validation error

Content of this Lecture What is Machine Learning? Model-Selection Linear Regression Frequentist Approach Bayesian Approach 50

How to find the parameters? Frequentist: Fisher Lets minimize the error of our prediction Objective is defined by minimizing a certain cost function 51

Frequentist view: Least Squares The classical cost function is the one of least-squares Using we can rewrite it as Scalar Product and solve it 52 Least Squares solution contains left pseudo-inverse

Physical Interpretation 53 Energy of springs ~ squared lengths minimize energy of system

Geometric Interpretation true (unknown) function value Minimize projection error orthogonal projection 54

Robotics Example: Rigid-Body Dynamics! Known Features Inertial Forces Coriolis Forces Centripetal Forces Gravity Inertial Forces Centripetal Forces 55 Gravity

Robotics Example: Rigid-Body Dynamics We realize that rigid body dynamics is linear in the parameters We can rewrite it as accelerations, velocities, sin and cos terms masses, lengths, inertia,... 56 For finding the parameters we can apply even the first machine learning method that comes to mind: Least-Squares Regression

Cost Function II:! Ridge Regression We punish the magnitude of the parameters Controls model complexity This yields ridge regression with, where is called regularizer Numerically, this is much more stable 57

Ridge regression: n=15 Influence of the regularization constant 58

MAP: Back to the Overfitting Problem Overfitting About Right Underfitting We can also scale the model complexity with the regularization parameter! 59 Smaller lambda higher model complexity

Content of this Lecture What is Machine Learning? Model-Selection Linear Regression Frequentist Approach Bayesian Approach 60

Maximum-Likelihood (ML) estimate Lets look at the problem from a probabilistic perspective Alternatively, we can maximize the likelihood of the parameters: Do the log-trick : That s hard! That s easy! 61 Least Squares Solution is equivalent to ML solution with Gaussian noise!!

Maximum a posteriori (MAP) estimate Put a prior on our parameters E.g., should be small: Find parameters that maximize the posterior Do the log trick again:

Maximum a posteriori (MAP) estimate The prior is just additive costs Lets put in our Model: 63 Ridge Regression is equivalent to MAP estimate with Gaussian prior

Predictions with the Model We found an amazing parameter set Let s do predictions! parameter estimate (e.g., ML, MAP) pred. function value Predictive mean: test input Predictive variance: 64

Comparing different data sets with same input data, but different output values (due to noise): 65

Comparing different data sets with same input data, but different output values (due to noise): 66

Comparing different data sets with same input data, but different output values (due to noise): 67 Our parameter estimate is also noisy! It depends on the noise in the data

Comparing different data sets Can we also estimate our! uncertainty in? Compute probability of! given data 68

How to find the parameters? Bayesian: Bayes Lets compute the probability of the parameters likelihood prior posterior evidence 69 Intuition: If you assign each parameter estimator a probability of being right, the average of these parameter estimators will be better than the single one

How to get the posterior? Bayes Theorem for Gaussians For our model: Prior over parameters: Posterior over parameters: Data Likelihood 70

What to do with the posterior? We could sample from it to estimate uncertainty 71

What to do with the posterior? We could sample from it to estimate uncertainty 72

What to do with the posterior? We could sample from it to estimate uncertainty 73

Full Bayesian Regression We can also do that in closed form: integrate out all possible parameters likelihood parameter posterior pred. function value test input training data Predictive Distribution is again a Gaussian for Gaussion likelihood and parameter posterior 74 State Dependent Variance!

Integrating out the parameters Variance depends on the information in the data! 75

Quick Summary Models that are linear in the parameters: Overfitting is bad Model selection (leave-one-out cross validation) Parameter Estimation: Frequentist vs. Bayesian Least Squares ~ Maximum Likelihood estimation (ML) Ridge Regression ~ Maximum a Posteriori estimation (MAP) Bayesian Regression integrates out the parameters when predicting State dependent uncertainty 76