Cheng Soon Ong & Christian Walder. Canberra February June 2018

Similar documents
Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

INTRODUCTION TO PATTERN RECOGNITION

PATTERN RECOGNITION AND MACHINE LEARNING

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Bayesian Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture : Probabilistic Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

COMP90051 Statistical Machine Learning

Bayesian Linear Regression. Sargur Srihari

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Srihari. Probability Theory. Sargur N. Srihari

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

CSC321 Lecture 2: Linear Regression

Mathematical Formulation of Our Example

Machine learning - HT Maximum Likelihood

Statistical Learning Theory

Introduction to Probabilistic Graphical Models: Exercises

Least Squares Regression

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

y(x) = x w + ε(x), (1)

CSCI567 Machine Learning (Fall 2014)

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Logistic Regression. COMP 527 Danushka Bollegala

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Overfitting, Bias / Variance Analysis

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Review: mostly probability and some statistics

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

INTRODUCTION TO PATTERN

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

BAYESIAN DECISION THEORY

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

An Introduction to Statistical and Probabilistic Linear Models

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Machine Learning Lecture 7

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Lecture 1: Bayesian Framework Basics

Probability and Information Theory. Sargur N. Srihari

ECE521 week 3: 23/26 January 2017

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

DD Advanced Machine Learning

Mathematics 426 Robert Gross Homework 9 Answers

Nonparameteric Regression:

Introduction to Machine Learning

Gaussian Processes (10/16/13)

Recap from previous lecture

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Probabilistic & Unsupervised Learning

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Review of Probability Theory

Cheng Soon Ong & Christian Walder. Canberra February June 2018

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Probability and Statistical Decision Theory

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Bivariate distributions

Machine Learning

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Least Squares Regression

Gaussian Process Regression

GWAS IV: Bayesian linear (variance component) models

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Multivariate statistical methods and data mining in particle physics

Naïve Bayes classification

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

Linear Models for Regression

Introduction to Machine Learning

GWAS V: Gaussian processes

Regression Analysis. Ordinary Least Squares. The Linear Model

Linear Regression (continued)

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Chris Bishop s PRML Ch. 8: Graphical Models

Statistics, Data Analysis, and Simulation SS 2015

Supervised Learning Coursework

Linear Classification

Distribution-Free Distribution Regression

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Lecture 11: Probability Distributions and Parameter Estimation

Introduction to Machine Learning

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

The problem of searching for patterns in data is a fundamental one and has a long and successful history. For instance, the extensive astronomical

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Transcription:

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 89

Part II Introduction 48of 89

Flavour of this course Formalise intuitions about problems Use language of mathematics to express models Geometry, vectors, linear algebra for reasoning Probabilistic models to capture uncertainty Design and analysis of algorithms Numerical algorithms in python Understand the choices when designing machine learning methods 49of 89

What is? Definition (Mitchell, 1998) A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. 50of 89

some artificial data created from the function sin(2πx) + random noise x = 0,..., 1 t 1 0 1 0 x 1 51of 89

- Input Specification N = 10 x (x 1,..., x N ) T t (t 1,..., t N ) T 52of 89

- Input Specification N = 10 x (x 1,..., x N ) T t (t 1,..., t N ) T x i R i = 1,..., N t i R i = 1,..., N 53of 89

- Model Specification M : order of polynomial y(x, w) = w 0 + w 1 x + w 2 x 2 + + w M x M M = w m x m m=0 nonlinear function of x linear function of the unknown model parameter w How can we find good parameters w = (w 1,..., w M ) T? 54of 89

Learning is Improving Performance t t n y(x n, w) x n x 55of 89

Learning is Improving Performance t t n y(x n, w) x n x Performance measure : Error between target and prediction of the model for the training data E(w) = 1 2 N (y(x n, w) t n ) 2 n=1 unique minimum of E(w) for argument w 56of 89

Model Comparison or Model Selection M y(x, w) = w m x m m=0 M=0 = w 0 t 1 M = 0 0 1 0 x 1 57of 89

Model Comparison or Model Selection y(x, w) = M m=0 w m x m M=1 = w 0 + w 1 x t 1 M = 1 0 1 0 x 1 58of 89

Model Comparison or Model Selection y(x, w) = M m=0 w m x m M=3 = w 0 + w 1 x + w 2 x 2 + w 3 x 3 t 1 M = 3 0 1 0 x 1 59of 89

Model Comparison or Model Selection overfitting y(x, w) = M m=0 w m x m M=9 = w 0 + w 1 x + + w 8 x 8 + w 9 x 9 t 1 M = 9 0 1 0 x 1 60of 89

Testing the Model Train the model and get w Get 100 new data points Root-mean-square (RMS) error E RMS = 2E(w )/N 1 Training Test ERMS 0.5 0 0 3 M 6 9 61of 89

Testing the Model M = 0 M = 1 M = 3 M = 9 w 0 0.19 0.82 0.31 0.35 w 1-1.27 7.99 232.37 w 2-25.43-5321.83 w 3 17.37 48568.31 w 4-231639.30 w 5 640042.26 w 6-1061800.52 w 7 1042400.18 w 8-557682.99 w 9 125201.43 Table: Coefficients w for polynomials of various order. 62of 89

More Data N = 15 t 1 0 N = 15 1 0 x 1 63of 89

More Data N = 100 heuristics : have no less than 5 to 10 times as many data points than parameters but number of parameters is not necessarily the most appropriate measure of model complexity! later: Bayesian approach t 1 N = 100 0 1 0 x 1 64of 89

Regularisation How to constrain the growing of the coefficients w? Add a regularisation term to the error function Ẽ(w) = 1 N ( y(x n, w) t n ) 2 + λ 2 2 w 2 n=1 Squared norm of the parameter vector w w 2 w T w = w 2 0 + w 2 1 + + w 2 M 65of 89

Regularisation M = 9 t 1 0 ln λ = 18 1 0 x 1 66of 89

Regularisation M = 9 t 1 0 ln λ = 0 1 0 x 1 67of 89

Regularisation M = 9 1 Training Test ERMS 0.5 0 35 30 ln λ 25 20 68of 89

What is? Definition (Mitchell, 1998) A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Task: regression Experience: x input examples, t output labels Performance: squared error Model choice Regularisation do not train on the test set! 69of 89

p(x, Y ) Y = 2 Y = 1 X 70of 89

Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 p(x, Y ) Y = 2 Y = 1 X 71of 89

Sum Rule Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 p(x = d, Y = 1) = 8/60 p(x = d) = p(x = d, Y = 2) + p(x = d, Y = 1) = 1/60 + 8/60 p(x = d) = Y p(x = d, Y) p(x) = Y p(x, Y) 72of 89

Sum Rule Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 p(x) = Y p(x, Y) p(y) = X p(x, Y) p(x) p(y ) X 73of 89

Product Rule Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 Conditional Probability p(x = d Y = 1) = 8/34 Calculate p(y = 1): p(y = 1) = X p(x, Y = 1) = 34/60 p(x = d, Y = 1) = p(x = d Y = 1)p(Y = 1) p(x, Y) = p(x Y) p(y) 74of 89

Product Rule Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 p(x) = Y p(x, Y) p(x, Y) = p(x Y) p(y) p(x) p(x Y = 1) X X 75of 89

Sum Rule and Product Rule Sum Rule p(x) = Y p(x, Y) Product Rule p(x, Y) = p(x Y) p(y) 76of 89

Bayes Theorem Use product rule p(x, Y) = p(x Y) p(y) = p(y X) p(x) Bayes Theorem and p(y X) = p(x Y) p(y) p(x) only defined for p(x) > 0 p(x) = Y = Y p(x, Y) p(x Y) p(y) (sum rule) (product rule) 77of 89

Real valued variable x R Probability of x to fall in the interval (x, x + δx) is given by p(x)δx for infinitesimal small δx. p(x (a, b)) = b a p(x) dx. p(x) P (x) δx x 78of 89

Constraints on p(x) Nonnegative Normalisation p(x) 0 p(x) dx = 1. p(x) P (x) δx x 79of 89

Cumulative distribution function P(x) or P(x) = x p(z) dz d P(x) = p(x) dx p(x) P (x) δx x 80of 89

Multivariate Probability Density Vector x (x 1,..., x D ) T = Nonnegative Normalisation This means x 1. x D p(x) 0 p(x) dx = 1. p(x) dx 1... dx D = 1. 81of 89

Sum and Product Rule for Sum Rule Product Rule p(x) = p(x, y) dy p(x, y) = p(y x) p(x) 82of 89

Expectations Weighted average of a function f(x) under the probability distribution p(x) E [f ] = p(x) f (x) x E [f ] = p(x) f (x) dx discrete distribution p(x) probability density p(x) 83of 89

How to approximate E [f ] Given a finite number N of points x n drawn from the probability distribution p(x). Approximate the expectation by a finite sum: E [f ] 1 N N f (x n ) n=1 How to draw points from a probability distribution p(x)? Lecture coming about Sampling 84of 89

Expection of a function of several variables arbitrary function f (x, y) E x [f (x, y)] = p(x) f (x, y) x E x [f (x, y)] = p(x) f (x, y) dx discrete distribution p(x) probability density p(x) Note that E x [f (x, y)] is a function of y. 85of 89

Conditional Expectation arbitrary function f (x) E x [f y] = p(x y) f (x) x E x [f y] = p(x y) f (x) dx discrete distribution p(x) probability density p(x) Note that E x [f y] is a function of y. Other notation used in the literature : E x y [f ]. What is E [E [f (x) y]]? Can we simplify it? This must mean E y [E x [f (x) y]]. (Why?) E y [E x [f (x) y]] = y p(y) E x [f y] = y p(y) x p(x y) f (x) = x,y f (x) p(x, y) = x f (x) p(x) = E x [f (x)] 86of 89

Variance arbitrary function f (x) var[f ] = E [ (f (x) E [f (x)]) 2] = E [ f (x) 2] E [f (x)] 2 Special case: f (x) = x var[x] = E [ (x E [x]) 2] = E [ x 2] E [x] 2 87of 89

Covariance Two random variables x R and y R cov[x, y] = E x,y [(x E [x])(y E [y])] With E [x] = a and E [y] = b = E x,y [x y] E [x] E [y] cov[x, y] = E x,y [(x a)(y b)] = E x,y [x y] E x,y [x b] E x,y [a y] + E x,y [a b] = E x,y [x y] b E x,y [x] a E x,y [y] +a b E x,y [1] }{{}}{{}}{{} =E x[x] =E y[y] =1 = E x,y [x y] a b a b + a b = E x,y [x y] a b = E x,y [x y] E [x] E [y] Expresses how strongly x and y vary together. If x and y are independent, their covariance vanishes. 88of 89

Covariance for Vector Valued Variables Two random variables x R D and y R D cov[x, y] = E x,y [ (x E [x])(y T E [ y T] ) ] = E x,y [ x y T ] E [x] E [ y T] 89of 89