COMP90051 Statistical Machine Learning

Similar documents
Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CSC321 Lecture 18: Learning Probabilistic Models

ECE521 week 3: 23/26 January 2017

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Naïve Bayes classification

Lecture : Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning

Logistic Regression. Machine Learning Fall 2018

Statistics: Learning models from data

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Computational Cognitive Science

Computational Perception. Bayesian Inference

Introduction to Bayesian Learning. Machine Learning Fall 2018

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Bayesian Learning (II)

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning 4771

COMP90051 Statistical Machine Learning

Bias-Variance Tradeoff

Bayesian Networks. Motivation

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Bayesian Models in Machine Learning

Density Estimation. Seungjin Choi

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Probability and Estimation. Alan Moses

MLE/MAP + Naïve Bayes

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Bayesian Methods: Naïve Bayes

Machine Learning

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Lecture 9: PGM Learning

Intractable Likelihood Functions

Machine Learning

Bayesian Inference and MCMC

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Learning Bayesian network : Given structure and completely observed data

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Bayesian Machine Learning

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Computational Cognitive Science

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

MLE/MAP + Naïve Bayes

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Learning with Probabilities

Machine Learning

The Naïve Bayes Classifier. Machine Learning Fall 2017

Bayesian Methods for Machine Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduc)on to Bayesian methods (con)nued) - Lecture 16

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

CPSC 340: Machine Learning and Data Mining

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction to Bayesian inference

Learning Bayesian Networks (part 1) Goals for the lecture

CS-E3210 Machine Learning: Basic Principles

an introduction to bayesian inference

Generative Clustering, Topic Modeling, & Bayesian Inference

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Parametric Techniques Lecture 3

CS 361: Probability & Statistics

CS 361: Probability & Statistics

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Probabilistic and Bayesian Machine Learning

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Notes on Machine Learning for and

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book)

Statistical learning. Chapter 20, Sections 1 4 1

Parametric Techniques

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Day 5: Generative models, structured classification

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

COS513 LECTURE 8 STATISTICAL CONCEPTS

Lecture 13 Fundamentals of Bayesian Inference

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Introduction. Le Song. Machine Learning I CSE 6740, Fall 2013

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Lecture 3: Pattern Classification

Probabilistic Machine Learning. Industrial AI Lab.

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Hierarchical Models & Bayesian Model Selection

Logistic Regression. William Cohen

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

ECE 5984: Introduction to Machine Learning

Graphical Models for Collaborative Filtering

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Lecture 2

Transcription:

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein

Statistical Schools of Thought Remainder of lecture is to provide intuition into how algorithms in this subject come about and inter-relate Based on Berkeley CS 294-34 tutorial slides by Ariel Kleiner 2

Frequentist Statistics Abstract problem * Given: X 1, X 2,, X n drawn i.i.d. from some distribution * Want to: identify unknown distribution Parametric approach ( parameter estimation ) * Class of models {p # x : θ Θ} indexed by parameters Θ (could be a real number, or vector, or.) * Select θ* (x -,, x 0 ) some function (or statistic) of data Examples Hat means estimate or estimator * Given n coin flips, determine probability of landing heads * Building a classifier is a very related problem 3

How do Frequentists Evaluate Estimators? Bias: B # θ3 = E # θ3 X -,, X 0 θ Variance: Var # θ3 = E # (θ3 E # [θ3]) = * Efficiency: estimate has minimal variance Square loss vs bias-variance E # θ θ3 = = [B(θ)] = +Var # (θ3) Consistency: θ3 X -,, X 0 big more on this later in the subject Subscript q means data really comes from p q θ* still function of data converges to θ as n gets 4

Is this Just Theoretical TM? Recall Lecture 1 à Those evaluation metrics? They re just estimators of a performance parameter COMP90051 Machine Learning (S2 2017) L1 Evaluation (Supervised Learners) How you measure quality depends on your problem! Typical process * Pick an evaluation metric comparing label vs prediction * Procure an independent, labelled test set * Average the evaluation metric over the test set Example evaluation metrics * Accuracy, Contingency table, Precision-Recall, ROC curves When data poor, cross-validate Example: error 22 Bias, Variance, etc. indicate quality of approximation 5

Maximum-Likelihood Estimation A general principle for designing estimators Involves optimisation θ3 x -,, x 0 = argmax # D * Question: Why a product? 0 FG- p # (x F ) Fischer 6

Example I: Normal Know data comes from Normal distribution with variance 1 but unknown mean; find mean MLE for mean * p # x = - =H exp - = x θ = * Maximising likelihood yields θ* = - 0 x 0 FG- F Exercise: derive MLE for variance σ = based on p # x = - =HN O exp - =N O x μ = with θ = (μ, σ = ) 7

Example II: Bernoulli Know data comes from Bernoulli distribution with unknown parameter (e.g., biased coin); find mean MLE for mean θ, if x = 1 * p # x = Q 1 θ, if x = 0 = θv 1 θ (note: p # x = 0 for all other x) * Maximising likelihood yields θ* = - 0 x 0 FG- F -WV Corrected typo after lecture, 27/7/17 8

MLE algorithm 1. given data x -,, x 0 define probability distribution, p #, assumed to have generated the data 0 2. express likelihood of data, FG- p # (x F ) (usually its logarithm why?) 3. optimise to find best (most likely) parameters θ3 1. take partial derivatives of log likelihood wrt θ 2. set to 0 and solve (failing that, use iterative gradient method) 9

Bayesian Statistics Probabilities correspond to beliefs Parameters * Modeled as r.v. s having distributions * Prior belief in θ encoded by prior distribution P(θ) * Write likelihood of data P(X) as conditional P(X θ) Laplace * Rather than point estimate θ*, Bayesians update belief P(θ) with observed data to P(θ X) the posterior distribution 10

More Detail (Probabilistic Inference) Bayesian machine learning * Start with prior P(θ) and likelihood P(X θ) * Observe data X = x * Update prior to posterior P(θ X = x) We ll later cover tools to get the posterior * Bayes Theorem: reverses order of conditioning P X = x θ P(θ) P θ X = x = P(X = x) * Marginalisation: eliminates unwanted variables P X = x = \ P(X = x, θ = t) ^ Bayes 11

Example We model X θ as N(θ, 1) with prior N(0,1) Suppose we observe X=1, then update posterior P θ X = 1 = ` X = 1 θ `(#) `(ag-) P X = 1 θ P θ = - =H N 0.5,0.5 exp (-W#)O = - =H exp #O = NB: allowed to push constants out front and ignore as these get taken care of by normalisation 12

How Bayesians Make Point Estimates They don t, unless forced at gunpoint! * The posterior carries full information, why discard it? But, there are common approaches * Posterior mean E # a θ = θp θ X dθ * Posterior mode argmax # P(θ X) (max a posteriori or MAP) P(θ X) θ MAP θ 13

MLE in Bayesian context MLE formulation: find parameters that best fit data θ3 = argmax # P X = x θ Consider the MAP under a Bayesian formulation θ3 = P θ X = x = argmax # P X = x θ P θ P X = x = argmax # P X = x θ P θ Difference is prior P θ ; assumed uniform for MLE 14

Parametric vs Non-Parametric Models Parametric Non-Parametric Determined by fixed, finite number of parameters Number of parameters grows with data, potentially infinite Limited flexibility Efficient statistically and computationally More flexible Less efficient Examples to come! There are non/parametric models in both the frequentist and Bayesian schools. 15

Generative vs. Discriminative Models X s are instances, Y s are labels (supervised setting!) * Given: i.i.d. data (X 1, Y 1 ),,(X n, Y n ) * Find model that can predict Y of new X Generative approach * Model full joint P(X, Y) Discriminative approach * Model conditional P(Y X) only Both have pro s and con s Examples to come! There are generative/discriminative models in both the frequentist and Bayesian schools. 16

Summary Philosophies: frequentist vs. Bayesian Principles behind many learners: * MLE * Probabilistic inference, MAP Discriminative vs. Generative models 17