Statistical learning. Chapter 20, Sections 1 3 1

Similar documents
Statistical learning. Chapter 20, Sections 1 3 1

Statistical learning. Chapter 20, Sections 1 4 1

Learning with Probabilities

Statistical Learning. Philipp Koehn. 10 November 2015

From inductive inference to machine learning

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

Introduction to Bayesian Learning. Machine Learning Fall 2018

Notes on Machine Learning for and

Bayesian Learning (II)

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Introduction to Bayesian Learning

Model Averaging (Bayesian Learning)

ECE521 week 3: 23/26 January 2017

Bayesian Methods: Naïve Bayes

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

CSC321 Lecture 18: Learning Probabilistic Models

MODULE -4 BAYEIAN LEARNING

Naïve Bayes classification

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Bayesian Machine Learning

Machine Learning 4771

Density Estimation. Seungjin Choi

Density Estimation: ML, MAP, Bayesian estimation

PMR Learning as Inference

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Algorithmisches Lernen/Machine Learning

CSCE 478/878 Lecture 6: Bayesian Learning

Computing Posterior Probabilities. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Expectation Maximization [KF Chapter 19] Incomplete data

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian Networks aka belief networks, probabilistic networks. Bayesian Networks aka belief networks, probabilistic networks. An Example Bayes Net

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

Learning Bayesian network : Given structure and completely observed data

Logistic Regression. Machine Learning Fall 2018

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Lecture : Probabilistic Machine Learning

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Least Squares Regression

Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Bayesian Updating: Discrete Priors: Spring

y Xw 2 2 y Xw λ w 2 2

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Bayesian RL Seminar. Chris Mansley September 9, 2008

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Computational Cognitive Science

CMU-Q Lecture 24:

Bayesian Networks BY: MOHAMAD ALSABBAGH

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

The Naïve Bayes Classifier. Machine Learning Fall 2017

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

COMP90051 Statistical Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Least Squares Regression

Bayesian Concept Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 9: Bayesian Learning

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Probability Based Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

GWAS IV: Bayesian linear (variance component) models

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 4: Probabilistic Learning

Probabilistic & Unsupervised Learning

Introduction to Probabilistic Machine Learning

Inconsistency of Bayesian inference when the model is wrong, and how to repair it

Parameter Estimation

Lecture 2 Machine Learning Review

CS-E3210 Machine Learning: Basic Principles

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Scribe to lecture Tuesday March

Machine Learning

Machine Learning Linear Classification. Prof. Matteo Matteucci

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Bayesian Updating: Discrete Priors: Spring

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

COS513 LECTURE 8 STATISTICAL CONCEPTS

Bayesian Networks. Motivation

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Bayesian Models in Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Linear Models for Regression

Computational Cognitive Science

GAUSSIAN PROCESS REGRESSION

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Bayesian Regression Linear and Logistic Regression

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Transcription:

Statistical learning Chapter 20, Sections 1 3 Chapter 20, Sections 1 3 1

Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete data linear regression Chapter 20, Sections 1 3 2

Full Bayesian learning View learning as Bayesian updating of a probability distribution over the hypothesis space H is the hypothesis variable, values h 1, h 2,..., prior P(H) jth observation d j gives the outcome of random variable D j training data d = d 1,..., d N Given the data so far, each hypothesis has a posterior probability: P (h i d) = αp (d h i )P (h i ) where P (d h i ) is called the likelihood Predictions use a likelihood-weighted average over the hypotheses: P(X d) = Σ i P(X d, h i )P (h i d) = Σ i P(X h i )P (h i d) No need to pick one best-guess hypothesis! Chapter 20, Sections 1 3 3

Example Suppose there are five kinds of bags of candies: 10% are h 1 : 100% cherry candies 20% are h 2 : 75% cherry candies + 25% lime candies 40% are h 3 : 50% cherry candies + 50% lime candies 20% are h 4 : 25% cherry candies + 75% lime candies 10% are h 5 : 100% lime candies Then we observe candies drawn from some bag: What kind of bag is it? What flavour will the next candy be? Chapter 20, Sections 1 3 4

Posterior probability of hypotheses Posterior probability of hypothesis 1 0.8 0.6 0.4 0.2 0 P(h 1 d) P(h 2 d) P(h 3 d) P(h 4 d) P(h 5 d) 0 2 4 6 8 10 Number of samples in d Chapter 20, Sections 1 3 5

Prediction probability 1 P(next candy is lime d) 0.9 0.8 0.7 0.6 0.5 0.4 0 2 4 6 8 10 Number of samples in d Chapter 20, Sections 1 3 6

MAP approximation Summing over the hypothesis space is often intractable (e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes) Maximum a posteriori (MAP) learning: choose h MAP maximizing P (h i d) I.e., maximize P (d h i )P (h i ) or log P (d h i ) + log P (h i ) Log terms can be viewed as (negative of) bits to encode data given hypothesis + bits to encode hypothesis This is the basic idea of minimum description length (MDL) learning For deterministic hypotheses, P (d h i ) is 1 if consistent, 0 otherwise MAP = simplest consistent hypothesis (cf. science) Chapter 20, Sections 1 3 7

ML approximation For large data sets, prior becomes irrelevant Maximum likelihood (ML) learning: choose h ML maximizing P (d h i ) I.e., simply get the best fit to the data; identical to MAP for uniform prior (which is reasonable if all hypotheses are of the same complexity) ML is the standard (non-bayesian) statistical learning method Chapter 20, Sections 1 3 8

ML parameter learning in Bayes nets Bag from a new manufacturer; fraction θ of cherry candies? Any θ is possible: continuum of hypotheses h θ θ is a parameter for this simple (binomial) family of models Suppose we unwrap N candies, c cherries and l = N c limes These are i.i.d. (independent, identically distributed) observations, so P (d h θ ) = N j = 1 P (d j h θ ) = θ c (1 θ) l Maximize this w.r.t. θ which is easier for the log-likelihood: L(d h θ ) = log P (d h θ ) = N dl(d h θ ) dθ P( F=cherry) θ Flavor j = 1 log P (d j h θ ) = c log θ + l log(1 θ) = c θ l 1 θ = 0 θ = c c + l = c N Seems sensible, but causes problems with 0 counts! Chapter 20, Sections 1 3 9

Multiple parameters Red/green wrapper depends probabilistically on flavor: Likelihood for, e.g., cherry candy in green wrapper: P (F = cherry, W = green h θ,θ1,θ 2 ) = P (F = cherry h θ,θ1,θ 2 )P (W = green F = cherry, h θ,θ1,θ 2 ) = θ (1 θ 1 ) N candies, r c red-wrapped cherry candies, etc.: P( F=cherry) θ Flavor F Wrapper cherry lime P( W=red F) θ 1 θ 2 P (d h θ,θ1,θ 2 ) = θ c (1 θ) l θ r c 1 (1 θ 1 ) gc θ r l 2 (1 θ 2 ) g l L = [c log θ + l log(1 θ)] + [r c log θ 1 + g c log(1 θ 1 )] + [r l log θ 2 + g l log(1 θ 2 )] Chapter 20, Sections 1 3 10

Multiple parameters contd. Derivatives of L contain only the relevant parameter: L θ = c θ l 1 θ = 0 θ = c c + l L θ 1 = r c θ 1 g c 1 θ 1 = 0 θ 1 = r c r c + g c L θ 2 = r l θ 2 g l 1 θ 2 = 0 θ 2 = r l r l + g l With complete data, parameters can be learned separately Chapter 20, Sections 1 3 11

P(y x) 3.5 4 2.5 3 1.5 2 0.5 1 0 Example: linear Gaussian model 0 0.2 0.4 0.6 x 0.8 1 0 0.20.4 0.6 0.81 y Maximizing P (y x) = 1 2πσ e (y (θ 1 x+θ 2 ))2 2σ 2 w.r.t. θ 1, θ 2 y 1 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x = minimizing E = N j = 1 (y j (θ 1 x j + θ 2 )) 2 That is, minimizing the sum of squared errors gives the ML solution for a linear fit assuming Gaussian noise of fixed variance Chapter 20, Sections 1 3 12

Summary Full Bayesian learning gives best possible predictions but is intractable MAP learning balances complexity with accuracy on training data Maximum likelihood assumes uniform prior, OK for large data sets 1. Choose a parameterized family of models to describe the data requires substantial insight and sometimes new models 2. Write down the likelihood of the data as a function of the parameters may require summing over hidden variables, i.e., inference 3. Write down the derivative of the log likelihood w.r.t. each parameter 4. Find the parameter values such that the derivatives are zero may be hard/impossible; modern optimization techniques help Chapter 20, Sections 1 3 13