Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Similar documents
Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Announcements. Proposals graded

Announcements. Proposals graded

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 2, Kevin Jamieson 1

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Point Estimation. Maximum likelihood estimation for a binomial distribution. CSE 446: Machine Learning

Bayesian Methods: Naïve Bayes

Machine Learning CSE546

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Gaussians Linear Regression Bias-Variance Tradeoff

Machine Learning CSE546

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Introduction to Machine Learning CMU-10701

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models: from data to models

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Introduction to Machine Learning

Bayesian Inference and MCMC

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian Inference: Posterior Intervals

Warm up: risk prediction with logistic regression

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning

Bayesian Learning (II)

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

6.867 Machine Learning

Naïve Bayes classification

Bayesian RL Seminar. Chris Mansley September 9, 2008

Machine Learning 4771

Introduction to Probabilistic Machine Learning

Bayesian Models in Machine Learning

Probabilistic Machine Learning

Classification Logistic Regression

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CSC321 Lecture 18: Learning Probabilistic Models

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

6.867 Machine Learning

[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]

Probability and Estimation. Alan Moses

Non-parametric Methods

Evaluating Hypotheses

Bias-Variance Tradeoff

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Learning P-maps Param. Learning

COMP90051 Statistical Machine Learning

Computational Perception. Bayesian Inference

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Machine Learning

Computational Cognitive Science

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

MLE/MAP + Naïve Bayes

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

CS 6140: Machine Learning Spring 2016

Density Estimation. Seungjin Choi

Introduction to Bayesian Learning. Machine Learning Fall 2018

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture 11: Probability Distributions and Parameter Estimation

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Introduction to Machine Learning

MLE/MAP + Naïve Bayes

Introduction to Machine Learning. Lecture 2

Probability Theory for Machine Learning. Chris Cremer September 2015

CHAPTER 2 Estimating Probabilities

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Classification Logistic Regression

Computational Cognitive Science

Machine Learning CSE546

Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Data Mining Techniques. Lecture 3: Probability

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Regression Linear and Logistic Regression

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Probabilistic Graphical Models

Statistical learning. Chapter 20, Sections 1 3 1

CPSC 540: Machine Learning

Is the test error unbiased for these programs?

y Xw 2 2 y Xw λ w 2 2

Lecture 2: Conjugate priors

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Linear Models A linear model is defined by the expression

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Transcription:

Accouncements You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF Please do not zip these files and submit (unless there are >5 files) 1

Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington September 28, 2017 2

MLE Recap - coin flips Data: sequence D= (HHTHT ), k heads out of n flips Hypothesis: P(Heads) = θ, P(Tails) = 1-θ P (D ) = k (1 ) n k Maximum likelihood estimation (MLE): Choose θ that maximizes the probability of observed data: b MLE = arg max = arg max P (D ) log P (D ) b MLE = k n 3

MLE Recap - Gaussians MLE: log P (D µ, )= n log( p nx (x i µ) 2 2 ) bµ MLE = 1 n nx x i c2 MLE = 1 n MLE for the variance of a Gaussian is biased E[ c2 MLE] 6= 2 2 2 nx (x i bµ MLE ) 2 Unbiased variance estimator: c2 unbiased = 1 nx (x i bµ MLE ) 2 n 1 4

MLE Recap Learning is Collect some data E.g., coin flips Choose a hypothesis class or model E.g., binomial Choose a loss function E.g., data likelihood Choose an optimization procedure E.g., set derivative to zero to obtain MLE Justifying the accuracy of the estimate E.g., Hoeffding s inequality 5

What about prior Billionaire: Wait, I know that the coin is close to 50-50. What can you do for me now? You say: I can learn it the Bayesian way 6

Bayesian vs Frequentist Data: D Estimator: b = t(d) loss: `(t(d), ) Frequentists treat unknown θ as fixed and the data D as random. Bayesian treat the data D as fixed and the unknown θ as random 7

Bayesian Learning Use Bayes rule: Or equivalently: 8

Bayesian Learning for Coins Likelihood function is simply Binomial: What about prior? Represent expert knowledge Conjugate priors: Closed-form representation of posterior For Binomial, conjugate prior is Beta distribution 9

Beta prior distribution P(θ) Mean: Mode: Beta(2,3) Beta(20,30) Likelihood function: Posterior: 10

Posterior distribution Prior: Data: α H heads and α T tails Posterior distribution: Beta(2,3) Beta(20,30) 11

Using Bayesian posterior Posterior distribution: Bayesian inference: No longer single parameter: Integral is often hard to compute 12

MAP: Maximum a posteriori approximation As more data is observed, Beta is more certain MAP: use most likely parameter: 13

MAP for Beta distribution MAP: use most likely parameter: 14

MAP for Beta distribution MAP: use most likely parameter: H + H 1 H + T + H + T 2 Beta prior equivalent to extra coin flips As N 1, prior is forgotten But, for small sample size, prior is important! 15

Recap for Bayesian learning Learning is Collect some data E.g., coin flips Choose a hypothesis class or model E.g., binomial and prior based on expert knowledge Choose a loss function E.g., parameter posterior likelihood Choose an optimization procedure E.g., set derivative to zero to obtain MAP Justifying the accuracy of the estimate E.g., If the model is correct, you are doing best possible 16

Recap for Bayesian learning Bayesians are optimists: If we model it correctly, we output most likely answer Assumes one can accurately model: Observations and link to unknown parameter θ: Distribution, structure of unknown θ: p( ) p(x ) Frequentist are pessimists: All models are wrong, prove to me your estimate is good Makes very few assumptions, e.g. E[X 2 ] < 1 and constructs an estimator (e.g., median of means of disjoint subsets of data) Prove guarantee E[( ) b 2 ] apple under hypothetical true θ s 17

Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 3, 2017 18

The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.} Training Data: {(x i,y i )} n x i 2 R d y i 2 R Sale Price # square feet 19

The regression problem Sale Price Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.} best linear fit # square feet Training Data: {(x i,y i )} n Hypothesis: linear y i x T i w Loss: least squares min w nx x i 2 R d y i 2 R y i x T i w 2 20

The regression problem in matrix notation y = 2 6 4 y 1. y n bw LS = arg min w 3 7 5 X = 2 3 6 4x T 7 1. 5 x T n nx y i x T i w 2 = arg min w (y Xw)T (y Xw) 21

The regression problem in matrix notation bw LS = arg min w Xw 2 2 = arg min w Xw)T (y Xw) 22

The regression problem in matrix notation bw LS = arg min w y Xw 2 2 =(X T X) 1 X T y What about an offset? bw LS, b b LS = arg min w,b nx y i (x T i w + b) 2 = arg min w,b y (Xw + 1b) 2 2 23

Dealing with an offset bw LS, b b LS = arg min w,b y (Xw + 1b) 2 2 24

Dealing with an offset bw LS, b b LS = arg min w,b y (Xw + 1b) 2 2 X T X bw LS + b b LS X T 1 = X T y 1 T X bw LS + b b LS 1 T 1 = 1 T y If X T 1 = 0 (i.e., if each feature is mean-zero) then bw LS =(X T X) 1 X T Y b bls = 1 n nx y i 25

The regression problem in matrix notation bw LS = arg min w y Xw 2 2 =(X T X) 1 X T y But why least squares? Consider y i = x T i w + i where i i.i.d. N (0, 2 ) P (y x, w, )= 26

Maximizing log-likelihood Maximize: Y n 1 log P (D w, ) = log( p 2 ) n e (y i x T i w)2 2 2 27

MLE is LS under linear model bw LS = arg min w nx y i x T i w 2 bw MLE = arg max P (D w, ) if y i = x T w i.i.d. i w + i and i N (0, 2 ) bw LS = bw MLE =(X T X) 1 X T Y 28

The regression problem Sale Price Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.} best linear fit # square feet Training Data: {(x i,y i )} n Hypothesis: linear y i x T i w Loss: least squares min w nx x i 2 R d y i 2 R y i x T i w 2 29

The regression problem Sale Price Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.} Training Data: date of sale best linear fit {(x i,y i )} n Hypothesis: linear y i x T i w Loss: least squares min w nx x i 2 R d y i 2 R y i x T i w 2 30

The regression problem Training Data: {(x i,y i )} n Hypothesis: linear y i x T i w x i 2 R d y i 2 R Transformed data: Loss: least squares min w nx y i x T i w 2 31

The regression problem Training Data: {(x i,y i )} n Hypothesis: linear y i x T i w Loss: least squares min w nx x i 2 R d y i 2 R y i x T i w 2 Transformed data: h : R d! R p maps original features to a rich, possibly high-dimensional space 2 3 2 h 1 (x) 6h 2 (x) 7 6 in d=1: h(x) = 6 4. h p (x) 7 5 = 6 4 for d>1, generate {u j } p j=1 Rd 1 h j (x) = 1 + exp(u T j x) h j (x) =(u T j x) 2 3 x x 2 7. 5 x p h j (x) = cos(u T j x) 32

The regression problem Training Data: {(x i,y i )} n Hypothesis: linear y i x T i w Loss: least squares min w nx x i 2 R d y i 2 R y i x T i w 2 Transformed data: h(x) = Hypothesis: linear y i h(x i ) T w Loss: least squares min w nx 2 3 h 1 (x) h 2 (x) 6 7 4. 5 h p (x) w 2 R p y i h(x i ) T w 2 33

The regression problem Training Data: {(x i,y i )} n x i 2 R d y i 2 R Transformed data: h(x) = 2 3 h 1 (x) h 2 (x) 6 7 4. 5 h p (x) best linear fit Hypothesis: linear Sale Price y i h(x i ) T w Loss: least squares min w nx w 2 R p y i h(x i ) T w 2 date of sale 34

The regression problem Training Data: {(x i,y i )} n Sale Price date of sale x i 2 R d y i 2 R small p fit Transformed data: h(x) = Hypothesis: linear y i h(x i ) T w Loss: least squares min w nx 2 3 h 1 (x) h 2 (x) 6 7 4. 5 h p (x) w 2 R p y i h(x i ) T w 2 35

The regression problem Training Data: {(x i,y i )} n Sale Price date of sale x i 2 R d y i 2 R large p fit Transformed data: h(x) = Hypothesis: linear y i h(x i ) T w Loss: least squares min w nx 2 3 h 1 (x) h 2 (x) 6 7 4. 5 h p (x) w 2 R p y i h(x i ) T w 2 What s going on here? 36