Announcements. Proposals graded

Similar documents
Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Announcements. stuff stat 538 Zaid Hardhaoui. Statistics. cry. spring. g fa. inference. VC dimension covering

Announcements. Proposals graded

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Warm up: risk prediction with logistic regression

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Classification Logistic Regression

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Bayesian Learning (II)

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Is the test error unbiased for these programs?

Bayesian Methods: Naïve Bayes

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Gaussians Linear Regression Bias-Variance Tradeoff

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Introduction to Machine Learning

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Naïve Bayes classification

Classification Logistic Regression

Machine Learning Practice Page 2 of 2 10/28/13

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Stochastic Gradient Descent

CPSC 340: Machine Learning and Data Mining

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

STA414/2104 Statistical Methods for Machine Learning II

Kernel Methods. Charles Elkan October 17, 2007

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Recap from previous lecture

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Bias-Variance Tradeoff

Kaggle.

Kernel methods, kernel SVM and ridge regression

Introduction to Probabilistic Machine Learning

CIS 520: Machine Learning Oct 09, Kernel Methods

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Machine Learning for Signal Processing Bayes Classification and Regression

Classification Logistic Regression

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

PAC-learning, VC Dimension and Margin-based Bounds

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Introduc)on to Bayesian methods (con)nued) - Lecture 16

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Support Vector Machines

Overfitting, Bias / Variance Analysis

Basis Expansion and Nonlinear SVM. Kai Yu

Machine Learning CSE546

CS-E3210 Machine Learning: Basic Principles

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Qualifying Exam in Machine Learning

Bayesian Support Vector Machines for Feature Ranking and Selection

Support Vector Machines

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Review: Support vector machines. Machine learning techniques and image analysis

Expectation Propagation for Approximate Bayesian Inference

9 Classification. 9.1 Linear Classifiers

Kernel Methods and Support Vector Machines

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Support Vector Machines

COMP 562: Introduction to Machine Learning

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Probability and Statistical Decision Theory

Introduction to SVM and RVM

Lecture : Probabilistic Machine Learning

Introduction to Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Learning Bayesian network : Given structure and completely observed data

COMS 4771 Introduction to Machine Learning. Nakul Verma

PATTERN RECOGNITION AND MACHINE LEARNING

Support Vector Machines and Kernel Methods

PAC-learning, VC Dimension and Margin-based Bounds

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Introduction to Machine Learning

Density Estimation. Seungjin Choi

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Introduction to Gaussian Process

Support Vector Machine (SVM) and Kernel Methods

Transcription:

Announcements Proposals graded Kevin Jamieson 2018 1

Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2

MLE Recap - coin flips Data: sequence D= (HHTHT ), k heads out of n flips Hypothesis: P(Heads) = θ, P(Tails) = 1-θ P (D ) = k (1 ) n k Maximum likelihood estimation (MLE): Choose θ that maximizes the probability of observed data: b MLE = arg max = arg max P (D ) log P (D ) b MLE = k n 2018 Kevin Jamieson 3

What about prior Billionaire: Wait, I know that the coin is close to 50-50. What can you do for me now? You say: I can learn it the Bayesian way 2018 Kevin Jamieson 4

Bayesian Learning Use Bayes rule: Or equivalently: 2018 Kevin Jamieson 5

Bayesian Learning for Coins Likelihood function is simply Binomial: What about prior? Represent expert knowledge Conjugate priors: P (D ) = k (1 ) n k Closed-form representation of posterior For Binomial, conjugate prior is Beta distribution 2018 Kevin Jamieson 6

Beta prior distribution P(θ) Mean: Mode: Beta(2,3) Beta(20,30) Likelihood function: Posterior: P (D ) = k (1 ) n k 2018 Kevin Jamieson 7

Posterior distribution Prior: Data: k heads and (n-k) tails Posterior distribution: P ( D) =Beta(k + H, (n k)+ T ) H =1, t =1 H = 10, t = 10 H = 50, t = 50 Prior P ( ) Posterior P ( D) k = 23,n= 25 2018 Kevin Jamieson 8

Using Bayesian posterior Posterior distribution: P ( D) =Beta(k + H, (n k)+ T ) Bayesian inference: Estimate mean E[ ] = Z 1 0 P ( D)d Estimate arbitrary function f For arbitrary f integral is often hard to compute 2018 Kevin Jamieson 9

MAP: Maximum a posteriori approximation P ( D) =Beta(k + H, (n k)+ T ) As more data is observed, Beta is more certain MAP: use most likely parameter: 2018 Kevin Jamieson 10

MAP for Beta distribution P ( D) / k+ H 1 (1 ) n k+ T 1 MAP: use most likely parameter: 2018 Kevin Jamieson 11

MAP for Beta distribution P ( D) / k+ H 1 (1 ) n k+ T 1 MAP: use most likely parameter: Beta prior equivalent to extra coin flips As N 1, prior is forgotten k + H 1 n k + T 1 But, for small sample size, prior is important! 2018 Kevin Jamieson 12

Bayesian vs Frequentist Data: D Frequentists treat unknown θ as fixed and the data D as random. b = t(d) Bayesian treat the data D as fixed and the unknown θ as random P ( D) 2018 Kevin Jamieson 13

Recap for Bayesian learning Bayesians are optimists: If we model it correctly, we quantify uncertainty exactly Answers all questions simultaneously with posterior probability Assumes one can accurately model: Observations and link to unknown parameter θ: Distribution, structure of unknown θ: p( ) p(x ) Frequentist are pessimists: All models are wrong, prove to me your estimate is good Answers each question with a separately analyzed estimator Makes very few assumptions, e.g. E[X 2 ] < 1 and constructs an estimator (e.g., median of means of disjoint subsets of data) 2018 Kevin Jamieson 14

Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 15

Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal Bayes classifier: P(Y =1 X = x) = 1 2 Predicted label: +1 Predicted label: -1 Figures stolen from Hastie et al 2018 Kevin Jamieson 16

Linear Decision Boundary Training data: True label: +1 True label: -1 Learned: Linear Decision boundary x T w + b =0 Predicted label: +1 Predicted label: -1 Figures stolen from Hastie et al 2018 Kevin Jamieson 17

15 Nearest Neighbor Boundary Training data: True label: +1 True label: -1 Learned: 15 nearest neighbor decision boundary (majority vote) Predicted label: +1 Predicted label: -1 2018 Kevin Jamieson 18

1 Nearest Neighbor Boundary Training data: True label: +1 True label: -1 Learned: 1 nearest neighbor decision boundary (majority vote) Predicted label: +1 Predicted label: -1 2018 Kevin Jamieson 19

k-nearest Neighbor Error Bias-Variance tradeoff As k->infinity? Bias: Best possible Variance: As k->1? Bias: Variance: 2018 Kevin Jamieson 20

Notable distance metrics (and their level sets) L 2 norm L 1 norm (taxi-cab) Mahalanobis L-infinity (max) norm Kevin Jamieson 2018 21

1 nearest neighbor One can draw the nearest-neighbor regions in input space. Dist(x i,x j ) = (x i 1 xj 1 )2 + (x i 2 xj 2 )2 Dist(x i,x j ) =(x i 1 xj 1 )2 +(3x i 2 3xj 2 )2 The relative scalings in the distance metric affect region shapes Kevin Jamieson 2018 22

1 nearest neighbor guarantee {(x i,y i )}) n i=1 x i 2 R d,y i 2 {1,...,k} As n!1assume the x i s become dense in R d and P(Y = j X = x) issmooth As x a! x b we have P(Y a = j)! P(Y b = j) for all j 2018 Kevin Jamieson 23

1 nearest neighbor guarantee {(x i,y i )}) n i=1 x i 2 R d,y i 2 {1,...,k} As n!1assume the x i s become dense in R d and P(Y = j X = x) issmooth As x a! x b we have P(Y a = j)! P(Y b = j) for all j If p` = P(Y a = `) =P(Y b = `) and ` = arg max `=1,...,k p` then Bayes Error = 1 p` 2018 Kevin Jamieson 24

1 nearest neighbor guarantee {(x i,y i )}) n i=1 x i 2 R d,y i 2 {1,...,k} As n!1assume the x i s become dense in R d and P(Y = j X = x) issmooth As x a! x b we have P(Y a = j)! P(Y b = j) for all j If p` = P(Y a = `) =P(Y b = `) and ` = arg max `=1,...,k p` then Bayes Error = 1 1-nearest neighbor error = p` P(Y a 6= Y b )= kx P(Y a = `,Y b 6= `) `=1 2018 Kevin Jamieson 25

1 nearest neighbor guarantee {(x i,y i )}) n i=1 x i 2 R d,y i 2 {1,...,k} As n!1assume the x i s become dense in R d and P(Y = j X = x) issmooth As x a! x b we have P(Y a = j)! P(Y b = j) for all j If p` = P(Y a = `) =P(Y b = `) and ` = arg max `=1,...,k p` then Bayes Error = 1 1-nearest neighbor error = = P(Y a 6= Y b )= kx P(Y a = `,Y b 6= `) `=1 kx p`(1 p`) apple 2(1 p` ) `=1 As n->infinity, then 1-NN rule error is at most twice the Bayes error! p` [Cover, Hart, 1967] k 2 (1 p` ) k 1 2018 Kevin Jamieson 26

Curse of dimensionality Ex. 1 side length r X is uniformly distributed over [0, 1] p. What is P(X 2 [0,r] p )? 2018 Kevin Jamieson 27

Curse of dimensionality Ex. 2 {X i } n i=1 are uniformly distributed over [.5,.5]p. What is the median distance from a point at origin to its 1NN? 2018 Kevin Jamieson 28

Nearest neighbor regression {(x i,y i )}) n i=1 N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X 1 k y i x i 2N k (x 0 ) Kevin Jamieson 2018 29

Nearest neighbor regression {(x i,y i )}) n i=1 Why are far-away neighbors weighted same as close neighbors! Kernel smoothing: K(x, y) N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X 1 k y i x i 2N k (x 0 ) bf(x 0 )=P n i=1 K(x 0,x i )y i P n i=1 K(x 0,x i ) Kevin Jamieson 2018 30

Nearest neighbor regression {(x i,y i )}) n i=1 N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X 1 k y i f(x0 b )= x i 2N k (x 0 ) P n i=1 K(x 0,x i )y i P n i=1 K(x 0,x i ) Kevin Jamieson 2018 31

Nearest neighbor regression {(x i,y i )}) n i=1 N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X x i 2N k (x 0 ) 1 k y i b f(x0 )= Why just average them? P n i=1 K(x 0,x i )y P i n i=1 K(x 0,x i ) Kevin Jamieson 2018 32

Nearest neighbor regression {(x i,y i )}) n i=1 N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X 1 k y i x i 2N k (x 0 ) bf(x 0 )=P n i=1 K(x 0,x i )y i P n i=1 K(x 0,x i ) w(x 0 ),b(x 0 ) = arg min w,b bf(x 0 )=b(x 0 )+w(x 0 ) T x 0 nx K(x 0,x i )(y i (b + w T x i )) 2 i=1 Local Linear Regression Kevin Jamieson 2018 33

Nearest Neighbor Overview Very simple to explain and implement No training! But finding nearest neighbors in large dataset at test can be computationally demanding (kd-trees help) 2018 Kevin Jamieson 34

Nearest Neighbor Overview Very simple to explain and implement No training! But finding nearest neighbors in large dataset at test can be computationally demanding (kd-trees help) You can use other forms of distance (not just Euclidean) Smoothing with Kernels and local linear regression can improve performance (at the cost of higher variance) 2018 Kevin Jamieson 35

Nearest Neighbor Overview Very simple to explain and implement No training! But finding nearest neighbors in large dataset at test can be computationally demanding (kd-trees help) You can use other forms of distance (not just Euclidean) Smoothing with Kernels and local linear regression can improve performance (at the cost of higher variance) With a lot of data, local methods have strong, simple theoretical guarantees. Without a lot of data, neighborhoods aren t local and methods suffer. 2018 Kevin Jamieson 36

Kernels Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2018 2018 Kevin Jamieson 37

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R nx Learning a model s parameters: Each `i(w) is convex. i=1 `i(w) Hinge Loss: `i(w) = max{0, 1 y i x T i w} Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 All in terms of inner products! Even nearest neighbor can use inner products! Kevin Jamieson 2018 38

What if the data is not linearly separable? Use features of features of features of features. (x) :R d! R p Feature space can get really large really quickly! 2018 Kevin Jamieson 39

Dot-product of polynomials exactly d d =1: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v 2 2017 Kevin Jamieson 40

Dot-product of polynomials d =1: (u) = d =2: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v 2 2 6 4 u 2 1 u 2 2 u 1 u 2 u 2 u 1 3 exactly d 7 5 h (u), (v)i = u2 1v1 2 + u 2 2v2 2 +2u 1 u 2 v 1 v 2 2017 Kevin Jamieson 41

Dot-product of polynomials d =1: (u) = d =2: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v 2 2 6 4 u 2 1 u 2 2 u 1 u 2 u 2 u 1 3 exactly d 7 5 h (u), (v)i = u2 1v1 2 + u 2 2v2 2 +2u 1 u 2 v 1 v 2 General d : Dimension of (u) is roughly p d if u 2 R p 2017 Kevin Jamieson 42

Kernel Trick bw = arg min w nx (y i x T i w) 2 + w 2 w i=1 There exists an 2 R n : bw = nx i x i i=1 Why? b = arg min nx X n (y i j hx j,x i i) 2 + i=1 j=1 nx nx i j hx i,x j i i=1 j=1 2017 Kevin Jamieson 43

Kernel Trick bw = arg min w nx (y i x T i w) 2 + w 2 w i=1 There exists an 2 R n : bw = nx i x i i=1 Why? b = arg min = arg min nx X n (y i j hx j,x i i) 2 + i=1 j=1 nx X n (y i j K(x i,x j )) 2 + i=1 j=1 nx nx i j hx i,x j i i=1 nx j=1 i=1 j=1 nx i j K(x i,x j ) = arg min y K 2 2 + T K K(x i,x j )=h (x i ), (x j )i 2017 Kevin Jamieson 44

Why regularization? Typically, K 0. What if = 0? b = arg min y K 2 2 + T K 2017 Kevin Jamieson 45

Why regularization? Typically, K 0. What if = 0? b = arg min y K 2 2 + T K Unregularized kernel least squares can (over) fit any data! b = K 1 y 2017 Kevin Jamieson 46

Common kernels Polynomials of degree exactly d Polynomials of degree up to d Gaussian (squared exponential) kernel u v 2 K(u, v) =exp 2 Sigmoid 2 2 2017 Kevin Jamieson 47

Mercer s Theorem When do we have a valid Kernel K(x,x )? Definition 1: when it is an inner product Mercer s Theorem: K(x,x ) is a valid kernel if and only if K is a positive semi-definite. PSD in the following sense: Z Z h(x)k(x, x 0 )h(x 0 )dxdx 0 0 8h : R d! R, x,x 0 x h(x) 2 dx apple1 2017 Kevin Jamieson 48

RBF Kernel K(u, v) =exp u v 2 2 2 2 Note that this is like weighting bumps on each point like kernel smoothing but now we learn the weights 2017 Kevin Jamieson 49

RBF Kernel K(u, v) =exp u v 2 2 2 2 The bandwidth sigma has an enormous effect on fit: = 10 2 = 10 4 = 10 1 = 10 4 = 10 0 = 10 4 bf(x) = nx b i K(x i,x) i=1 2017 Kevin Jamieson 50

RBF Kernel K(u, v) =exp u v 2 2 2 2 The bandwidth sigma has an enormous effect on fit: = 10 2 = 10 4 = 10 1 = 10 4 = 10 0 = 10 4 = 10 3 = 10 4 = 10 1 = 10 0 bf(x) = nx b i K(x i,x) i=1 2017 Kevin Jamieson 51

RBF kernel and random features 2 cos( ) cos( ) = cos( + ) + cos( ) e jz = cos(z)+sin(z) Recall HW1 where we used the feature map: 2p 3 2 cos(w T 1 x + b 1 ) 6 7 (x) = 4 p. 5 2 cos(w T p x + b p ) E[ 1 p (x)t (y)] = 1 p w k N (0, 2 I) b k uniform(0, ) px E[2 cos(wk T x + b k ) cos(wk T y + b k )] k=1 = E w,b [2 cos(w T x + b) cos(w T y + b)] 2017 Kevin Jamieson 52

RBF kernel and random features 2 cos( ) cos( ) = cos( + ) + cos( ) e jz = cos(z)+sin(z) Recall HW1 where we used the feature map: 2p 3 2 cos(w T 1 x + b 1 ) 6 7 (x) = 4 p. 5 2 cos(w T p x + b p ) E[ 1 p (x)t (y)] = 1 p w k N (0, 2 I) b k uniform(0, ) px E[2 cos(wk T x + b k ) cos(wk T y + b k )] k=1 = E w,b [2 cos(w T x + b) cos(w T y + b)] = e x y 2 2 [Rahimi, Recht NIPS 2007] NIPS Test of Time Award, 2018 2017 Kevin Jamieson 53

RBF Classification min,b nx bw = max{0, 1 y i (b + x T i w)} + w 2 2 i=1 nx nx nx max{0, 1 y i (b + j hx i,x j i)} + i j hx i,x j i i=1 j=1 i,j=1 2017 Kevin Jamieson 54

Wait, infinite dimensions? Isn t everything separable there? How are we not overfitting? Regularization! Fat shattering (R/margin)^2 2018 Kevin Jamieson 55

String Kernels Example from Efron and Hastie, 2016 Amino acid sequences of different lengths: x1 x2 All subsequences of length 3 (of possible 20 amino acids) 2017 Kevin Jamieson 56