Announcements. Proposals graded

Size: px
Start display at page:

Download "Announcements. Proposals graded"

Transcription

1 Announcements Proposals graded Kevin Jamieson

2 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, Kevin Jamieson 2

3 MLE Recap - coin flips Data: sequence D= (HHTHT ), k heads out of n flips Hypothesis: P(Heads) = θ, P(Tails) = 1-θ P (D ) = k (1 ) n k Maximum likelihood estimation (MLE): Choose θ that maximizes the probability of observed data: b MLE = arg max = arg max P (D ) log P (D ) b MLE = k n 2018 Kevin Jamieson 3

4 What about prior Billionaire: Wait, I know that the coin is close to What can you do for me now? You say: I can learn it the Bayesian way 2018 Kevin Jamieson 4

5 Bayesian Learning Use Bayes rule: Or equivalently: 2018 Kevin Jamieson 5

6 Bayesian Learning for Coins Likelihood function is simply Binomial: What about prior? Represent expert knowledge Conjugate priors: P (D ) = k (1 ) n k Closed-form representation of posterior For Binomial, conjugate prior is Beta distribution 2018 Kevin Jamieson 6

7 Beta prior distribution P(θ) Mean: Mode: Beta(2,3) Beta(20,30) Likelihood function: Posterior: P (D ) = k (1 ) n k 2018 Kevin Jamieson 7

8 Posterior distribution Prior: Data: k heads and (n-k) tails Posterior distribution: P ( D) =Beta(k + H, (n k)+ T ) H =1, t =1 H = 10, t = 10 H = 50, t = 50 Prior P ( ) Posterior P ( D) k = 23,n= Kevin Jamieson 8

9 Using Bayesian posterior Posterior distribution: P ( D) =Beta(k + H, (n k)+ T ) Bayesian inference: Estimate mean E[ ] = Z 1 0 P ( D)d Estimate arbitrary function f For arbitrary f integral is often hard to compute 2018 Kevin Jamieson 9

10 MAP: Maximum a posteriori approximation P ( D) =Beta(k + H, (n k)+ T ) As more data is observed, Beta is more certain MAP: use most likely parameter: 2018 Kevin Jamieson 10

11 MAP for Beta distribution P ( D) / k+ H 1 (1 ) n k+ T 1 MAP: use most likely parameter: 2018 Kevin Jamieson 11

12 MAP for Beta distribution P ( D) / k+ H 1 (1 ) n k+ T 1 MAP: use most likely parameter: Beta prior equivalent to extra coin flips As N 1, prior is forgotten k + H 1 n k + T 1 But, for small sample size, prior is important! 2018 Kevin Jamieson 12

13 Bayesian vs Frequentist Data: D Frequentists treat unknown θ as fixed and the data D as random. b = t(d) Bayesian treat the data D as fixed and the unknown θ as random P ( D) 2018 Kevin Jamieson 13

14 Recap for Bayesian learning Bayesians are optimists: If we model it correctly, we quantify uncertainty exactly Answers all questions simultaneously with posterior probability Assumes one can accurately model: Observations and link to unknown parameter θ: Distribution, structure of unknown θ: p( ) p(x ) Frequentist are pessimists: All models are wrong, prove to me your estimate is good Answers each question with a separately analyzed estimator Makes very few assumptions, e.g. E[X 2 ] < 1 and constructs an estimator (e.g., median of means of disjoint subsets of data) 2018 Kevin Jamieson 14

15 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington November 1, Kevin Jamieson 15

16 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal Bayes classifier: P(Y =1 X = x) = 1 2 Predicted label: +1 Predicted label: -1 Figures stolen from Hastie et al 2018 Kevin Jamieson 16

17 Linear Decision Boundary Training data: True label: +1 True label: -1 Learned: Linear Decision boundary x T w + b =0 Predicted label: +1 Predicted label: -1 Figures stolen from Hastie et al 2018 Kevin Jamieson 17

18 15 Nearest Neighbor Boundary Training data: True label: +1 True label: -1 Learned: 15 nearest neighbor decision boundary (majority vote) Predicted label: +1 Predicted label: Kevin Jamieson 18

19 1 Nearest Neighbor Boundary Training data: True label: +1 True label: -1 Learned: 1 nearest neighbor decision boundary (majority vote) Predicted label: +1 Predicted label: Kevin Jamieson 19

20 k-nearest Neighbor Error Bias-Variance tradeoff As k->infinity? Bias: Best possible Variance: As k->1? Bias: Variance: 2018 Kevin Jamieson 20

21 Notable distance metrics (and their level sets) L 2 norm L 1 norm (taxi-cab) Mahalanobis L-infinity (max) norm Kevin Jamieson

22 1 nearest neighbor One can draw the nearest-neighbor regions in input space. Dist(x i,x j ) = (x i 1 xj 1 )2 + (x i 2 xj 2 )2 Dist(x i,x j ) =(x i 1 xj 1 )2 +(3x i 2 3xj 2 )2 The relative scalings in the distance metric affect region shapes Kevin Jamieson

23 1 nearest neighbor guarantee {(x i,y i )}) n i=1 x i 2 R d,y i 2 {1,...,k} As n!1assume the x i s become dense in R d and P(Y = j X = x) issmooth As x a! x b we have P(Y a = j)! P(Y b = j) for all j 2018 Kevin Jamieson 23

24 1 nearest neighbor guarantee {(x i,y i )}) n i=1 x i 2 R d,y i 2 {1,...,k} As n!1assume the x i s become dense in R d and P(Y = j X = x) issmooth As x a! x b we have P(Y a = j)! P(Y b = j) for all j If p` = P(Y a = `) =P(Y b = `) and ` = arg max `=1,...,k p` then Bayes Error = 1 p` 2018 Kevin Jamieson 24

25 1 nearest neighbor guarantee {(x i,y i )}) n i=1 x i 2 R d,y i 2 {1,...,k} As n!1assume the x i s become dense in R d and P(Y = j X = x) issmooth As x a! x b we have P(Y a = j)! P(Y b = j) for all j If p` = P(Y a = `) =P(Y b = `) and ` = arg max `=1,...,k p` then Bayes Error = 1 1-nearest neighbor error = p` P(Y a 6= Y b )= kx P(Y a = `,Y b 6= `) `= Kevin Jamieson 25

26 1 nearest neighbor guarantee {(x i,y i )}) n i=1 x i 2 R d,y i 2 {1,...,k} As n!1assume the x i s become dense in R d and P(Y = j X = x) issmooth As x a! x b we have P(Y a = j)! P(Y b = j) for all j If p` = P(Y a = `) =P(Y b = `) and ` = arg max `=1,...,k p` then Bayes Error = 1 1-nearest neighbor error = = P(Y a 6= Y b )= kx P(Y a = `,Y b 6= `) `=1 kx p`(1 p`) apple 2(1 p` ) `=1 As n->infinity, then 1-NN rule error is at most twice the Bayes error! p` [Cover, Hart, 1967] k 2 (1 p` ) k Kevin Jamieson 26

27 Curse of dimensionality Ex. 1 side length r X is uniformly distributed over [0, 1] p. What is P(X 2 [0,r] p )? 2018 Kevin Jamieson 27

28 Curse of dimensionality Ex. 2 {X i } n i=1 are uniformly distributed over [.5,.5]p. What is the median distance from a point at origin to its 1NN? 2018 Kevin Jamieson 28

29 Nearest neighbor regression {(x i,y i )}) n i=1 N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X 1 k y i x i 2N k (x 0 ) Kevin Jamieson

30 Nearest neighbor regression {(x i,y i )}) n i=1 Why are far-away neighbors weighted same as close neighbors! Kernel smoothing: K(x, y) N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X 1 k y i x i 2N k (x 0 ) bf(x 0 )=P n i=1 K(x 0,x i )y i P n i=1 K(x 0,x i ) Kevin Jamieson

31 Nearest neighbor regression {(x i,y i )}) n i=1 N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X 1 k y i f(x0 b )= x i 2N k (x 0 ) P n i=1 K(x 0,x i )y i P n i=1 K(x 0,x i ) Kevin Jamieson

32 Nearest neighbor regression {(x i,y i )}) n i=1 N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X x i 2N k (x 0 ) 1 k y i b f(x0 )= Why just average them? P n i=1 K(x 0,x i )y P i n i=1 K(x 0,x i ) Kevin Jamieson

33 Nearest neighbor regression {(x i,y i )}) n i=1 N k (x 0 )=k-nearest neighbors of x 0 bf(x 0 )= X 1 k y i x i 2N k (x 0 ) bf(x 0 )=P n i=1 K(x 0,x i )y i P n i=1 K(x 0,x i ) w(x 0 ),b(x 0 ) = arg min w,b bf(x 0 )=b(x 0 )+w(x 0 ) T x 0 nx K(x 0,x i )(y i (b + w T x i )) 2 i=1 Local Linear Regression Kevin Jamieson

34 Nearest Neighbor Overview Very simple to explain and implement No training! But finding nearest neighbors in large dataset at test can be computationally demanding (kd-trees help) 2018 Kevin Jamieson 34

35 Nearest Neighbor Overview Very simple to explain and implement No training! But finding nearest neighbors in large dataset at test can be computationally demanding (kd-trees help) You can use other forms of distance (not just Euclidean) Smoothing with Kernels and local linear regression can improve performance (at the cost of higher variance) 2018 Kevin Jamieson 35

36 Nearest Neighbor Overview Very simple to explain and implement No training! But finding nearest neighbors in large dataset at test can be computationally demanding (kd-trees help) You can use other forms of distance (not just Euclidean) Smoothing with Kernels and local linear regression can improve performance (at the cost of higher variance) With a lot of data, local methods have strong, simple theoretical guarantees. Without a lot of data, neighborhoods aren t local and methods suffer Kevin Jamieson 36

37 Kernels Machine Learning CSE546 Kevin Jamieson University of Washington October 26, Kevin Jamieson 37

38 Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R nx Learning a model s parameters: Each `i(w) is convex. i=1 `i(w) Hinge Loss: `i(w) = max{0, 1 y i x T i w} Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 All in terms of inner products! Even nearest neighbor can use inner products! Kevin Jamieson

39 What if the data is not linearly separable? Use features of features of features of features. (x) :R d! R p Feature space can get really large really quickly! 2018 Kevin Jamieson 39

40 Dot-product of polynomials exactly d d =1: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v Kevin Jamieson 40

41 Dot-product of polynomials d =1: (u) = d =2: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v u 2 1 u 2 2 u 1 u 2 u 2 u 1 3 exactly d 7 5 h (u), (v)i = u2 1v1 2 + u 2 2v2 2 +2u 1 u 2 v 1 v Kevin Jamieson 41

42 Dot-product of polynomials d =1: (u) = d =2: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v u 2 1 u 2 2 u 1 u 2 u 2 u 1 3 exactly d 7 5 h (u), (v)i = u2 1v1 2 + u 2 2v2 2 +2u 1 u 2 v 1 v 2 General d : Dimension of (u) is roughly p d if u 2 R p 2017 Kevin Jamieson 42

43 Kernel Trick bw = arg min w nx (y i x T i w) 2 + w 2 w i=1 There exists an 2 R n : bw = nx i x i i=1 Why? b = arg min nx X n (y i j hx j,x i i) 2 + i=1 j=1 nx nx i j hx i,x j i i=1 j= Kevin Jamieson 43

44 Kernel Trick bw = arg min w nx (y i x T i w) 2 + w 2 w i=1 There exists an 2 R n : bw = nx i x i i=1 Why? b = arg min = arg min nx X n (y i j hx j,x i i) 2 + i=1 j=1 nx X n (y i j K(x i,x j )) 2 + i=1 j=1 nx nx i j hx i,x j i i=1 nx j=1 i=1 j=1 nx i j K(x i,x j ) = arg min y K T K K(x i,x j )=h (x i ), (x j )i 2017 Kevin Jamieson 44

45 Why regularization? Typically, K 0. What if = 0? b = arg min y K T K 2017 Kevin Jamieson 45

46 Why regularization? Typically, K 0. What if = 0? b = arg min y K T K Unregularized kernel least squares can (over) fit any data! b = K 1 y 2017 Kevin Jamieson 46

47 Common kernels Polynomials of degree exactly d Polynomials of degree up to d Gaussian (squared exponential) kernel u v 2 K(u, v) =exp 2 Sigmoid Kevin Jamieson 47

48 Mercer s Theorem When do we have a valid Kernel K(x,x )? Definition 1: when it is an inner product Mercer s Theorem: K(x,x ) is a valid kernel if and only if K is a positive semi-definite. PSD in the following sense: Z Z h(x)k(x, x 0 )h(x 0 )dxdx 0 0 8h : R d! R, x,x 0 x h(x) 2 dx apple Kevin Jamieson 48

49 RBF Kernel K(u, v) =exp u v Note that this is like weighting bumps on each point like kernel smoothing but now we learn the weights 2017 Kevin Jamieson 49

50 RBF Kernel K(u, v) =exp u v The bandwidth sigma has an enormous effect on fit: = 10 2 = 10 4 = 10 1 = 10 4 = 10 0 = 10 4 bf(x) = nx b i K(x i,x) i= Kevin Jamieson 50

51 RBF Kernel K(u, v) =exp u v The bandwidth sigma has an enormous effect on fit: = 10 2 = 10 4 = 10 1 = 10 4 = 10 0 = 10 4 = 10 3 = 10 4 = 10 1 = 10 0 bf(x) = nx b i K(x i,x) i= Kevin Jamieson 51

52 RBF kernel and random features 2 cos( ) cos( ) = cos( + ) + cos( ) e jz = cos(z)+sin(z) Recall HW1 where we used the feature map: 2p 3 2 cos(w T 1 x + b 1 ) 6 7 (x) = 4 p. 5 2 cos(w T p x + b p ) E[ 1 p (x)t (y)] = 1 p w k N (0, 2 I) b k uniform(0, ) px E[2 cos(wk T x + b k ) cos(wk T y + b k )] k=1 = E w,b [2 cos(w T x + b) cos(w T y + b)] 2017 Kevin Jamieson 52

53 RBF kernel and random features 2 cos( ) cos( ) = cos( + ) + cos( ) e jz = cos(z)+sin(z) Recall HW1 where we used the feature map: 2p 3 2 cos(w T 1 x + b 1 ) 6 7 (x) = 4 p. 5 2 cos(w T p x + b p ) E[ 1 p (x)t (y)] = 1 p w k N (0, 2 I) b k uniform(0, ) px E[2 cos(wk T x + b k ) cos(wk T y + b k )] k=1 = E w,b [2 cos(w T x + b) cos(w T y + b)] = e x y 2 2 [Rahimi, Recht NIPS 2007] NIPS Test of Time Award, Kevin Jamieson 53

54 RBF Classification min,b nx bw = max{0, 1 y i (b + x T i w)} + w 2 2 i=1 nx nx nx max{0, 1 y i (b + j hx i,x j i)} + i j hx i,x j i i=1 j=1 i,j= Kevin Jamieson 54

55 Wait, infinite dimensions? Isn t everything separable there? How are we not overfitting? Regularization! Fat shattering (R/margin)^ Kevin Jamieson 55

56 String Kernels Example from Efron and Hastie, 2016 Amino acid sequences of different lengths: x1 x2 All subsequences of length 3 (of possible 20 amino acids) 2017 Kevin Jamieson 56

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal

More information

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF Accouncements You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF Please do not zip these files and submit (unless there are >5 files) 1 Bayesian Methods Machine Learning

More information

Announcements. stuff stat 538 Zaid Hardhaoui. Statistics. cry. spring. g fa. inference. VC dimension covering

Announcements. stuff stat 538 Zaid Hardhaoui. Statistics. cry. spring. g fa. inference. VC dimension covering Announcements spring Convex Optimization next quarter ML stuff EE 578 Margam FaZe CS 547 Tim Althoff Modeling how to formulate real world problems as convex optimization Data science constrained optimization

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Hypothesis testing Machine Learning CSE546 Kevin Jamieson University of Washington October 30, 2018 2018 Kevin Jamieson 2 Anomaly detection You are

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013 Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables? Linear Regression Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 1 What about continuous variables? n Billionaire says: If I am measuring a continuous variable, what

More information

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables? Linear Regression Machine Learning CSE546 Sham Kakade University of Washington Oct 4, 2016 1 What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1 Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 1 The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq.

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Classification Logistic Regression

Classification Logistic Regression Classification Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 1 THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS 2 Weather prediction

More information

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Point Estimation. Vibhav Gogate The University of Texas at Dallas Point Estimation Vibhav Gogate The University of Texas at Dallas Some slides courtesy of Carlos Guestrin, Chris Bishop, Dan Weld and Luke Zettlemoyer. Basics: Expectation and Variance Binary Variables

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Logistics CSE 446: Point Estimation Winter 2012 PS2 out shortly Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Last Time Random variables, distributions Marginal, joint & conditional

More information

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c Warm up D cai.yo.ie p IExrL9CxsYD Sglx.Ddl f E Luo fhlexi.si dbll Fix any a, b, c > 0. 1. What is the x 2 R that minimizes ax 2 + bx + c x a b Ta OH 2 ax 16 0 x 1 Za fhkxiiso3ii draulx.h dp.d 2. What is

More information

Is the test error unbiased for these programs?

Is the test error unbiased for these programs? Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Gaussians Linear Regression Bias-Variance Tradeoff

Gaussians Linear Regression Bias-Variance Tradeoff Readings listed in class website Gaussians Linear Regression Bias-Variance Tradeoff Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 22 nd, 2007 Maximum Likelihood Estimation

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning 3. Instance Based Learning Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 Outline Parzen Windows Kernels, algorithm Model selection

More information

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML) Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Classification Logistic Regression

Classification Logistic Regression O due Thursday µtwl Classification Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 1 THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Kernel Methods. Charles Elkan October 17, 2007

Kernel Methods. Charles Elkan October 17, 2007 Kernel Methods Charles Elkan elkan@cs.ucsd.edu October 17, 2007 Remember the xor example of a classification problem that is not linearly separable. If we map every example into a new representation, then

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask Machine Learning and Data Mining Support Vector Machines Kalev Kask Linear classifiers Which decision boundary is better? Both have zero training error (perfect training accuracy) But, one of them seems

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Kaggle.

Kaggle. Administrivia Mini-project 2 due April 7, in class implement multi-class reductions, naive bayes, kernel perceptron, multi-class logistic regression and two layer neural networks training set: Project

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,

More information

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 22 Dan Weld Learning Gaussians Naïve Bayes Last Time Gaussians Naïve Bayes Logistic Regression Today Some slides from Carlos Guestrin, Luke Zettlemoyer

More information

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI Bayes Classifiers CAP5610 Machine Learning Instructor: Guo-Jun QI Recap: Joint distributions Joint distribution over Input vector X = (X 1, X 2 ) X 1 =B or B (drinking beer or not) X 2 = H or H (headache

More information

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Introduc)on to Bayesian methods (con)nued) - Lecture 16 Introduc)on to Bayesian methods (con)nued) - Lecture 16 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Vibhav Gogate Outline of lectures Review of

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Basis Expansion and Nonlinear SVM. Kai Yu

Basis Expansion and Nonlinear SVM. Kai Yu Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion

More information

Machine Learning CSE546

Machine Learning CSE546 http://www.cs.washington.edu/education/courses/cse546/17au/ Machine Learning CSE546 Kevin Jamieson University of Washington September 28, 2017 1 You may also like ML uses past data to make personalized

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Review: Support vector machines. Machine learning techniques and image analysis

Review: Support vector machines. Machine learning techniques and image analysis Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014 Decision Trees Machine Learning CSEP546 Carlos Guestrin University of Washington February 3, 2014 17 Linear separability n A dataset is linearly separable iff there exists a separating hyperplane: Exists

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

Support Vector Machines

Support Vector Machines Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder September 28, 2017 Prof. Michael Paul Today Two important concepts: Margins Kernels Large Margin Classification

More information

COMP 562: Introduction to Machine Learning

COMP 562: Introduction to Machine Learning COMP 562: Introduction to Machine Learning Lecture 20 : Support Vector Machines, Kernels Mahmoud Mostapha 1 Department of Computer Science University of North Carolina at Chapel Hill mahmoudm@cs.unc.edu

More information

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically

More information

Probability and Statistical Decision Theory

Probability and Statistical Decision Theory Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Probability and Statistical Decision Theory Many slides attributable to: Erik Sudderth (UCI) Prof. Mike Hughes

More information

Introduction to SVM and RVM

Introduction to SVM and RVM Introduction to SVM and RVM Machine Learning Seminar HUS HVL UIB Yushu Li, UIB Overview Support vector machine SVM First introduced by Vapnik, et al. 1992 Several literature and wide applications Relevance

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes About this class Maximum margin classifiers SVMs: geometric derivation of the primal problem Statement of the dual problem The kernel trick SVMs as the solution to a regularization problem Maximizing the

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information