Introduction to Machine Learning
|
|
- George Benson
- 6 years ago
- Views:
Transcription
1 Introduction to Machine Learning 3. Instance Based Learning Alex Smola Carnegie Mellon University
2 Outline Parzen Windows Kernels, algorithm Model selection Crossvalidation, leave one out, bias variance Watson-Nadaraya estimator Classification, regression, novelty detection Nearest Neighbor estimator Limit case of Parzen Windows
3 Parzen Windows Parzen
4 Density Estimation Observe some data x i Want to estimate p(x) Find unusual observations (e.g. security) Find typical observations (e.g. prototypes) Classifier via Bayes Rule p(y x) = p(x, y) p(x) = P p(x y)p(y) y 0 p(x y 0 )p(y 0 ) Need tool for computing p(x) easily
5 Bin Counting Discrete random variables, e.g. English, Chinese, German, French,... Male, Female Bin counting (record # of occurrences) 25 English Chinese German French Spanish male female
6 Bin Counting Discrete random variables, e.g. English, Chinese, German, French,... Male, Female Bin counting (record # of occurrences) 25 English Chinese German French Spanish male female
7 Bin Counting Discrete random variables, e.g. English, Chinese, German, French,... Male, Female Bin counting (record # of occurrences) 25 English Chinese German French Spanish male female
8 Bin Counting Discrete random variables, e.g. English, Chinese, German, French,... Male, Female not enough data Bin counting (record # of occurrences) 25 English Chinese German French Spanish male female
9 Curse of dimensionality (lite) Discrete random variables, e.g. English, Chinese, German, French,... Male, Female ZIP code Day of the week Operating system... Continuous random variables Income Bandwidth Time #bins grows exponentially need many bins per dimension
10 Curse of dimensionality (lite) Discrete random variables, e.g. English, Chinese, German, French,... Male, Female ZIP code Day of the week Operating system... Continuous random variables Income Bandwidth Time #bins grows exponentially need many bins per dimension
11 Density Estimation sample 0.04 underlying density Continuous domain = infinite number of bins Curse of dimensionality 10 bins on [0, 1] is probably good bins on [0, 1] 10 requires high accuracy in estimate: probability mass per cell also decreases by
12 Bin Counting
13 Bin Counting
14 Bin Counting
15 Bin Counting can t we just go and smooth this out?
16 What is happening? Hoeffding s theorem Pr ( E[x] 1 m mx i=1 x i > ) apple 2e 2m 2 For any average of [0,1] iid random variables. Bin counting Random variables x i are events in bins Apply Hoeffding s theorem to each bin Take the union bound over all bins to guarantee that all estimates converge
17 Density Estimation Hoeffding s theorem Pr ( E[x] 1 m mx i=1 x i > ) apple 2e 2m 2 Applying the union bound and Hoeffding Pr sup a2a ˆp(a) p(a) apple X a2a Pr ( ˆp(a) p(a) ) apple2 A exp 2m 2 good news Solving for error probability 2 A apple exp( m 2 )=) apple r log 2 A 2m log
18 Density Estimation Hoeffding s theorem Pr ( E[x] 1 m mx i=1 x i > ) apple 2e 2m 2 Applying the union bound and Hoeffding Pr sup a2a ˆp(a) p(a) apple X a2a Pr ( ˆp(a) p(a) ) apple2 A exp Solving for error probability 2m but 2 not good good enough news 2 A apple exp( m 2 )=) apple r log 2 A 2m log
19 Density Estimation Hoeffding s theorem Pr ( E[x] 1 m mx i=1 x i > ) apple 2e 2m 2 bins not Applying the union bound and independent Hoeffding Pr sup a2a ˆp(a) p(a) apple X a2a Pr ( ˆp(a) p(a) ) apple2 A exp Solving for error probability 2m but 2 not good good enough news 2 A apple exp( m 2 )=) apple r log 2 A 2m log
20 Bin Counting
21 Bin Counting can t we just go and smooth this out?
22 Parzen Windows Naive approach Use empirical density (delta distributions) p emp (x) = 1 m mx i=1 x i (x) This breaks if we see slightly different instances Kernel density estimate Smear out empirical density with a nonnegative smoothing kernel k x (x ) satisfying Z X k x (x 0 )dx 0 = 1 for all x
23 Parzen Windows Density estimate p emp (x) = 1 m ˆp(x) = 1 m mx i=1 mx x i (x) k xi (x) Smoothing kernels i= Gauss Laplace Epanechikov Uniform (2 ) 1 2 e 1 2 x2 1 2 e x 3 4 max(0, 1 x2 ) [ 1,1] (x)
24 Smoothing
25 Smoothing dist = norm(x - x * ones(1,m),'columns'); p = (1/m) * ((2 * pi)**(-d/2)) * sum(exp(-0.5 * dist.**2))
26 Smoothing
27 Smoothing
28 Size matters
29 Size matters Shape matters mostly in theory Kernel width Too narrow overfits Too wide smoothes with constant distribution How to choose? k xi (x) =r d h x r xi
30 Model Selection
31 Maximum Likelihood Need to measure how well we do For density estimation we care about Pr {X} = my p(x i ) i=1 Finding a that maximizes P(X) will peak at all data points since x i explains x i best... Maxima are delta functions on data. Overfitting!
32 Overfitting Likelihood on training set is much higher than typical
33 Overfitting density 0 Likelihood on training set is much higher than typical. density
34 Underfitting Likelihood on training set is very similar to typical one. Too simple
35 Model Selection Validation Use some of the data to estimate density. Use other part to evaluate how well it works Pick the parameter that works best Learning Theory L(X 0 X) := 1 n 0 Use data to build model 0 Xn i=1 log ˆp(x 0 i) Measure complexity and use this to bound nx 1 log ˆp(x i ) E x [log ˆp(x)] n i=1
36 Model Selection Validation Use some of the data to estimate density. Use other part to evaluate how well it works Pick the parameter that works best Learning Theory L(X 0 X) := 1 n 0 Use data to build model 0 Xn i=1 log ˆp(x 0 i) Measure complexity and use this to bound nx 1 log ˆp(x i ) E x [log ˆp(x)] n i=1 easy
37 Model Selection Validation Use some of the data to estimate density. Use other part to evaluate how well it works Pick the parameter that works best Learning Theory L(X 0 X) := 1 n 0 Use data to build model 0 Xn i=1 log ˆp(x 0 i) Measure complexity and use this to bound nx 1 log ˆp(x i ) E x [log ˆp(x)] n i=1 easy wasteful
38 Model Selection Validation Use some of the data to estimate density. Use other part to evaluate how well it works Pick the parameter that works best Learning Theory L(X 0 X) := 1 n 0 Use data to build model 0 Xn i=1 log ˆp(x 0 i) Measure complexity and use this to bound nx 1 log ˆp(x i ) E x [log ˆp(x)] n difficult i=1 easy wasteful
39 Model Selection Leave-one-out Crossvalidation Use almost all data to estimate density. Use single instance to estimate how well it works 1 X log p(x i X\x i ) = log k(x i,x j ) n 1 This has huge variance Average over estimates for all training data Pick the parameter that works best Simple implementation 1 n nx i=1 log apple n n 1 p(x i) j6=i 1 n 1 k(x i,x i ) where p(x) = 1 n nx k(x i,x) i=1
40 Leave-one out estimate
41 Optimal estimate
42 Model Selection k-fold Crossvalidation Partition data into k blocks (typically 10) Use all but one block to compute estimate Use remaining block as validation set Average over all validation estimates kx 1 l(p(x i X\X i )) k i=1 Almost unbiased (e.g. via Luntz and Brailovski, 1969) (error is for (k-1)/k sized set) Pick best parameter (why must we not check too many?)
43 Watson Nadaraya Estimator Geoff Watson
44 From density estimation to classification Binary classification Estimate p(x y = 1) and p(x y = 1) Use Bayes rule p(y x) = p(x y)p(y) p(x) = P 1 m y y i =y k(x i,x) my m P 1 m i k(x i,x) Decision boundary p(y =1 x) p(y = 1 x) = P j y jk(x j,x) P i k(x i,x) = X j y j k(x j,x) P i k(x i,x) local weights
45
46 Watson-Nadaraya Classifier
47 Watson-Nadaraya Classifier dist = norm(x - x * ones(1,m),'columns'); f = sum(y.* exp(-0.5 * dist.**2));
48 Watson Nadaraya Regression Binary classification p(y =1 x) p(y = 1 x) = P j y jk(x j,x) P i k(x i,x) = X j y j k(x j,x) P i k(x i,x) Regression - use same weighted expansion ŷ(x) = X j y j k(x j,x) P i k(x i,x) labels local weights
49
50 Watson-Nadaraya regression estimate
51 Silverman s Rule Bernard Silverman
52 Silverman s rule Chicken and egg problem Want wide kernel for low density region Want narrow kernel where we have much data Need density estimate to estimate density Simple hack Use average distance from k nearest neighbors r i = r k X kx i xk x2nn(x i,k) Nonuniform bandwidth for smoother.
53 Density true density
54 non adaptive estimate
55 adaptive estimate
56 distance distribution
57 Nearest Neighbor
58 Nearest Neighbors Table lookup For previously seen instance remember label Nearest neighbor Pick label of most similar neighbor Slight improvement - use k-nearest neighbors For regression average Really useful baseline! Easy to implement for small amounts of data.
59 Relation to Watson Nadaraya Watson Nadaraya estimator ŷ(x) = X j y j k(x i,x) P y j w j (x) i k(x i,x) = X j Nearest neighbor estimator ŷ(x) = X j y j k(x j,x) P y j w j (x) i k(x i,x) = X j Neighborhood function is hard threshold.
60 1-Nearest Neighbor
61 4-Nearest Neighbors
62 4-Nearest Neighbors Sign
63 If we get more data 1 Nearest Neighbor Converges to perfect solution if separation Twice the minimal error rate 2p(1-p) for noisy problems k-nearest Neighbor Converges to perfect solution if separation (but needs more data) Converges to minimal error min(p,1-p) for noisy problems (use increasing k)
64 1 Nearest Neighbor For given point x take ϵ neighborhood N with probability mass > d/n Probability that at least one point of n is in this neighborhood is 1-e -d so we can make this small Assume that probability mass doesn t change much in neighborhood Probability that labels of query and point do not match is 2p(1-p) (up to some approximation error in neighborhood)
65 k Nearest Neighbor For given point x take ϵ neighborhood N with probability mass > dk/n Small probability that we don t have at least k points in neighborhood. Assume that probability mass doesn t change much in neighborhood Bound probability that majority of points doesn t match majority for p (e.g. via Hoeffding s theorem for tail). Show that it vanishes Error is therefore min(p, 1-p), i.e. Bayes optimal error.
66 Fast lookup KD trees (Moore et al.) Partition space (one dimension at a time) Only search for subset that contains point Cover trees (Beygelzimer et al.) Hierarchically partition space with distance guarantees No need for nonoverlapping sets Bounded number of paths to follow (logarithmic time lookup)
67 Summary Parzen Windows Kernels, algorithm Model selection Crossvalidation, leave one out, bias variance Watson-Nadaraya estimator Classification, regression, novelty detection Nearest Neighbor estimator Limit case of Parzen Windows
68 Further Reading Cover tree homepage (paper & code) (kd trees, original paper) (Andrew Moore s tutorial from his PhD thesis) Nadaraya s regression estimator (1964) Watson s regression estimator (1964) Watson-Nadaraya regression package in R Stone s k-nn regression consistency proof Cover and Hart s k-nn classification consistency proof Tom Cover s rate analysis for k-nn Rates of Convergence for Nearest Neighbor Procedures. Sanjoy Dasgupta s analysis for k-nn estimation with selective sampling Multiedit & Condense (Dasarathy, Sanchez, Townsend) Geometric approximation via core sets
An Introduction to Machine Learning
An Introduction to Machine Learning L2: Instance Based Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationNearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2
Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationprobability of k samples out of J fall in R.
Nonparametric Techniques for Density Estimation (DHS Ch. 4) n Introduction n Estimation Procedure n Parzen Window Estimation n Parzen Window Example n K n -Nearest Neighbor Estimation Introduction Suppose
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationInstance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationLearning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin
Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good
More informationMachine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)
Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationBayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI
Bayes Classifiers CAP5610 Machine Learning Instructor: Guo-Jun QI Recap: Joint distributions Joint distribution over Input vector X = (X 1, X 2 ) X 1 =B or B (drinking beer or not) X 2 = H or H (headache
More informationAdvanced Introduction to Machine Learning CMU-10715
Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear
More informationMachine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)
Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationIntroduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones
Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationSupport Vector Machines
Support Vector Machines Jordan Boyd-Graber University of Colorado Boulder LECTURE 7 Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Jordan Boyd-Graber Boulder Support Vector Machines 1 of
More informationBAYESIAN DECISION THEORY
Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationIntroduction to Machine Learning
Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland SUPPORT VECTOR MACHINES Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Machine Learning: Jordan
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationKernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton
Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,
More informationSUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION
SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology
More informationMachine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier
Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L6: Structured Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More information12 - Nonparametric Density Estimation
ST 697 Fall 2017 1/49 12 - Nonparametric Density Estimation ST 697 Fall 2017 University of Alabama Density Review ST 697 Fall 2017 2/49 Continuous Random Variables ST 697 Fall 2017 3/49 1.0 0.8 F(x) 0.6
More informationMachine Learning Midterm Exam March 4, 2015
Name: Andrew ID: Instructions Anything on paper is OK in arbitrary shape size, and quantity. Electronic devices are not acceptable. This includes ipods, ipads, Android tablets, Blackberries, Nokias, Windows
More informationStephen Scott.
1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan Boulder 1 of 39 Recap Supervised learning Previously: KNN, naïve
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationMachine Learning for Computational Advertising
Machine Learning for Computational Advertising L1: Basics and Probability Theory Alexander J. Smola Yahoo! Labs Santa Clara, CA 95051 alex@smola.org UC Santa Cruz, April 2009 Alexander J. Smola: Machine
More informationNon-parametric Methods
Non-parametric Methods Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 2 of Pattern Recognition and Machine Learning by Bishop (with an emphasis on section 2.5) Möller/Mori 2 Outline Last
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationCS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines
CS4495/6495 Introduction to Computer Vision 8C-L3 Support Vector Machines Discriminative classifiers Discriminative classifiers find a division (surface) in feature space that separates the classes Several
More informationExploiting k-nearest Neighbor Information with Many Data
Exploiting k-nearest Neighbor Information with Many Data 2017 TEST TECHNOLOGY WORKSHOP 2017. 10. 24 (Tue.) Yung-Kyun Noh Robotics Lab., Contents Nonparametric methods for estimating density functions Nearest
More informationReminders. Thought questions should be submitted on eclass. Please list the section related to the thought question
Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationBayesian Learning (II)
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 26, 2015 Today: Bayes Classifiers Conditional Independence Naïve Bayes Readings: Mitchell: Naïve Bayes
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationNonparametric Methods Lecture 5
Nonparametric Methods Lecture 5 Jason Corso SUNY at Buffalo 17 Feb. 29 J. Corso (SUNY at Buffalo) Nonparametric Methods Lecture 5 17 Feb. 29 1 / 49 Nonparametric Methods Lecture 5 Overview Previously,
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationMachine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationOutline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationMaximum Mean Discrepancy
Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia
More informationBrief Introduction to Machine Learning
Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector
More informationCSE446: non-parametric methods Spring 2017
CSE446: non-parametric methods Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Linear Regression: What can go wrong? What do we do if the bias is too strong? Might want
More informationDecision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014
Decision Trees Machine Learning CSEP546 Carlos Guestrin University of Washington February 3, 2014 17 Linear separability n A dataset is linearly separable iff there exists a separating hyperplane: Exists
More informationLearning Theory Continued
Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 1 A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec.
More informationCurve Fitting Re-visited, Bishop1.2.5
Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationMachine Learning Gaussian Naïve Bayes Big Picture
Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 27, 2011 Today: Naïve Bayes Big Picture Logistic regression Gradient ascent Generative discriminative
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationProbability and Statistical Decision Theory
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Probability and Statistical Decision Theory Many slides attributable to: Erik Sudderth (UCI) Prof. Mike Hughes
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationThe Bayes classifier
The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal
More informationA first model of learning
A first model of learning Let s restrict our attention to binary classification our labels belong to (or ) We observe the data where each Suppose we are given an ensemble of possible hypotheses / classifiers
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationNon-parametric Methods
Non-parametric Methods Machine Learning Alireza Ghane Non-Parametric Methods Alireza Ghane / Torsten Möller 1 Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection
More informationNonparametric probability density estimation
A temporary version for the 2018-04-11 lecture. Nonparametric probability density estimation Václav Hlaváč Czech Technical University in Prague Czech Institute of Informatics, Robotics and Cybernetics
More informationGenerative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul
Generative Learning INFO-4604, Applied Machine Learning University of Colorado Boulder November 29, 2018 Prof. Michael Paul Generative vs Discriminative The classification algorithms we have seen so far
More informationGenerative Clustering, Topic Modeling, & Bayesian Inference
Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationIntroduction to Machine Learning
Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical
More informationNaïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning
More informationTufts COMP 135: Introduction to Machine Learning
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)
More informationClassification: The rest of the story
U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher
More informationProbability Models for Bayesian Recognition
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIAG / osig Second Semester 06/07 Lesson 9 0 arch 07 Probability odels for Bayesian Recognition Notation... Supervised Learning for Bayesian
More informationComputer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression
Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.
More informationIntro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation
Lecture 15. Pattern Classification (I): Statistical Formulation Outline Statistical Pattern Recognition Maximum Posterior Probability (MAP) Classifier Maximum Likelihood (ML) Classifier K-Nearest Neighbor
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationIntroduction to Machine Learning
Introduction to Machine Learning 12. Gaussian Processes Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 The Normal Distribution http://www.gaussianprocess.org/gpml/chapters/
More informationComputer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression
Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:
More informationIntroduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen
Kernel Methods Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Kernel Methods 1 / 37 Outline
More informationMachine Learning
Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 1, 2011 Today: Generative discriminative classifiers Linear regression Decomposition of error into
More informationA Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007
Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More information