CMU-Q Lecture 24:
|
|
- Christopher Watkins
- 5 years ago
- Views:
Transcription
1 CMU-Q Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro
2 SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input features and outputs h θ x (i), y (i), i = 1,, m and a hypothesis function h θ, find parameters values θ that minimize the average empirical error: minimize θ 1 m m i=1 l h θ x (i), y (i) We need to specify: 1. The hypothesis class H, h θ H 2. The loss function l 3. The algorithm for solving the optimization problem (often approximately) 4. A complete ML design: from data processing to learning to validation and testing 2
3 CLASSIFICATION AND REGRESSION Features: Width, Lightness Classification: (Width, Lightness) {Salmon, Sea bass} (discrete) Regression: (Width, Lightness) Weight (continuous) h θ x : X R 2 Y = {0,1} Which hypothesis class H? h θ x : X R 2 Y R Complex boundaries, relations 3
4 PROBABILISTIC MODELS: DISCRIMINATIVE VS. GENERATIVE Regression and classification problems can be stated in probabilistic terms (later) The mapping y = h θ x that we are learning can be naturally interpreted as the probability of the output being y given the input data x (under the selected hypothesis h and the learned parameter vector θ) Discriminative models: Directly learn p y x) Parametric hypothesis Allow to discriminate between classes / predicted outputs Generative models / Probability distributions: Learn p(x, y), the probabilistic model that describes the data, then use Bayes rule Allow to generate data any relevant data = sea bass = salmon x 1 x 2 x 1 x 2 p y x) = p x y)p(y) p(x) = p(x, y) p(x) 4
5 GENERATIVE MODELS A discriminative model, that learn learns p y x; θ), can be used to label the data, to discriminate the data, but not to generate the data o E.g., a discriminative approach tries to find out which (linear, in this case) decision boundary allows for the best classification based on the training data, and takes decisions accordingly o Direct learning of the mapping from X to Y A generative approach would proceed as follows: 1. By looking at the feature data about salmons, build a model of a salmon 2. By looking at the feature data about sea basses, build a model of a sea bass 3. To classify a new fish based on its features x, we can match it against the salmon and the sea bass models, to see whether it looks more like the salmons or more like the sea basses we had seen in the training set 1,2,3 is equivalent to model p x y), where y = {ω 1, ω 2 }: the conditional probability that the observed features x are those of a salmon or a sea bass 5
6 GENERATIVE MODELS p x p x y = ω 1 ) models the distribution of salmon s features y = ω 2 ) models the distribution of sea bass features p(y) can be derived from the dataset or from other sources o E.g., p(ω 1 ) = ratio of salmons in the dataset, p(ω 2 ) = ratio of sea basses Bayes rule: p y x) = p x y)p(y) p(x) = p(x, y) p(x) posterior = likelihood prior evidence p x = p x y = ω 1 )p y = ω 1 + p x y = ω 2 )p(y = ω 2 ) To make a prediction: arg max y p y x) = arg max y p x y)p(y) p(x) = arg max y p x y)p(y) Equivalent to: decide ω 1 if p ω 1 x) > p ω 2 x), otherwise decide ω 2 6
7 GENERATIVE MODELS AND BAYES DECISION RULE Decide ω 1 if p x ω 1 )p ω 1 > p x ω 2 )p(ω 2 ) otherwise decide ω 2 Decide ω 1 if p x ω 1) > p(ω 2) p x ω 2 ) p(ω 1 ) otherwise decide ω 2 Likelihood ratio Two disconnected regions for class 2 7
8 GENERATIVE MODELS Given the joint distribution we can generate any conditional or marginal probability Sample from p(x, y) to obtain labeled data points Given the priors p(y), sample a class or a predictor value Given the class y, sample instance data p x predictor variable sample an expected output y) of that class, or, given a Downside: higher complexity, more parameters to learn Density estimation problem: Parametric (e.g., Gaussian densities) Non-parametric (full density estimation) 8
9 LET S GO BACK TO LINEAR REGRESSION Linear model as hypothesis: y = h x; w = w 0 + w 1 x 1 + w 2 x w d x d = w T x x = (1, x 1, x 2,, x d ) h Find w that minimizes the deviation from the desired answers: y (i) h x i, i in dataset Loss function: Mean squared error (MSE) m l = 1 m i=1 y i h x i 2 The model does not try to explain variation in observed ys for the data 9
10 STATISTICAL MODEL FOR LINEAR REGRESSION A statistical model of linear regression: y = w T x + ε ε ~ N(0, σ 2 ) y ~ N(w T x, σ 2 ) The model does explain variation in observed ys for the data in terms of a white Gaussian noise The conditional distribution of y given x: p y x; w, σ) = 1 σ 2π exp 1 2σ 2 y wt x 2 Probability of the output being y given the predictor x E y x = w T x 10
11 STATISTICAL MODEL FOR LINEAR REGRESSION Let s consider the entire data set D, and let s assume that all samples are independent and identically distributed (i.i.d.) random variables What is the joint probability of all training data? That is, the probability of observing all the outputs y in D given w and σ? By iid: p y (1), y (2),, y (m) p y (1), y (2),, y (m) x (1), x (2),, x (m) ; w, σ) x (1), x (2),, x (m) ; w, σ) = p y i m i=1 x (i) ; w, σ) L(D, w, σ) = ς m i=1 p y i x (i) ; w, σ) Likelihood function of predictions, the probability of observing the outputs y in D given w and σ Maximum likelihood estimation of the parameters w: parameter values maximizing the likelihood of the predictions, the value of the parameters such that the probability of observing the data in D is maximized w = arg max w L(D, w, σ) 11
12 STATISTICAL MODEL FOR LINEAR REGRESSION Log-Likelihood: l D, w, σ = log(l(d, w, σ)) = log ς m i=1 p y i x (i) ; w, σ) m = log p y i i=1 x (i) ; w, σ) Using the conditional density: p y x; w, σ) = 1 σ 2π exp 1 2σ 2 y wt x 2 m l D, w, σ = log i=1 = 1 m 2σ 2 i=1 1 σ 2π exp 1 2σ 2 y i w T x (i) 2 = y i w T x (i) 2 + c(σ) m i=1 1 2σ 2 y i w T x (i) 2 c(σ) Maximizing the predictive log-likelihood with regard to w, is equivalent to minimizing the MSE loss function Does it look familiar? max w l D, w, σ ~ min MSE w More in general, least squares linear fit under Gaussian noise corresponds to the maximum likelihood estimator of the data 12
13 NON-LINEAR, ADDITIVE REGRESSION MODELS 13
14 NON-LINEAR PROBLEMS? Design a non-linear regressor / classifier Modify the input data to make the problem linear 14
15 MAP DATA IN HIGHER DIMENSIONALITY FEATURE SPACES 15
16 MAP DATA IN HIGHER DIMENSIONALITY FEATURE SPACES The hyperplane is found in z-space, then projected back in x-space, where is an ellipsis The property of the solution of SVMs (that are in terms of dot products between feature vectors) allows to easily define a kernel function that implicitly perform the desired transformation, allowing keeping using linear classifiers. 16
17 NON-LINEAR, ADDITIVE REGRESSION MODELS Main idea to model nonlinearities: Replace inputs to linear units with b feature (basis) functions φ j x, j = 1,, b, where φ j x is an arbitrary function of x y = h x; w = w 0 + w 1 φ 1 x + w 2 φ 2 x + + w b φ b x = w T φ(x) h b b Original feature input New input Linear model 17
18 EXAMPLES OF FEATURE FUNCTIONS Higher order polynomial with one-dimensional input, x = (x) φ 1 x = x, φ 2 x = x 2, φ 3 x = x 3, Quadratic polynomial with two-dimensional inputs, x = (x 1, x 2 ) φ 1 x = x 1, φ 2 x = x 2 1, φ 3 x = x 2, φ 4 x = x 2 2, φ 3 x = x 1 x 2 Transcendent functions: φ 1 x = sin(x), φ 2 x = cos(x) 18
19 SOLUTION USING FEATURE FUNCTIONS The same techniques (analytical gradient + system of equations, or gradient descent) used for the plain linear case with MSE as loss function h x i ; w = w 0 + w 1 φ 1 x i + w 2 φ 2 x i + + w b φ b x i = w T φ(x i ) φ x i = (1, φ 1 x i, φ 2 x i,, φ b (x i )) m l = 1 m i=1 y i h x i 2 To find min w l we have to look where w l = 0 m w l = 2 m i=1 y i h x i φ x i = 0 Results in a system of b linear equations: m m m m w 0 1φ j x i + w 1 φ 1 x i φ j x i + + w k φ k x i φ j x i + w b φ b x i i=1 m i=1 i=1 i=1 = y i φ j x i j = 1,, b i=1 φ j x i 19
20 EXAMPLE OF SDG WITH FEATURE FUNCTIONS One dimensional feature vectors and high-order polynomial: x = x, φ i x = x i h x; w = w 0 + w 1 φ 1 x + w 2 φ 2 x + + w b φ b x = w 0 + w i x i On-line, single sample, (x i, y i ), gradient update, j = 1,, b w j = w j + α w l h x i ; w, y i = w j + α y i h x i φ j x i b i=1 Same form as in the linear regression model, with x j (i) φj x i 20
21 ELECTRICITY EXAMPLE New data: it doesn t look linear anymore 21
22 NEW HYPOTHESIS The complexity of the model grows: one parameter for each feature transformed according to a polynomial of order 2 (at least 3 parameters vs. 2 of original hypothesis) 22
23 NEW HYPOTHESIS At least 5 parameters (if we had multiple predicting features, all their order d products should be considered, resulting into a number of additional parameters) 23
24 NEW HYPOTHESIS The number of parameters is now larger than the data points, such that the polynomial can almost precisely fit the data Overfitting 24
25 SELECTING MODEL COMPLEXITY Dataset with 10 points, 1D features: which hypothesis class should we use? Linear regression: y = h x; w = w 0 + w 1 x Polynomial regression, cubic: y = h x; w = w 0 + w 1 x + w 2 x 2 + w 3 x 3 MSE for the loss functions Which model would give the smaller error in terms of MSE / least squares fit? 25
26 SELECTING MODEL COMPLEXITY Cubic regression provides a better fit to the data, and a smaller MSE Should we stick with the hypothesis h x; w = w 0 + w 1 x + w 2 x 2 + w 3 x 3? Since a higher order polynomial seems to provide a better fit, why don t we use a polynomial of order higher than 3? What is the highest order that makes sense for the given problem? 26
27 SELECTING MODEL COMPLEXITY For 10 data points, a degree 9 polynomial gives a perfect fit (Lagrange interpolation). Error is zero. Is it always good to minimize (even reduce to zero) the training error? Related (and more important) question: How do we (will) perform on new, unseen data? 27
28 OVERFITTING The 9-polynomial model totally fails the prediction for the new point! Overfitting: Situation when the training error is low and the generalization error is high. Causes of the phenomenon: Highly complex hypothesis model, with a large number of parameters (degrees of freedom) Small data size (as compared to the complexity of the model) The learned function has enough degrees of freedom to (over)fit all data perfectly 28
29 OVERFITTING Empirical loss vs. Generalization loss 29
30 TRAINING AND VALIDATION LOSS 30
31 SPLITTING DATASET IN TWO 31
32 PERFORMANCE ON VALIDATION SET 32
33 PERFORMANCE ON VALIDATION SET 33
34 INCREASING MODEL COMPLEXITY In this case, the small size of the dataset favors an easy overfitting by increasing the degree of the polynomial (i.e., hypothesis complexity). For a large multi-dimensional dataset this effect is less strong / evident 34
35 TRAINING VS. VALIDATION LOSS 35
36 MODEL SELECTION AND EVALUATION PROCESS 1. Break all available data into training and testing sets (e.g., 70% / 30%) 2. Break training set into training and validation sets (e.g., 70% / 30%) 3. Loop: i. Set a hyperparameter value (e.g., degree of polynomial model complexity) ii. iii. iv. Train the model using training sets Validate the model using validation sets Exit loop if (validation errors keep growing && training errors go to zero) 4. Choose hyperparameters using validation set results: hyperparameter values corresponding to lowest validation errors 5. (Optional) With the selected hyperparameters, retrain the model using all training data sets 6. Evaluate (generalization) performance on the testing sets (more on this next time) 36
37 MODEL SELECTION AND EVALUATION PROCESS Dataset Training set Testing set Internal training set Validation set Model 1 Model 2 Learn 1 Learn 2 Validate 1 Validate 2 Select best model Learn Model Model n Learn n Validate n 37
Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationModeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop
Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationJeff Howbert Introduction to Machine Learning Winter
Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationOutline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationLinear Models for Regression
Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationPart of the slides are adapted from Ziko Kolter
Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationProbabilistic Machine Learning. Industrial AI Lab.
Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear
More informationCSC 411: Lecture 04: Logistic Regression
CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationSupport Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar
Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 Support Vector Machines Find a linear hyperplane
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationDEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE
Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures
More informationLearning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014
Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationParameter learning in CRF s
Parameter learning in CRF s June 01, 2009 Structured output learning We ish to learn a discriminant (or compatability) function: F : X Y R (1) here X is the space of inputs and Y is the space of outputs.
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationSupport Vector Machines
Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781
More information6.867 Machine learning
6.867 Machine learning Mid-term eam October 8, 6 ( points) Your name and MIT ID: .5.5 y.5 y.5 a).5.5 b).5.5.5.5 y.5 y.5 c).5.5 d).5.5 Figure : Plots of linear regression results with different types of
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationThe Perceptron algorithm
The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationLINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning
LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary
More informationOverview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation
Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated
More informationMachine Learning Support Vector Machines. Prof. Matteo Matteucci
Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way
More informationIntroduction to SVM and RVM
Introduction to SVM and RVM Machine Learning Seminar HUS HVL UIB Yushu Li, UIB Overview Support vector machine SVM First introduced by Vapnik, et al. 1992 Several literature and wide applications Relevance
More information18.9 SUPPORT VECTOR MACHINES
744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the
More informationCSC321 Lecture 18: Learning Probabilistic Models
CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationMachine Learning, Midterm Exam: Spring 2009 SOLUTION
10-601 Machine Learning, Midterm Exam: Spring 2009 SOLUTION March 4, 2009 Please put your name at the top of the table below. If you need more room to work out your answer to a question, use the back of
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationFinal Exam, Machine Learning, Spring 2009
Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3
More informationClassification. Chapter Introduction. 6.2 The Bayes classifier
Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More informationNotes on Discriminant Functions and Optimal Classification
Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem
More informationLecture 10: A brief introduction to Support Vector Machine
Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationBayesian Learning (II)
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP
More informationPattern Recognition. Parameter Estimation of Probability Density Functions
Pattern Recognition Parameter Estimation of Probability Density Functions Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The
More informationMachine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber
Machine Learning Regression-Based Classification & Gaussian Discriminant Analysis Manfred Huber 2015 1 Logistic Regression Linear regression provides a nice representation and an efficient solution to
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationSVMs, Duality and the Kernel Trick
SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today
More informationIntroduction to Machine Learning
Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationMachine Learning Basics: Maximum Likelihood Estimation
Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning
More informationCS 188: Artificial Intelligence Spring Announcements
CS 188: Artificial Intelligence Spring 2010 Lecture 22: Nearest Neighbors, Kernels 4/18/2011 Pieter Abbeel UC Berkeley Slides adapted from Dan Klein Announcements On-going: contest (optional and FUN!)
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationIntroduction to Bayesian Learning. Machine Learning Fall 2018
Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability
More informationNon-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines
Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationMachine Learning, Fall 2012 Homework 2
0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0
More informationLecture 3: Pattern Classification. Pattern classification
EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationBayes Decision Theory
Bayes Decision Theory Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 / 16
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationNaïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationy Xw 2 2 y Xw λ w 2 2
CS 189 Introduction to Machine Learning Spring 2018 Note 4 1 MLE and MAP for Regression (Part I) So far, we ve explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression:
More informationLecture 3: Pattern Classification
EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationIntroduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak
Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,
More informationSupport Vector Machines.
Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel
More informationLecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher
Lecture 3 STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher Previous lectures What is machine learning? Objectives of machine learning Supervised and
More informationMachine Learning for Structured Prediction
Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationGaussian discriminant analysis Naive Bayes
DM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline 1. is 2. Multi-variate
More informationAn Introduction to Statistical and Probabilistic Linear Models
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning
More informationMidterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.
CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic
More informationCS 188: Artificial Intelligence Spring Announcements
CS 188: Artificial Intelligence Spring 2010 Lecture 24: Perceptrons and More! 4/22/2010 Pieter Abbeel UC Berkeley Slides adapted from Dan Klein Announcements W7 due tonight [this is your last written for
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationCOMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16
COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-
More information