GI01/M055: Supervised Learning

Size: px
Start display at page:

Download "GI01/M055: Supervised Learning"

Transcription

1 GI01/M055: Supervised Learning 1. Introduction to Supervised Learning October 5, 2009 John Shawe-Taylor 1

2 Course information 1. When: Mondays, 14:00 17:00 Where: Room 1.20, Engineering Building, Malet Place 2. Course webpage: 3. Mailing list: 4. Office: 8.14, CS Building, Malet Place 2

3 Assessment 1. Homework (25%) and Exam (75%) 2. 3 homework assignments (deliver them on-time, penalty otherwise) 3. To pass the course, you must obtain an average of at least 50% when the homework and exam components are weighted together. 3

4 Material Lecture notes Reference book Hastie, Tibshirani, & Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2001 Additional material (see webpage for more info) General: Duda, Hart & Stork; Mitchell; Bishop;... Bayesian methods in ML: Mackay;... Kernel methods: Shawe-Taylor & Cristianini; Schölkopft & Smola;... Learning theory: Devroye, Lugosi, &Gÿorfi; Vapnik;... 4

5 Prerequisites Calculus (real-valued functions, limits, derivatives, Taylor series, integrals,...) Elements of probability theory (random variables, expectation, variance, conditional probabilities, Bayes rule,...) Fundamentals of linear algebra (vectors, angles, matrices, eigenvectors/eigenvalues,...), A bit of optimization theory (convex functions, Lagrange multipliers) 5

6 Provisional course outline (Week 1) Key concepts (probabilistic formulation of learning from examples, error functionals; loss function, Bayes rule, learning algorithm, overfitting and underfitting, model selection, cross validation); Some basic learning algorithms (linear regression, k NN); (Week 2) Discriminative vs. Generative approach (Week 3) Optimization and learning algorithms (Week 4) Regularisation, Kernels 6

7 Provisional course outline (Week 6) Elements of Learning Theory (PAC-learning, Complexity of hypothesis space, VC-bounds) (Week 7) Support Vector Machines/Bayesian Interpretation (Week 8) Tree-based Learning Algorithms and Boosting Acknowledgments: The lectures are based on earlier year s courses. Thanks to Massi Pontil for the course notes. He also inherited notes from Fernando Perez-Cruz, Iain Murray and Ed Snelson of the Gatsby Unit at UCL. 7

8 Today s plan Supervised learning problem Why and when learning? Regression and classification Two learning algorithms: least squares and k-nn Probabilistic model, error functional, optimal solutions Hypothesis space, overfitting and underfitting Choice of the learning algorithm (Model selection) 8

9 Supervised learning problem Given a set of input/output pairs (training set) we wish to compute the functional relationship between the input and the output x f y Ex1 (people detection) given an image we wish to say if it depicts a person or not The output is one of 2 possible categories Ex2 (pose estimation) we wish to predict the pose of a face image The output is a continuous number (here a real number describing the face rotation angle) In both problems the input is a high dimensional vector x representing pixel intensity/color 9

10 Why and when learning? The aim of learning is to develop software systems able to perform particular tasks such as people detection. Standard software engineering approach would be to specify the problem, develop an algorithm to compute the solution, and then implement efficiently. Problem with this approach is developing the algorithm: No known criterion for distinguishing the images; In many cases humans have no difficulty; Typically problem is to specify the problem in logical terms. 10

11 Learning approach Learning attempts to infer the algorithm for a set of (labelled) examples in much the same way that children learn by being shown a set of examples (eg sports/non sports car). Attempts to isolate underlying structure from a set of examples. Approach should be stable: finds something that is not chance part of set of examples efficient: infers solution in time polynomial in the size of the data robust: should not be too sensitive to mislabelled/noisy examples 11

12 People detection example 12

13 People detection example (cont.) 13

14 Notation X: input space (eg, X IR d ), with elements x, x, x i,... Y: output space, with elements y, y,y i,... Classification (qualitative output) Y = {c 1,...,c k } Binary classification: k =2 Regression (quantitative output) Y IR S = {(x i,y i )} m i=1 : training set (set of I/O examples) Supervised learning problem: describes I/O relationship compute a function which best 14

15 Learning algorithm Training set: S = {(x i,y i ) m i=1 } X Y A learning algorithm is a mapping S f S A new input x is predicted as f S (x) In the course we mainly deal with deterministic algorithms but we ll also comment on some randomized ones Today: we describe two simple learning algorithms: linear regression and k nearest neighbours 15

16 Some important questions How is the data collected? (need assumptions!) How do we represent the inputs? (may require preprocessing step) How accurate is f S on new data (study of generalization error) /Howdoweevaluate performance of the learning algorithm on unseen data? (computational com- How complex is a learning task? plexity, sample complexity) Given two different learning algorithms, f S and f S should we choose? (model selection problem) which one 16

17 Some difficulties/aspects of the learning process New inputs differ from the ones in the training set (look up tables do not work!) Inputs are measured with noise Output is not deterministically obtained by the input Input is high dimensional but some components/variables may be irrelevant Whenever prior knowledge is available it should be used 17

18 More examples / applications Optical digit recognition (useful for identifying the numbers in a ZIP code from a digitalized image) (Computer Vision) Predicting house prices based on sq. feet, number of rooms, distance from central London,... (Marketing) Estimate amount of glucose in the blood of a diabetic person (Medicine) Detect spam s (Information retrieval) Predict protein functions / structures (Bioinformatics) Speaker identification / sound recognition (Speech recognition) 18

19 Binary classification: an example We describe two basic learning algorithms/models for classification which can be easily adapted to regression as well. We choose: X =IR 2, x =(x 1,x 2 )andy = {green, red} Our first learning algorithm computes a linear function (perceptron), w x + b and classifies an input x as f(x) = red w x + b>0 green w x + b 0 19

20 Least squares How do we compute the parameters w and b? We do not loose generality if we set b = 0 (we can change the data representation as x (x, 1) and w (w,b)) We find w by minimizing the residual sum of squares on the data R(w) = m i=1 ( yi w x i ) 2 To compute the minimum we need to solve the system of equations ( ) R(w) = 0 recall that : =, w 1 w 2 20

21 Normal equations Note that: R w k =2 m i=1 R(w) = m i=1 ( yi w x i ) 2 ( w x i y i ) (w x i ) w k =2 m i=1 ( w x i y i ) xik Hence, to find w =(w 1,w 2 ) we need to solve the linear system of equations m i=1 (x ik x i1 w 1 + x ik x i2 w 2 ) = m i=1 x ik y i, k =1, 2 21

22 Normal equations (cont.) In vector notations: In matrix notation: m i=1 x i x i w = m i=1 x i y i where X = X Xw = X y x 11 x m [ ] x 1,, x m, y = x 1d x md Note: here d = 2 but, clearly, the result holds for any d IN y 1. y m 22

23 Least square solution X Xw = X y For the time being we will assume that the matrix X X is invertible, so we conclude that w =(X X) 1 X y Otherwise, the solution may not be unique... (we will deal with the case that X X is not invertible later) 23

24 Going back to b Substituting x by (x, 1) and w by (w,b), the above system of equations can be expressed in matrix form as (exercise): (X X)w + X 1b = X y 1 Xw + mb = 1 y that is [ ][ X X X 1 w 1 X m b ] = [ ] X y 1 y where 1 =(1, 1,...,1), m 1 vector of ones 24

25 A different approach: k nearest neighbours Let N(x; k) be the set of k nearest training inputs to x and I x = {i : x i N(x; k)} the corresponding index set f(x) = red green if if 1 k i I x y i > k i I x y i 1 2 Closeness is measured using a metric (eg, Euclidean dist.) Local rule (compute local majority vote) Decision boundary is non-linear Note: for regression we set f(x) = 1 k i I x y i (a local mean ) 25

26 k NN: the effect of k The smaller k the more irregular the decision boundary k =15 k =1 How to choose k? later... 26

27 Linear regression vs. k NN (informal) Global vs. local Linear vs. non-linear Bias / variance considerations: LR relies heavily on linear assumption (may have large bias) k-nn does not LR is stable (solution does not change much if data are perturbed) 1-NN isn t! k NN sensitive to input dimension d: if d is high, the inputs tends to be far away from each other! 27

28 Supervised learning model (back to Q1) We assume that the data is obtained by sampling i.i.d. from a fixed but unknown probability density P (x,y) Expected error: E(f) :=E [ (y f(x)) 2] = (y f(x)) 2 dp (x,y) Our goal is to minimize E Optimal solution: f := argmin f E(f) But: in order to compute f we need to know P! Note: for binary classification with Y = {0, 1} and f : X Y, E(f) counts the average number of mistakes of f (aka expected misclassification error) 28

29 Regression function... Let us compute the optimal solution f for regression (Y =IR) and binary classification (Y = {0, 1}) Using the decomposition P (y, x) = P (y x)p (x) wehave { ( ) } 2dP E(f) = y f(x) (y x) dp (x) so we see that X Y for regression, f (called the regression function) is f (x) =argmin c IR Y (y c)2 dp (y x) = (exercise) = Y ydp(y x) 29

30 ... and Bayes classifier for binary classification, f (called the Bayes classifier) is f (x) = 1 if P (1 x) > otherwise To see this note, as before, that f (x) =argmin c {0,1} { Y (y c)2 dp (y x) =(1 c) 2 P (1 x)+c 2 P (0 x) and use P (0 x)+p(1 x) =1 } 30

31 k-nn revised k-nn attempts to estimate/approximate P (1 x) as 1 k i I x y i Expectation is replaced by averaging over sample data Conditioning at x is relaxed to conditioning on some region close to x One can show that E(f S ) E(f ) as m provided that: k(m) and k(m) m 0 Weakness: the approximation (rate of convergence) depends critically on the input dimension... 31

32 Least squares revisited P (x,y) is unknown cannot compute f =argmin f E(f) We are only given a sample (training set) from P A natural approach: we approximate the expected error E(f) by the empirical error E emp (f) := 1 m m i=1 (y i f(x i )) 2 If we minimize E emp over all possible functions, we can always find a function with zero empirical error! Note: Sometimes the (empirical) error is called (empirical) risk 32

33 Least squares revisited (cont.) Proposed solution: we introduce a restrictive class of functions H called hypothesis space possibly with a prior distribution over the functions in H We minimize E emp within H. That is, our learning algorithm is: f S =argmin f H E emp (f) This approach is usually called empirical error minimization Linear regression: H = {f(x) =w x + b : w IR d,b IR} 33

34 Summary Data S sampled i.i.d from P (fixed but unknown) f is what we want, f S is what we get Different approaches to attempt to estimate/approximate f : Minimize E emp in some restricted space of functions (eg, linear) Compute local approximation of f (k-nn) Estimate P and then use Bayes rule... (we ll discuss this next week) 34

35 Perspectives Theoretical / methodological aspects involved in supervised learning Function representation and approximation to describe H Optimization/numerical methods to compute f S Probabilistic methods to study generalization error of f S or to infer the likelihood of f S 35

36 Additive noise model Consider the regression problem. computed as Assume that the output is y = f(x)+ɛ where ɛ is a zero mean r.v. Hence we can write where E[ɛ] =0 P (y, x) =P (y x)p (x) =P ɛ ( y f(x) ) P (x) Noisefreemodel: y deterministically computed from x (ɛ 0) 36

37 Additive noise model (cont.) y = f(x)+ɛ P (y, x) =P (y x)p (x) =P ɛ ( y f(x) ) P (x) The training data is obtained, for i =1,...,m as follows sample x i from P x sample ɛ i from P ɛ set y i = f(x i )+ɛ i Note: P (x) P x (x) (just use different notation when needed) 37

38 Additive noise model (cont.) P (y, x) =P (y x)p x (x) =P ɛ ( y f(x) ) Px (x) So, since ɛ has zero mean, we have that f ( ) (x) := ydp(y x) = ydp ɛ y f(x) = (f(x)+ɛ)dp ɛ (ɛ) =f(x) A common choice for the noise distribution P ɛ is a Gaussian: P ɛ (ɛ) = 1 ( ) 2πσ 2 exp ɛ2 2σ 2 38

39 Maximum likelihood What is the probability of the data S given the underlying function is f? where P (S f) = = m i=1 m i=1 A = P (y i, x i f) = P (x i ) m i=1 m i=1 We define the likelihood of f as m i=1 P (y i x i,f)p (x i ) P (y i x i,f)=a m i=1 P (x i )=P (x 1,...,x m ) L(f; S) =P (S f) P (y i x i,f) 39

40 Maximum likelihood (cont.) Maximum likelihood principle: compute f by maximizing L(f; S) If we use linear functions and additive Gaussian noise, we have L(w; S) =A m i=1 ( 2πσ 2 ) 12 exp { (y i w x i ) 2 2σ 2 In particular the log likelihood is (note: since the log function is strictly increasing maximining L or log L is the same) log L(w; S) = 1 2σ 2 m i=1 ( yi w x i ) 2 + const Hence maximizing the likelihood is equivalent to least squares! } 40

41 Supervised learning as function approximation Given the training data y i = f (x i )+ɛ i, the goal is to compute an approximation of f in some region of IR d. We look for an approximant of f within a prescribed hypothesis space H Unless prior knowledge is available on f (eg, f is linear) we cannot expect f H Choosing H very large leads to overfitting! (we ll see an example of this in a moment) 41

42 Polynomial fitting As an example of hypothesis spaces of increasing complexity consider regression in one dimension H 0 = { f(x) =b : b IR } H 1 = { f(x) =ax + b : a, b IR } H 2 = { f(x) =a 1 x + a 2 x 2 + b : a 1,a 2,b IR }. H n = f(x) = n l=1 a l x l + b : a 1,...,a n,b IR Let us minimize the empirical error in H r (r = polynomial degree) 42

43 Polynomial fitting (simulation) y y y x x x y y y x x x r =0, 1, 2, 3, 4, 5. As r increases the fit to the data improves (empirical error decreases) 43

44 Overfitting vs. underfitting Compare the empirical error (solid line) with expected error (dashed line) r small: underfitting r large: overfitting The larger r the lower the empirical error of f S! We cannot rely on the training error! Empirical Error degree 44

45 k NN: the effect of k The smaller k the more irregular the decision boundary and the smaller the empirical error k =15 k =1 45

46 k NN: the effect of k m k large: overfitting m k small: underfitting 46

47 Model selection How to choose k in k-nn? How to choose the degree r for polynomial regression? The simplest approach is to use part of the training data (say 2/3) for training and the rest as a validation set Another approach is K fold cross-validation see next slide In all cases use the proxy for the test error to select the best k We will come back to model selection later in the course 47

48 Cross-validation we split the data in K parts (of roughly equal sizes) repeatedly train on K 1 parts and test on the part left out average errors of K tests to give so-called cross-validation error smaller K is less expensive but poorer estimate as size of training set is smaller and random fluctuations larger For a dataset of size m, m-fold cross-validation is referred to as leave-one-out (LOO) testing 48

49 Other learning paradigms Supervised learning is not the only learning setup! Semi-supervised learning: the learning environment may give us access to many input examples but only few of them are labeled Active learning: we are given many inputs and we can choose which ones to ask the label for Unsupervised learning: We have only input examples. Here we may want to find data clusters, estimate the probability density of the data, find important features/variable (dimensionality reduction problem), detect anomalies, etc. 49

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics) COMS 4771 Lecture 1 1. Course overview 2. Maximum likelihood estimation (review of some statistics) 1 / 24 Administrivia This course Topics http://www.satyenkale.com/coms4771/ 1. Supervised learning Core

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

More information

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

ECE-271B. Nuno Vasconcelos ECE Department, UCSD ECE-271B Statistical ti ti Learning II Nuno Vasconcelos ECE Department, UCSD The course the course is a graduate level course in statistical learning in SLI we covered the foundations of Bayesian or generative

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University PAC Learning Matt Gormley Lecture 14 March 5, 2018 1 ML Big Picture Learning Paradigms:

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

PDEEC Machine Learning 2016/17

PDEEC Machine Learning 2016/17 PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January, 1 Stat 406: Algorithms for classification and prediction Lecture 1: Introduction Kevin Murphy Mon 7 January, 2008 1 1 Slides last updated on January 7, 2008 Outline 2 Administrivia Some basic definitions.

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007 Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

Linear Regression and Discrimination

Linear Regression and Discrimination Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1 Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 1 The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq.

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

STK Statistical Learning: Advanced Regression and Classification

STK Statistical Learning: Advanced Regression and Classification STK4030 - Statistical Learning: Advanced Regression and Classification Riccardo De Bin debin@math.uio.no STK4030: lecture 1 1/ 42 Outline of the lecture Introduction Overview of supervised learning Variable

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

Brief Introduction to Machine Learning

Brief Introduction to Machine Learning Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

Logic and machine learning review. CS 540 Yingyu Liang

Logic and machine learning review. CS 540 Yingyu Liang Logic and machine learning review CS 540 Yingyu Liang Propositional logic Logic If the rules of the world are presented formally, then a decision maker can use logical reasoning to make rational decisions.

More information

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Bayesian decision theory 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Introduction and Models

Introduction and Models CSE522, Winter 2011, Learning Theory Lecture 1 and 2-01/04/2011, 01/06/2011 Lecturer: Ofer Dekel Introduction and Models Scribe: Jessica Chang Machine learning algorithms have emerged as the dominant and

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 22 Dan Weld Learning Gaussians Naïve Bayes Last Time Gaussians Naïve Bayes Logistic Regression Today Some slides from Carlos Guestrin, Luke Zettlemoyer

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence. Outline CS 188: Artificial Intelligence Lecture 21: Perceptrons Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. Outline Generative vs. Discriminative Binary Linear Classifiers Perceptron Multi-class

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Advanced statistical methods for data analysis Lecture 1

Advanced statistical methods for data analysis Lecture 1 Advanced statistical methods for data analysis Lecture 1 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline

More information

Binary Classification / Perceptron

Binary Classification / Perceptron Binary Classification / Perceptron Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Supervised Learning Input: x 1, y 1,, (x n, y n ) x i is the i th data

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Probabilistic Graphical Models for Image Analysis - Lecture 1

Probabilistic Graphical Models for Image Analysis - Lecture 1 Probabilistic Graphical Models for Image Analysis - Lecture 1 Alexey Gronskiy, Stefan Bauer 21 September 2018 Max Planck ETH Center for Learning Systems Overview 1. Motivation - Why Graphical Models 2.

More information

Mathematical Formulation of Our Example

Mathematical Formulation of Our Example Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot

More information

Tutorial on Machine Learning for Advanced Electronics

Tutorial on Machine Learning for Advanced Electronics Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)

More information

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Lecture Slides for INTRODUCTION TO. Machine Learning. By:  Postedited by: R. Lecture Slides for INTRODUCTION TO Machine Learning By: alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml Postedited by: R. Basili Learning a Class from Examples Class C of a family car Prediction:

More information

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003 Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)

More information

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL 6.867 Machine learning: lecture 2 Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics The learning problem hypothesis class, estimation algorithm loss and estimation criterion sampling, empirical and

More information

CS Machine Learning Qualifying Exam

CS Machine Learning Qualifying Exam CS Machine Learning Qualifying Exam Georgia Institute of Technology March 30, 2017 The exam is divided into four areas: Core, Statistical Methods and Models, Learning Theory, and Decision Processes. There

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Steven Bergner, Chris Demwell Lecture notes for Cmpt 882 Machine Learning February 19, 2004 Abstract In these notes, a

More information

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Active Learning and Optimized Information Gathering

Active Learning and Optimized Information Gathering Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

Day 3: Classification, logistic regression

Day 3: Classification, logistic regression Day 3: Classification, logistic regression Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 20 June 2018 Topics so far Supervised

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Machine Learning Lecture 1

Machine Learning Lecture 1 Many slides adapted from B. Schiele Machine Learning Lecture 1 Introduction 18.04.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Organization Lecturer Prof.

More information