Machine Learning Lecture 5


 Percival Howard
 1 years ago
 Views:
Transcription
1 Machine Learning Lecture 5 Linear Discriminant Functions Bastian Leibe RWTH Aachen
2 Course Outline Fundamentals Bayes Decision Theory Probability Density Estimation Classification Approaches Linear Discriminants Support Vector Machines Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Deep Learning Foundations Convolutional Neural Networks Recurrent Neural Networks 2
3 Recap: Mixture of Gaussians (MoG) Generative model j p(j) = ¼ j Weight of mixture component p(x) p(xjµ j ) Mixture component x p(x) p(xjµ) = Mixture density MX p(xjµ j )p(j) Slide credit: Bernt Schiele x j=1 3
4 Recap: Estimating MoGs Iterative Strategy Assuming we knew the values of the hidden variable ML for Gaussian #1 ML for Gaussian #2 assumed known j h(j = 1jx n ) = h(j = 2jx n ) = ¹ 1 = P N n=1 h(j = 1jx n)x n P N i=1 h(j = 1jx n) ¹ 2 = P N n=1 h(j = 2jx n)x n P N i=1 h(j = 2jx n) Slide credit: Bernt Schiele 4
5 Recap: Estimating MoGs Iterative Strategy Assuming we knew the mixture components assumed known p(j = 1jx) p(j = 2jx) j Bayes decision rule: Decide j = 1 if p(j = 1jx n ) > p(j = 2jx n ) Slide credit: Bernt Schiele 5
6 Recap: KMeans Clustering Iterative procedure 1. Initialization: pick K arbitrary centroids (cluster means) 2. Assign each sample to the closest centroid. 3. Adjust the centroids to be the means of the samples assigned to them. 4. Go to step 2 (until no change) Algorithm is guaranteed to converge after finite #iterations. Local optimum Final result depends on initialization. Slide credit: Bernt Schiele 6
7 Recap: EM Algorithm ExpectationMaximization (EM) Algorithm EStep: softly assign samples to mixture components j (x n ) Ã ¼ jn(x n j¹ j ; j ) P N 8j = 1; : : : ; K; k=1 ¼ kn(x n j¹ k ; k ) n = 1; : : : ; N MStep: reestimate the parameters (separately for each mixture component) based on the soft assignments NX ^N j Ã j (x n ) = soft number of samples labeled j n=1 ^¼ new j ^¹ new j ^ new j Ã ^N j N Ã 1^Nj Ã 1^Nj Slide adapted from Bernt Schiele NX j (x n )x n n=1 NX n=1 j (x n )(x n ^¹ new j )(x n ^¹ new j ) T 7
8 Topics of This Lecture Linear discriminant functions Definition Extension to multiple classes Leastsquares classification Derivation Shortcomings Generalized linear models Connection to neural networks Generalized linear discriminants & gradient descent 8
9 Discriminant Functions Bayesian Decision Theory p(c k jx) = p(xjc k)p(c k ) p(x) Model conditional probability densities p(xjc k ) and priors p(c k ) Compute posteriors p(c k jx) (using Bayes rule) Minimize probability of misclassification by maximizing p(cjx). New approach Directly encode decision boundary Without explicit modeling of probability densities Minimize misclassification probability directly. Slide credit: Bernt Schiele 9
10 Recap: Discriminant Functions Formulate classification in terms of comparisons Discriminant functions y 1 (x); : : : ; y K (x) Classify x as class C k if y k (x) > y j (x) 8j 6= k Examples (Bayes Decision Theory) y k (x) = p(c k jx) y k (x) = p(xjc k )p(c k ) y k (x) = log p(xjc k ) + log p(c k ) Slide credit: Bernt Schiele 10
11 Discriminant Functions Example: 2 classes y 1 (x) > y 2 (x), y 1 (x) y 2 (x) > 0, y(x) > 0 Decision functions (from Bayes Decision Theory) y(x) = p(c 1 jx) p(c 2 jx) y(x) = ln p(xjc 1) p(xjc 2 ) + ln p(c 1) p(c 2 ) Slide credit: Bernt Schiele 11
12 Learning Discriminant Functions General classification problem Goal: take a new input x and assign it to one of K classes C k. Given: training set X = {x 1,, x N } with target values T = {t 1,, t N }. Learn a discriminant function y(x) to perform the classification. 2class problem Binary target values: Kclass problem 1ofK coding scheme, e.g. t n 2 f0; 1g t n = (0; 1; 0; 0; 0) T 12
13 Linear Discriminant Functions 2class problem y(x) > 0 : Decide for class C 1, else for class C 2 In the following, we focus on linear discriminant functions y(x) = w T x + w 0 weight vector bias (= threshold) If a data set can be perfectly classified by a linear discriminant, then we call it linearly separable. Slide credit: Bernt Schiele 13
14 Linear Discriminant Functions Decision boundary Normal vector: w Offset: w 0 kwk y(x) = 0 y > 0 y = 0 y < 0 defines a hyperplane y(x) = w T x + w 0 Slide credit: Bernt Schiele 14
15 Linear Discriminant Functions Notation D : Number of dimensions x = x 1 x w = w 1 w x D w D y(x) = w T x + w 0 DX = w i x i + w 0 = i=1 DX w i x i i=0 with x 0 = 1 constant Slide credit: Bernt Schiele 15
16 Extension to Multiple Classes Two simple strategies Onevsall classifiers Onevsone classifiers How many classifiers do we need in both cases? What difficulties do you see for those strategies? 16 Image source: C.M. Bishop, 2006
17 Extension to Multiple Classes Problem Both strategies result in regions for which the pure classification result (y k > 0) is ambiguous. In the onevsall case, it is still possible to classify those inputs based on the continuous classifier outputs y k > y j 8j k. Solution We can avoid those difficulties by taking K linear functions of the form y k (x) = w T k x + w k0 and defining the decision boundaries directly by deciding for C k iff y k > y j 8j k. This corresponds to a 1ofK coding scheme t n = (0; 1; 0; : : : ; 0; 0) T 17 Image source: C.M. Bishop, 2006
18 Extension to Multiple Classes Kclass discriminant Combination of K linear functions y k (x) = w T k x + w k0 Resulting decision hyperplanes: (w k w j ) T x + (w k0 w j0 ) = 0 It can be shown that the decision regions of such a discriminant are always singly connected and convex. This makes linear discriminant models particularly suitable for problems for which the conditional densities p(x w i ) are unimodal. 18 Image source: C.M. Bishop, 2006
19 Topics of This Lecture Linear discriminant functions Definition Extension to multiple classes Leastsquares classification Derivation Shortcomings Generalized linear models Connection to neural networks Generalized linear discriminants & gradient descent 19
20 General Classification Problem Classification problem Let s consider K classes described by linear models y k (x) = w T k x + w k0 ; k = 1; : : : ; K We can group those together using vector notation where y(x) = W ft ex 2 w 10 : : : w K0 w 11 : : : w K1 fw = [ ew 1 ; : : : ; ew K ] = w 1D : : : w KD The output will again be in 1ofK notation. We can directly compare it to the target value. t = [t 1 ; : : : ; t k ] T 20
21 General Classification Problem Classification problem For the entire dataset, we can write Y( e X) = e X f W and compare this to the target matrix T where fw = [ ew 1 ; : : : ; ew K ] 2 3 ex = 6 xt T = x T N t T 1... t T N Result of the comparison: ex f W T fw Goal: Choose such that this is minimal! 21
22 LeastSquares Classification Simplest approach Directly try to minimize the sumofsquares error We could write this as E(w) = 1 2 NX KX (y k (x n ; w) t kn ) 2 n=1 k=1 But let s stick with the matrix notation for now (The result will be simpler to express and we ll learn some nice matrix algebra rules along the way...) 22
23 LeastSquares Classification Multiclass case E D ( W) f = 1 n( 2 Tr X e W f T) T ( X e W f o T) Let s formulate the sumofsquares error in matrix notation X using: a 2 ij = Tr A T A ª i;j Taking the f W E D( f W) = f W Tr n ( e X f W T) T ( e X f W T) = X e W f T) T ( X e W f W f ( X e W f T) T ( X e W f T) = e X T ( e X f W T) o @Y n ( X e W f T) T ( X e W f o T) Tr fag = 25
24 LeastSquares Classification Minimizing the W f E D( W) f = X e T ( X e W f T) =! 0 ex W f = T fw = ( e X T e X) 1 e X T T = e X y T pseudoinverse We then obtain the discriminant function as y(x) = f W T ex = T T³ e X y T ex Exact, closedform solution for the discriminant function parameters. 26
25 Problems with Least Squares Leastsquares is very sensitive to outliers! The error function penalizes predictions that are too correct. 27 Image source: C.M. Bishop, 2006
26 Problems with LeastSquares Another example: 3 classes (red, green, blue) Linearly separable problem Leastsquares solution: Most green points are misclassified! Deeper reason for the failure Leastsquares corresponds to Maximum Likelihood under the assumption of a Gaussian conditional distribution. However, our binary target vectors have a distribution that is clearly nongaussian! Leastsquares is the wrong probabilistic tool in this case! 28 Image source: C.M. Bishop, 2006
27 Topics of This Lecture Linear discriminant functions Definition Extension to multiple classes Leastsquares classification Derivation Shortcomings Generalized linear models Connection to neural networks Generalized linear discriminants & gradient descent 29
28 Generalized Linear Models Linear model y(x) = w T x + w 0 Generalized linear model y(x) = g(w T x + w 0 ) g( ) is called an activation function and may be nonlinear. The decision surfaces correspond to y(x) = const:, w T x + w 0 = const: If g is monotonous (which is typically the case), the resulting decision boundaries are still linear functions of x. 30
29 Generalized Linear Models Consider 2 classes: p(c 1 jx) = = = p(xjc 1 )p(c 1 ) p(xjc 1 )p(c 1 ) + p(xjc 2 )p(c 2 ) p(xjc 2)p(C 2 ) p(xjc 1 )p(c 1 ) exp( a) g(a) Slide credit: Bernt Schiele with a = ln p(xjc 1)p(C 1 ) p(xjc 2 )p(c 2 ) 31
30 Logistic Sigmoid Activation Function g(a) exp( a) Example: Normal distributions with identical covariance p x a pxb x p a x p b x Slide credit: Bernt Schiele x 32
31 Normalized Exponential General case of K > 2 classes: p(c k jx) = p(xjc k)p(c k ) P j p(xjc j)p(c j ) = P exp(a k) j exp(a j) with a k = ln p(xjc k )p(c k ) This is known as the normalized exponential or softmax function Can be regarded as a multiclass generalization of the logistic sigmoid. Slide credit: Bernt Schiele 33
32 Relationship to Neural Networks 2Class case with x 0 = 1 constant Neural network ( singlelayer perceptron ) Slide credit: Bernt Schiele 34
33 Relationship to Neural Networks Multiclass case with x 0 = 1 constant Multiclass perceptron K Slide credit: Bernt Schiele 35
34 Logistic Discrimination If we use the logistic sigmoid activation function g(a) exp( a) y(x) = g(w T x + w 0 ) then we can interpret the y(x) as posterior probabilities! Slide adapted from Bernt Schiele 36
35 Other Motivation for Nonlinearity Recall leastsquares classification One of the problems was that data points that are too correct have a strong influence on the decision surface under a squarederror criterion. E(w) = NX (y(x n ; w) t n ) 2 n=1 Reason: the output of y(x n ;w) can grow arbitrarily large for some x n : y(x; w) = w T x + w 0 By choosing a suitable nonlinearity (e.g. a sigmoid), we can limit those influences y(x; w) = g(w T x + w 0 ) 37
36 Discussion: Generalized Linear Models Advantages The nonlinearity gives us more flexibility. Can be used to limit the effect of outliers. Choice of a sigmoid leads to a nice probabilistic interpretation. Disadvantage Leastsquares minimization in general no longer leads to a closedform analytical solution. Need to apply iterative methods. Gradient descent. 38
37 Linear Separability Up to now: restrictive assumption Only consider linear decision boundaries Classical counterexample: XOR x 2 x 1 Slide credit: Bernt Schiele 39
38 Generalized Linear Discriminants Generalization Transform vector x with M nonlinear basis functions Á j (x): y k (x) = MX w kj Á j (x) + w k0 j=1 Purpose of Á j (x): basis functions Allow nonlinear decision boundaries. By choosing the right Á j, every continuous function can (in principle) be approximated with arbitrary accuracy. Notation y k (x) = MX w kj Á j (x) with Á 0 (x) = 1 Slide credit: Bernt Schiele j=0 41
39 Generalized Linear Discriminants Model y k (x) = MX w kj Á j (x) = y k (x; w) j=0 K functions (outputs) y k (x;w) Learning in Neural Networks Singlelayer networks: Á j are fixed, only weights w are learned. Multilayer networks: both the w and the Á j are learned. We will take a closer look at neural networks from lecture 11 on. For now, let s first consider generalized linear discriminants in general Slide credit: Bernt Schiele 42
40 Gradient Descent Learning the weights w: N training data points: X = {x 1,, x N } K outputs of decision functions: y k (x n ;w) Target vector for each data point: T = {t 1,, t N } Error function (leastsquares error) of linear model E(w) = 1 2 = 1 2 NX n=1 k=1 KX (y k (x n ; w) t kn ) 2 0 NX n=1 k=1 j=1 1 MX w kj Á j (x n ) t kn A 2 Slide credit: Bernt Schiele 43
41 Gradient Descent Problem The error function can in general no longer be minimized in closed form. ldea (Gradient Descent) Iterative minimization Start with an initial guess for the parameter values. Move towards a (local) minimum by following the gradient. w ( +1) kj = w ( kj This simple scheme corresponds to a 1 st order Taylor expansion (There are more complex procedures available). w( ) : Learning rate w (0) kj 44
42 Gradient Descent Basic Strategies Batch learning w ( +1) kj = w ( kj w( ) : Learning rate Compute the gradient based on all kj Slide credit: Bernt Schiele 45
43 Gradient Descent Basic Strategies Sequential updating E(w) = NX n=1 w ( +1) kj = w ( ) kj E n kj : Learning rate w( ) Compute the gradient based on a single data point at a n kj Slide credit: Bernt Schiele 46
44 Gradient Descent Error function NX E(w) = E n (w) = 1 2 n=1 E n (w) = 1 2 n kj NX 0 n=1 k=1 0 k=1 1 MX w kj Á j (x n ) t kn A j=1 1 MX w kj Á j (x n ) t kn A j=1 1 MX w k ~j Á ~j (x n) t kn A Á j (x n ) ~j=1 2 2 = (y k (x n ; w) t kn ) Á j (x n ) Slide credit: Bernt Schiele 47
45 Gradient Descent Delta rule (=LMS rule) w ( +1) kj = w ( ) kj (y k(x n ; w) t kn ) Á j (x n ) = w ( ) kj ± kná j (x n ) where ± kn = y k (x n ; w) t kn Simply feed back the input data point, weighted by the classification error. Slide credit: Bernt Schiele 48
46 Gradient Descent Cases with differentiable, nonlinear activation function 0 y k (x) = g(a k ) = 1 MX w ki Á j (x n ) A j=0 Gradient descent Slide credit: Bernt n kj kj (y k (x n ; w) t kn ) Á j (x n ) w ( +1) kj = w ( ) kj ± kná j (x n ) ± kn kj (y k (x n ; w) t kn ) 49
47 Summary: Generalized Linear Discriminants Properties General class of decision functions. Nonlinearity g( ) and basis functions Á j allow us to address linearly nonseparable problems. Shown simple sequential learning approach for parameter estimation using gradient descent. Better 2 nd order gradient descent approaches available (e.g. NewtonRaphson). Limitations / Caveats Flexibility of model is limited by curse of dimensionality g( ) and Á j often introduce additional parameters. Models are either limited to lowerdimensional input space or need to share parameters. Linearly separable case often leads to overfitting. Several possible parameter choices minimize training error. 50
48 References and Further Reading More information on Linear Discriminant Functions can be found in Chapter 4 of Bishop s book (in particular Chapter 4.1). Christopher M. Bishop Pattern Recognition and Machine Learning Springer,
Machine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationMachine Learning Lecture 2
Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen
More informationMachine Learning Lecture 12
Machine Learning Lecture 12 Neural Networks 30.11.2017 Bastian Leibe RWTH Aachen http://www.vision.rwthaachen.de leibe@vision.rwthaachen.de Course Outline Fundamentals Bayes Decision Theory Probability
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte  CMPT 726 Bishop PRML Ch. 4 Classification: Handwritten Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationLast updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship
More informationMachine Learning Lecture 2
Announcements Machine Learning Lecture 2 Eceptional number of lecture participants this year Current count: 449 participants This is very nice, but it stretches our resources to their limits Probability
More informationLecture Machine Learning (past summer semester) If at least one Englishspeaking student is present.
Advanced Machine Learning Lecture 1 Organization Lecturer Prof. Bastian Leibe (leibe@vision.rwthaachen.de) Introduction 20.10.2015 Bastian Leibe Visual Computing Institute RWTH Aachen University http://www.vision.rwthaachen.de/
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 MultiLayer Perceptrons The BackPropagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationIntroduction to Neural Networks
Introduction to Neural Networks Steve Renals Automatic Speech Recognition ASR Lecture 10 24 February 2014 ASR Lecture 10 Introduction to Neural Networks 1 Neural networks for speech recognition Introduction
More informationMachine Learning Lecture 14
Many slides adapted from B. Schiele, S. Roth, Z. Gharahmani Machine Learning Lecture 14 Undirected Graphical Models & Inference 23.06.2015 Bastian Leibe RWTH Aachen http://www.vision.rwthaachen.de/ leibe@vision.rwthaachen.de
More informationMachine Learning. 7. Logistic and Linear Regression
Sapienza University of Rome, Italy  Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of Kmeans Hard assignments of data points to clusters small shift of a
More informationLINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input
More informationMark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.
University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: MultiLayer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x
More informationMachine Learning Lecture 14
Machine Learning Lecture 14 Tricks of the Trade 07.12.2017 Bastian Leibe RWTH Aachen http://www.vision.rwthaachen.de leibe@vision.rwthaachen.de Course Outline Fundamentals Bayes Decision Theory Probability
More informationMachine Learning Lecture 1
Many slides adapted from B. Schiele Machine Learning Lecture 1 Introduction 18.04.2016 Bastian Leibe RWTH Aachen http://www.vision.rwthaachen.de/ leibe@vision.rwthaachen.de Organization Lecturer Prof.
More informationLinear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging
More informationMultilayer Perceptron
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4
More informationNeural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feedforward Networks Network Training Error Backpropagation Applications
Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of
More informationMachine Learning 2017
Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section
More informationThe classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know
The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationECE521 Lecture7. Logistic Regression
ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multiclass classification 2 Outline Decision theory is conceptually easy and computationally hard
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Nonparametric
More informationMidterm: CS 6375 Spring 2015 Solutions
Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a onepage cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an
More informationCENG 793. On Machine Learning and Optimization. Sinan Kalkan
CENG 793 On Machine Learning and Optimization Sinan Kalkan 2 Now Introduction to ML Problem definition Classes of approaches KNN Support Vector Machines Softmax classification / logistic regression Parzen
More informationNeural networks and support vector machines
Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith
More informationLecture 4: Perceptrons and Multilayer Perceptrons
Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II  Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationNeural Networks. Nicholas Ruozzi University of Texas at Dallas
Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify
More informationArtificial Neural Networks 2
CSC2515 Machine Learning Sam Roweis Artificial Neural s 2 We saw neural nets for classification. Same idea for regression. ANNs are just adaptive basis regression machines of the form: y k = j w kj σ(b
More informationLearning Neural Networks
Learning Neural Networks Neural Networks can represent complex decision boundaries Variable size. Any boolean function can be represented. Hidden units can be interpreted as new features Deterministic
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Secondorder methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Secondorder methods. Linear models for classification Logistic regression Gradient descent and secondorder methods
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two onepage, twosided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More information10701/ Machine Learning, Fall
070/578 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sumofsquares error is the most common training
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: MingWei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationIntroduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis
Introduction to Natural Computation Lecture 9 Multilayer Perceptrons and Backpropagation Peter Lewis 1 / 25 Overview of the Lecture Why multilayer perceptrons? Some applications of multilayer perceptrons.
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More informationMidterm, Fall 2003
578 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more timeconsuming and is worth only three points, so do not attempt 9 unless you are
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationMachine Learning (CS 567) Lecture 5
Machine Learning (CS 567) Lecture 5 Time: TTh 5:00pm  6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol
More informationLinear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.
Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also
More informationWhat Do Neural Networks Do? MLP Lecture 3 Multilayer networks 1
What Do Neural Networks Do? MLP Lecture 3 Multilayer networks 1 Multilayer networks Steve Renals Machine Learning Practical MLP Lecture 3 7 October 2015 MLP Lecture 3 Multilayer networks 2 What Do Single
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationMIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE
MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on
More informationNumerical Learning Algorithms
Numerical Learning Algorithms Example SVM for Separable Examples.......................... Example SVM for Nonseparable Examples....................... 4 Example Gaussian Kernel SVM...............................
More informationSlides modified from: PATTERN RECOGNITION CHRISTOPHER M. BISHOP. and: Computer vision: models, learning and inference Simon J.D.
Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP and: Computer vision: models, learning and inference. 2011 Simon J.D. Prince ClassificaLon Example: Gender ClassiﬁcaLon
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and NonParametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationPATTERN CLASSIFICATION
PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wileylnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept HardMargin SVM SoftMargin SVM Dual Problems of HardMargin
More informationLecture 2: Simple Classifiers
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 2: Simple Classifiers Slides based on Rich Zemel s All lecture slides will be available on the course website: www.cs.toronto.edu/~jessebett/csc412
More informationLeast Mean Squares Regression
Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method
More informationLab 5: 16 th April Exercises on Neural Networks
Lab 5: 16 th April 01 Exercises on Neural Networks 1. What are the values of weights w 0, w 1, and w for the perceptron whose decision surface is illustrated in the figure? Assume the surface crosses the
More informationCOMP 551 Applied Machine Learning Lecture 14: Neural Networks
COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: Ryan Lowe (ryan.lowe@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted,
More informationSupport Vector Machines and Kernel Methods
2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University
More informationArtificial Neural Networks Examination, June 2005
Artificial Neural Networks Examination, June 2005 Instructions There are SIXTY questions. (The pass mark is 30 out of 60). For each question, please select a maximum of ONE of the given answers (either
More informationIntroduction to Gaussian Process
Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Nonparametric Gaussian Process (GP) GP Regression
More informationMachine Learning, Midterm Exam
10601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv BarJoseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have
More informationLecture 4: Training a Classifier
Lecture 4: Training a Classifier Roger Grosse 1 Introduction Now that we ve defined what binary classification is, let s actually train a classifier. We ll approach this problem in much the same way as
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 20052007 Carlos Guestrin 1 Generative
More informationLecture 17: Neural Networks and Deep Learning
UVA CS 6316 / CS 4501004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1Layer Neural Network Multilayer Neural Network Loss Functions
More informationIntroduction to Logistic Regression
Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the
More informationArtificial Neural Networks (ANN)
Artificial Neural Networks (ANN) Edmondo Trentin April 17, 2013 ANN: Definition The definition of ANN is given in 3.1 points. Indeed, an ANN is a machine that is completely specified once we define its:
More informationBayesian Decision Theory
Bayesian Decision Theory Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Bayesian Decision Theory Bayesian classification for normal distributions Error Probabilities
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationBayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington
Bayesian Classifiers and Probability Estimation Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington 1 Data Space Suppose that we have a classification problem The
More informationDeep Neural Networks (1) Hidden layers; Backpropagation
Deep Neural Networs (1) Hidden layers; Bacpropagation Steve Renals Machine Learning Practical MLP Lecture 3 4 October 2017 / 9 October 2017 MLP Lecture 3 Deep Neural Networs (1) 1 Recap: Softmax single
More informationMachine Learning and Data Mining. Multilayer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler
+ Machine Learning and Data Mining Multilayer Perceptrons & Neural Networks: Basics Prof. Alexander Ihler Linear Classifiers (Perceptrons) Linear Classifiers a linear classifier is a mapping which partitions
More informationLinear Models for Classification
Linear Models for Classification Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 303320280 hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Linear
More informationLecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1
Lecture 2 1 Probability (90 min.) Definition, Bayes theorem, probability densities and their properties, catalogue of pdfs, Monte Carlo 2 Statistical tests (90 min.) general concepts, test statistics,
More informationLast Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression
CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 22 Dan Weld Learning Gaussians Naïve Bayes Last Time Gaussians Naïve Bayes Logistic Regression Today Some slides from Carlos Guestrin, Luke Zettlemoyer
More informationModeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop
Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector
More informationIntroduction to Graphical Models
Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) YungKyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationMIXTURE MODELS AND EM
Last updated: November 6, 212 MIXTURE MODELS AND EM Credits 2 Some of these slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Simon Prince, University College London Sergios Theodoridis,
More informationAdvanced statistical methods for data analysis Lecture 1
Advanced statistical methods for data analysis Lecture 1 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline
More informationNeural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1
Neural Networks Xiaoin Zhu erryzhu@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison slide 1 Terminator 2 (1991) JOHN: Can you learn? So you can be... you know. More human. Not
More informationNeural Networks Lecture 4: Radial Bases Function Networks
Neural Networks Lecture 4: Radial Bases Function Networks H.A Talebi Farzaneh Abdollahi Department of Electrical Engineering Amirkabir University of Technology Winter 2011. A. Talebi, Farzaneh Abdollahi
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More informationLINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning
LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In KNN we saw an example of a nonlinear classifier: the decision boundary
More informationIntroduction to Machine Learning
Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1
More informationMachine Learning: Logistic Regression. Lecture 04
Machine Learning: Logistic Regression Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Supervised Learning Task = learn an (unkon function t : X T that maps input
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationBayesian Networks Structure Learning (cont.)
Koller & Friedman Chapters (handed out): Chapter 11 (short) Chapter 1: 1.1, 1., 1.3 (covered in the beginning of semester) 1.4 (Learning parameters for BNs) Chapter 13: 13.1, 13.3.1, 13.4.1, 13.4.3 (basic
More informationLecture 3: Pattern Classification. Pattern classification
EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and
More informationNearest Neighbors Methods for Support Vector Machines
Nearest Neighbors Methods for Support Vector Machines A. J. Quiroz, Dpto. de Matemáticas. Universidad de Los Andes joint work with María GonzálezLima, Universidad Simón Boĺıvar and Sergio A. Camelo, Universidad
More informationMachine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More informationCOMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16
COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training
More informationArtificial neural networks
Artificial neural networks Chapter 8, Section 7 Artificial Intelligence, spring 203, Peter Ljunglöf; based on AIMA Slides c Stuart Russel and Peter Norvig, 2004 Chapter 8, Section 7 Outline Brains Neural
More informationMachine Learning
Machine Learning 10601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 2, 2015 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)
More informationIntro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation
Lecture 15. Pattern Classification (I): Statistical Formulation Outline Statistical Pattern Recognition Maximum Posterior Probability (MAP) Classifier Maximum Likelihood (ML) Classifier KNearest Neighbor
More information2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University
2018 EE448, Big Data Mining, Lecture 5 Supervised Learning (Part II) Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html Content of Supervised Learning
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppäaho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer
More informationMachine Learning, Fall 2012 Homework 2
060 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv BarJoseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0
More informationMachine Learning. Neural Networks. (slides from Domingos, Pardo, others)
Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward
More informationArtifical Neural Networks
Neural Networks Artifical Neural Networks Neural Networks Biological Neural Networks.................................. Artificial Neural Networks................................... 3 ANN Structure...........................................
More information