DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane
|
|
- Doreen Dennis
- 5 years ago
- Views:
Transcription
1 DATA MINING AND MACHINE LEARNING Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane Academic Year 2016/2017
2 Table of contents Linear models for regression Regularized least squares SVD analysis of ridge regression Online estimation of LRR Multiple outputs Hints on the geometry of least square Linear expansion on random bases Linear models for classification Least squares classification Logistic regression Discriminant analysis
3 OLS for a fixed basis expansion Going back to OLS, we can easily obtain a nonlinear model by first transforming the input x via a nonlinear function h(x) : R d R h. In this case, we construct a linear model in R h, which is implicitly nonlinear in the original space R d : f (x) = w T h(x). As a motivating example, consider a one-dimensional input x, and a polynomial expansion given by: for some user-specified P. h(x) = [ x 0, x 1,..., x P],
4 Polynomial expansion (1) 3 Dataset 2 Real 1 Predicted x Figure 1 : Data sampled from 2 sin(x) 0.4x with Gaussian noise; for P = 1, we obtain the standard OLS estimate. y
5 Polynomial expansion (2) 3 Dataset 2 Real 1 Predicted x Figure 2 : As before, but we choose P = 3. The models is now nonlinear in the original space, providing a relatively good approximation to our model. y
6 Polynomial expansion (3) Dataset 5 Real Predicted x Figure 3 : We now increase P to 7. The model complexity has increased, but its approximation capabilities have clearly worsened. This is an example of overfitting. y
7 The learning curve By setting P = N, we can construct a predictor which passes exactly through all training points, achieving an empirical error of 0. This is reminiscent of the k-nn behavior with k = 1. We can visualize this behavior by plotting simultaneously the evolution of the training error and of the error on an independent test set, as in the next slide. Note that this polynomial expansion becomes exponentially more complex as we increase d. As an example, for d = 3 and P = 2 we already need h = 10 terms: h(x) = [ 1, x 1, x 2, x 3, x 2 1, x 2 2, x 2 3, x 1 x 2, x 1 x 3, x 2 x 3 ].
8 Example of overfitting Training error Test error Error P Figure 4 : The training set is composed of 20 elements, the test set of 250 independent samples. We see that overfitting increases by increasing P.
9 Example of overfitting (2) Training error Test error Error P Figure 5 : Similar to before, but we increase the training set size to 40. The results are better for low P, but overfitting is still a problem for larger values of the polynomial expansion.
10 The cause of overfitting in this case Norm of w P Figure 6 : If we plot the norm of w for varying P, we see that overfitting is closely linked to the magnitude of the weights. Larger weights make the functions less smooth and more prone to overfitting.
11 Regularized least squares We can try to improve the results by including in the optimization process a term favoring low-norm solutions, in line with our intuition that such solutions generalize better: J(w) = 1 2N N ( ) 2 y i w T h(x i ) + C 2 w 2 2 i=1 where C balances the two terms. This model is called regularized least squares (RLS), or linear ridge regression (LRR). The solution can still be expressed in closed form as: w T = ( H T H + CI ) 1 H T y, where H is constructed similarly to X, and I is the identity matrix of appropriate size.
12 Polynomial expansion with regularization (1) Dataset 5 Real Predicted x Figure 7 : We solve the model for P = 7 with a small amount of regularization C = The results have highly improved. y
13 Polynomial expansion with regularization (2) Dataset 5 Real Predicted x Figure 8 : If we increase too much C (0.1 in this case), the model is again too simple to model anything interesting. This is an example of underfitting. y
14 The learning curve with regularization 10 1 Training error Test error Error P Figure 9 : By including a small regularization term C = 0.01, the overfitting is kept under control even for large values of P.
15 The norm of the weights with regularization 10 2 Norm of w P Figure 10 : If we plot the norm of the weights, we see that after a certain value of P, the norm is curtailed by the presence of the l 2 norm in the optimization problem.
16 Regularized risk minimization The previous idea is extremely powerful; in fact, a large set of supervised learning methods can be expressed as the solution of a regularized risk functional given by two terms: I reg [f ] = 1 N N L ( y i, f (x i ) ) i=1 } {{ } Training error + C R[f ], }{{} Regularization where R[f ] embodies any a priori information we might have on the optimal function, generally trading off its complexity with respect to training accuracy. The parameter C is a hyper-parameter similar to the k of k- NN, allowing the user to select a complexity adequate to the problem under consideration.
17 Table of contents Linear models for regression Regularized least squares SVD analysis of ridge regression Online estimation of LRR Multiple outputs Hints on the geometry of least square Linear expansion on random bases Linear models for classification Least squares classification Logistic regression Discriminant analysis
18 The SVD decomposition The singular value decomposition (SVD) of the matrix X is defined as: where: X = UDV T, U is an N d semi-orthogonal matrix, whose columns span the column space of X. V is a d d orthogonal matrix, whose columns span the row space of X. D is a d d diagonal matrix, whose elements on the diagonal satisfy D 11 D D dd 0 and are called the singular values of X.
19 OLS with SVD decomposition The SVD decomposition can be computed in a stable fashion, and its factors have many useful properties: For U, due to the orthogonality of the columns, we have U T U = I or, alternatively, U 1 = U T. The same is also true for V. For D, being diagonal, we have that D T = D. Also, D 1 is a diagonal matrix with elements 1/D ii. In addition to this, we remember that for any two matrices A, B of appropriate sizes: (AB) T = B T A T. (AB) 1 = B 1 A 1.
20 OLS with SVD decomposition (2) We can rewrite the OLS estimate as: w = ( X T X ) 1 X T y = ( VDU T UDV T ) 1 VDU T y = VD 1 U T y. Note that if X is singular, at least one singular value is 0, leading to an undefined operation in the computation of D 1. More generally, the operation is problematic even when X is very close to singular (e.g., two features highly dependent among them).
21 LRR with SVD decomposition With regularization, the matrix inversion step is modified as: ( VD 2 V T + CI ) 1 = ( VD 2 V T + CVV T ) 1 = ( V ( D 2 + CI ) V T ) 1. From which we can write the LRR solution as: w = V D 1 U T y. with: D 1 ii = D ii D 2 ii + C,
22 Regularization and singular values The formula above, apart from providing an efficient way to compute the LRR estimate, shows that regularization acts on the singular values of X. In fact, even if one singular value is 0, the modified value in D will be C. More in general, LRR with C > 0 can be shown to be strongly convex. The following quantity is called the effective degrees of freedom of the estimator: df(c) = d i=1 D 2 ii D 2 ii + C. Even if we are adapting d parameters, their adaptation is constrained by the value of C, so that the actual number of free parameters of the model is lower (hence, model complexity).
23 Effective degrees of freedom 8 6 df(c) C Figure 11 : A plot of df(c) in the previous problem, with P = 8.
24 QR decomposition In practice, alternative decomposition strategies are used to solve the LRR problem. As an example, the matrix X (if not singular) can be factored as: X = QR, where Q R N d is semi-orthogonal, while R R d d is upper triangular. This is called the QR decomposition of X.
25 Solving OLS with QR decomposition Substituting in the OLS solution we obtain: w = ( R T Q T QR ) 1 R T Q T y = ( R T R ) 1 R T Q T y = R 1 Q T y. The overall system can be solved efficiently via the use of back substitution. The most complex step is now the QR decomposition which, however, is much more stable than inverting the original matrix.
26 Table of contents Linear models for regression Regularized least squares SVD analysis of ridge regression Online estimation of LRR Multiple outputs Hints on the geometry of least square Linear expansion on random bases Linear models for classification Least squares classification Logistic regression Discriminant analysis
27 Online estimation In online learning, the training dataset is not available beforehand, but it is received one sample at a time in a streaming fashion. The objective is to create a sequence of estimates f 0, f 1,..., f N converging to the solution of the regularized risk. Note the similarity between online learning and stochastic optimization. As a general way of tackling online learning problems, we can apply an SGD formulation using the current sample (x n, y n ) to compute an estimate of the gradient. For LRR, the result is the so-called least mean square algorithm: w n+1 = w n α n ( yn w T n x n ) xn + α n Cw n.
28 Sherman-Morrison formula We can obtain a more efficient formulation with some algebra manipulations. In particular, denote by X n the matrix X restricted to the first n rows (and similarly for y n ). Defining for simplicity P n = ( X T X + CI ), we have P n+1 = P n + x n+1 x T n+1. The Sherman-Morrison formula allows us to compute the inverse of P 1 n+1 from the knowledge of P 1 n : P 1 n+1 = P 1 n P 1 n x n+1 x T n+1p 1 n. 1 + x T n+1 P 1 n x n+1
29 Recursive least-square We can write the complete LRR update as: ( ) w n+1 = w n + P 1 n+1 X T n y n + x n+1 y n+1. Putting everything together we obtain: w n+1 = w n + P 1 n+1x n+1 ( yn+1 w T n x n+1 ), which, combined with the previous update for P 1 n, gives us a fully iterative solution, by initializing P 1 0 as 1 I. C This is know as the recursive least squares. Note that, differently from the SGD solution, w n is the exact solution we would obtain by solving the LRR problem with the data received up to now.
30 Table of contents Linear models for regression Regularized least squares SVD analysis of ridge regression Online estimation of LRR Multiple outputs Hints on the geometry of least square Linear expansion on random bases Linear models for classification Least squares classification Logistic regression Discriminant analysis
31 Handling multiple outputs If we wish to estimate a vector y R c model can be written as: in output, the linear f(x) = W T x where we need to adapt a matrix W R d c. The objective function can now be written as: J(W) = 1 2N c k=1 N (y ik f k (x i )) 2 + C 2 i=1 d i=1 c j=1 W 2 ij.
32 Handling multiple outputs (2) After some manipulations, the objective can be rewritten as: J(W) = 1 { (Y 2N Tr ) T ( ) } XW Y XW + C 2 vect(w) 2 2, where Y is constructed by stacking row-wise the y i s; Tr { } is the trace operator; and vect( ) transforms the matrix into a vector. Computing the gradient and equating it to 0 shows that the solution is unchanged: W = ( X T X + CI ) 1 X T Y. In particular, the estimates for each output are independent among them.
33 Table of contents Linear models for regression Regularized least squares SVD analysis of ridge regression Online estimation of LRR Multiple outputs Hints on the geometry of least square Linear expansion on random bases Linear models for classification Least squares classification Logistic regression Discriminant analysis
34 An alternative formulation of LRR We can write the predictions of LRR as: ŷ = Xw = ( X ( X T X + CI ) 1 X T ) y = My. The LRR model can be seen as performing a linear combination of the training points in order to predict a value, weighting them via the corresponding rows of the hat matrix M. The errors of the model have a similar interpretation: y = ŷ + (y ŷ) = My + (I M) y.
35 Interpreting the geometry The desired vector y defines a point in some high-dimensional space R N. Consider the set of all possible linear models: ŷ = Xw = d w i X :i, where X :i is the ith column of X. Technically, the columns of X span a linear subspace (of dimension d if N d and the matrix has full rank) of the larger space R N. We can think of M as the orthogonal matrix projecting y onto the closest possible point on this linear subspace. i=1
36 A concrete example Consider a simple example with d = 1 and N = 2, with: [ ] [ ] X =, y = In this case, X spans a line in R 2 formed by all multiples of X itself. By working out the OLS solution, we find that the closest point on this line with respect to y is: [ ] ŷ y
37 A concrete example (2) y_ y_1 Figure 12 : The red point is y, the blue line is the space spanned by X. Multiplying by M projects y onto the closest point.
38 Table of contents Linear models for regression Regularized least squares SVD analysis of ridge regression Online estimation of LRR Multiple outputs Hints on the geometry of least square Linear expansion on random bases Linear models for classification Least squares classification Logistic regression Discriminant analysis
39 Random bases The polynomial expansion becomes infeasible for more than a few dimensions, due to the exponential number of terms that need to be computed. Can we find some alternative expansion h( ) with good approximation properties? One interesting idea is to combine h nonlinear transformations, whose parameters are extracted randomly from some distribution. As an example, consider the following nonlinear transformation: h i (x) = exp { a T i x b i }, with a i, b i taken from some uniform distribution in [ λ, λ].
40 Random vector functional-link networks The previous model is known as a random vector functionallink network. Crucially, it can be shown that the model can approximate any (compact) nonlinear function, provided h is large enough. Thanks to the closed-form solution, this model can be trained extremely fast and performs reasonably well in many situations. It provides a good baseline solution for many scenarios, including low-power devices. Clearly, the number of hidden bases might grow too large in some problems. Additionally, an improper regularization results in an extremely large variance of the model. [1] Pao, Y.H., Park, G.H. and Sobajic, D.J., Learning and generalization characteristics of the random vector functional-link net. Neurocomputing, 6(2), pp
41 Example of RVFL application y Figure 13 : C = Dataset 5 Real Predicted x An example of a RVFL model, with h = 500 and
42 Table of contents Linear models for regression Regularized least squares SVD analysis of ridge regression Online estimation of LRR Multiple outputs Hints on the geometry of least square Linear expansion on random bases Linear models for classification Least squares classification Logistic regression Discriminant analysis
43 Classification problems Let us now consider classification problems, i.e.: Binary classification: the desired output is a single value y i {0, 1}, defining whether the corresponding input belongs or not to that class. Multi-class classification: the output can take one of M different values. One way to tackle this problem is to consider a dummy encoding of the output variable, fit a model on the target vector, and select the class according to the largest value in output. Alternatively, we may fit M different discriminant functions separately, one for each class.
44 Least squares classification If we apply an LRR model, we obtain a least squares classification algorithm. The class is obtained as: { Class of x = arg max fk (x) }. k=1,...,m There are (at least) two possible drawbacks to this approach: 1. We cannot guarantee that the elements f k (x) represent valid probability distributions with respect to the classes, as they can be negative or larger than More importantly, for M > 2, there is the possibility that one class is masked by the others, as shown later in a simple 3-class example.
45 An example of least squares classification Figure 14 : With 3 perfectly separable classes, the decision boundaries found by least squares classification are correct.
46 An example of masking Figure 15 : In this case, even if the classes are linearly separable, one of the three is entirely masked by the others.
47 Table of contents Linear models for regression Regularized least squares SVD analysis of ridge regression Online estimation of LRR Multiple outputs Hints on the geometry of least square Linear expansion on random bases Linear models for classification Least squares classification Logistic regression Discriminant analysis
48 Squashing functions Focusing on the binary classification case, one way to obtain valid probabilities from the linear model is to squash all possible values w T x in the interval [0, 1] using some scalar function σ( ). One example is the sigmoid (or logistic) function, defined as: σ(s) = exp { s}. This is an example of a squashing function, i.e. a function such that: lim s σ(s) = 1, and similarly lim s σ(s) = 0. The function is monotonically increasing.
49 Visualizing the sigmoid Sigmoid σ(s) Figure 16 : s A plot of the sigmoid function.
50 Inverse link function Rearranging the model, we see that it is still linear with respect to a nonlinear transformation of the output: { } f (x) log = w T x. 1 f (x) If we interpret f (x) as the probability p(y = 1 x), we are constructing a linear model of the logarithm of the odds (i.e., the ratio of success probability vs. probability of failure). The nonlinear transformation above is called logit function in statistics, and it is an example of link function (linking probabilities with outputs). Since the sigmoid is its inverse, it is sometimes called an inverse link function.
51 Selecting a loss function It is possible to train the logistic regression model using a squared error, but if we interpret its values as probabilities, we can think of designing more appropriate loss functions. Let us consider the very natural 0/1 loss: { 0 if y = rnd(ŷ), L(y, ŷ) = 1 otherwise. where rnd( ) rounds the value. This loss cannot be optimized easily because it is not differentiable. We need to resort to an approximation to the 0/1 loss, generally called a surrogate loss function.
52 Logistic loss function Given two probability distributions q 1 ( ), q 2 ( ), taking values in a discrete space D, their similarity can be computed by the cross-entropy function as: H(q 1, q 2 ) = d D q 1 (d) log ( q 2 (d) ). Applying this to our context, we derive the cross-entropy loss: L(y, ŷ) = y log ( ŷ ) (1 y) log ( 1 ŷ ). The cross-entropy loss can be in principle be used for any classification model. In the case of logistic regression, it is sometimes called logistic loss.
53 Bernoulli distribution We can justify the cross-entropy loss from a probabilistic point of view. Remember that f (x) is the probability of observing 1 given x, while 1 f (x) is the probability of observing 0. Mathematically, the output follows a Bernoulli distribution: p(y f (x)) = f (x) y (1 f (x)) 1 y. Since the elements of the training set are independent, the probability of having observed the entire dataset is the product of the probabilities: P(w) = N p(y f (x)). i=1 The previous quantity is called the likelihood function.
54 Maximum likelihood A low value of P(w) means that the probability of having observed S is small. Finding a set of parameters w maximizing this criterion is called the maximum likelihood (ML) approach: { N } w = max p(y f (x)). w R d i=1 This is the first example we encounter of an approach explicitly modeling the conditional probabilities. Taking the logarithm and exchanging maximization with minimization: J(w) = N log ( p(y i f (x i ) ). i=1
55 Logistic regression algorithm Expanding each term in the logarithm, we obtain the logistic regression cost function: J(w) = N y i log ( f (x i ) ) + (1 y i ) log ( 1 f (x i ) ), i=1 which is equivalent to minimizing the cross-entropy loss on our training dataset. We can easily construct a regularized version of logistic regression by adding an l 2 penalty on w to the previous cost function. Its probabilistic derivation is slightly more involved, being left for a future lecture.
56 Solving the logistic regression problem After some manipulations, we express the objective function as: J(w) = N y i w T x i log ( ) 1 + e wt x i. i=1 In the simplest case, we can solve this using a gradient descent algorithm, where (using the chain rule): J(w) = = N 1 y i x i e wt x i x 1 + e wt x i i i=1 N (y i f (x i ))x i, i=1 and we made use of the fact that 1 1+e s = es 1+e s.
57 Newton s method More commonly, logistic regression is solved with the Newton s method, an optimization algorithm that makes use of Hessian information. The method works by constructing, at each iteration, a second-order approximation to J(w) around w n using a Taylor s expansion: J(w) J(w n ) + ( J(w n ) ) T (w wn ) (w w n) T 2 J(w n ) (w w n ). The estimate is updated according to the analytical minimum of this approximation: w n+1 = w n ( 2 J(w n ) ) 1 J(wn ).
58 Newton s method in logistic regression We can write gradient and Hessian as: J(w) = X T (y p), (1) 2 J(w) = X T WX, (2) where p is the vector of predicted outputs, and W is a diagonal matrix with W ii = f (x i ) (1 f (x i )). Putting everything together: w n+1 = w n + ( X T WX ) 1 X T (y p).
59 An example of logistic regression (a) Start (b) 3 iterations (c) 15 iterations Figure 17 : An example of application of logistic regression. 6
60 Logistic regression for multi-class problems For multi-class problems with M classes, we use a dummy encoding for the output y, where y k = 1 if the pattern is of class k, 0 otherwise. Consider a linear model s = Wx. The equivalent of the sigmoid function here is the softmax function: f k (s) = exp { s k} M l=1 exp { s l}. (3) This ensures that the outputs f 1,..., f M form a valid probability distribution over the classes. The cross-entropy loss trivially extends: N M J(w) = y ik log ( f k (x i ) ). i=1 k=1
61 Table of contents Linear models for regression Regularized least squares SVD analysis of ridge regression Online estimation of LRR Multiple outputs Hints on the geometry of least square Linear expansion on random bases Linear models for classification Least squares classification Logistic regression Discriminant analysis
62 Discriminant analysis Discriminant analysis (DA) is an alternative classification technique, which is very popular in statistics, also for historical reasons. In DA, we model several different probabilities: The prior probabilities π 1,..., π M are the probabilities of observing a specific class, irrespective of x (i.e., the relative frequencies of the classes). For each class k, we model the probability p(x k) describing the distribution of points in that class. This is our first example of generative model. Using Bayes theorem, we can write the probability that x belongs to class i as: p(k x) = p(x k)π k M t=1 p(x t)π. t
63 Modeling the probabilities Each prior probability can be modeled easily as π k = N k N, where N k is the number of points in the dataset belonging to class k. For simplicity, we can model each class-conditional probability as a Gaussian distribution: { 1 p(x k) = exp 1 } (2π) d/2 Σ k 1/2 2 (x µ k) T Σ 1 k (x µ k), where µ k is the mean, and Σ k the covariance of class k.
64 Estimating the probabilities Both the mean and the covariance can be estimated empirically from the data: µ k = 1 N k Σ k = x S k x 1 N k 1 x S k (x µ k ) (x µ k ) T. where S k is the set of input points belonging to class k. While this approach is extremely efficient, it requires us to estimate M+d+d 2 parameters for each class, for a total of M+Md(d+1) parameters, as compared to the lower number of Md parameters in a logistic regression model.
65 Quadratic discriminant analysis By Bayes theorem, we can associate the following discriminant function to each class: δ k (x) = 1 2 log Σ k 1 2 (x µ k) T Σ 1 k (x µ k) + log π k. By choosing the class corresponding to the larger δ k ( ), we obtain the quadratic discriminant analysis (QDA) algorithm. If we further suppose that all classes share the same covariance matrix Σ k = Σ, k = 1,..., M, we obtain linear discriminant analysis (LDA), requiring d 2 + dm + M parameters: δ k (x) = x T Σ 1 µ k 1 2 µt k Σ 1 µ k + log π k.
66 Naïve Bayes The naïve Bayes is a simplified form of discriminant analysis, where we assume that all features are independent among them given the class label, i.e.: p(x k) = d p(x i k). i=1 As an example, for continuous features we can model each p(x i k) as a univariate Gaussian, for a total of M + 2d parameters. Naïve Bayes is particularly powerful whenever d is large and we have many categorical features.
67 Visualizing QDA Figure 18 : Plot of the two Gaussian distributions for QDA in a simple binary classification problem.
68 Visualizing LDA Figure 19 : Plot of the two Gaussian distributions for LDA in the same dataset.
69 Visualizing Naive Bayes Figure 20 : Plot of the two distributions for Naive Bayes in the same dataset.
70 Further readings Most of the material in this lecture is rearranged from Chapters 3 and 4 of The Elements of Statistical Learning. An introductory article to linear expansions on random bases: [1] Scardapane, S. and Wang, D., Randomness in neural networks: An overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(2). Two interesting articles on the Naïve Bayes algorithm: [2] Domingos, P. and Pazzani, M., On the optimality of the simple Bayesian classifier under zero-one loss. Machine learning, 29(2-3), pp [3] Lewis, D.D., Naive (Bayes) at forty: The independence assumption in information retrieval. In ECML (pp. 4-15). Springer Berlin Heidelberg.
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationLINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception
LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationLast updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationMathematical Formulation of Our Example
Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationStatistical Machine Learning Hilary Term 2018
Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationMaximum Likelihood, Logistic Regression, and Stochastic Gradient Training
Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationLogistic Regression & Neural Networks
Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability
More informationClassification. Sandro Cumani. Politecnico di Torino
Politecnico di Torino Outline Generative model: Gaussian classifier (Linear) discriminative model: logistic regression (Non linear) discriminative model: neural networks Gaussian Classifier We want to
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More informationLogistic Regression. Seungjin Choi
Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationMidterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.
CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationLinear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationCPSC 340: Machine Learning and Data Mining. More PCA Fall 2017
CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationLinear Models for Regression
Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationMachine Learning. 7. Logistic and Linear Regression
Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationAn Introduction to Statistical and Probabilistic Linear Models
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationComputer Vision Group Prof. Daniel Cremers. 3. Regression
Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More informationMidterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.
CS 189 Spring 2013 Introduction to Machine Learning Midterm You have 1 hour 20 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. Please use non-programmable calculators
More informationECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression
ECE521: Inference Algorithms and Machine Learning University of Toronto Assignment 1: k-nn and Linear Regression TA: Use Piazza for Q&A Due date: Feb 7 midnight, 2017 Electronic submission to: ece521ta@gmailcom
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII
More informationMachine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber
Machine Learning Regression-Based Classification & Gaussian Discriminant Analysis Manfred Huber 2015 1 Logistic Regression Linear regression provides a nice representation and an efficient solution to
More informationLecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning
Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function
More informationMODULE -4 BAYEIAN LEARNING
MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities
More informationCS 195-5: Machine Learning Problem Set 1
CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of
More informationMachine Learning Support Vector Machines. Prof. Matteo Matteucci
Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way
More informationIntroduction to Machine Learning
Introduction to Machine Learning Thomas G. Dietterich tgd@eecs.oregonstate.edu 1 Outline What is Machine Learning? Introduction to Supervised Learning: Linear Methods Overfitting, Regularization, and the
More informationLinear Discrimination Functions
Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationComputational statistics
Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationCOMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma
COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationDecember 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis
.. December 20, 2013 Todays lecture. (PCA) (PLS-R) (LDA) . (PCA) is a method often used to reduce the dimension of a large dataset to one of a more manageble size. The new dataset can then be used to make
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationCENG 793. On Machine Learning and Optimization. Sinan Kalkan
CENG 793 On Machine Learning and Optimization Sinan Kalkan 2 Now Introduction to ML Problem definition Classes of approaches K-NN Support Vector Machines Softmax classification / logistic regression Parzen
More informationIntroduction to Machine Learning
Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationBehavioral Data Mining. Lecture 7 Linear and Logistic Regression
Behavioral Data Mining Lecture 7 Linear and Logistic Regression Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips Linear Regression
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationStatistical Methods for SVM
Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationLecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data
Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationCS540 Machine learning Lecture 5
CS540 Machine learning Lecture 5 1 Last time Basis functions for linear regression Normal equations QR SVD - briefly 2 This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 3
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationChap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University
Chap 2. Linear Classifiers (FTH, 4.1-4.4) Yongdai Kim Seoul National University Linear methods for classification 1. Linear classifiers For simplicity, we only consider two-class classification problems
More informationThe classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know
The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is
More informationThe classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA
The Bayes classifier Linear discriminant analysis (LDA) Theorem The classifier satisfies In linear discriminant analysis (LDA), we make the (strong) assumption that where the min is over all possible classifiers.
More informationLeast Squares Regression
E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute
More informationBANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1
BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013) ISLR Data Mining I Lecture
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationMachine Learning (CS 567) Lecture 5
Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol
More informationLINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input
More informationMachine Learning and Data Mining. Linear classification. Kalev Kask
Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q
More information