Dipartimento di Informatica e Scienze dell Informazione

Size: px
Start display at page:

Download "Dipartimento di Informatica e Scienze dell Informazione"

Transcription

1 Dipartimento di Informatica e Scienze dell Informazione A Statistical Learning approach to Liver Iron Overload Estimation Luca Baldassarre 1,2, Barbara Gianesin 1, Annalisa Barla 2 and Mauro Marinelli DIFI - Università di Genova, v. Dodecaneso 33, 16146, Genova, Italy 2 - DISI - Università di Genova, v. Dodecaneso 35, 16146, Genova, Italy Technical Report DISI-TR DISI, Università di Genova v. Dodecaneso 35, Genova, Italy 1

2 A Statistical Learning approach to Liver Iron Overload Estimation Luca Baldassarre, Barbara Gianesin, Annalisa Barla and Mauro Marinelli Abstract In this work we present and discuss in detail a novel vector-valued regression technique: our approach allows for an all-at-once estimation of the vector components, as opposed to solve a number of independent scalar-valued regression tasks. Despite its general purpose nature, the method has been designed to solve a delicate medical issue: a reliable and non-invasive assessment of body-iron overload. The Magnetic Iron Detector (MID) measures the magnetic track of a person, which depends on the anthropometric characteristics and the body-iron burden. We aim to provide an estimate of this signal in absence of iron overload. We show how this question can be formulated as the estimation of a vector-valued function which encompasses the prior knowledge on the shape of the magnetic track. This is accomplished by designing an appropriate vector-valued feature map. We successfully applied the method on a dataset of 84 volunteers. 1 Introduction The iron is essential to human life, but is toxic in excessive amounts. There are several diseases characterized by liver iron overload, such as thalassemia or hereditary hemochromatosis. The assessment of body iron excess is therefore essential for managing the therapies for these diseases. Each disease is characterized by a different mechanism of iron accumulation. For thalassemic patients iron overload is induced by periodical blood transfusions and for hemochromatosis patients it is induced by an incorrect dietary absorbtion of iron. The invasive liver biopsy is still considered the best way to perform iron overload evaluation, but being a local measure, it is affected by large errors due to the heterogeneous distribution 2

3 3 PATIENT x[cm] Eddy Current Signal 3 R ( x) = Magnetization Signal x[cm] VOLUNTEER μv MAGNETIZATION -2 EDDY -3 CURRENT -4-5 μv 1.1: Volunteer and patient magnetizaton signals. 1.2: Eddy current and magnetization signals of a volunteer. of iron deposition in the liver. Recently, Marinelli and colleagues [9] have developed a roomtemperature biosusceptometer, the Magnetic Iron Detector (MID), which measures the variation of a magnetic field at different positions along the axis that crosses the patient s liver. This instrument allows the non-invasive assessment of the iron overload in the whole liver. Given an estimate of the background signal of a patient, that is the signal that would be generated in absence of iron overload, it is possible to recover the iron burden by subtracting the estimated signal from the measured signal. The statistical model developed by Marinelli and coworkers [8] is currently used at the E.O. Ospedali Galliera Hospital in Genoa, Italy, for assessing the iron overload. The model has been trained on a dataset of 84 healthy volunteers and it estimates the ratio R(x) between the two signals shown in subfig.1.2, of which only the magnetization signal depends on the liver iron content. The core idea behind their approach is that the magnetization signal of a well-treated patient is indistinguishable from the one generated by a healthy volunteer with the same biometric features, see subfig.1.1. Furthermore, they assume that the ratio R(x) of the two signals, evaluated only in the range between -8cm and 8cm, resembles a parabola. We reformulate this problem in the context of Statistical Learning presenting a method to transform a curve fitting task into a vector-valued regression model. Since the measures are always taken at fixed positions along the measurement axis, they can be thought of as components of a vector and a high correlation among them can be assumed, because they approximately lie on a parabola. In this way we eschew from directly estimating the magnetization curve. Our 3

4 vector-valued regression model simultaneously estimates the five points of the background signal. The correlation between these points is introduced by an appropriate feature map, which is linear in the biometric features and quadratic in the measurement positions. As we will show in subsec.3.2, we can compute the corresponding matrix-valued kernel function The method described in Sec.3.2 can be implemented by means of iterative algorithms, see [7], such as Landweber, ν-method or the sparsity enforcing l1l2 regularization [3; 4]. 2 Liver Iron Overload Estimation The biosusceptometer is composed of an AC magnetic source and a pickup coil which measures the electromotive force emf produced by the oscillation of the magnetic field flux, as shown in subfig.1.3. A body placed between the magnet and the pickup slightly modifies the flux and therefore the emf measured: the amount of the variation depends on the magnetic properties of the body, on its geometry and on its positioning in the field. The emf produced in the pickup by the field is about 4V and the diamagnetic signal of a body is about 4µV. A moderate iron overload adds a paramagnetic contribution of about 0.4µV. The symmetry of the system and the use of synchronous detection make this difficult measurement feasible [8]. A sample of susceptibility χ, placed between the magnet and one of the pickups, generates the signal: V = V olume χ( r)g( r)d r (1) The weight function g( r) is reported in subfig.1.4: all body tissues contribute to the signal generation, but the major contribution comes from those between the magnet and the pick up coil (measurement region). The patient lies supine on a stretcher, such that in the measurement region the liver center of mass crosses the magnetic field axis. The magnetic track of the patient is the complete scan of the magnetic properties of the body section. It is composed of the measurements of the magnetic signal 4cm apart. Position x = 0cm corresponds to the center of the body; negative positions indicate the liver side, while positive positions the spleen side. Subfigure 1.1 reports the magnetic tracks of a healthy volunteer and of a patient with similar anthropometric features: the liver iron overload produces an evident 4

5 A Magnet A g(r ) g max B Pickup g( r ) gmax B g( r ) g max 1.3: Magnet and pickup configuration. 1.4: Weight function g( r) and its cross section in the direction A and B. Figure 1: When the patient is positioned between the magnet and the pickup a emf is produced by the magnetization on the tissues of the patient. With this positioning, the whole liver falls in the region where the function g( r) is greater. variation of the signal in the left part of the track. Due to the oscillating magnetic field, the magnetic signal generated by the human body has two independent sources: the magnetization signal, from the diamagnetic and paramagnetic properties of the tissues, and the eddy current signal, from their electrical conductivity. For each patient a double track is recorded (an example is shown in subfig. 1.2): only the magnetization signal depends on the iron overload. The aim of the present work is to find a model which best approximates the magnetization signal of a healthy volunteer (background signal) from its anthropometric data and the eddy current signal. The iron overload can be evaluated by computing the difference between the measured magnetization signal and the estimate of the background signal. Therefore, increasing the prediction accuracy of the model, increases our ability to detect slight iron overloads. However, the maximum accuracy we can obtain is limited by the measurement error, mainly due to the positioning of the patient on the stretcher. This error is about 150nV and corresponds to an iron overload of 0.4g 5

6 3 Statistical Learning Approach 3.1 Non-parametric regression and regularization Given a set of input-output examples z = {(x 1, y 1 ),..., (x n, y n )}, x i X and y i Y, the aim of statistical learning is to find the deterministic function that best represents the relationship between x and y. This function is called the regression function. To tackle this problem it is necessary to define the space of candidate functions (hypothesis space, H) and which measure to use to assess the goodness of a candidate (loss function, V ). Ideally one would like to find a function that minimizes the loss function on all the possible input-output pairs. Since one always deals with finite sets, one has to resort to minimize the empirical risk, that is the loss calculated only on the examples z: E n (f) = 1 n n V (y i, f(x i )). i=1 If the hypothesis space is large enough to accommodate for almost any sensible function, it will always be possible to perfectly predict (or fit) the examples z, without any guarantee that this minimizer will perform well on unseen data. This problem is called overfitting. There are two alternative approaches to avoid incurring in overfitting: the first consists in restricting the hypothesis space, the second in favoring smooth functions. It is usually more straightforward to translate our prior information on the nature of the specific problem at hand into properties of the regression function than into a restriction of the hypothesis space. We prefer to follow the second approach. Convenient hypothesis spaces are the Reproducing Kernel Hilbert Spaces (RKHS), which, for specific choices of the kernel function, K, are dense in L 2 (X ). The norm in these spaces normally reflects the L2 norm: a smaller norm corresponds to smoother functions. It is therefore natural to search for those functions that minimize the following functional: 1 n n V (y i, f(x i )) + λ f 2 K (2) i=1 The regularizing parameter, λ, controls the trade off between the two terms: let λ reach zero and we obtain the interpolating solution, let λ grow and the solution will be smoother and smoother. This approach is called Tikhonov regularization. In a RKHS the representer theorems [5; 10] 6

7 guarantee that the solution can always be written as: f(x) = n K(x, x i )c i i=1 where the coefficients c i depend on the data, on the loss function, on the kernel choice and on the regularization parameter λ. Note that, if Y R then the c i and K(x, x i ) are scalar, whether if Y R d the c i are d-dimensional vectors and K(x, x i ) is a d d matrix. The direct approach for vector-valued regression, as in [10], is computationally expensive since it requires to invert a nd nd matrix. To overcome this issue, we propose an extension to the vector-valued case of iterative methods originally developed for scalar regression [12; 7]. The main idea of these techniques is to start with an approximate solution and iteratively add a correction in the direction opposite to the gradient of the empirical risk. Letting the iterations go to infinity leads to an overfitted solution, with its problems of stability and generalization. By early stopping the procedure, a regularized solution is achieved. The number of iterations m plays the role of the regularization parameter. We are also interested in studying feature selection on vector-valued functions. We implemented the l1l2 regularization, a sparsification method initially proposed by [13], studied in [3] and already applied in [6]. This method iteratively minimizes the following functional, derived from (2) with the square loss and the addition of a l1 penalty term: 1 n n (y i f(x i )) 2 d + λ(1 α) f 2 l2 + λα f l1. i=1 3.2 Designing the feature map For each person i = 1,..., 84, we consider only the 5 measures y ik at positions t k = { 8, 4, 0, 4, 8}cm. The measures can be thought as the components of a five-dimensional vector and lie approximately on a parabola, hence we can model them as y ik = f(x i ) k + ɛ ik, where x i stands for the biometric data of the volunteer i, ɛ ik represents the noise and: f(x i ) k = c 0 (x i ) + c 1 (x i )t k + c 2 (x i )t 2 k + b(t k), (3) where b(t) represents a measure offset independent of the volunteer. To avoid computing b(t k ), we choose to set the mean of the values y ik to zero for each k. 7

8 In our model, we assume that the coefficients c j depend linearly on x: c j (x) = β j x, j = 0, 1, 2 and introduce the vector-valued feature map, ϕ : X R 5 3p, (X R p, p = 22): x xt 1 xt 2 1 x xt 2 xt 2 2 ϕ(x) = x xt 3 xt 2. (4) 3 x xt 4 xt 2 4 x xt 5 xt 2 5 Let us define β as the vector obtained by concatenating the coefficient vectors β j. Note that the element β lj is the l-th component of the coefficient vector β j. The vector-valued estimator can be rewritten as a linear combination of these new features: f(x) = ϕ(x)β, β R 3p. (5) We decided to use the quadratic loss function V, therefore the empirical risk is: E n (β) = 1 n n y i ϕ(x i )β 2 R. 5 i=1 Our aim is to compare the performance of the MID parametric model versus the vector-valued model estimated via the Landweber, ν-method and l1l2 algorithms. These methods require the computation of the empirical risk gradient, which, for this specific case, is: E n (β) = 2 n (ϕy ϕt ϕβ) (6) n (ϕy ) γ = < ϕ γ (x i ), y i > R 5 (ϕ T ϕ) γ,γ = i=1 n < ϕ γ (x i ), ϕ γ (x i ) > R 5 i=1 where ϕ γ (x) corresponds to the γ-th row of ϕ(x), ϕy R 3p and ϕ T ϕ R 3p 3p. The simplest iterative method is the Landweber approach [1]. It starts with the null solution (i.e. all the coefficients β lj equal to zero), which is updated by adding the negative of the gradient multiplied by a constant step size, η: β m+1 = β m η E n (β m ), β 0 = (0,..., 0) 8

9 The number of iterations m corresponds to the inverse of the regularization parameter λ. The ν method [7] extends Landweber by using a dynamic step size and introducing an inertial term which keeps memory of the previous update: β m+1 = β m + u(β m β m 1 ) wη E n (β m ), where w and u change at each iteration. It has been shown that this algorithm performs better and faster than Landweber: in fact, the number of iterations corresponds to λ 2. l1l2 regularization iteratively minimizes the following functional: 1 n n y i ϕ(x i )β 2 R + λ(1 α) β 2 5 l2 + λα β l1 i=1 The l1 penalty term forces many of the coefficients β lj to be zero and the corresponding variables can be considered irrelevant to the problem and discarded. The iterations are essentially of the Landweber type, but at each step the coefficients are soft-thresholded and shrunk: β m+1 = H(β m η E n (β m ), τ)/(1 + µ) τ = λα µ = λ(1 α) where H is the soft-thresholding operator, which sets to zero all coefficients within [ τ, τ] and shifts towards zero by τ the remaining coefficients. The algorithm stops accordingly to a convergence criterion, for details see [3]. From the vector-valued feature map ϕ we can calculate the corresponding matrix-valued kernel. Following [2]: 5 (K(x, s)) pq = ϕ kp (x)ϕ kq (s) k=1 = (x s)(1 + t p t q + t 2 pt 2 q). Note: We can recast the vector-valued model into a scalar one, by considering for each volunteer five input points (x, t k ), one for each measurement position, and using the factorized scalar kernel: K((x, t p ), (s, t q )) = (x s)(1 + t p t q + t 2 pt 2 q) 9

10 It is possible to extend our approach to the non-linear case by replacing the dot product with a suitable scalar kernel function. The estimator can be written as: f(x, t) = n 5 K((x, t), (x i, t k ))c ik i=1 k=1 With this approach one can use standard scalar regularizing regression techniques, but there are some considerations to be made. The first regards the i.i.d hypothesis on the examples. Reformulating the problem as scalar regression, each volunteer is associated with 5 vectors composed of two parts. The first is the biometric data of the volunteer, x i, while the second part is the measurement position, t k. Consequently, the training set has 5 n elements, whose biometric and position components are not i.i.d. Furthermore, enforcing sparsity on the coefficients c ik is very different than sparsifying the coefficients β lj, which are directly related to the biometric features. 3.3 A naïve scalar approach For comparison, we tested our model against a naïve approach, which consists in treating the measures at each position as independent scalar regression problems. Therefore, five scalar models are trained separately and their outputs combined to recover the background signal. The prior knowledge that the magnetic signal for each subject is roughly a parabola, is no longer taken into account. We implemented standard RLS regression with polinomial and gaussian kernel and l1l2 regularization. 3.4 Model selection and assessment We adopt an experimental protocol in order to select the model parameters and assess the generalization capabilities of our method in an unbiased way. We perform two nested loops of K-fold Cross Validation. We recall that the estimate of the generalization error is the mean of the empirical errors on the K test sets. If K equals the total number of data available, the method is called Leave One Out Cross Validation (LOO). Higher values of K reduce the bias of the estimator, since the model is trained on more data, but increase its variance, since fewer 10

11 data are used for testing. On the other hand, more splits imply more computations, hence more time.in some cases, RLS for example [11], closed form solutions of the Leave One Out Error have been obtained, resulting in very fast computation. For the vector-valued model, the inner loop is a 5-fold Cross Validation and is performed to select the regularizing parameter (e.g. λ or the number of iterations, m). For each value of the parameter, an estimate of the generalization error is computed. The value that minimizes the error is used for training. The outer loop is a Leave One Out Cross Validation evaluating the performance of the chosen model. The estimate of the generalization error is the mean of the K = n empirical errors. For the RLS scalar models, the inner loop is a LOO CV for selecting both the kernel parameter and the regularizing parameter, λ, exploiting the computational advantage of the closed form solution for the LOO error for the latter. The selection of λ for the scalar l1l2 regularization method was performed by a 5-fold CV. The evaluation of the performce of these models has been carried out through LOO CV, in correspondence to the procedure adopted for the vector-valued model. 4 Results The data set is composed of 84 healthy volunteers represented by vectors of features which are reported in Table 1. Note that from now on we will refer to n as the number of examples in the training set within the innermost loop of CV. The features are highly inhomogeneous and can lead to numerical problems, therefore we decided to normalize our data. We set the columns of the n p data matrix X = (x 1,..., x n ) T, to have zero mean and fixed range and changed the variable t from { 8, 4, 0, 4, 8} to { 1, 0.5, 0, 0.5, 1}, since it only represents a label for the components of the vector y. Thus, each element of the 3d matrix ϕ(x) R n 3p 5, obtained by applying the feature map to the data matrix X, belongs to [ 1, 1]. In the test phase, we apply these normalizing factors to the test data. For model selection and assessment we used the experimental protocol outlined in subsec.3.4: the model parameters to be selected are the number of iterations m for the Landweber and ν-method algorithms and the regularizing parameter λ for the l1l2 method. In the latter case, 11

12 Table 1: Volunteer s features Feature Description 1 Eddy current at -12 cm 2 Eddy current at -8 cm 3 Eddy current at -4 cm 4 Eddy current at 0 cm 5 Eddy current at 4 cm 6 Eddy current at 8 cm 7 Eddy current at 12 cm 8 Thorax section area at 0 cm 9 Thorax section area at 18 cm 10 Thorax section area at -18 cm 11 Thorax height at 0 cm 12 Thorax height at 18 cm 13 Thorax height at -18 cm 14 Adam s apple position 15 Navel position 16 Age 17 Height 18 Weight 19 Thorax circumference 20 Circumference under ribs arch 21 BMI (Body Mass Index) 22 Body area 12

13 α was set to 0.9 to enforce maximum sparsity while retain correlated features [3]. For scalar RLS regression, we also selected the kernel parameters (the degree of the polynomial or the σ of the gaussian) alongside the regularization parameter λ. The implemented iterative algorithms require the specification of a step size η. We choose the value η = (2 ϕ T ϕ ) 1 which guarantees their convergence, see [12; 3; 7]. We report the selected parameters in Table 2 for the vector-valued model and in Table 3 for the naive approach. Note that the values correspond to the median of the parameters for each model learnt during the outer loop of LOO cross validation. Table 2: Selected parameters Model Number of iterations λ Landweber 397 n.a. ν-method 68 n.a. l1l Table 3: Selected parameters - Naive approach Model RLS gaussian RLS polynomial l1l2 Position λ Par λ Par λ Par x x x x x Figure 2 shows the boxplots for the LOO errors distributions for all the models tested, compared with the model in use at the E.O. Ospedali Galliera Hospital, assessed with the same validation protocol. As expected, we observe that the LOO errors show a high variance. Table 4 summarizes the statistics for each distribution. The l1l2 algorithm performs slightly better and seems more robust to outliers, both for the vector-valued model and for the naïve approach. The accuracies obtained with these methods 13

14 Table 4: Statistic summary for the LOO errors distributions Model I quartile Median III quartile Landweber ν-method l1l MID RLS gauss RLS polynomial L1L correspond to a precision in the iron overload estimation of about 0.8g. Iron overload lower than 1g is considered mild: currently no model is capable to detect this kind of iron burden. 5 Conclusions The model proposed is a general method to approach vector-valued regression problems. Moreover, it can also be used to estimate a curve explained by a variable that is always sampled at fixed values. Prior knowledge (e.g. the shape of the curve w.r.t. the parametrizing variable, or the correlation among the elements of the vector-valued function to be estimated) can be easily incorporated by explicitly writing the feature map or the kernel function. Our results show that the iterative algorithms can be applied to the vector-valued case with success. They also provide an efficient alternative to the direct computation of the inverse of a nd nd matrix. The model selection and validation protocol adopted leads to an unbiased solution, avoiding overfitting and unreliable estimates of the performance. The Marinelli group will soon start a new data acquisition campaign on volunteers: the old features will be measured more accurately and some new ones will be introduced, for example a 3d laser scan of the volunteer s thorax. The statistical methods here exposed will be employed on the new dataset and will be compared against a neural network system that the Marinelli group is planning to develop. 14

15 Absolute value of residue Landweber nu method L1L2 MID RLS gauss RLS poly L1L2 scalar Model Figure 2: LOO errors distributions. The first three models are obtained from the vector-valued one by the indicated algorithms. The MID model is the one currently used for diagnosis. The last three boxplots regard the naïve scalar approach. References [1] P. Bühlmann and B. Yu. Boosting with the l2- loss: Regression and classication. Journal of American Statistical Association, 98: , [2] A. Caponnetto, C. Micchelli, M. Pontil, and Y. Ying. Universal kernels for multi-task learning. Journal of Machine Learning Research, submitted. [3] C. De Mol, E. De Vito, and L. Rosasco. Sparse tikhonov regularization for variable selection and learning. Technical report, DISI, [4] C. De Mol, S. Mosci, M. Traskine, and A. Verri. A regularized method for selecting nested groups of relevant genes from microarray data. Technical report, DISI,

16 [5] E. De Vito, L. Rosasco, A. Caponnetto, M. Piana, and A. Verri. Some properties of regularized kernel methods. Journal of Machine Learning Research, 5: , [6] A. Destrero, S. Mosci, C. D. Mol, A. Verri, and F. Odone. Feature selection for high dimensional data. Computational Management Science, to appear. [7] L. Lo Gerfo, L. Rosasco, F. Odone, E. De Vito, and A. Verri. Spectral algorithms for supervised learning. Neural Computation, to appear. [8] M. Marinelli, S. Cuneo, B. Gianesin, A. Lavagetto, M. Lamagna, E. Oliveri, G. Sobrero, L. Terenzani, and G. Forni. Non-invasive measurement of iron overload in the human body. IEEE Transactions on applied superconductivity, 16(2), June [9] M. Marinelli, B. Gianesin, M. Lamagna, A. Lavagetto, E. Oliveri, M. Saccone, G. Sobrero, L. Terenzani, and G. Forni. Whole liver iron overload measurement by a non-cryogenic magnetic susceptometer. In Proc. of New Frontiers in Biomagnetism, Vancouver, Canada, [10] C. A. Micchelli and M. Pontil. On learning vector-valued functions. Neural Computation, 17, [11] R. M. Rifkin and R. A. Lippert. Notes on regularized least squares. Technical report, MIT Dspace [ (United States), [12] Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2): , August [13] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2),

Regularization Algorithms for Learning

Regularization Algorithms for Learning DISI, UNIGE Texas, 10/19/07 plan motivation setting elastic net regularization - iterative thresholding algorithms - error estimates and parameter choice applications motivations starting point of many

More information

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Regularization via Spectral Filtering

Regularization via Spectral Filtering Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Dipartimento di Fisica

Dipartimento di Fisica Dipartimento di Fisica Multi-Output Learning with Spectral Filters by Luca Baldassarre DIFI, Università di Genova Via Dodecaneso 33, 16146 Genova, Italy http://www.fisica.unige.it/ Dottorato di Ricerca

More information

TUM 2016 Class 3 Large scale learning by regularization

TUM 2016 Class 3 Large scale learning by regularization TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond

More information

Spectral Regularization

Spectral Regularization Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Sparse Approximation and Variable Selection

Sparse Approximation and Variable Selection Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation

More information

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

2 Tikhonov Regularization and ERM

2 Tikhonov Regularization and ERM Introduction Here we discusses how a class of regularization methods originally designed to solve ill-posed inverse problems give rise to regularized learning algorithms. These algorithms are kernel methods

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

18.6 Regression and Classification with Linear Models

18.6 Regression and Classification with Linear Models 18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight

More information

Spectral Filtering for MultiOutput Learning

Spectral Filtering for MultiOutput Learning Spectral Filtering for MultiOutput Learning Lorenzo Rosasco Center for Biological and Computational Learning, MIT Universita di Genova, Italy Plan Learning with kernels Multioutput kernel and regularization

More information

Spectral Algorithms for Supervised Learning

Spectral Algorithms for Supervised Learning LETTER Communicated by David Hardoon Spectral Algorithms for Supervised Learning L. Lo Gerfo logerfo@disi.unige.it L. Rosasco rosasco@disi.unige.it F. Odone odone@disi.unige.it Dipartimento di Informatica

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Mathematical Methods for Data Analysis

Mathematical Methods for Data Analysis Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS Filippo Portera and Alessandro Sperduti Dipartimento di Matematica Pura ed Applicata Universit a di Padova, Padova, Italy {portera,sperduti}@math.unipd.it

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Sample questions for Fundamentals of Machine Learning 2018

Sample questions for Fundamentals of Machine Learning 2018 Sample questions for Fundamentals of Machine Learning 2018 Teacher: Mohammad Emtiyaz Khan A few important informations: In the final exam, no electronic devices are allowed except a calculator. Make sure

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012 CS-62 Theory Gems September 27, 22 Lecture 4 Lecturer: Aleksander Mądry Scribes: Alhussein Fawzi Learning Non-Linear Classifiers In the previous lectures, we have focused on finding linear classifiers,

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Are Loss Functions All the Same?

Are Loss Functions All the Same? Are Loss Functions All the Same? L. Rosasco E. De Vito A. Caponnetto M. Piana A. Verri November 11, 2003 Abstract In this paper we investigate the impact of choosing different loss functions from the viewpoint

More information

MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

More information

9.520 Problem Set 2. Due April 25, 2011

9.520 Problem Set 2. Due April 25, 2011 9.50 Problem Set Due April 5, 011 Note: there are five problems in total in this set. Problem 1 In classification problems where the data are unbalanced (there are many more examples of one class than

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com Linear Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms 03/Feb/2010 VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms Presented by Andriy Temko Department of Electrical and Electronic Engineering Page 2 of

More information

Machine Learning and Data Mining. Linear regression. Kalev Kask

Machine Learning and Data Mining. Linear regression. Kalev Kask Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ Parameters q Learning algorithm Program ( Learner ) Change q Improve performance

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Relevance Vector Machines for Earthquake Response Spectra

Relevance Vector Machines for Earthquake Response Spectra 2012 2011 American American Transactions Transactions on on Engineering Engineering & Applied Applied Sciences Sciences. American Transactions on Engineering & Applied Sciences http://tuengr.com/ateas

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Tutorial on Machine Learning for Advanced Electronics

Tutorial on Machine Learning for Advanced Electronics Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling

More information

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c Warm up D cai.yo.ie p IExrL9CxsYD Sglx.Ddl f E Luo fhlexi.si dbll Fix any a, b, c > 0. 1. What is the x 2 R that minimizes ax 2 + bx + c x a b Ta OH 2 ax 16 0 x 1 Za fhkxiiso3ii draulx.h dp.d 2. What is

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Online Learning With Kernel

Online Learning With Kernel CS 446 Machine Learning Fall 2016 SEP 27, 2016 Online Learning With Kernel Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Stochastic Gradient Descent Algorithms Regularization Algorithm Issues

More information

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee 9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science Neural Networks Prof. Dr. Rudolf Kruse Computational Intelligence Group Faculty for Computer Science kruse@iws.cs.uni-magdeburg.de Rudolf Kruse Neural Networks 1 Supervised Learning / Support Vector Machines

More information

Lecture 6. Regression

Lecture 6. Regression Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Bagging and Other Ensemble Methods

Bagging and Other Ensemble Methods Bagging and Other Ensemble Methods Sargur N. Srihari srihari@buffalo.edu 1 Regularization Strategies 1. Parameter Norm Penalties 2. Norm Penalties as Constrained Optimization 3. Regularization and Underconstrained

More information

Resampling techniques for statistical modeling

Resampling techniques for statistical modeling Resampling techniques for statistical modeling Gianluca Bontempi Département d Informatique Boulevard de Triomphe - CP 212 http://www.ulb.ac.be/di Resampling techniques p.1/33 Beyond the empirical error

More information

DELFT UNIVERSITY OF TECHNOLOGY

DELFT UNIVERSITY OF TECHNOLOGY DELFT UNIVERSITY OF TECHNOLOGY REPORT -09 Computational and Sensitivity Aspects of Eigenvalue-Based Methods for the Large-Scale Trust-Region Subproblem Marielba Rojas, Bjørn H. Fotland, and Trond Steihaug

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Homework 5. Convex Optimization /36-725

Homework 5. Convex Optimization /36-725 Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

An introduction to Support Vector Machines

An introduction to Support Vector Machines 1 An introduction to Support Vector Machines Giorgio Valentini DSI - Dipartimento di Scienze dell Informazione Università degli Studi di Milano e-mail: valenti@dsi.unimi.it 2 Outline Linear classifiers

More information

Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert

Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2007-025 CBCL-268 May 1, 2007 Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert massachusetts institute

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

Machine Learning, Midterm Exam

Machine Learning, Midterm Exam 10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

SVM optimization and Kernel methods

SVM optimization and Kernel methods Announcements SVM optimization and Kernel methods w 4 is up. Due in a week. Kaggle is up 4/13/17 1 4/13/17 2 Outline Review SVM optimization Non-linear transformations in SVM Soft-margin SVM Goal: Find

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 29 April, SoSe 2015 Support Vector Machines (SVMs) 1. One of

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

SVMs, Duality and the Kernel Trick

SVMs, Duality and the Kernel Trick SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today

More information