CS-E3210 Machine Learning: Basic Principles
|
|
- Maryann Chapman
- 6 years ago
- Views:
Transcription
1 CS-E3210 Machine Learning: Basic Principles Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) / 48
2 In a nutshell today and friday we consider regression problems data points x (i) R d and continuous target y (i) R we want to learn a function h(x (i) ) y (i) function prediction h(x) is continuous in classification both target y and h(x) are binary a function h( ) is represented by parameters w parameters w need to fit data X = (x (1), y (1) ),..., (x (N), y (N) ) 2 / 48
3 output y: rent Can we predict apartment rent? 2000 Rent prediction input x: house size (sqm) we observe rents y (i) for i = 1,..., 11 houses x (i) learn from this data to predict rent h(x) R given d house properties x R d (designing good h(x) by hand is not machine learning) 3 / 48
4 output y: rent input x 2 : house age Which features do we have access to? Rent prediction Rent prediction, output y: rent input x: house size (sqm) input x 1 : house size (sqm) house size x size can predict a linear trend in rent y house age x age gives non-linear information about y new and old houses seem expensive, little effect for 40 s to 90 s informative features add accuracy (eg. location, condition) non-informative features add noise (eg. house color) 4 / 48
5 output y: rent output y: rent Alternative hypotheses h(x), which to choose? 2000 Rent prediction h(x) = 8.5 x Rent prediction h(x) complex input x: house size (sqm) input x: house size (sqm) linear functions are surprisingly powerful non-linear functions can achieve low error, but still err badly a model should learn the underlying model and generalise to future data Lectures 7 & 8 5 / 48
6 input x 2 : house age input x 2 : house age 800 Alternative hypotheses h(x), which to choose? Rent prediction, output y: rent Rent prediction, output y: rent input x 1 : house size (sqm) input x 1 : house size (sqm) a linear function can not explain the bimodal behavior of x age basis functions 6 / 48
7 Outline 1 2 Polynomial basis Gaussian basis 7 / 48
8 A regression problem inputs x (i) = (x (i) 1,..., x (i) d )T R d with d features/properties/dimensions/covariates a scalar target/response/output/label y (i) R a dataset of N data points X = {(x (1), y (1) ),..., (x (N), y (N) )} = {x (i), y (i) } N i=1 in matrix form the dataset is x (1) 1 x (1) d X =..... =. x (N) 1 x (N) d x (1)T x (N)T RN d, y = learn a function h( ) : R d R with y (i) h(x (i) ) (1) which function family h(x) to choose? (2) how to measure h(x) y? y (1). y (N) R N 8 / 48
9 linear regression for multivariate inputs x R d defines d h w (x) = w j x j = w T x j=0 where w R d are linear weight parameters encode x 0 = 1, then w 0 encodes intercept the hypothesis class h w {h w : w R d } all predictions in matrix notation now h(x (1) ). h(x (N) ) = w T x (1). w T x (N) = X w measure prediction error by square error/loss L((x (i), y (i) ), h( )) = (y (i) h(x (i) )) 2 9 / 48
10 output y: rent Can we predict apartment rent? Rent prediction input x (i) output y (i) input x: house size (sqm) x (1) = 31 y (1) = 705 x (2) = 33 y (2) = 540 x (3) = 31 y (3) = 650 x (4) = 49 y (4) = 840 x (5) = 53 y (5) = 890 x (6) = 69 y (6) = 850 x (7) = 101 y (7) = 1200 x (8) = 99 y (8) = 1150 x (9) = 143 y (9) = 1700 x (10) = 132 y (10) = 900 x (11) = 109 y (11) = 1550 we observe data X = (x (1), y (1) ),..., (x (N), y (N) ) with N = 11 we assume y (i) f (x (i) ) where f ( ) is the true function 10 / 48
11 output y: rent Can we predict apartment rent? Rent prediction h(x) = 9 x input x: house size (sqm) input x (i) output y (i) h(x) = 9 x x (1) = 31 y (1) = 705 h(x (1) ) = 679 x (2) = 33 y (2) = 540 h(x (2) ) = 697 x (3) = 31 y (3) = 650 h(x (3) ) = 679 x (4) = 49 y (4) = 840 h(x (4) ) = 841 x (5) = 53 y (5) = 890 h(x (5) ) = 877 x (6) = 69 y (6) = 850 h(x (6) ) = 1021 x (7) = 101 y (7) = 1200 h(x (7) ) = 1309 x (8) = 99 y (8) = 1150 h(x (8) ) = 1291 x (9) = 143 y (9) = 1700 h(x (9) ) = 1687 x (10) = 132 y (10) = 900 h(x (10) ) = 1588 x (11) = 109 y (11) = 1550 h(x (11) ) = 1381 linear hypothesis class h w (x) = w 1 x + w 0 = w T x encode x = (x, 1) T with w = (w 1, w 0 ) T compute losses (y (i) h(x (i) )) 2 11 / 48
12 output y: rent Which parameters to choose? 2000 Rent prediction input x: house size (sqm) choose parameters to minimize empirical risk (mean loss) { ŵ = argmin E(h( ) X) = 1 N (y (i) h(x (i) )) 2 w N i=1 = 1 N y X w 2} 12 / 48
13 y Empirical risk Empirical risk 2000 Rent prediction (b=0) Empirical risk 1500 data h(x) = 5x 10 8 Empirical risk w= x w empirical risk quantifies how well the function fits data h(x) = w 1 x + 0, w 1 = 5 13 / 48
14 y Empirical risk Empirical risk 2000 Rent prediction (b=0) Empirical risk 1500 data h(x) = 5x h(x) = 11.7x 10 8 Empirical risk w=5 w= x w empirical risk quantifies how well the function fits data h(x) = w 1 x + 0, w 1 = / 48
15 y Empirical risk Empirical risk 2000 Rent prediction (b=0) Empirical risk 1500 data h(x) = 5x h(x) = 11.7x h(x) = 15x 10 8 Empirical risk w=5 w=11.7 w= x w empirical risk quantifies how well the function fits data h(x) = w 1 x + 0, w 1 = 15 best hypothesis was w 1 = 11.7 when w 0 = 0 (only on this data X!) 15 / 48
16 Empirical risk 2D empirical risk surface over w 0, w 1 16 / 48
17 Derivatives let s minimize empirical risk minimization of functions is based on derivatives df (x) f (x + h) f (x) = lim dx h 0 h derivative is the direction of steepest descent 17 / 48
18 Derivatives derivative of w wrt empirical error is (for 1D problem) E(h( ) X) = 1 N N i=1 (y (i) wx (i) ) 2 w w = 1 N (y (i) wx (i) ) 2 N w = 2 N = 2 N i=1 N i=1 N i=1 (y (i) wx (i) ) (y (i) wx (i) ) w x (i) (y (i) wx (i) ) }{{} i th data error gradient of w = (w 1,..., w d ) T wrt empirical error is w E(h w ( ) X) = E(h w 1,...,w d ( ) X) w 1. E(h w 1,...,w d ( ) X) w d 18 / 48
19 Iterative gradient descent choose initial parameter w (0) (eg. all 0 s) and stepsize α iterative gradient descent (GD): for k = 1,..., K, update w (k+1) = w (k) α w E(h( ) X) = w (k) + 2α }{{} N gradient N i=1 x (i) (y (i) w (k)t x (i) ) }{{} i th data point error output: final K th regression weight vector w (K) choice of step size or learning rate α is crucial! if α is too large: iterations may not converge if α is too small: very slow convergence α usually chosen by trial-and-error gradient w E(h( ) X) points to direction of the maximal rate of increase of E(h( ) X) at current value w subtract gradient from w (k) to maximally decrease E(h( ) X) computational complexity O(K N 2 ) 19 / 48
20 Gradient minimization we use update equation w (k+1) = w (k) + 2α N N x (i) (y (i) w (k)t x (i) ) i=1 stepsize α is good 20 / 48
21 Gradient minimization we use update equation w (k+1) = w (k) + 2α N N x (i) (y (i) w (k)t x (i) ) i=1 too large α, we are not converging 21 / 48
22 Stochastic gradient descent in gradient descent each data point pulls parameters w (k+1) = w (k) α w E(h( ) X) = w (k) + 2α N N i=1 x (i) (y (i) w (k)t x (i) ) }{{} i th data point error in stochastic gradient descent (SGD) we compute gradient over random minibatches I {1,..., N} of data of size M < N w (k+1) = w (k) α w E(h( ) X I ) = w (k) + 2α N computational complexity O(K M 2 ) x (i) (y (i) w (k)t x (i) ) SGD is one of the most powerful optimizers for large models i I 22 / 48
23 Analytical solution for linear regression to minimize E(h( ) X) we can directly solve where its gradients are 0: w E(h( ) X) = 0 with solution (DLbook 5.1.4) ŵ = (X T X ) 1 X T y we get global optimum since empirical risk (of linear regression) is convex (X T X ) 1 needs to be invertible Regression Home Assignment matrix inverse is an O(d 3 ) operation 23 / 48
24 ID card of linear regression input/feature space X = R d target space Y = R function family h(x) = w T x = d j=0 w jx j bias trick: x 0 = 1 and j starts from 0 loss function L((x, y), h( )) = (h(x) y) 2 empirical risk E(h( ) X) = 1 N X w y 2 2 empirical risk minimization leads to parameters ŵ = (X T X ) 1 X T y (..or..) w (k+1) = w (k) + 2α N x (i) (y (i) w (k)t x (i) ) N i=1 DL book: covered in chapter / 48
25 Case study: predict red wine quality with linear regression? one wants to understand what makes a wine taste good we have measured chemical composition of many wines (x) tasting evaluations to rate the wines (y) task: predict wine quality h(x) given its composition x 25 / 48
26 Wine measurement data we construct a dataset X of N = 1599 wine measurements x we manually obtain a rating y [0, 10] for each wine from subjective tastings fixed volatile citric free total acid acid acid sugar chlorides sulfur sulfur density ph sulphates alcohol quality x (i) 1 x (i) 2 x (i) 3 x (i) 4 x (i) 5 x (i) 6 x (i) 7 x (i) 8 x (i) 9 x (i) 10 x (i) x (1) 5 y (1) x (2) 5 y (2) X = x (3), y = 5 y (3) x (4) 6 y (4) x (5) 5 y (5) x (1599) 6 y (1599) *P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4): , / 48
27 Linear Regression on wine linear hypothesis space H = {h w (x) = w T x : w R 11 } empirical risk minimizer (fits the 1599 wines best): ŵ = argmin w 1 N N (y (i) w T x (i) ) 2 = (X T X ) 1 X T y i=1 [ ] = ŵ X = fixed volatile citric free total acid acid acid sugar chlorides sulfur sulfur density ph sulphates alcohol x (i) 1 x (i) 2 x (i) 3 x (i) 4 x (i) 5 x (i) 6 x (i) 7 x (i) 8 x (i) 9 x (i) 10 x (i) x (1) x (2) x (3) x (4) x (5) , y = quality 5 y (1) 5 y (2) 5 y (3) 6 y (4) 5 y (5).. 27 / 48
28 predictions h(x (i) ) = j ŵ j x (i) j = ŵ T x (i) X = [ ] x (1) x (2), x (3) x (4) x (5) = ŵ w T x (1) w T x (2) w T x (3) w T x (4) w T x (5) w T x (1599) h(x (1) ) = ( 1.09) = h(x (2) ) = ( 1.09) = / 48
29 result on wine We achieve empirical risk (mean square error) E(h( ) X) = 1 N N (h(x (i) ) y (i) ) 2 = i= y = 5, X ŵ = / 48
30 Outline Polynomial basis Gaussian basis 1 2 Polynomial basis Gaussian basis 30 / 48
31 Non-linearity Polynomial basis Gaussian basis so far we have analysed linear models where each feature contribution towards output is summed independently most machine learning problems are non-linear non-linear effects, e.g. log(x alcohol ) combined effects, e.g. x sugar x alcohol let s expand the feature space by considering n basis functions h(x) = n w j φ j (x) = w T φ(x) j=0 where φ(x) : R d R n with usually n > d and φ 0 (x) = 0 dataset is then Φ = (φ(x (1) ),..., φ(x (N) )) T R N n risk: 1 N Φw y 2 2, solution: ŵ = (ΦT Φ) 1 Φ T y 31 / 48
32 Outline Polynomial basis Gaussian basis 1 2 Polynomial basis Gaussian basis 32 / 48
33 Polynomial expansion Polynomial basis Gaussian basis map φ : (x 1, x 2 ) (x 1, x 2, x1 2x 2 2) the product x1 2x 2 2 solves the problem (feature expansion) trivial solution now w 3 = 1 33 / 48
34 Polynomial basis functions Polynomial basis Gaussian basis let s consider non-additive effects via M th order polynomial basis functions: where φ (M) (x) = {x j1 x j2 x jm : j M 1,..., d} φ (0) (x) = 1 φ (1) (x) = (x 1, x 2,..., x d ) T φ (2) (x) = (x 2 1, x 1 x 2,..., x d 1 x d, x 2 d )T d = 11 features gives 55 pairwise terms, 165 triplets, etc. basis expansion dramatically increases hypothesis space bases are precomputed to produce Φ matrix basis functions results in non-linear prediction 34 / 48
35 Polynomial basis Gaussian basis Polynomial basis example sample 100 points where x (i) [ 1, 1] and y (i) = sin(πx (i) ) + ε black dots: 7 data points red dots: more samples linear function h(x) = 1.37x X Y sin(x π) degree 1 polynomial 35 / 48
36 Polynomial basis Gaussian basis Polynomial regressor, M = X Y sin(x π) degree 0 polynomial h(x) = ŵ 0 36 / 48
37 Polynomial basis Gaussian basis Polynomial regressor, M = X Y sin(x π) degree 1 polynomial h(x) = ŵ 0 + ŵ 1 x 37 / 48
38 Polynomial basis Gaussian basis Polynomial regressor, M = X Y sin(x π) degree 2 polynomial h(x) = ŵ T φ(x) = ŵ 0 + ŵ 1 x + ŵ 2 x 2 38 / 48
39 Polynomial basis Gaussian basis Polynomial regressor, M = X Y sin(x π) degree 3 polynomial h(x) = ŵ T φ(x) = ŵ 0 + ŵ 1 x + ŵ 2 x 2 + ŵ 3 x 3 39 / 48
40 Polynomial basis Gaussian basis Polynomial regressors, M = X Y sin(x π) degree 5 polynomial h(x) = ŵ T φ(x) = ŵ 0 + ŵ 1 x + ŵ 2 x 2 + ŵ 3 x 3 + ŵ 4 x 4 + ŵ 5 x 5 40 / 48
41 Polynomial basis Gaussian basis Polynomial regressors, M = 5 with enough data X Y sin(x π) Polynomial of degree 5 41 / 48
42 Outline Polynomial basis Gaussian basis 1 2 Polynomial basis Gaussian basis 42 / 48
43 Kernel basis functions Polynomial basis Gaussian basis kernel function K(x, x ) R measures similarity of two vectors x, x R d opposite concept to distance function D(x, x ) a common kernel is the gaussian kernel ( K(x, x ) = exp 1 x x 2 ) 2 σ 2 kernel basis function encodes feature φ i (x) as similarity to other point m (i), φ i (x) = K(x, m (i) ) how to choose basis points m (i)? 43 / 48
44 Polynomial basis Gaussian basis feature mapping with 3 gaussian bases 3 features φ j (x) = e (x m(j) ) 2 2σ 2 at m (j) = 50, 100, 150 feature mapping φ : x (φ 1 (x), φ 2 (x), φ 3 (x)) eg. x = 31 becomes φ(31) = (0.74, 0.02, 0.00) eg. x = 69 becomes φ(69) = (0.74, 0.46, 0.00) eg. x = 143 becomes φ(143) = (0.00, 0.22, 0.96) 44 / 48
45 3 gaussian bases on 1D Polynomial basis Gaussian basis three gaussian features φ j (x) = e (x m(j) ) 2 2σ 2 with (m (1), m (2), m (3) ) = (50, 100, 150) hypothesis is a sum of weighted gaussian features 3 h(x) = w j φ j (x) j=1 45 / 48
46 ID card of linear basis regression Polynomial basis Gaussian basis input space X = R d feature space F = R n by basis function φ(x) R n dataset is then Φ = (φ(x (1) ),..., φ(x (N) ) T R N n target space Y = R function family h(x) = w T φ(x) loss function L((x, y), h( )) = (h(x) y) 2 empirical risk E(h( ) X) = 1 N Φw y 2 2 empirical risk minimization leads to parameters ŵ = (Φ T Φ) 1 Φ T y 46 / 48
47 Basis function summary Polynomial basis Gaussian basis basis functions φ : R d R n project the data into higher dimensional space (if n > d) linear regression with the high-dimensional data points φ(x) leads to non-linear hypothesis h(φ(x)) selection of informative basis functions is a difficult task polynomial bases take combinations (products) of existing features gaussian bases generate a new feature mapping 47 / 48
48 Next steps Polynomial basis Gaussian basis next lecture: Regression II with kernel methods and Bayesian regression on friday at 10:15 DL book: read chapters 5.1 and 5.2 on linear regression more information about basis functions Hastie s book 1 : chapters 3.2 & 5 Bishop s book 2 : chapter 3.1 fill out post-lecture questionnaire in MyCourses! we read and appreciate all feedback 1 Elements of Statistical Learning, Springer. 2 Pattern recognition and Machine Learning, Springer / 48
CS-E3210 Machine Learning: Basic Principles
CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationCSC321 Lecture 2: Linear Regression
CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,
More informationMachine Learning Linear Models
Machine Learning Linear Models Outline II - Linear Models 1. Linear Regression (a) Linear regression: History (b) Linear regression with Least Squares (c) Matrix representation and Normal Equation Method
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationModeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop
Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationMachine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari
Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets
More informationHarrison B. Prosper. Bari Lectures
Harrison B. Prosper Florida State University Bari Lectures 30, 31 May, 1 June 2016 Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 1 h Lecture 1 h Introduction h Classification h Grid Searches
More informationCOMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)
COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless
More informationKernelized Perceptron Support Vector Machines
Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More informationLogistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction
More informationSCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.
SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting
More informationCOMP 551 Applied Machine Learning Lecture 2: Linear Regression
COMP 551 Applied Machine Learning Lecture 2: Linear Regression Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the
More informationSVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels
SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationCSC 411 Lecture 6: Linear Regression
CSC 411 Lecture 6: Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 06-Linear Regression 1 / 37 A Timely XKCD UofT CSC 411: 06-Linear Regression
More informationLinear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com
Linear Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationKernel Methods in Machine Learning
Kernel Methods in Machine Learning Autumn 2015 Lecture 1: Introduction Juho Rousu ICS-E4030 Kernel Methods in Machine Learning 9. September, 2015 uho Rousu (ICS-E4030 Kernel Methods in Machine Learning)
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline
More informationLecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University
Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationSupport Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Support Vector Machines CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationAdvanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)
Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting) Professor: Aude Billard Assistants: Nadia Figueroa, Ilaria Lauzana and Brice Platerrier E-mails: aude.billard@epfl.ch,
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationCSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 24) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 2, 24 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24 Outline Review
More informationMachine Learning: Logistic Regression. Lecture 04
Machine Learning: Logistic Regression Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Supervised Learning Task = learn an (unkon function t : X T that maps input
More informationCOMP 551 Applied Machine Learning Lecture 2: Linear regression
COMP 551 Applied Machine Learning Lecture 2: Linear regression Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted for this
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationSVMs, Duality and the Kernel Trick
SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today
More informationCOMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16
COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationLinear Regression Introduction to Machine Learning. Matt Gormley Lecture 4 September 19, Readings: Bishop, 3.
School of Computer Science 10-701 Introduction to Machine Learning Linear Regression Readings: Bishop, 3.1 Murphy, 7 Matt Gormley Lecture 4 September 19, 2016 1 Homework 1: due 9/26/16 Project Proposal:
More informationCS 195-5: Machine Learning Problem Set 1
CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of
More informationMultivariate Bayesian Linear Regression MLAI Lecture 11
Multivariate Bayesian Linear Regression MLAI Lecture 11 Neil D. Lawrence Department of Computer Science Sheffield University 21st October 2012 Outline Univariate Bayesian Linear Regression Multivariate
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationSupport Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM
1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University
More informationSimple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017
Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient
More informationLinear Regression 1 / 25. Karl Stratos. June 18, 2018
Linear Regression Karl Stratos June 18, 2018 1 / 25 The Regression Problem Problem. Find a desired input-output mapping f : X R where the output is a real value. x = = y = 0.1 How much should I turn my
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:
More informationLecture 7: Kernels for Classification and Regression
Lecture 7: Kernels for Classification and Regression CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011 Outline Outline A linear regression problem Linear auto-regressive
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationLinear Regression. CSL603 - Fall 2017 Narayanan C Krishnan
Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationJ. Sadeghi E. Patelli M. de Angelis
J. Sadeghi E. Patelli Institute for Risk and, Department of Engineering, University of Liverpool, United Kingdom 8th International Workshop on Reliable Computing, Computing with Confidence University of
More informationLinear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan
Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationMachine Learning Basics III
Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient
More informationThese slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
More informationOPTIMIZATION METHODS IN DEEP LEARNING
Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationIntroduction to Machine Learning
Introduction to Machine Learning CS4731 Dr. Mihail Fall 2017 Slide content based on books by Bishop and Barber. https://www.microsoft.com/en-us/research/people/cmbishop/ http://web4.cs.ucl.ac.uk/staff/d.barber/pmwiki/pmwiki.php?n=brml.homepage
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationLinear Models for Regression
Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More informationML (cont.): SUPPORT VECTOR MACHINES
ML (cont.): SUPPORT VECTOR MACHINES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 40 Support Vector Machines (SVMs) The No-Math Version
More informationLinear Regression and Discrimination
Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian
More informationPerceptron (Theory) + Linear Regression
10601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron (Theory) Linear Regression Matt Gormley Lecture 6 Feb. 5, 2018 1 Q&A
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer
More informationGradient Descent. Sargur Srihari
Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationKernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning
Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:
More informationAn Introduction to Statistical and Probabilistic Linear Models
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationCSE446: non-parametric methods Spring 2017
CSE446: non-parametric methods Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Linear Regression: What can go wrong? What do we do if the bias is too strong? Might want
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More informationGWAS IV: Bayesian linear (variance component) models
GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian
More informationLinear Models for Regression
Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationMLPR: Logistic Regression and Neural Networks
MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition Amos Storkey Amos Storkey MLPR: Logistic Regression and Neural Networks 1/28 Outline 1 Logistic Regression 2 Multi-layer
More informationLINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception
LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,
More informationOutline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.
Outline MLPR: and Neural Networks Machine Learning and Pattern Recognition 2 Amos Storkey Amos Storkey MLPR: and Neural Networks /28 Recap Amos Storkey MLPR: and Neural Networks 2/28 Which is the correct
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationFoundation of Intelligent Systems, Part I. Regression
Foundation of Intelligent Systems, Part I Regression mcuturi@i.kyoto-u.ac.jp FIS-2013 1 Before starting Please take this survey before the end of this week. Here are a few books which you can check beyond
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More informationLeast Mean Squares Regression. Machine Learning Fall 2018
Least Mean Squares Regression Machine Learning Fall 2018 1 Where are we? Least Squares Method for regression Examples The LMS objective Gradient descent Incremental/stochastic gradient descent Exercises
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationMachine Learning Support Vector Machines. Prof. Matteo Matteucci
Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More information