Cheng Soon Ong & Christian Walder. Canberra February June 2018
|
|
- Tiffany Dickerson
- 5 years ago
- Views:
Transcription
1 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254
2 Part V Linear 1 Least 181of 254
3 t Linear N = 10 x (x 1,..., x N ) T t (t 1,..., t N ) T x i R i = 1,..., N t i R i = 1,..., N Predictor y(x, w)? Performance measure? Optimal solution w? Recall: projection, inverse, eigenvalue decompostion x Least 182of 254
4 Probabilities, Losses Gaussian Distribution Bayes Rule Expected Loss Cross Validation Least 183of 254
5 Maximum Likelihood Consider the linear regression problem, where we have random variables x n and t n. We assume a conditional model t n x n We propose a distribution, parameterized by θ t n x n density(θ) For a given θ the density defines the probability of observing t n x n. We are interested in finding θ that maximises the probability (called the likelihood) of the data. Least 184of 254
6 Given a training data set of N observations {x n } and target values t n. Goal : Learn to predict the value of one ore more target values t given a new value of the input x. Example: Polynomial curve fitting (see Introduction). t Least? x 185of 254
7 Supervised Learning Training Phase training data x model with adjustable parameter w training targets t Least Test Phase fix the most appropriate w* test data x model with fixed parameter w* test target t 186of 254
8 Why Linear? Analytic solution when using least squares loss Well understood statistical behaviour Efficient algorithms exist for convex losses and regularizers But what if the relationship is non-linear? Least of 254
9 Linear combination of fixed nonlinear basis functions M 1 y(x, w) = w j φ j (x) = w T φ(x) j=0 parameter w = (w 0,..., w M 1 ) T basis functions φ(x) = (φ 0 (x),..., φ M 1 (x)) T convention φ 0 (x) = 1 w 0 is the bias parameter Least 188of 254
10 Polynomial Basis Functions Scalar input variable x φ j (x) = x j Limitation : Polynomials are global functions of the input variable x. Extension: Split the input space into regions and fit a different polynomial to each region (spline functions) Least of 254
11 Gaussian Basis Functions Scalar input variable x φ j (x) = exp { (x µ j) 2 } 2s 2 Not a probability distribution. No normalisation required, taken care of by the model parameters w Least of 254
12 Sigmoidal Basis Functions Scalar input variable x ( ) x µj φ j (x) = σ s where σ(a) is the logistic sigmoid function defined by σ(a) = exp( a) σ(a) is related to the hyperbolic tangent tanh(a) by tanh(a) = 2σ(a) Least of 254
13 Other Basis Functions Fourier Basis : each basis function represents a specific frequency and has infinite spatial extent. Wavelets : localised in both space and frequency (also mutually orthogonal to simplify appliciation). Splines (piecewise polynomials restricted to regions of the input space; additional constraints where pieces meet, e.g. smoothness constraints conditions on the derivatives). Least Linear Splines Quadratic Splines Cubic Splines Quartic Splines Approximate the points {(0, 0), (1, 1), (2, 1), (3, 0), (4, 2), (5, 1)} by different splines. 192of 254
14 Least No special assumption about the basis functions φ j (x). In the simplest case, one can think of φ j (x) = x j, or φ(x) = x. Assume target t is given by t = y(x, w) }{{} + }{{} ɛ deterministic noise where ɛ is a zero-mean Gaussian random variable with precision (inverse variance) β. Thus p(t x, w, β) = N (t y(x, w), β 1 ) Least t y(x0) p(t x0) y(x) x0 x 193of 254
15 Least Likelihood of one target t given the data x was p(t x, w, β) = N (t y(x, w), β 1 ) Now, a set of inputs X with corresponding target values t. Assume data are independent and identically distributed (i.i.d.) (means : data are drawn independent and from the same distribution). The likelihood of the target t is then p(t X, w, β) = = N N (t n y(x n, w), β 1 ) n=1 N N (t n w T φ(x n ), β 1 ) n=1 From now on drop the conditioning variable X from the notation, as with supervised learning we do not seek to model the distribution of the input data. Least 194of 254
16 Least Consider the logarithm of the likelihood p(t w, β) (the logarithm is a monotone function! ) ln p(t w, β) = = N ln N (t n w T φ(x n ), β 1 ) n=1 ( N β ln { 2π exp β2 } ) (t n w T φ(x n )) 2 n=1 = N 2 ln β N 2 ln(2π) βe D(w) where the sum-of-squares error function is E D (w) = 1 2 N {t n w T φ(x n )} 2. n=1 Least arg max w ln p(t w, β) arg min w E D (w) 195of 254
17 Least Goal: Find a more compact representation. Rewrite the error function E D (w) = 1 N {t n w T φ(x n )} 2 = (t Φw)T (t Φw) n=1 where t = (t 1,..., t N ) T, and φ 0 (x 1 ) φ 1 (x 1 )... φ M 1 (x 1 ) φ 0 (x 2 ) φ 1 (x 2 )... φ M 1 (x 2 ) Φ = φ 0 (x N ) φ 1 (x N )... φ M 1 (x N ) Least 196of 254
18 Least The log likelihood is now ln p(t w, β) = N 2 ln β N 2 ln(2π) βe D(w) = N 2 ln β N 2 ln(2π) β 1 2 (t Φw)T (t Φw) Find critical points of ln p(t w, β). The gradient with respect to w is w ln p(t w, β) = βφ T (t Φw). Setting the gradient to zero gives which results in 0 = Φ T t Φ T Φw, Least w ML = (Φ T Φ) 1 Φ T t = Φ t where Φ is the Moore-Penrose pseudo-inverse of the matrix Φ. 197of 254
19 Least The log likelihood with the optimal w ML is now ln p(t w ML, β) = N 2 ln β N 2 ln(2π) β 1 2 (t Φw ML) T (t Φw ML ) Find critical points of ln p(t w, β) wrt β, results in ln p(t w ML, β) β = 0 1 β ML = 1 N (t Φw ML) T (t Φw ML ) Note: We can first find the maximum likelihood for w as this does not depend on β. Then we can use w ML to find the maximum likelihood solution for β. Could we have chosen optimisation wrt β first, and then wrt to w? Least 198of 254
20 Target vector t = (t 1,..., t N ) T R N Basis vectors ϕ j = (Φ j1,..., Φ jn ) T = (φ j (x 1 ),..., φ j (x N )) T R N span a subspace S Find w such that y = (y(x 1, w),..., y(x N, w)) T S is closest to t. S ϕ 2 ϕ 1 y t Least 199of 254
21 - Stochastic Gradient Descent For large data sets, calculating the maximum likelihood parameters w ML and β ML may be costly. For online applications, never all data in memory. Use a sequential algorithms (online algorithm). If the error function is a sum over data points E = n E n, then 1 initialise w (0) to some starting value 2 update the parameter vector at iteration τ + 1 by w (τ+1) = w (τ) η E n, where E n is the error function after presenting the nth data set, and η is the learning rate. Least 200of 254
22 - Stochastic Gradient Descent For the sum-of-squares error function, stochastic gradient descent results in ( ) w (τ+1) = w (τ) η t n w (τ)t φ(x n ) φ(x n ) The value for the learning rate must be chosen carefully. A too large learning rate may prevent the algorithm from converging. A too small learning rate does follow the data too slowly. Least 201of 254
23 Add regularisation in order to prevent overfitting E D (w) + λe W (w) with regularisation coefficient λ. Simple quadratic regulariser E W (w) = 1 2 wt w Maximum likelihood solution w = ( λi + Φ T Φ ) 1 Φ T t Least 202of 254
24 More general regulariser E W (w) = 1 2 M w j q q = 1 (lasso) leads to a sparse model if λ large enough. j=1 Least q = 0.5 q = 1 q = 2 q = 4 203of 254
25 Comparison of Quadratic and Lasso Regulariser Assume a sufficiently large regularisation coefficient λ. Quadratic regulariser w M j=1 w w 2 j Lasso regulariser w w M w j j=1 Least w 1 w 1 204of 254
26 More than 1 target variable per data point. y becomes a vector instead of a scalar. Each dimension can be treated with a different set of basis functions (and that may be necessary if the data in the different target dimensions represent very different types of information.) Here we restrict ourselves to the SAME basis functions y(x, w) = W T φ(x) where y is a K-dimensional column vector, W is an M K matrix of model parameters, and φ(x) = (φ 0 (x),..., φ M 1 (x), φ 0 (x) = 1, as before. Define target matrix T containing the target vector t T n in the n th row. Least 205of 254
27 Suppose the conditional distribution of the target vector is an isotropic Gaussian of the form p(t x, W, β) = N (t W T φ(x), β 1 I). The log likelihood is then ln p(t X, W, β) = N ln N (t n W T φ(x n ), β 1 VI) n=1 = NK 2 ln ( β 2π ) β 2 N t n W T φ(x n ) 2 n=1 Least 206of 254
28 Maximisation with respect to W results in W ML = (Φ T Φ) 1 Φ T T. For each target variable t k, we get w k = (Φ T Φ) 1 Φ T t k = Φ t k. The solution between the different target variables decouples. Holds also for a general Gaussian noise distribution with arbitrary covariance matrix. Why? W defines the mean of the Gaussian noise distribution. And the maximum likelihood solution for the mean of a multivariate Gaussian is independent of the covariance. Least 207of 254
29 Over-fitting results from a large number of basis functions and a relatively small training set. Regularisation can prevent overfitting, but how to find the correct value for the regularisation constant λ? Frequentists viewpoint of the model complexity is the bias-variance trade-off. Least 208of 254
30 Choose an estimator y(x) to estimate the target value t for each input x. Choose a loss function L(t, y(x)) which measures the difference between the target t and the estimate y(x). The expected loss is then E [L] = L(t, y(x)) p(x, t) dx dt Common choice: Squared Loss L(t, y(x)) = {y(x) t} 2. Expected loss for squared loss function E [L] = {y(x) t} 2 p(x, t) dx dt. Least 209of 254
31 Expected loss for squared loss function E [L] = {y(x) t} 2 p(x, t) dx dt. Minimise E [L] by choosing the regression function t p(x, t) dt y(x) = = t p(t x) dt = E t [t x] p(x) (use calculus of variations to derive this result). Least 210of 254
32 The regression function which minimises the expected squared loss, is given by the mean of the conditional distribution p(t x). t y(x) Least y(x 0 ) p(t x 0 ) x 0 x 211of 254
33 Analyse the expected loss E [L] = {y(x) t} 2 p(x, t) dx dt. Rewrite the squared loss Claim {y(x) t} 2 = {y(x) E [t x] + E [t x] t} 2 = {y(x) E [t x]} 2 + {E [t x] t} {y(x) E [t x]} {E [t x] t} {y(x) E [t x]} {E [t x] t} p(x, t) dx dt = 0. Least 212of 254
34 Claim {y(x) E [t x]} {E [t x] t} p(x, t) dx dt = 0. Seperate functions depending on t from function depending on x ( ) {y(x) E [t x]} {E [t x] t} p(x, t) dt dx Calculate the integral over t t p(x, t) {E [t x] t} p(x, t) dt = E [t x] p(x) p(x) dt p(x) = E [t x] p(x) p(x)e [t x] = 0 Least 213of 254
35 The expected loss is now E [L] = {y(x) E [t x]} 2 p(x) dx + var[t x] p(x) dx Least Minimise first term by choosing appropriate y(x). Second term represents the intrinsic variability of the target data (can be regarded as noise). Independent of the choice y(x), can not be reduced by learning a better y(x). 214of 254
36 Consider now the dependency on the data set D. Prediction function now y(x; D). Consider again squared loss for which the optimal prediction is given by the conditional expectation h(x) h(x) = E [t x] = t p(t x) dt. BUT: we can not know h(x) exactly, as we would need an infinite number of training data to learn it accurately. Evaluate performance of algorithm by taking the expectation E D [L] over all data sets D Least 215of 254
37 Taking the expectation over all data sets D [ E D [E [L]] = E D {y(x; D) h(x)} 2 ] p(x) dx + {h(x) t} 2 p(x, t) dx dt Again, add and subtract the expectation E D [y(x; D)] {y(x; D) h(x)} 2 = { y(x; D) E D [y(x; D)] + E D [y(x; D)] h(x)} 2 and show that the mixed term does vanish under the expectation E D [...]. Least 216of 254
38 Expected loss E D [L] over all data sets D expected loss = (bias) 2 + variance + noise. where (bias) 2 = {E D [y(x; D)] h(x)} 2 p(x) dx variance = E D [{y(x; D) E D [y(x; D)]} 2] p(x) dx noise = {h(x) t} 2 p(x, t) dx dt. variance : How sensitive is the model to small changes in the training set? (How much do solutions for individual data sets vary around their average? squared bias : How accurate is a model across different training sets? (How much does the average prediction over all data sets differ from the desired regression function?) Least 217of 254
39 Simple models have low variance and high bias. 1 t ln λ = t Least x 1 0 x 1 Left: Result of fitting the model to 100 data sets, only 25 shown. Right: Average of the 100 fits in red, the sinusoidal function from where the data were created in green. 218of 254
40 Dependence of bias and variance on the model complexity 1 t ln λ = t Least x 1 0 x 1 Left: Result of fitting the model to 100 data sets, only 25 shown. Right: Average of the 100 fits in red, the sinusoidal function from where the data were created in green. 219of 254
41 Complex models have high variance and low bias. 1 t ln λ = t Least x 1 0 x 1 Left: Result of fitting the model to 100 data sets, only 25 shown. Right: Average of the 100 fits in red, the sinusoidal function from where the data were created in green. 220of 254
42 Squared bias, variance, their sum, and test data The minimum for (bias) 2 + variance occurs close to the value that gives the minimum error (bias) 2 variance (bias) 2 + variance test error Least ln λ 221of 254
43 Tradeoff between bias and variance simple models have low variance and high bias complex models have high variance and low bias The sum of bias and variance has a minimum at a certain model complexity. Expected loss E D [L] over all data sets D expected loss = (bias) 2 + variance + noise. The noise comes from the data, and can not be removed from the expected loss. To analyse the bias-variance decomposition : many data sets needed, which are not always available. Least 222of 254
These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More informationLinear Models for Regression
Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationLinear Models for Regression
Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning
More informationLinear Models for Regression
Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationMachine Learning. 7. Logistic and Linear Regression
Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,
More informationOverview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met
c Outlines Statistical Group and College of Engineering and Computer Science Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 89 Part II
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationLinear Models for Regression
Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output
More informationBayesian Linear Regression. Sargur Srihari
Bayesian Linear Regression Sargur srihari@cedar.buffalo.edu Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2 Linear
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationSCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati
SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA Sistemi di Elaborazione dell Informazione Regressione Ruggero Donida Labati Dipartimento di Informatica via Bramante 65, 26013 Crema (CR), Italy http://homes.di.unimi.it/donida
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationMark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.
University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More informationCheng Soon Ong & Christian Walder. Canberra February June 2017
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679 Part XIX
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 143 Part IV
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationBias-Variance Trade-off in ML. Sargur Srihari
Bias-Variance Trade-off in ML Sargur srihari@cedar.buffalo.edu 1 Bias-Variance Decomposition 1. Model Complexity in Linear Regression 2. Point estimate Bias-Variance in Statistics 3. Bias-Variance in Regression
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationOverview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation
Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationVasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks
C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,
More informationMachine Learning - MT & 5. Basis Expansion, Regularization, Validation
Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships
More informationModeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop
Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationLINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception
LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,
More informationComputer Vision Group Prof. Daniel Cremers. 3. Regression
Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the
More informationIterative Reweighted Least Squares
Iterative Reweighted Least Squares Sargur. University at Buffalo, State University of ew York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College
More informationOutline Lecture 2 2(32)
Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic
More informationLinear Regression. CSL603 - Fall 2017 Narayanan C Krishnan
Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization
More informationLinear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging
More informationLinear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan
Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis
More informationOutline lecture 2 2(30)
Outline lecture 2 2(3), Lecture 2 Linear Regression it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control
More informationRidge Regression. Mohammad Emtiyaz Khan EPFL Oct 1, 2015
Ridge Regression Mohammad Emtiyaz Khan EPFL Oct, 205 Mohammad Emtiyaz Khan 205 Motivation Linear models can be too limited and usually underfit. One way is to use nonlinear basis functions instead. Nonlinear
More informationRegression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.
Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh September 24 (All of the slides in this course have been adapted from previous versions
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationLecture 1b: Linear Models for Regression
Lecture 1b: Linear Models for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationIntroduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones
Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive
More information1. Non-Uniformly Weighted Data [7pts]
Homework 1: Linear Regression Writeup due 23:59 on Friday 6 February 2015 You will do this assignment individually and submit your answers as a PDF via the Canvas course website. There is a mathematical
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationLast updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationLogistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA
Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative 1. Fixed basis
More informationLecture 5: GPs and Streaming regression
Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationLecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning
Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function
More informationBayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop
Bayesian Gaussian / Linear Models Read Sections 2.3.3 and 3.3 in the text by Bishop Multivariate Gaussian Model with Multivariate Gaussian Prior Suppose we model the observed vector b as having a multivariate
More informationLinear Regression and Discrimination
Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian
More informationDATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane
DATA MINING AND MACHINE LEARNING Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Linear models for regression Regularized
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationLeast Squares Regression
E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationLearning from Data: Regression
November 3, 2005 http://www.anc.ed.ac.uk/ amos/lfd/ Classification or Regression? Classification: want to learn a discrete target variable. Regression: want to learn a continuous target variable. Linear
More informationCS-E3210 Machine Learning: Basic Principles
CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationMODULE -4 BAYEIAN LEARNING
MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationMulticlass Logistic Regression
Multiclass Logistic Regression Sargur. Srihari University at Buffalo, State University of ew York USA Machine Learning Srihari Topics in Linear Classification using Probabilistic Discriminative Models
More informationMachine Learning and Data Mining. Linear regression. Kalev Kask
Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ Parameters q Learning algorithm Program ( Learner ) Change q Improve performance
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationNEURAL NETWORKS
5 Neural Networks In Chapters 3 and 4 we considered models for regression and classification that comprised linear combinations of fixed basis functions. We saw that such models have useful analytical
More informationSlides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP
Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP Predic?ve Distribu?on (1) Predict t for new values of x by integra?ng over w: where The Evidence Approxima?on (1) The
More informationUniversity of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians
Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision
More informationLecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016
Lecture 7 Logistic Regression Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 11, 2016 Luigi Freda ( La Sapienza University) Lecture 7 December 11, 2016 1 / 39 Outline 1 Intro Logistic
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationBayesian Decision Theory
Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 1 / 46 Bayesian
More informationClassification. Sandro Cumani. Politecnico di Torino
Politecnico di Torino Outline Generative model: Gaussian classifier (Linear) discriminative model: logistic regression (Non linear) discriminative model: neural networks Gaussian Classifier We want to
More informationAn Introduction to Statistical and Probabilistic Linear Models
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning
More information10-701/ Machine Learning, Fall
0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training
More informationBayesian Linear Regression [DRAFT - In Progress]
Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More information