Linear Models for Regression
|
|
- Noah Sims
- 5 years ago
- Views:
Transcription
1 Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1
2 The Regression Problem Training data: A set of input-output pairs: (x n, t n ) x n is the n-th training input. t n is the target output for x n. Goal: learn a function y(x), that can predict the target value t for a new input x. So far, this is the standard definition of a generic supervised learning problem. What differentiates regression problems is that the target outputs come from a continuous space. 2
3 A Regression Example circles: training data. red curve: one possible solution: a line. green curve: another possible solution: a cubic polynomial. 3
4 Linear Models for Regression M 1 y x, w = w 0 + w j φ j (x) j=1 Functions φ j are called basis functions. You must decide what these functions should be, before you start training. They can be any functions you want. Parameters w j are weights. They are real numbers. The goal of linear regression is to estimate these weights. The output of training is the values of these weights. 4
5 The Dummy φ 0 Function M 1 y x, w = w 0 + w j φ j (x) j=1 To simplify notation, we define a "dummy" basis function φ 0 x = 1. Then, the above formula becomes: M 1 y x, w = w j φ j (x) j=0 5
6 Dot Product Version M 1 y x, w = w j φ j x j=0 = w T φ x w 0 w 1 w is a column vector of weights: w = w M 1 w T is the transpose of w: w T = [w 0, w 1,, w M 1 ]. φ(x) is a column vector: φ(x) = φ 0 (x) φ 1 (x) φ M 1 (x) 6
7 Common Choices for Basis Functions Gaussian basis functions: φ j x = e (x μ j )2 2s 2 7
8 Common Choices for Basis Functions Gaussian basis functions: φ j x = e (x μ j )2 2s 2 8
9 Common Choices for Basis Functions Sigmoidal basis functions: φ j x = σ x μ j s σ a = e a σ a is called the logistic sigmoid function. We will see it again in neural networks. 9
10 Common Choices for Basis Functions Sigmoidal basis functions: φ j x = σ x μ j s σ a = e a σ a is called the logistic sigmoid function. We will see it again in neural networks. 10
11 Common Choices for Basis Functions Polynomial basis functions. For example: powers of x: φ j x = x j If the basis functions are powers of x, then the regression process fits a polynomial to the data. In other words, the regression process estimates the parameters w j of a polynomial of degree M-1: M 1 y x, w = w j x j j=0 11
12 Global Versus Local Functions Polynomial basis functions are global functions. Their values are far from zero for most of the input range from to +. Changing a single weight w j affects the output y(x, w) in the entire input range. Gaussian functions are local functions. Their values are practically (but not mathematically) zero for most of the input range from to +. Changing a single weight w j practically does not affect the output y(x, w) except in a specific small interval. It is often easier to fit data with local basis functions. Each basis function fits a small region of the input space. 12
13 Linear Versus onlinear Functions A linear function y(x, w) produces an output that depends linearly on both x and w. ote that polynomial, Gaussian, and sigmoidal basis functions are nonlinear. If we use nonlinear basis functions, then the regression process produces a function y(x, w) which is: Linear to w. onlinear to x. It is important to remember: linear regression can be used to estimate nonlinear functions of x. It is called linear regression because y is linear to w, OT because y is linear to x. 13
14 Solving Regression Problems There are different methods for solving regression problems. We will study two approaches: Least squares: find the weights w that minimize the squared error. Regularized least squares: find the weights w that minimize the squared error, using some hand-picked regularization parameter λ. 14
15 The Gaussian oise Assumption Suppose that we want to find the most likely solution. We want to find the weights w that maximize the likelihood of the training data. In order to do that, we need to make an additional assumption about the process that generates outputs based on inputs. A common approach is to assume a Gaussian noise model. t = y x, w + ε In words, t is generated by computing y(x, w) and then adding some noise. The noise ε is a random variable from a zero-mean Gaussian distribution. 15
16 The Gaussian oise Assumption t = y x, w + ε The noise ε is a random variable from a zero-mean Gaussian distribution. Therefore: p t x, w, β = t y(x, w, β 1 ) In the above equation, β is called the precision of the Gaussian, and is defined as the inverse of the variance: β = 1 σ 2 The likelihood that input x leads to output t is a Gaussian, whose mean is y(x, w), and its variance is β 1. 16
17 Finding the Most Likely Solution What is the value of w that is most likely given the data? Suppose we have a set of training inputs: X = x 1,, x We also have a set of corresponding outputs: t = t 1,, t We assume that training inputs are independent of each other. We assume that outputs are conditionally independent of each other, given their inputs and noise parameter β. Then: p w X, β, t = p t X, w, β p(w X, β) p t X, β 17
18 Finding the Most Likely Solution p w X, β, t = p t X, w, β p(w X, β) p t X, β We assume that, given X and β, all values of w are equally likely. Then, p(w X,β) is a constant that does not depend on w. p t X,β Therefore, finding the w that maximizes p w X, β, t is the same as finding the w that maximizes p t X, w, β. p t X, w, β is the likelihood of the training data. So, to find the most likely answer w, we must find the value of w that maximizes the likelihood of the training data. 18
19 Likelihood of the Training Data What is the probability of the training data given w? We assume that outputs are conditionally independent of each other, given their inputs and noise parameter β. Then: p t X, w, β = t n y(x n, w, β 1 ) 19
20 Likelihood of the Training Data Remember that, using dot product notation: y x n, w = w T φ x n Therefore: p t X, w, β = t n y(x n, w, β 1 ) p t X, w, β = t n w T φ(x n, β 1 ) Our goal is to find the weights w that maximize p t X, w, β. 20
21 Log Likelihood of the Training Data p t X, w, β = t n w T φ(x n, β 1 ) Maximizing p t X, w, β is the same as maximizing ln(p t X, w, β ), where ln is the natural logarithm. ln p t X, w, β = ln( t n w T φ(x n, β 1 )) 21
22 Log Likelihood of the Training Data ln p t X, w, β = ln( t n w T φ(x n, β 1 )) = ln = ln = ln 1 β β t n w T φ(x n ) 2 2π e 2 β 2π β 1 2π e + ln e β t n w T φ(x n ) 2 t n w T φ(x n ) 2 2β
23 Log Likelihood of the Training Data ln p t X, w, β = ln β 2π + ln e β t n w T φ(x n ) 2 2 = ln β 2π + β t n w T φ(x n ) 2 2 ote that ln β 2π is independent of w. Therefore, to maximize ln p t X, w, β we must maximize β t n w T φ(x n )
24 Log Likelihood and Sum-of-Squares Error To maximize ln p t X, w, β we must maximize: β t n w T φ(x n ) 2 2 Remember that the sum-of-squares error is defined in the textbook as: E D w = 1 t 2 n w T φ(x n ) 2 Therefore, we want to maximize βe D w, which is the same as saying that we want to minimize E D w. Therefore, we have proven that: the w that maximizes the likelihood of the training data is the same w that minimizes the sum-of-squares error. 24
25 Minimizing Sum-of-Squares Error We want to find the w that minimizes: E D w = 1 t 2 n w T φ(x n ) 2 E D w is a function mapping an M-dimensional vector to a real number. Remember from calculus: to minimize any such function: Compute the gradient vector E D w. This gradient is an M-dimensional row vector. Finding values of w that solve equation E D w = 0. These values of w can be possible maxima or minima. 25
26 Minimizing Sum-of-Squares Error We want to find the w that minimizes: E D w = 1 2 t n w T φ(x n ) 2 To calculate E d w : First, let's calculate the gradient E D,n w of function: E D,n w = t n w T φ(x n ) 2 E D,n w is the composition f g(w) of: f x = x 2 g w = t n w T φ x n According to the chain rule: f g = f g g. 26
27 Minimizing Sum-of-Squares Error E D,n w = t n w T φ(x n ) 2 E D,n w is the composition f g(w) of: f x = x 2 g w = t n w T φ x n According to the chain rule: f g = f g g f x g w = 2x = φ(x n ) T f g(w) g w = 2 t n w T φ x n ( φ x n T ) 27
28 Minimizing Sum-of-Squares Error E D,n w = t n w T φ(x n ) 2 E D,n w = 2 t n w T φ x n φ x n T E D w = 1 2 t n w T φ(x n ) 2 = 1 2 E D,n w Therefore: E D w = 1 2 = 1 2 E D,n w 2 t n w T φ x n φ x n T 28
29 Minimizing Sum-of-Squares Error E D w = t n w T φ x n φ x n T = w T φ x n t n φ x n T = w T φ x n φ x n T t n φ x n T 29
30 Minimizing Sum-of-Squares Error E D w = w T φ x n φ x n T t n φ x n T We want to solve equation E D w = 0. We can simplify expressions using vector and matrix notation: Φ = φ 0 x 1,, φ M 1 x 1 φ 0 x 2,, φ M 1 x 2 t = φ 0 x,, φ M 1 x t 1 t 2 t 30
31 Minimizing Sum-of-Squares Error E D w = w T φ x n φ x n T t n φ x n T We want to solve equation E D w = 0. We can simplify expressions using vector and matrix notation: Φ = φ 0 x 1,, φ M 1 x 1 φ 0 x 2,, φ M 1 x 2 t = φ 0 x,, φ M 1 x t 1 t 2 t 31
32 Minimizing Sum-of-Squares Error φ x n φ x n T = φ 0 x n φ M 1 x n φ 0 x n,, φ M 1 x n = φ 0 x n 2, φ 0 x n φ 1 x n,, φ 0 x n φ M 1 x n φ 0 x n φ 1 x n, φ 1 x n 2,, φ 1 x n φ M 1 x n φ 0 x n φ M 1 x n, φ 1 x n φ M 1 x n,, φ M 1 x n 2 = Φ T Φ 32
33 Minimizing Sum-of-Squares Error E D w = w T φ x n φ x n T t n φ x n T Φ = φ 0 x 1,, φ M 1 x 1 φ 0 x 2,, φ M 1 x 2 t = φ 0 x,, φ M 1 x t 1 t 2 t 33
34 Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t n φ x n T Φ = φ 0 x 1,, φ M 1 x 1 φ 0 x 2,, φ M 1 x 2 t = φ 0 x,, φ M 1 x t 1 t 2 t 34
35 Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t n φ x n T Φ = φ 0 x 1,, φ M 1 x 1 φ 0 x 2,, φ M 1 x 2 t = φ 0 x,, φ M 1 x t 1 t 2 t 35
36 Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t T Φ Φ = φ 0 x 1,, φ M 1 x 1 φ 0 x 2,, φ M 1 x 2 t = φ 0 x,, φ M 1 x t 1 t 2 t 36
37 Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t T Φ E D w = 0 w T Φ T Φ = t T Φ Multiplying both sides by (Φ T Φ) 1 w T = t T Φ (Φ T Φ) 1 37
38 Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t T Φ E D w = 0 w T Φ T Φ = t T Φ Transposing both sides w T = w = t T Φ (Φ T Φ) 1 t T Φ (Φ T Φ) 1 T 38
39 Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t T Φ E D w = 0 w T Φ T Φ = t T Φ Using rule: (AB) T = B T A T w T = w = t T Φ (Φ T Φ) 1 t T Φ (Φ T Φ) 1 T w = (Φ T Φ) 1 T Φ T t 39
40 Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t T Φ E D w = 0 w T Φ T Φ = t T Φ Using rule: (A T ) 1 = (A 1 ) T w T = w = t T Φ (Φ T Φ) 1 t T Φ (Φ T Φ) 1 T w = (Φ T Φ) 1 T Φ T t w = (Φ T Φ) T 1 Φ T t 40
41 Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t T Φ E D w = 0 w T Φ T Φ = t T Φ Using rule: (AB) T = B T A T w T = w = t T Φ (Φ T Φ) 1 t T Φ (Φ T Φ) 1 T w = (Φ T Φ) 1 T Φ T t w = (Φ T Φ) T 1 Φ T t w = (Φ T Φ) 1 Φ T t 41
42 otation: w ML From the previous slides, we got the formula w = (Φ T Φ) 1 Φ T t We denote this value of w as w ML, because it is the maximum likelihood estimate of w. In other words, w ML is the most likely value of w given the data. So, we rewrite the formula as: w ML = (Φ T Φ) 1 Φ T t 42
43 Solving for β Remember the Gaussian oise model: p t x, w, β = t y(x, w, β 1 ) In the above equation, β is the precision of the Gaussian, and is defined as the inverse of the variance: β = 1 σ 2 Given our estimate w ML, we can also estimate the precision β and the variance σ 2 : σ ML 2 = 1 β ML = 1 t n w ML T φ(x n ) 2 43
44 Least Squares Solution - Summary w = w 0 w 1 w M 1 t = t 1 t 2 t Φ = φ 0 x 1,, φ M 1 x 1 φ 0 x 2,, φ M 1 x 2 φ 0 x,, φ M 1 x Given the above notation: σ ML 2 = 1 β ML = 1 w ML = (Φ T Φ) 1 Φ T t t n w ML T φ(x n ) 2 44
45 umerical Issues Black circles: training data. Green curve: true function y(x)=sin(2πx). Red curve: Top figure: Fitted 7-degree polynomial. Bottom figure: Fitted 8-degree polynomial. Do you see anything strange? 45
46 umerical Issues The 8-degree polynomial fits the data worse than the 7- degree polynomial. Is this possible? 46
47 umerical Issues The 8-degree polynomial fits the data worse than the 7- degree polynomial. Mathematically, this is impossible. 8-degree polynomials are a superset of 7-degree polynomials. Thus, the best-fitting 8-degree polynomial cannot fit the data worse than the best-fitting 7- degree polynomial. 47
48 umerical Issues The 8-degree polynomial fits the data worse than the 7- degree polynomial. Mathematically, this is impossible. Cause of the problem: numerical issues. Computing the inverse of Φ T Φ may produce a result that is far from the correct one. 48
49 umerical Issues Work-around in Matlab: inv(phi' * phi) leads to the incorrect 8-degree polynomial fit below. pinv(phi' * phi) leads to the correct 8-degree polynomial fit above (almost identical to the 7-degree result). 49
50 Sequential Learning The w ML estimate was obtained by processing the training data as a batch. We use all the data at once to estimate w ML. However, batch processing can be computationally expensive for large datasets. It involves matrix multiplications of Mx matrices. An alternative is sequential learning. This is also called online learning. In this scenario: We first, somehow, get an initial estimate w (0). Then, we observe training examples, one by one. Every time we observe a new training example, we update the estimate. 50
51 Sequential Learning We first, somehow, get an initial estimate w (0). That can be obtained, for example, by computing w ML using the first few training examples as a batch. Or, we initialize w to a random value. Then, we observe training examples, one by one. When we observe the n th training example, we update the estimate from w (τ) to w (τ+1). Remember that the n th training example contributes to the overall error E D w a term E D,n w defined as: E D,n w = t n w T φ(x n ) 2 51
52 Sequential Learning The n th training example contributes to the overall error E D w a term E D,n w defined as: E D,n w = t n w T φ(x n ) 2 When we observe the n th training example, we update the estimate from w (τ) to w (τ+1) as follows: w (τ+1) = w (τ) η E D,n (w) = w (τ) +η(t n w τ T φ n )φ n η is called the learning rate. It is picked manually. This whole process is called stochastic gradient descent. 52
53 Sequential Learning - Intuition When we observe the n th training example, we update the estimate from w (τ) to w (τ+1) as follows: w (τ+1) = w (τ) η E D,n (w) = w (τ) +η(t n w τ T φ n )φ n What is the intuition behind this update? The gradient E D,n (w) is a vector that points in the direction where E D,n (w) increases. Therefore, subtracting a very small amount of E D,n w from w should make E D,n (w) a little bit smaller. 53
54 Sequential Learning - Intuition When we observe the n th training example, we update the estimate from w (τ) to w (τ+1) as follows: w (τ+1) = w (τ) η E D,n (w) = w (τ) +η(t n w τ T φ n )φ n Choosing a good value for η is important. If η is too small, the minimization may happen too slowly and require too many training examples. If η is large, the update may overfit the most recent training example, and overall w may fluctuate too much from one update to the next. Unfortunately, picking a good η is more of an art than a science, and involves trial-and-error. 54
55 Regularized Least Squares We estimated w ML by minimizing the sum-of-squares error E D w : E D w = 1 2 t n w T φ(x n ) 2 We have seen in Chapter 1 the regularized version of this error: E D w = 1 2 t n w T φ(x n ) 2 + λ 2 wt w 55
56 Solving Regularized Least Squares E D w = 1 2 t n w T φ(x n ) 2 + λ 2 wt w The value of w that minimizes E D w is: w = (λi + Φ T Φ) 1 Φ T t In the above equation, I is the MxM identity matrix. 56
57 Why Use Regularization? E D w = 1 2 t n w T φ(x n ) 2 + λ 2 wt w We use regularization when we know, in advance, that some solutions should be preferred over other solutions. Then, we add to the error function a term that penalizes undesirable solutions. In the linear regression case: people know (from past experience) that high values of weights are indicative of overfitting. So, for each high value w i we add to the error a penalty (w i ) 2. 57
58 Impact of Parameter λ 58
59 Impact of Parameter λ 59
60 Impact of Parameter λ As the previous figures show: Low values of λ lead to polynomials whose values fluctuate more and more rapidly. This can lead to increased overfitting. High values of λ lead to flatter and flatter polynomials, that look more and more like straight lines. This can lead to increased underfitting, or not fitting the data sufficiently. 60
Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging
More informationThese slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationSupport Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Support Vector Machines CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification
More informationLinear Models for Regression
Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning
More informationCSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 24) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 2, 24 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24 Outline Review
More informationNeural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Neural Networks CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Perceptrons x 0 = 1 x 1 x 2 z = h w T x Output: z x D A perceptron
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationMachine Learning. 7. Logistic and Linear Regression
Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationy(x n, w) t n 2. (1)
Network training: Training a neural network involves determining the weight parameter vector w that minimizes a cost function. Given a training set comprising a set of input vector {x n }, n = 1,...N,
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationLinear Regression. CSL603 - Fall 2017 Narayanan C Krishnan
Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization
More informationLinear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan
Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationBayesian Linear Regression. Sargur Srihari
Bayesian Linear Regression Sargur srihari@cedar.buffalo.edu Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2 Linear
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationLinear classifiers: Overfitting and regularization
Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x
More informationModeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop
Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector
More informationToday. Calculus. Linear Regression. Lagrange Multipliers
Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject
More informationRegression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.
Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh September 24 (All of the slides in this course have been adapted from previous versions
More informationLinear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com
Linear Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these
More informationComputer Vision Group Prof. Daniel Cremers. 3. Regression
Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationSupport Vector Machines
Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization
More informationIntroduction to Machine Learning
Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationSupport Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Support Vector Machines CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification
More informationLecture 6. Regression
Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationMachine Learning and Data Mining. Linear regression. Kalev Kask
Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ Parameters q Learning algorithm Program ( Learner ) Change q Improve performance
More informationLECTURE NOTE #NEW 6 PROF. ALAN YUILLE
LECTURE NOTE #NEW 6 PROF. ALAN YUILLE 1. Introduction to Regression Now consider learning the conditional distribution p(y x). This is often easier than learning the likelihood function p(x y) and the
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationWays to make neural networks generalize better
Ways to make neural networks generalize better Seminar in Deep Learning University of Tartu 04 / 10 / 2014 Pihel Saatmann Topics Overview of ways to improve generalization Limiting the size of the weights
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationLearning from Data: Regression
November 3, 2005 http://www.anc.ed.ac.uk/ amos/lfd/ Classification or Regression? Classification: want to learn a discrete target variable. Regression: want to learn a continuous target variable. Linear
More informationCSC 411: Lecture 04: Logistic Regression
CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic
More informationGradient Descent. Sargur Srihari
Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors
More informationError Functions & Linear Regression (1)
Error Functions & Linear Regression (1) John Kelleher & Brian Mac Namee Machine Learning @ DIT Overview 1 Introduction Overview 2 Univariate Linear Regression Linear Regression Analytical Solution Gradient
More informationLogistic Regression Logistic
Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,
More informationMark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.
University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x
More informationLogistic Regression & Neural Networks
Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationMachine Learning Basics III
Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient
More informationRidge Regression 1. to which some random noise is added. So that the training labels can be represented as:
CS 1: Machine Learning Spring 15 College of Computer and Information Science Northeastern University Lecture 3 February, 3 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu 1 Introduction Ridge
More informationMachine Learning Basics
Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationCSC321 Lecture 2: Linear Regression
CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,
More information18.6 Regression and Classification with Linear Models
18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight
More informationLinear Regression (9/11/13)
STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter
More informationMidterm: CS 6375 Spring 2015 Solutions
Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an
More informationArtificial Neural Networks
Artificial Neural Networks Stephan Dreiseitl University of Applied Sciences Upper Austria at Hagenberg Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Knowledge
More informationLogistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017
1 Will Monroe CS 109 Logistic Regression Lecture Notes #22 August 14, 2017 Based on a chapter by Chris Piech Logistic regression is a classification algorithm1 that works by trying to learn a function
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationLast updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship
More informationMachine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari
Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationMidterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.
CS 189 Spring 2013 Introduction to Machine Learning Midterm You have 1 hour 20 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. Please use non-programmable calculators
More informationIntroduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Module 2 Lecture 05 Linear Regression Good morning, welcome
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target
More informationSupport Vector Machines
Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder September 28, 2017 Prof. Michael Paul Today Two important concepts: Margins Kernels Large Margin Classification
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationLecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016
Lecture 7 Logistic Regression Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 11, 2016 Luigi Freda ( La Sapienza University) Lecture 7 December 11, 2016 1 / 39 Outline 1 Intro Logistic
More informationHidden Markov Models Part 1: Introduction
Hidden Markov Models Part 1: Introduction CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Modeling Sequential Data Suppose that
More informationReminders. Thought questions should be submitted on eclass. Please list the section related to the thought question
Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a
More informationCENG 793. On Machine Learning and Optimization. Sinan Kalkan
CENG 793 On Machine Learning and Optimization Sinan Kalkan 2 Now Introduction to ML Problem definition Classes of approaches K-NN Support Vector Machines Softmax classification / logistic regression Parzen
More informationVasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks
C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,
More informationDATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane
DATA MINING AND MACHINE LEARNING Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Linear models for regression Regularized
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationMulticlass Logistic Regression
Multiclass Logistic Regression Sargur. Srihari University at Buffalo, State University of ew York USA Machine Learning Srihari Topics in Linear Classification using Probabilistic Discriminative Models
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationClassifier Complexity and Support Vector Classifiers
Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl
More informationGradient Descent Training Rule: The Details
Gradient Descent Training Rule: The Details 1 For Perceptrons The whole idea behind gradient descent is to gradually, but consistently, decrease the output error by adjusting the weights. The trick is
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationMachine Learning - MT & 5. Basis Expansion, Regularization, Validation
Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationSupport Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar
Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 Support Vector Machines Find a linear hyperplane
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More information