Machine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme
|
|
- Mercy Kelly
- 6 years ago
- Views:
Transcription
1 Machine Learning A. Supervised Learning A.1. Linear Regression Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany 1 / 36
2 Outline 1. Linear Regression via Normal Equations 2. Minimizing a Function via Gradient Descent 3. Learning Linear Regression Models via Gradient Descent 4. Case Weights 2 / 36
3 1. Linear Regression via Normal Equations Outline 1. Linear Regression via Normal Equations 2. Minimizing a Function via Gradient Descent 3. Learning Linear Regression Models via Gradient Descent 4. Case Weights 2 / 36
4 1. Linear Regression via Normal Equations The Simple Linear Regression Problem Given a set D train := {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )} R R called training data, compute the parameters ( ˆβ 0, ˆβ 1 ) of a linear regression function ŷ(x) := ˆβ 0 + ˆβ 1 x s.t. for a set D test R R called test set the test error is minimal. err(ŷ; D test ) := 1 D test (x,y) D test (y ŷ(x)) 2 Note: D test has (i) to be from the same data generating process and (ii) not to be available during training. 2 / 36
5 1. Linear Regression via Normal Equations The (Multiple) Linear Regression Problem Given a set D train := {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )} R M R called training data, compute the parameters ( ˆβ 0, ˆβ 1,..., ˆβ M ) of a linear regression function ŷ(x) := ˆβ 0 + ˆβ 1 x ˆβ M x M s.t. for a set D test R M R called test set the test error is minimal. err(ŷ; D test ) := 1 D test (x,y) D test (y ŷ(x)) 2 Note: D test has (i) to be from the same data generating process and (ii) not to be available during training. 3 / 36
6 1. Linear Regression via Normal Equations Several predictors Several predictor variables x.,1, x.,2,..., x.,m : ŷ = ˆβ 0 + ˆβ 1 x.,1 + ˆβ 2 x.,2 + ˆβ M x.,m M =β 0 + ˆβ m x.,m m=1 with M + 1 parameters ˆβ 0, ˆβ 1,..., ˆβ M. 4 / 36
7 1. Linear Regression via Normal Equations Linear form Several predictor variables x.,1, x.,2,..., x.,m : M ŷ = ˆβ 0 + ˆβ m x.,m = ˆβ, x. m=1 where ˆβ := ˆβ 0 ˆβ 1. ˆβ M, x. := 1 x.,1. x.,m, Thus, the intercept is handled like any other parameter, for the artificial constant predictor x., / 36
8 1. Linear Regression via Normal Equations Simultaneous equations for the whole dataset For the whole dataset D train := {(x 1, y 1 ),..., (x N, y N )}: y ŷ := X ˆβ where y := y 1. y N, ŷ := ŷ 1. ŷ N, X := x 1. x N = x 1,1 x 1,2... x 1,M.... x N,1 x N,2... x N,M 6 / 36
9 1. Linear Regression via Normal Equations Least squares estimates Least squares estimates ˆβ minimize RSS( ˆβ, D train ) := N (y n ŷ n ) 2 = y ŷ 2 = y X ˆβ 2 n=1 The least squares estimates ˆβ are computed via normal equations X T X ˆβ = X T y Proof: y X ˆβ 2 = y X ˆβ, y X ˆβ (...) ˆβ = 2 X, y X ˆβ = 2(X T y X T X ˆβ)! = 0 7 / 36
10 1. Linear Regression via Normal Equations How to compute least squares estimates ˆβ Solve the M M system of linear equations X T X ˆβ = X T y i.e., Ax = b (with A := X T X, b = X T y, x = ˆβ). There are several numerical methods available: 1. Gaussian elimination 2. Cholesky decomposition 3. QR decomposition 8 / 36
11 1. Linear Regression via Normal Equations Learn Linear Regression via Normal Equations 1: procedure learn-linreg-normeq(d train := {(x 1, y 1 ),..., (x N, y N )}) 2: X := (x 1, x 2,..., x N ) T 3: y := (y 1, y 2,..., y N ) T 4: A := X T X 5: b := X T y 6: ˆβ := solve-sle(a, b) 7: return ˆβ 9 / 36
12 1. Linear Regression via Normal Equations Example Given is the following data: Predict a y value for x 1 = 3, x 2 = 4. x 1 x 2 y / 36
13 1. Linear Regression via Normal Equations Example / Simple Regression Models for Comparison ŷ = ˆβ 0 + ˆβ 1 x 1 ŷ = ˆβ 0 + ˆβ 2 x 2 = x 1 = x 2 y data model y data model x1 x2 ŷ(x 1 = 3) = 3.25 ŷ(x 2 = 4) = / 36
14 1. Linear Regression via Normal Equations Example Now fit to the data: ŷ = ˆβ 0 + ˆβ 1 x 1 + ˆβ 2 x 2 X = X T X = , y = x 1 x 2 y , X T y = / 36
15 1. Linear Regression via Normal Equations Example i.e., ˆβ = 1597/ / / / 36
16 1. Linear Regression via Normal Equations Example y x2 x / 36
17 1. Linear Regression via Normal Equations Example / Visualization of Model Fit To visually assess the model fit, a plot can be plotted: residuals ˆɛ := y ŷ vs. true values y y y^ y 15 / 36
18 2. Minimizing a Function via Gradient Descent Outline 1. Linear Regression via Normal Equations 2. Minimizing a Function via Gradient Descent 3. Learning Linear Regression Models via Gradient Descent 4. Case Weights 16 / 36
19 2. Minimizing a Function via Gradient Descent Gradient Descent Given a function f : R N R, find x with minimal f (x). Idea: start from a random x 0 and then improve step by step, i.e., choose x i+1 with f (x i+1 ) f (x i ) f(x) Choose the negative gradient f x (x i) as direction for descent, i.e., with a suitable step length α i > 0. x i+1 x i = α i f x (x i) x 16 / 36
20 2. Minimizing a Function via Gradient Descent Gradient Descent 1: procedure minimze-gd-fsl(f : R N R, x 0 R N, α R, i max N, ɛ R + ) 2: for i = 1,..., i max do 3: x i := x i 1 α f x (x i 1) 4: if f (x i 1 ) f (x i ) < ɛ then 5: return x i 6: error not converged in i max iterations x 0 start value α (fixed) step length / learning rate i max maximal number of iterations ɛ minimum stepwise improvement 17 / 36
21 2. Minimizing a Function via Gradient Descent Example Example: f (x) := x 2 f, x (x) = 2x, x 0 := 2, α i : 0.25 Then we compute iteratively: using f i x i x (x i) x i x i+1 = x i α n f x (x i) x f(x) x 2 x 3 x 1 x 0 18 / 36
22 2. Minimizing a Function via Gradient Descent Step Length Why do we need a step length? Can we set α n 1? The negative gradient gives a direction of descent only in an infinitesimal neighborhood of x n. Thus, the step length may be too large, and the function value of the next point does not decrease. f(x) x 1 x 0 Lars Schmidt-Thieme, Information Systems and 3 Machine 2 Learning 1 Lab 0 (ISMLL), 1 University 2 3 of Hildesheim, Germany 19 / 36
23 2. Minimizing a Function via Gradient Descent Step Length There are many different strategies to adapt the step length s.t. 1. the function value actually decreases and 2. the step length becomes not too small (and thus convergence slow) Armijo-Principle: α n := max{α {2 j j N 0 } f (x n α f x (x n)) f (x n ) αδ f x (x n), f x (x n) } with δ (0, 1). 20 / 36
24 2. Minimizing a Function via Gradient Descent Armijo Step Length 1: procedure steplength-armijo(f : R N R, x R N, d R N, δ (0, 1)) 2: α := 1 3: while f (x) f (x + αd) < αδd T d do 4: α = α/2 5: return α x last position d descend direction δ minimum steepness (δ 0: any step will do) 21 / 36
25 2. Minimizing a Function via Gradient Descent Gradient Descent 1: procedure minimize-gd(f : R N R, x 0 R N, α, i max N, ɛ R + ) 2: for i = 1,..., i max do 3: d := f x (x i 1) 4: α i := α(f, x i 1, d) 5: x i := x i 1 + α i d 6: if f (x i 1 ) f (x i ) < ɛ then 7: return x i 8: error not converged in i max iterations x 0 start value α step length function, e.g., steplength-armijo (with fixed δ). i max maximal number of iterations ɛ minimum stepwise improvement 22 / 36
26 2. Minimizing a Function via Gradient Descent Bold Driver Step Length [Bat89] A variant of the Armijo principle with memory: 1: procedure steplengthbolddriver(f : R N R, x R N, d R N, α old, α +, α (0, 1)) 2: α := α old α + 3: while f (x) f (x + αd) 0 do 4: α = αα 5: return α α old last step length α + step length increase factor, e.g., 1.1. α step length decrease factor, e.g., / 36
27 2. Minimizing a Function via Gradient Descent Simple Step Length Control in Machine Learning The Armijo and Bold Driver step lengths evaluate the objective function (including the loss) several times, and thus often are too costly and not used. But useful for debugging as they guarantee decrease in f. Constant step lengths α (0, 1) are frequently used. If chosen (too) small, the learning algorithm becomes slow, but usually still converges. The step length becomes a hyperparameter that has to be searched. Regimes of shrinking step lengths are used: α i := α i 1 γ, γ (0, 1) not too far from 1 If the initial step length α0 is too large, later iterations will fix it. If γ is too small, GD may get stuck before convergence. 24 / 36
28 2. Minimizing a Function via Gradient Descent How Many Minima can a Funciton have? In general, a function f can have several different minima. GD will find a random one (with small step lengths, usually one close to the starting point; local optimization method). 25 / 36
29 2. Minimizing a Function via Gradient Descent Convexity A function f : R N R is called convex if f (tx 1 + (1 t)x 2 ) tf (x 1 ) + (1 t)f (x 2 ), x 1, x 2 R N, t [0, 1] A two-times differentiable function is convex if its Hessian is positive semidefinite, i.e., ( x T 2 ) f x 0 x R N x i x j i=1,...,n,j=1,...,n For any matrix A R N M, the matrix A T A is positive semidefinite. 26 / 36
30 3. Learning Linear Regression Models via Gradient Descent Outline 1. Linear Regression via Normal Equations 2. Minimizing a Function via Gradient Descent 3. Learning Linear Regression Models via Gradient Descent 4. Case Weights 27 / 36
31 3. Learning Linear Regression Models via Gradient Descent Sparse Predictors Many problems have predictors x R M that are high-dimensional: M is large, and sparse: most x m are zero. For example, text regression: task: predict the rating of a customer review. predictors: a text about a product a sequence of words. can be represented as vector via bag of words: x m encodes the frequency of word m in a given text. dimensions 30,000-60,000 for English texts in short texts as reviews with a couple of hundred words, maximally a couple of hundred dimensions are non-zero. target: the customers rating of the product a (real) number. 27 / 36
32 3. Learning Linear Regression Models via Gradient Descent Sparse Predictors Dense Normal Equations Recall, the normal equations have dimensions M M. X T X ˆβ = X T y Even if X is sparse, generally X T X will be rather dense. (X T X ) m,l = X T.,mX.,l For text regression, (X T X ) m,l will be non-zero for every pair of words m, l that co-occurs in any text. Even worse, even if A := X T X is sparse, standard methods to solve linear systems (such as Gaussian elimination, LR decomposition etc.) do not take advantage. 28 / 36
33 3. Learning Linear Regression Models via Gradient Descent Learn Linear Regression via Loss Minimization Alternatively to learning a linear regression model via solving the linear normal equation system one can minimize the loss directly: When computing f and f ˆβ, f ( ˆβ) := ˆβ T X T X ˆβ 2y T X ˆβ + y T y = (y X ˆβ) T (y X ˆβ) f ˆβ ( ˆβ) = 2(X T y X T X ˆβ) = 2X T (y X ˆβ) avoid computing (dense) X T X. always compute (sparse) X times a (dense) vector. 29 / 36
34 3. Learning Linear Regression Models via Gradient Descent Objective Function and Gradient as Sums over Instances f ( ˆβ) := (y X ˆβ) T (y X ˆβ) T = = N (y n xn T ˆβ) 2 n=1 N ɛ 2 n, n=1 f ˆβ ( ˆβ) = 2X T (y X ˆβ) = 2 = 2 N (y n xn T ˆβ)x n n=1 N ɛ n x n n=1 ɛ n := y n x T n ˆβ 30 / 36
35 3. Learning Linear Regression Models via Gradient Descent Learn Linear Regression via Loss Minimization: GD 1: procedure learn-linreg- GD(D train := {(x 1, y 1 ),..., (x N, y N )}, α, i max N, ɛ R + ) 2: X := (x 1, x 2,..., x N ) T 3: y := (y 1, y 2,..., y N ) T 4: ˆβ 0 := (0,..., 0) 5: ˆβ := minimize-gd( f ( ˆβ) := (y X ˆβ) T (y X ˆβ), ˆβ 0, α, i max, ɛ) 6: return ˆβ 31 / 36
36 4. Case Weights Outline 1. Linear Regression via Normal Equations 2. Minimizing a Function via Gradient Descent 3. Learning Linear Regression Models via Gradient Descent 4. Case Weights 32 / 36
37 4. Case Weights Cases of Different Importance Sometimes different cases are of different importance, e.g., if their measurements are of different accurracy or reliability. Example: assume the left most point is known to be measured with lower reliability. Thus, the model does not need to fit to this point equally as well as it needs to do to the other points. I.e., residuals of this point should get lower weight than the others. y data model x 32 / 36
38 4. Case Weights Case Weights In such situations, each case (x n, y n ) is assigned a case weight w n 0: the higher the weight, the more important the case. cases with weight 0 should be treated as if they have been discarded from the data set. Case weights can be managed as an additional pseudo-variable w in implementations. 33 / 36
39 4. Case Weights Weighted Least Squares Estimates Formally, one tries to minimize the weighted residual sum of squares with N w n (y n ŷ n ) 2 = W 1 2 (y ŷ) 2 n=1 W := w 1 0 w w n The same argument as for the unweighted case results in the weighted least squares estimates X T WX ˆβ = X T Wy 34 / 36
40 4. Case Weights Weighted Least Squares Estimates / Example To downweight the left most point, we assign case weights as follows: w x y y data model (w. weights) model (w./o. weights) x 35 / 36
41 4. Case Weights Summary For regression, linear models of type ŷ = x T ˆβ can be used to predict a quantitative y based on several (quantitative) x. A bias term can be modeled as additional predictor that is constant 1. The ordinary least squares estimates (OLS) are the parameters with minimal residual sum of squares (RSS). OLS estimates can be computed by solving the normal equations X T X ˆβ = X T y as any system of linear equations via Gaussian elimination. Alternatively, OLS estimates can be computed iteratively via Gradient Descent. Especially for high-dimensional, sparse predictors GD is advantageous as it never has it never has to compute the large, dense X T X. Case weights can be handled seamlessly by both methods to model different importance of cases. 36 / 36
42 Further Readings [JWHT13, chapter 3], [Mur12, chapter 7], [HTFF05, chapter 3]. 37 / 36
43 References Roberto Battiti. Accelerated backpropagation learning: Two optimization methods. Complex systems, 3(4): , Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. The elements of statistical learning: data mining, inference and prediction, volume Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Springer, Kevin P. Murphy. Machine learning: a probabilistic perspective. The MIT Press, / 36
Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline
More informationLars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Syllabus Fri. 21.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 27.10. (2) A.1 Linear Regression Fri. 3.11. (3) A.2 Linear Classification Fri. 10.11. (4) A.3 Regularization
More informationModern Optimization Techniques
Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University
More informationLinear Regression. CSL603 - Fall 2017 Narayanan C Krishnan
Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization
More informationLinear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan
Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationMachine Learning Linear Models
Machine Learning Linear Models Outline II - Linear Models 1. Linear Regression (a) Linear regression: History (b) Linear regression with Least Squares (c) Matrix representation and Normal Equation Method
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More informationModeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop
Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationModern Optimization Techniques
Modern Optimization Techniques Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Stochastic Gradient Descent Stochastic
More informationCOMS 4771 Regression. Nakul Verma
COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More informationRegression, Ridge Regression, Lasso
Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationMS-C1620 Statistical inference
MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents
More informationMachine Learning. 1. Linear Regression
Machine Learning Machine Learning 1. Linear Regression Lars Schmidt-Thieme Information Sstems and Machine Learning Lab (ISMLL) Institute for Business Economics and Information Sstems & Institute for Computer
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationarxiv: v3 [stat.ml] 14 Apr 2016
arxiv:1307.0048v3 [stat.ml] 14 Apr 2016 Simple one-pass algorithm for penalized linear regression with cross-validation on MapReduce Kun Yang April 15, 2016 Abstract In this paper, we propose a one-pass
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationAdvanced Topics in Machine Learning
Advanced Topics in Machine Learning 1. Learning SVMs / Primal Methods Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim, Germany 1 / 16 Outline 10. Linearization
More informationLinear Regression (9/11/13)
STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationStatistical Inference
Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park
More informationAn Introduction to Statistical and Probabilistic Linear Models
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning
More informationCOMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma
COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationGradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve
More informationBehavioral Data Mining. Lecture 7 Linear and Logistic Regression
Behavioral Data Mining Lecture 7 Linear and Logistic Regression Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips Linear Regression
More informationOrdinary Least Squares Linear Regression
Ordinary Least Squares Linear Regression Ryan P. Adams COS 324 Elements of Machine Learning Princeton University Linear regression is one of the simplest and most fundamental modeling ideas in statistics
More informationBias-Variance Decomposition. Mohammad Emtiyaz Khan EPFL Oct 6, 2015
Bias-Variance Decomposition Mohammad Emtiyaz Khan EPFL Oct 6, 2015 Mohammad Emtiyaz Khan 2015 Motivation In ridge regression, we observe a typical behaviour for train and test errors with respect to model
More informationMultiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar
Multiple regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Multiple regression 1 / 36 Previous two lectures Linear and logistic
More informationMaking Our Cities Safer: A Study In Neighbhorhood Crime Patterns
Making Our Cities Safer: A Study In Neighbhorhood Crime Patterns Aly Kane alykane@stanford.edu Ariel Sagalovsky asagalov@stanford.edu Abstract Equipped with an understanding of the factors that influence
More informationLogistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015
Logistic Regression Mohammad Emtiyaz Khan EPFL Oct 8, 2015 Mohammad Emtiyaz Khan 2015 Classification with linear regression We can use y = 0 for C 1 and y = 1 for C 2 (or vice-versa), and simply use least-squares
More informationCSC321 Lecture 2: Linear Regression
CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,
More informationCOMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16
COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationData Mining Stat 588
Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic
More informationLinear regression methods
Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response
More informationConvex Optimization for High-Dimensional Portfolio Construction
Convex Optimization for High-Dimensional Portfolio Construction R/Finance 2017 R. Michael Weylandt 2017-05-20 Rice University In the beginning... 2 The Markowitz Problem Given µ, Σ minimize w T Σw λµ T
More informationLeast Angle Regression, Forward Stagewise and the Lasso
January 2005 Rob Tibshirani, Stanford 1 Least Angle Regression, Forward Stagewise and the Lasso Brad Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani Stanford University Annals of Statistics,
More informationStatistical Machine Learning Hilary Term 2018
Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More informationLinear Model Selection and Regularization
Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationSolving Regression. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 12. Slides adapted from Matt Nedrich and Trevor Hastie
Solving Regression Jordan Boyd-Graber University of Colorado Boulder LECTURE 12 Slides adapted from Matt Nedrich and Trevor Hastie Jordan Boyd-Graber Boulder Solving Regression 1 of 17 Roadmap We talked
More informationCOS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION
COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:
More informationConjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.
Lab 1 Conjugate-Gradient Lab Objective: Learn about the Conjugate-Gradient Algorithm and its Uses Descent Algorithms and the Conjugate-Gradient Method There are many possibilities for solving a linear
More informationLast updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationApproximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.
Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and
More informationMachine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang
Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set
More informationESL Chap3. Some extensions of lasso
ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied
More informationx k+1 = x k + α k p k (13.1)
13 Gradient Descent Methods Lab Objective: Iterative optimization methods choose a search direction and a step size at each iteration One simple choice for the search direction is the negative gradient,
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationMachine Learning and Data Mining. Linear regression. Kalev Kask
Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ Parameters q Learning algorithm Program ( Learner ) Change q Improve performance
More informationLINEAR REGRESSION, RIDGE, LASSO, SVR
LINEAR REGRESSION, RIDGE, LASSO, SVR Supervised Learning Katerina Tzompanaki Linear regression one feature* Price (y) What is the estimated price of a new house of area 30 m 2? 30 Area (x) *Also called
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2015 Announcements TA Monisha s office hour has changed to Thursdays 10-12pm, 462WVH (the same
More informationBootstrap & Confidence/Prediction intervals
Bootstrap & Confidence/Prediction intervals Olivier Roustant Mines Saint-Étienne 2017/11 Olivier Roustant (EMSE) Bootstrap & Confidence/Prediction intervals 2017/11 1 / 9 Framework Consider a model with
More informationCOMP 551 Applied Machine Learning Lecture 2: Linear regression
COMP 551 Applied Machine Learning Lecture 2: Linear regression Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted for this
More informationMLCC 2017 Regularization Networks I: Linear Models
MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationLogistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction
More informationHigh-dimensional regression modeling
High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR 6625 http://www.agrocampus-ouest.fr/math/causeur/ Course objectives Making
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:
Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Carlos Guestrin University of Washington October 7, 2013 1 Sparsity Vector w is sparse, if many entries are zero: Very useful
More informationA Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)
A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical
More informationNeural Networks: Optimization & Regularization
Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg
More informationFundamentals of Machine Learning (Part I)
Fundamentals of Machine Learning (Part I) Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp April 12, 2018 Mohammad Emtiyaz Khan 2018 1 Goals Understand (some) fundamentals
More informationCSC 411 Lecture 6: Linear Regression
CSC 411 Lecture 6: Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 06-Linear Regression 1 / 37 A Timely XKCD UofT CSC 411: 06-Linear Regression
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Supervised Learning: Regression I Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Some of the
More informationCS-E3210 Machine Learning: Basic Principles
CS-E3210 Machine Learning: Basic Principles Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 48 In a nutshell
More informationConditional Random Fields for Sequential Supervised Learning
Conditional Random Fields for Sequential Supervised Learning Thomas G. Dietterich Adam Ashenfelter Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.eecs.oregonstate.edu/~tgd
More informationLinear Regression and Discrimination
Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian
More informationA New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives
A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives Paul Grigas May 25, 2016 1 Boosting Algorithms in Linear Regression Boosting [6, 9, 12, 15, 16] is an extremely
More informationAdvanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)
Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting) Professor: Aude Billard Assistants: Nadia Figueroa, Ilaria Lauzana and Brice Platerrier E-mails: aude.billard@epfl.ch,
More informationNewton s Method. Javier Peña Convex Optimization /36-725
Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and
More informationData Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction
Data Mining 3.6 Regression Analysis Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Straight-Line Linear Regression Multiple Linear Regression Other Regression Models References Introduction
More informationPlanning and Optimal Control
Planning and Optimal Control 1. Markov Models Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany 1 / 22 Syllabus Tue.
More informationLecture 17: October 27
0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods
More informationGaussian Graphical Models and Graphical Lasso
ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf
More informationVariable Selection in Data Mining Project
Variable Selection Variable Selection in Data Mining Project Gilles Godbout IFT 6266 - Algorithmes d Apprentissage Session Project Dept. Informatique et Recherche Opérationnelle Université de Montréal
More informationLogistic Regression. William Cohen
Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting
More informationGradient Descent. Mohammad Emtiyaz Khan EPFL Sep 22, 2015
Gradient Descent Mohammad Emtiyaz Khan EPFL Sep 22, 201 Mohammad Emtiyaz Khan 2014 1 Learning/estimation/fitting Given a cost function L(β), we wish to find β that minimizes the cost: min β L(β), subject
More informationNonparameteric Regression:
Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationSTAT 462-Computational Data Analysis
STAT 462-Computational Data Analysis Chapter 5- Part 2 Nasser Sadeghkhani a.sadeghkhani@queensu.ca October 2017 1 / 27 Outline Shrinkage Methods 1. Ridge Regression 2. Lasso Dimension Reduction Methods
More informationA summary of Deep Learning without Poor Local Minima
A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given
More informationLEAST ANGLE REGRESSION 469
LEAST ANGLE REGRESSION 469 Specifically for the Lasso, one alternative strategy for logistic regression is to use a quadratic approximation for the log-likelihood. Consider the Bayesian version of Lasso
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting
More information