Data Mining Techniques. Lecture 2: Regression

Similar documents
Multiple Regression Part I STAT315, 19-20/3/2014

GRAD6/8104; INES 8090 Spatial Statistic Spring 2017

Introduction to PyTorch

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

<br /> D. Thiebaut <br />August """Example of DNNRegressor for Housing dataset.""" In [94]:

Stat588 Homework 1 (Due in class on Oct 04) Fall 2011

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

HW1 Roshena MacPherson Feb 1, 2017

Sparse polynomial chaos expansions as a machine learning regression technique

CS6220: DATA MINING TECHNIQUES

SKLearn Tutorial: DNN on Boston Data

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression (continued)

CSC 411: Lecture 02: Linear Regression

Overfitting, Bias / Variance Analysis

ON CONCURVITY IN NONLINEAR AND NONPARAMETRIC REGRESSION MODELS

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

CSCI567 Machine Learning (Fall 2014)

Linear Models for Regression. Sargur Srihari

CS145: INTRODUCTION TO DATA MINING

On the Harrison and Rubinfeld Data

Linear Models for Regression CS534

Linear Models for Regression CS534

ECE521 lecture 4: 19 January Optimization, MLE, regularization

CS60021: Scalable Data Mining. Large Scale Machine Learning

Review on Spatial Data

Why should you care about the solution strategies?

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

CS6220: DATA MINING TECHNIQUES

Linear Models for Regression CS534

Logistic Regression. COMP 527 Danushka Bollegala

Analysis of multi-step algorithms for cognitive maps learning

ECE521 week 3: 23/26 January 2017

CSC321 Lecture 2: Linear Regression

Linear Regression. S. Sumitra

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

DISCRIMINANT ANALYSIS: LDA AND QDA

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Data Mining Techniques. Lecture 3: Probability

Lecture 1: Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Machine Learning and Data Mining. Linear regression. Kalev Kask

Finding Outliers in Models of Spatial Data

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Stat 401B Final Exam Fall 2016

Stochastic Gradient Descent

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Week 3: Linear Regression

Data Mining Techniques

Sample questions for Fundamentals of Machine Learning 2018

Introduction to Optimization

Single Index Quantile Regression for Heteroscedastic Data

ECS171: Machine Learning

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

COMP 551 Applied Machine Learning Lecture 2: Linear regression

Regression with Numerical Optimization. Logistic

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Machine Learning (CS 567) Lecture 2

CSCI567 Machine Learning (Fall 2018)

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Chapter 4 Dimension Reduction

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

Optimization II: Unconstrained Multivariable

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

ESS2222. Lecture 4 Linear model

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Linear Models for Regression

Optimization II: Unconstrained Multivariable

CS260: Machine Learning Algorithms

We choose parameter values that will minimize the difference between the model outputs & the true function values.

Learning from Data: Regression

Neural Network Training

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Logistic Regression and Support Vector Machine

Optimization Methods for Machine Learning

Machine Learning Linear Regression. Prof. Matteo Matteucci

Least Squares Regression

Fundamentals of Machine Learning

Introduction to Gaussian Process

Stats 170A: Project in Data Science Predictive Modeling: Regression

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Fundamentals of Machine Learning (Part I)

CS-E3210 Machine Learning: Basic Principles

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Lecture 7

Machine learning - HT Basis Expansion, Regularization, Validation

Bayesian Linear Regression. Sargur Srihari

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Data Mining Techniques

Transcription:

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 2: Regression Jan-Willem van de Meent (credit: Yijun Zhao, Marc Toussaint, Bishop)

Administrativa Instructor Jan-Willem van de Meent Email: j.vandemeent@northeastern.edu Phone: +1 617 373-7696 Office Hours: 478 WVH, Wed 1.30pm - 2.30pm Teaching Assistants Yuan Zhong E-mail: yzhong@ccs.neu.edu Office Hours: WVH 462, Wed 3pm - 5pm Kamlendra Kumar E-mail: kumark@zimbra.ccs.neu.edu Office Hours: WVH 462, Fri 3pm - 5pm

Administrativa Course Website http://www.ccs.neu.edu/course/cs6220f16/sec3/ Piazza https://piazza.com/northeastern/fall2016/cs622003/home Project Guidelines (Vote next week) http://www.ccs.neu.edu/course/cs6220f16/sec3/project/

Question What would you like to get out of this course?

Linear Regression

Regression Examples Features Continuous Value x =) y {age, major, gender, race} GPA {income, credit score, profession} Loan Amount {college,major,gpa} Future Income

Example: Boston Housing Data UC Irvine Machine Learning Repository (good source for project datasets) https://archive.ics.uci.edu/ml/datasets/housing

Example: Boston Housing Data 1. CRIM: per capita crime rate by town 2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft. 3. INDUS: proportion of non-retail business acres per town 4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 5. NOX: nitric oxides concentration (parts per 10 million) 6. RM: average number of rooms per dwelling 7. AGE: proportion of owner-occupied units built prior to 1940 8. DIS: weighted distances to five Boston employment centres 9. RAD: index of accessibility to radial highways 10. TAX: full-value property-tax rate per $10,000 11. PTRATIO: pupil-teacher ratio by town 12. B: 1000(Bk - 0.63)^2 where Bk is the proportion of african americans by town 13. LSTAT: % lower status of the population 14. MEDV: Median value of owner-occupied homes in $1000's

Example: Boston Housing Data CRIM: per capita crime rate by town

Example: Boston Housing Data CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

Example: Boston Housing Data MEDV: Median value of owner-occupied homes in $1000's

Example: Boston Housing Data N data points D features

Regression: Problem Setup Given N observations {(x 1, y 1 ), (x 2, y 2 ),...,(x N, y N )} learn a function y i = f (x i ) i = 1, 2,..., N and for a new input x* predict y = f (x )

Linear Regression Assume f is a linear combination of D features were x and w are defined as for N points we write Learning task: Estimate w

Linear Regression

Error Measure Mean Squared Error (MSE): E(w) = 1 N NX n=1 (w T x n y n ) 2 where 2 X = 6 4 x 1 T x 2 T... x N T = 1 N 3 k Xw y k2 7 5 y = 2 6 4 y 1 T y 2 T... y N T 3 7 5

Minimizing the Error E(w) = 1 N k Xw y k2 5E(w) = 2 N XT (Xw y) =0 X T Xw = X T y w = X y where X =(X T X) 1 X T is the pseudo-inverse of X

Minimizing the Error E(w) = 1 N k Xw y k2 5E(w) = 2 N XT (Xw y) =0 X T Xw = X T y w = X y where X =(X T X) 1 X T is the pseudo-inverse of X Matrix Cookbook (on course website)

Ordinary Least Squares Construct matrix X and the vector y from the dataset {(x 1, y 1 ), x 2, y 2 ),...,(x N, y N )} (each x includes x 0 =1)asfollows: 2 x T 3 2 1 y X = 6 x T 1 T 2 7 4... 5 y = 6 y2 T 4... 3 7 5 x T N y T N Compute X =(X T X) 1 X T Return w = X y

Gradient Descent 50 countours : E(w ) 45 40 35 30 w 1 25 20 15 10 5 5 10 15 20 25 30 35 40 45 50 w 0

Least Mean Squares (a.k.a. gradient descent) Initialize the w(0) for time t =0 for t =0, 1, 2,... do Compute the gradient g t = 5E(w(t)) Set the direction to move, v t = g t Update w(t +1)=w(t)+ v t Iterate until it is time to stop Return the final weights w

Question When would you want to use OLS, when LMS?

Computational Complexity Ordinary least squares (OMS) Least Mean Squares (LMS)

Computational Complexity Ordinary least squares (OMS) Least Mean Squares (LMS) OMS is expensive when D is large

Effect of step size

Choosing Stepsize r rf(x)? Set step size proportional to? small gradient small step? large gradient large step?

Choosing Stepsize r rf(x)? Set step size proportional to? small gradient small step? large gradient large step? Two commonly used techniques 1. Stepsize adaptation 2. Line search

Stepsize Adaptation Input: initial x 2 R n, functions f(x) and rf(x), initial stepsize, tolerance Output: x 1: repeat 2: y x rf(x) rf(x) 3: if [ thenstep is accepted]f(y) apple f(x) 4: x y 5: 1.2 // increase stepsize 6: else[step is rejected] 7: 0.5 // decrease stepsize 8: end if 9: until y x < [perhaps for 10 iterations in sequence] ( magic numbers )

Second Order Methods Compute Hessian matrix of second derivatives

Second Order Methods Broyden-Fletcher-Goldfarb-Shanno (BFGS) method: Input: initial x 2 R n, functions f(x), rf(x), tolerance Output: x 1: initialize H -1 = I n 2: repeat 3: compute = H -1 rf(x) 4: perform a line search min f(x + ) 5: 6: y rf(x + ) rf(x) 7: x x + 8: update H -1 9: until 1 < I y > >H -1 I > y y > + > > y > y Memory-limited version: L-BFGS

Stochastic Gradient Descent What if N is really large? Batch gradient descent (evaluates all data) Minibatch gradient descent (evaluates subset) Converges under Robbins-Monro conditions

Probabilistic Interpretation

Normal Distribution Right Skewed Left Skewed Random

Normal Distribution ) p Density:

Central Limit Theorem 3 N =1 3 N =2 3 N = 10 2 2 2 1 1 1 0 0 0.5 1 0 0 0.5 1 0 0 0.5 1 If y1,, yn are 1. Independent identically distributed (i.i.d.) 2. Have finite variance 0 < σy 2 <

Density: Multivariate Normal

Regression: Probabilistic Interpretation

Regression: Probabilistic Interpretation

Regression: Probabilistic Interpretation Joint probability of N independent data points

Regression: Probabilistic Interpretation Log joint probability of N independent data points

Regression: Probabilistic Interpretation Log joint probability of N independent data points

Regression: Probabilistic Interpretation Log joint probability of N independent data points

Regression: Probabilistic Interpretation Log joint probability of N independent data points Maximum Likelihood

Basis function regression Linear regression y = w 0 + w 1 x 1 +...+ w D x D = w T x Basis function regression Polynomial regression

Polynomial Regression 1 M =0 1 M =1 t t 0 0 1 1 0 x 1 0 x 1 1 M =3 1 M =9 t t 0 0 1 1 0 x 1 0 x 1

Polynomial Regression 1 M =0 1 M =1 Underfit t t 0 0 1 1 0 x 1 0 x 1 1 M =3 1 M =9 t t 0 0 1 1 0 x 1 0 x 1

Polynomial Regression 1 M =0 1 M =1 t t 0 0 1 1 0 x 1 0 x 1 Overfit 1 M =3 1 M =9 t t 0 0 1 1 0 x 1 0 x 1

Regularization L2 regularization(ridgeregression)minimizes: E(w) = 1 N k Xw y k2 + k w k 2 where 0 and w 2 = w T w k k L1 regularization (LASSO) minimizes: E(w) = 1 N k Xw y k2 + w 1 where 0 and w 1 = D P i=1! i

Regularization

Regularization L2: closed form solution w =(X T X + I) 1 X T y L1: No closed form solution. Use quadratic programming: minimize k Xw y k 2 s.t. k w k 1 apple s

Review: Bias-Variance Trade-off Maximum likelihood estimator Bias-variance decomposition (expected value over possible data points)

Bias-Variance Trade-off Often: low bias ) high variance low variance ) high bias Trade-o :

K-fold Cross-Validation 1. Divide dataset into K folds 2. Train on all except k-th fold 3. Test on k-th fold 4. Minimize test error w.r.t. λ

K-fold Cross-Validation Choices for K: 5, 10, N (leave-one-out) Cost of computation: K x number of λ

Learning Curve

Learning Curve

Loss Functions