II. Linear Models (pp.47-70)

Similar documents
LINEAR REGRESSION, RIDGE, LASSO, SVR

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Stat 602 Exam 1 Spring 2017 (corrected version)

Lecture 2: Linear regression

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

III. Naïve Bayes (pp.70-72) Probability review

Logistic Regression Logistic

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

COMS 4771 Regression. Nakul Verma

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

Linear Models: Comparing Variables. Stony Brook University CSE545, Fall 2017

Advanced Introduction to Machine Learning CMU-10715

Machine Learning for Biomedical Engineering. Enrico Grisan

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

6.036 midterm review. Wednesday, March 18, 15

Generative Clustering, Topic Modeling, & Bayesian Inference

Polynomial and Synthetic Division

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Optimization and Gradient Descent

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Machine Learning Basics

Classification Logistic Regression

4 Bias-Variance for Ridge Regression (24 points)

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning 4771

Linear Regression. Udacity

Linear Models for Regression

Regression. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 11. Jordan Boyd-Graber Boulder Regression 1 of 19

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Week 3: Linear Regression

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Lecture 15: Exploding and Vanishing Gradients

Solving Regression. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 12. Slides adapted from Matt Nedrich and Trevor Hastie

Lecture 14: Shrinkage

Regression. Simple Linear Regression Multiple Linear Regression Polynomial Linear Regression Decision Tree Regression Random Forest Regression

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

CSE546 Machine Learning, Autumn 2016: Homework 1

Graphing Linear Equations: Warm Up: Brainstorm what you know about Graphing Lines: (Try to fill the whole page) Graphing

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Ordinary Least Squares Linear Regression

CPSC 340: Machine Learning and Data Mining

Machine Learning and Data Mining. Linear regression. Kalev Kask

CS145: INTRODUCTION TO DATA MINING

Quick Introduction to Nonnegative Matrix Factorization

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Lecture 3. Linear Regression

CSC 411: Lecture 04: Logistic Regression

Learning Representations for Counterfactual Inference. Fredrik Johansson 1, Uri Shalit 2, David Sontag 2

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Day 3: Classification, logistic regression

Linear Algebra. Introduction. Marek Petrik 3/23/2017. Many slides adapted from Linear Algebra Lectures by Martin Scharlemann

Machine learning - HT Basis Expansion, Regularization, Validation

Algebra 1. Predicting Patterns & Examining Experiments. Unit 5: Changing on a Plane Section 4: Try Without Angles

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Feature Engineering, Model Evaluations

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Gaussian Process

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance

CSC 411 Lecture 6: Linear Regression

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Machine Learning 4771

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Performance Evaluation

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Algebra 1B notes and problems March 12, 2009 Factoring page 1

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Lecture #11: Classification & Logistic Regression

CSC321 Lecture 4 The Perceptron Algorithm

CSC 411: Lecture 02: Linear Regression

( ) is called the dependent variable because its

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Lesson 3-2: Solving Linear Systems Algebraically

Sparse Approximation and Variable Selection

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Lines and Their Equations

ECE521 week 3: 23/26 January 2017

ISyE 691 Data mining and analytics

Manual for a computer class in ML

Learning from Data: Regression

AE = q < H(p < ) + (1 q < )H(p > ) H(p) = p lg(p) (1 p) lg(1 p)

Nonlinear Classification

Today. Calculus. Linear Regression. Lagrange Multipliers

FA Homework 2 Recitation 2

CMSC858P Supervised Learning Methods

Strategies for dealing with Missing Data

Transcription:

Notation: Means pencil-and-paper QUIZ Means coding QUIZ Agree or disagree: Regression can be always reduced to classification. Explain, either way! A certain classifier scores 98% on the training set, but only 55% on the testing set. What do you think happened? A certain classifier scores 59% on the training set, and 60% on the testing set. What do you think happened? A certain classifier scores 61% on the training set, and 85% on the testing set. What do you think happened? A certain classifier scores 93% on the training set, and 90% on the testing set. What do you think happened? Explain the difference between synthetic and empirical datasets. Why do we need both? Write code to create a wave dataset with 10 points. Display them in a plot. When extending the Boston dataset with the engineered features, the nr. of features exploded from 13 to 104. Explain! (Hint: Pairs of features are multiplied.) Why do we call KNN a lazy classifier? II. Linear Models (pp.47-70) Prediction is achieved by means of a linear function of the features, i.e. a function involving only additions and multiplication by constants. Linear models for regression When the target feature y is predicted based on only one other feature x[0], we have the simple formula This is the equation of a line, using the well-known slope (w[0]) and y-intercept (b). 1. Linear regression (ordinary least-squares = OLS)

A.k.a. ordinary least squares, because the parameters w and b are chosen such as to minimize the sum of the squares of the errors between the training data points and the values predicted by the model. This is known in statistics as the R.M.S. (Root Mean Square) Error, or RMSE: RMSE = Explain on the plot above! Calculate RMSE for the 3-point dataset and the line shown: intercept is always a scalar: It is the interception/intersection point between the line/plane/hyperplane and the target axis. coef is a scalar (the slope) for a 1D model, but it is in general a 1D numpy array with as many elements as there are non-target features in the dataset: The relatively low train and test scores indicate underfitting, but here we have no choice, b/c in OLS we cannot control the model complexity. p.49: Linear regression has no parameters [set by the user]

A better way to put it is: OLS has no hyper-parameters.

Boston housing dataset: What is your diagnosis? Overfitting! Why? Because the number of samples (506) is of the same order of magnitude as the number of features (105, including the derived ones). To avoid overfitting in LR, we need nr. samples >> nr. features.

Solutions: Write code to create a wave dataset with 10 points. Display them in a plot. Note the behavior of the legend! When extending the Boston dataset with the engineered features, the nr. of features exploded from 13 to 104. Explain! (Hint: Pairs of features are multiplied.) Each feature combines with itself, as well as all others, but pairs of features are only combined once. We have 13 + 12 + + 2 + 1 = 14(14-1)/2 = 14 13 =/2 = 91 new features. 13 + 91 = 104. Calculate RMSE for the 3-point dataset and the line shown.

2. Ridge regression (RR) Same formula, but the coefficients are calculated by minimizing not only the RMSE, but also the sum of the squared coefficients, which is a measure of the complexity of the model: New error = EC is a new parameter (hyperparameter), called regularization parameter, that the user can set in order to control the model complexity: small alpha little regularization high complexity possible overfitting large alpha... default alpha: 1 The proof is in the pudding let us use RR on the same dataset extended_boston: Remember the previous scores from ordinary regression: Now let us play with alpha: Conclusion?

Plotting coef_ (slopes) for the three alphas above: Note that a large alpha leads to small coefficients (triangles) this means a less complex model.

What is the effect of the size of the training set? Score for RR training is always smaller than for OLS. Why? Up to about 380, OLS doesn t learn anything! RR does significantly better than OLS when dataset is small With enough data points, OLS eventually catches up on RR A little research: What exactly happens with the Linear Regression test scores below 380? I modified the mglearn code to print the actual values of the Linear Regression test scores. They are: Now is a good time to go back to the definition of R^2 and understand how it is possible for it to have such huge negative values. When R^2 is negative, it is recommended to use the simple average of the data points as an estimate!

3. Lasso regression (LaR) Ridge: Minimize Uses Euclidean norm, a.k.a. L2 norm. Lasso: Minimize Uses sum of absolute values, a.k.a. L1 norm. This Wikipedia pic explains why LaR sets some of the coefficients w to zero, rather than just making them smaller (as RR does). 1 For this reason, LaR leads to a simpler model than RR: it finds out redundant features, which do not contribute to explaining the data, and eliminates them. Another way to look at it: it selects and retains only the important features. Boston dataset: 1 Source: https://en.wikipedia.org/wiki/lasso_(statistics)#geometric_interpretation

See alpha = 1 above (blue squares) What is your diagnosis? Massive underfitting! Let us try to get closer to the optimum:

Note: The text says that Lasso with alpha = 0.01 does slightly better than Ridge with alpha = 0.1... but in the examples we see that they both have the test score 0.77. How can we find out if it is true? A: Simply include more digits when we display the result: Conclusion: Not true! (at least not on our random partition). Conclusions on linear regression algorithms: Use OLS only when the nr. of data points is >> nr. of features. Try to have at least 50 times more points than features to be safe from overfitting, e.g. for 2 features use at least 100 data points. (Of course, the distribution of those points is also important!) In general, RR is the first choice, unless...... the nr. of features is large (hundreds!) and we expect many of them to be irrelevant (but we don t know which ones). In this case, use LaR.