COMS 4771 Regression. Nakul Verma

Similar documents
COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Machine Learning Practice Page 2 of 2 10/28/13

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Support Vector Machines

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

SVMs, Duality and the Kernel Trick

10-701/ Machine Learning - Midterm Exam, Fall 2010

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Linear Regression (9/11/13)

6.036 midterm review. Wednesday, March 18, 15

Linear regression COMS 4771

COMS 4771 Introduction to Machine Learning. Nakul Verma

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Machine Learning Linear Classification. Prof. Matteo Matteucci

Introduction to Machine Learning Midterm, Tues April 8

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Lecture 2 Machine Learning Review

High-dimensional regression

Linear Methods for Regression. Lijun Zhang

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

STA414/2104 Statistical Methods for Machine Learning II

Introduction to Gaussian Process

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Linear & nonlinear classifiers

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Least Squares Regression

18.9 SUPPORT VECTOR MACHINES

Warm up: risk prediction with logistic regression

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Least Squares Regression

Linear Models in Machine Learning

CS6220: DATA MINING TECHNIQUES

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning Lecture 7

Machine Learning and Data Mining. Linear regression. Kalev Kask

CSC 411 Lecture 17: Support Vector Machine

Linear Models for Regression. Sargur Srihari

Support Vector Machines and Kernel Methods

Day 3: Classification, logistic regression

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression. COMP 527 Danushka Bollegala

Classification Logistic Regression

Homework 4. Convex Optimization /36-725

Support Vector Machine I

COMS 4771 Introduction to Machine Learning. Nakul Verma

Introduction to Machine Learning Midterm Exam Solutions

Support Vector Machines

ECE521 week 3: 23/26 January 2017

Week 5: Logistic Regression & Neural Networks

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Basis Expansion and Nonlinear SVM. Kai Yu

Nonparametric Bayesian Methods (Gaussian Processes)

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Applied Machine Learning Annalisa Marsico

CS145: INTRODUCTION TO DATA MINING

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

CS6220: DATA MINING TECHNIQUES

ECE 5424: Introduction to Machine Learning

Stochastic Gradient Descent

Machine Learning And Applications: Supervised Learning-SVM

Homework 2 Solutions Kernel SVM and Perceptron

Introduction. Chapter 1

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Day 4: Classification, support vector machines

Machine Learning Gaussian Naïve Bayes Big Picture

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Pattern Recognition and Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

LINEAR REGRESSION, RIDGE, LASSO, SVR

Introduction to Machine Learning Midterm Exam

Midterm: CS 6375 Spring 2015 Solutions

Today. Calculus. Linear Regression. Lagrange Multipliers

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Linear classifiers: Logistic regression

CIS 520: Machine Learning Oct 09, Kernel Methods

Machine Learning, Fall 2009: Midterm

CPSC 340: Machine Learning and Data Mining

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Support Vector Machines

Transcription:

COMS 4771 Regression Nakul Verma

Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the optimal solution

Learning more Sophisticated Outputs So far we have focused on classification f : X {1,, k} What about other outputs? PM 2.5 (pollutant) particulate matter exposure estimate: Input: # cars, temperature, etc. Output: 50 ppb Pose estimation Sentence structure estimate:

Regression We ll focus on problems with real number outputs (regression problem): Example: Next eruption time of old faithful geyser (at Yellowstone)

Regression Formulation for the Example Given x, want to predict an estimate ŷ of y, which minizes the discrepancy (L) between ŷ and y. Absolute error Loss Squared error A linear predictor f, can be defined by the slope w and the intercept w 0 : which minimizes the prediction loss. How is this different from classification?

Parametric vs non-parametric Regression If we assume a particular form of the regressor: Parametric regression Goal: to learn the parameters which yield the minimum error/loss If no specific form of regressor is assumed: Non-parametric regression Goal: to learn the predictor directly from the input data that yields the minimum error/loss

Linear Regression Want to find a linear predictor f, i.e., w (intercept w 0 absorbed via lifting): which minimizes the prediction loss over the population. We estimate the parameters by minimizing the corresponding loss on the training data: (Geometrically) for squared error

Linear Regression: Learning the Parameters Linear predictor with squared loss: x 1 y 1 2 x i w y i x n y n Unconstrained problem! Can take the gradient and examine the stationary points! Why need not check the second order conditions?

Linear Regression: Learning the Parameters Best fitting w: At a stationary point Pseudo-inverse Also called the Ordinary Least Squares (OLS) The solution is unique and stable when X T X is invertible What is the interpretation of this solution?

Linear Regression: Geometric Viewpoint x 1 Consider the column space view of data X: x i x n Find a w, such that the linear combination of minimizes Say ŷ is the ols solution, ie, projection matrix Π residual Thus, ŷ is the orthogonal projection of y onto the! w ols forms the coefficients of ŷ Π(y)

Linear Regression: Statistical Modeling View Let s assume that data is generated from the following process: A example x i is draw independently from the data space X y clean is computed as (w. x i ), from a fixed unknown w y clean is corrupted from by adding independent Gaussian noise N(0,σ 2 ) (x i, y i ) is revealed as the i th sample

Linear Regression: Statistical Modeling View How can we determine w, from Gaussian noise corrupted observations? Observation: parameter How to estimate parameters of a Gaussian? Let s try Maximum Likelihood Estimation! ignoring terms independent of w optimizing for w yields the same ols result! What happens if we model each y i with indep. noise of different variance?

Linear Regression for Classification? Linear regression seems general, can we use it to derive a binary classifier? Let s study 1-d data: X Problem #1: Where is y? for regression. Y = 1 Y Y = 0 X Problem #2: Not really linear! Perhaps it is linear in some transformed coordinates?

Linear Regression for Classification Y = 1 Y Sigmoid a better model! Y = 0 X Binary predictor: Interpretation: For an event that occurs with probability P, the odds of that event is: For an event with P=0.9, odds = 9 But, for an event P=0.1, odds = 0.11 (very asymmetric) Consider the log of the odds Symmetric!

Logistic Regression Model the log-odds or logit with linear function! Sigmoid! Y = 1 Y OK, we have a model, how do we learn the parameters? Y = 0 X

Logistic Regression: Learning Parameters Given samples (y i {0,1} binary) Binomial Now, use logistic model! Can take the derivative and analyze stationary points, unfortunately no closed form solution (use iterative methods like gradient descent to find the solution)

Linear Regression: Other Variations Back to the ordinary least squares (ols): Additionally how can we incorporate prior knowledge? Often poorly behaved when X T X not invertible perhaps want w to be sparse. perhaps want to simple w. Lasso regression Ridge regression

Ridge Regression Objective reconstruction error regularization parameter The regularization helps avoid overfitting, and always resulting in a unique solution. Equivalent to the following optimization problem: Geometrically: Why?

Lasso Regression Objective lasso penalty no closed form solution Lasso regularization encourages sparse solutions. Equivalent to the following optimization problem: Geometrically: Why? How can we find the solution?

What About Optimality? Linear regression (and variants) is great, but what can we say about the best possible estimate? Can we construct an estimator for real outputs that parallels Bayes classifier for discrete outputs?

Optimal L 2 Regressor Best possible regression estimate at x: Theorem: for any regression estimate g(x) Similar to Bayes classifier, but for regression. Proof is straightforward

Proof Consider L 2 error of g(x) Why? Cross term: Therefore Which is minimized when g(x) = f*(x)!

Non-parametric Regression Linear regression (and variants) is great, but what if we don t know parametric form of the relationship between the independent and dependent variables? How can we predict value of a new test point x without model assumptions? Y Idea: ŷ = f(x) = Average estimate Y of observed data in a local neighborhood X of x! ŷ o o X

Kernel Regression Y Consider example localization functions: Then define: Want weights that emphasize local observations Box kernel ŷ Gaussian kernel Triangle kernel o o X Weighted average

Consistency Theorem Recall: best possible regression estimate at x: Theorem: As n, h, h/n 0, then where is the kernel regressor with most localization kernels. Proof is a bit tedious

Proof Sketch Prove for a fixed x and then integrate over (just like before) squared bias of variance of Bias-variance decomposition Sq. bias Variance Pick

Kernel Regression Advantages: Does not assume any parametric form of the regression function. Kernel regression is consistent Disadvantages: Evaluation time complexity: O(dn) Need to keep all the data around! How can we address the shortcomings of kernel regression?

kd trees: Speed Up Nonparametric Regression k-d trees to the rescue! Idea: partition the data in cells organized in a tree based hierarchy. (just like before) To return an estimated value, return the average y value in a cell!

What We Learned Linear Regression Parametric vs Nonparametric regression Logistic Regression for classification Ridge and Lasso Regression Kernel Regression Consistency of Kernel Regression Speeding non-parametric regression with trees

Questions?

Next time Statistical Theory of Learning!