COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Similar documents
COMS 4771 Regression. Nakul Verma

Machine Learning Practice Page 2 of 2 10/28/13

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010

Support Vector Machines

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

SVMs, Duality and the Kernel Trick

6.036 midterm review. Wednesday, March 18, 15

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Linear & nonlinear classifiers

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Classification Logistic Regression

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

COMS 4771 Introduction to Machine Learning. Nakul Verma

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Lecture 2 Machine Learning Review

STA414/2104 Statistical Methods for Machine Learning II

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Linear Regression (9/11/13)

Machine Learning Linear Classification. Prof. Matteo Matteucci

Logistic Regression. COMP 527 Danushka Bollegala

Warm up: risk prediction with logistic regression

Linear regression COMS 4771

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Announcements - Homework

Linear Models in Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Introduction to Machine Learning Midterm, Tues April 8

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Support Vector Machines

CSC 411 Lecture 17: Support Vector Machine

Linear Methods for Regression. Lijun Zhang

Support Vector Machine I

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

CMU-Q Lecture 24:

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Statistical Data Mining and Machine Learning Hilary Term 2016

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Least Squares Regression

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Linear classifiers: Logistic regression

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Machine Learning - MT Classification: Generative Models

Overfitting, Bias / Variance Analysis

Linear Models for Regression. Sargur Srihari

CPSC 340: Machine Learning and Data Mining

Week 5: Logistic Regression & Neural Networks

Linear Regression (continued)

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Statistical Methods for Data Mining

Least Squares Regression

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Lecture 15: Logistic Regression

Nonparametric Bayesian Methods (Gaussian Processes)

Machine Learning Lecture 7

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Linear & nonlinear classifiers

Linear Models for Regression

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning, Fall 2012 Homework 2

LINEAR REGRESSION, RIDGE, LASSO, SVR

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Introduction to Gaussian Process

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

18.9 SUPPORT VECTOR MACHINES

Midterm exam CS 189/289, Fall 2015

Classification Logistic Regression

COMS 4771 Introduction to Machine Learning. Nakul Verma

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Statistical Methods for SVM

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Discriminative Learning and Big Data

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Recap from previous lecture

Logistic Regression. Seungjin Choi

Introduction to Logistic Regression and Support Vector Machine

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Linear Models for Classification

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Stochastic Gradient Descent

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Support Vector Machines

Homework 4. Convex Optimization /36-725

Kernel Logistic Regression and the Import Vector Machine

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Day 4: Classification, support vector machines

Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent

0.5. (b) How many parameters will we learn under the Naïve Bayes assumption?

Logistic Regression and Generalized Linear Models

Transcription:

COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma

Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released soon

Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the optimal solution

Learning more Sophisticated Outputs So far we have focused on classification f : X {1,, k} What about other outputs? PM 2.5 (pollutant) particulate matter exposure estimate: Input: # cars, temperature, etc. Output: 50 ppb Pose estimation Sentence structure estimate:

Regression We ll focus on problems with real number outputs (regression problem): Example: Next eruption time of old faithful geyser (at Yellowstone)

Regression Formulation for the Example Given x, want to predict an estimate ŷ of y, which minizes the discrepancy (L) between ŷ and y. Absolute error Loss Squared error A linear predictor f, can be defined by the slope w and the intercept w 0 : which minimizes the prediction loss. How is this different from classification?

Parametric vs non-parametric Regression If we assume a particular form of the regressor: Parametric regression Goal: to learn the parameters which yield the minimum error/loss If no specific form of regressor is assumed: Non-parametric regression Goal: to learn the predictor directly from the input data that yields the minimum error/loss

Linear Regression Want to find a linear predictor f, i.e., w (intercept w 0 absorbed via lifting): which minimizes the prediction loss over the population. We estimate the parameters by minimizing the corresponding loss on the training data: (Geometrically) for squared error

Linear Regression: Learning the Parameters Linear predictor with squared loss: x 1 y 1 2 x i w y i x n y n Unconstrained problem! Can take the gradient and examine the stationary points! Why need not check the second order conditions?

Linear Regression: Learning the Parameters Best fitting w: At a stationary point Pseudo-inverse Also called the Ordinary Least Squares (OLS) The solution is unique and stable when X T X is invertible What is the interpretation of this solution?

Linear Regression: Geometric Viewpoint x 1 Consider the column space view of data X: x i x n Find a w, such that the linear combination of minimizes Say ŷ is the ols solution, ie, projection matrix P residual Thus, ŷ is the orthogonal projection of y onto the! w ols forms the coefficients of ŷ P(y)

Linear Regression: Statistical Modeling View Let s assume that data is generated from the following process: A example x i is draw independently from the data space X y clean is computed as (w. x i ), from a fixed unknown w y clean is corrupted from by adding independent Gaussian noise N(0,s 2 ) (x i, y i ) is revealed as the i th sample

Linear Regression: Statistical Modeling View How can we determine w, from Gaussian noise corrupted observations? Observation: parameter How to estimate parameters of a Gaussian? Let s try Maximum Likelihood Estimation! = nx log p(y i w, x i )= i=1 nx i=1 (y i x > i w)2 2 2 + constant ignoring terms independent of w optimizing for w yields the same ols result! What happens if we model each y i with indep. noise of different variance?

Linear Regression for Classification? Linear regression seems general, can we use it to derive a binary classifier? Let s study 1-d data: X Problem #1: Where is y? for regression. Y = 1 Y Y = 0 X Problem #2: Not really linear! Perhaps it is linear in some transformed coordinates?

Linear Regression for Classification Y = 1 Y Sigmoid a better model! Y = 0 X Binary predictor: Interpretation: For an event that occurs with probability P, the odds of that event is: Consider the log of the odds For an event with P=0.9, odds = 9 But, for an event P=0.1, odds = 0.11 (very asymmetric) Symmetric!

Logistic Regression Model the log-odds or logit with linear function! Sigmoid! Y = 1 Y OK, we have a model, how do we learn the parameters? Y = 0 X

Logistic Regression: Learning Parameters Given samples (y i Î{0,1} binary) Binomial = = nx log(1 p(x i )) + i=1 nx i=1 log(1 + e w x i )+ nx y i log i=1 nx y i w x i i=1 p(x i ) 1 p(x i ) Now, use logistic model! Can take the derivative and analyze stationary points, unfortunately no closed form solution (use iterative methods like gradient descent to find the solution)

Linear Regression: Other Variations Back to the ordinary least squares (ols): Additionally how can we incorporate prior knowledge? Often poorly behaved when X T X not invertible perhaps want w to be sparse. perhaps want a simple w. Lasso regression Ridge regression

Ridge Regression Objective reconstruction error regularization parameter The regularization helps avoid overfitting, and always resulting in a unique solution. Equivalent to the following optimization problem: Geometrically: Why?

Lasso Regression Objective lasso penalty no closed form solution Lasso regularization encourages sparse solutions. Equivalent to the following optimization problem: Geometrically: Why? How can we find the solution?

What About Optimality? Linear regression (and variants) is great, but what can we say about the best possible estimate? Can we construct an estimator for real outputs that parallels Bayes classifier for discrete outputs?

Optimal L 2 Regressor Best possible regression estimate at x: Theorem: for any regression estimate g(x) Similar to Bayes classifier, but for regression. Proof is straightforward

Proof Consider L 2 error of g(x) Why? Cross term: Therefore Which is minimized when g(x) = f*(x)!

Non-parametric Regression Linear regression (and variants) is great, but what if we don t know parametric form of the relationship between the independent and dependent variables? How can we predict value of a new test point x without model assumptions? Y Idea: ŷ = f(x) = Average estimate Y of observed data in a local neighborhood X of x! ŷ o o X

Kernel Regression Y Consider example localization functions: Then define: Want weights that emphasize local observations Box kernel ŷ Gaussian kernel Triangle kernel o o X Weighted average

Kernel Regression Advantages: Does not assume any parametric form of the regression function. Kernel regression is consistent Disadvantages: Evaluation time complexity: O(dn) Need to keep all the data around!

What We Learned Linear Regression Parametric vs Nonparametric regression Logistic Regression for classification Ridge and Lasso Regression Kernel Regression

Questions?