Lecture 02 Linear classification methods I

Similar documents
Local regression I. Patrick Breheny. November 1. Kernel weighted averages Local linear regression

Terminology for Statistical Data

Lecture 14 Singular Value Decomposition

41903: Introduction to Nonparametrics

ECE521 lecture 4: 19 January Optimization, MLE, regularization

COMS 4771 Regression. Nakul Verma

CMSC858P Supervised Learning Methods

Econometrics I. Lecture 10: Nonparametric Estimation with Kernels. Paul T. Scott NYU Stern. Fall 2018

Instructor Notes for Chapters 3 & 4

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Econ 582 Nonparametric Regression

CHALLENGE! (0) = 5. Construct a polynomial with the following behavior at x = 0:

Introduction to Nonparametric Regression

Lecture 17 Intro to Lasso Regression

CSE446: non-parametric methods Spring 2017

STAT 111 Recitation 7

Weighted Least Squares

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Lecture 6: Methods for high-dimensional problems

Lecture 3: Statistical Decision Theory (Part II)

Lecture 05 Geometry of Least Squares

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Lecture 16 Solving GLMs via IRWLS

14 Increasing and decreasing functions

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

CSC321 Lecture 4 The Perceptron Algorithm

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Lecture 5: Function Approximation: Taylor Series

Chaos, Complexity, and Inference (36-462)

Nonparametric Regression. Changliang Zou

Nonparametric Regression. Badr Missaoui

Statistics 153 (Time Series) : Lecture Three

Nonparametric Regression

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

L2: Two-variable regression model

WEIGHTED LEAST SQUARES. Model Assumptions for Weighted Least Squares: Recall: We can fit least squares estimates just assuming a linear mean function.

9 Classification. 9.1 Linear Classifiers

Fitting a regression model

1S11: Calculus for students in Science

CSC321 Lecture 2: Linear Regression

Taylor Series. Math114. March 1, Department of Mathematics, University of Kentucky. Math114 Lecture 18 1/ 13

MS&E 226: Small Data

Day 3: Classification, logistic regression

1 Lecture 24: Linearization

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

MIRA, SVM, k-nn. Lirong Xia

Lecture 2: Linear regression

Nonparametric Methods

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Additive Isotonic Regression

Section 9.8. First let s get some practice with determining the interval of convergence of power series.

CS229 Final Project. Wentao Zhang Shaochuan Xu

MAT137 Calculus! Lecture 45

SYDE 112, LECTURE 7: Integration by Parts

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Opening Theme: Flexibility vs. Stability

Nonparametric Bayesian Methods (Gaussian Processes)

STAT 704 Sections IRLS and Bootstrap

Statistical Methods for NLP

Introduction to Machine Learning

Kernel methods CSE 250B

HOMEWORK 4: SVMS AND KERNELS

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Polynomials. p n (x) = a n x n + a n 1 x n 1 + a 1 x + a 0, where

Alternatives to Basis Expansions. Kernels in Density Estimation. Kernels and Bandwidth. Idea Behind Kernel Methods

Hierarchical models for the rainfall forecast DATA MINING APPROACH

Linear and Logistic Regression. Dr. Xiaowei Huang

Homework 4. Convex Optimization /36-725

Homework #3 RELEASE DATE: 10/28/2013 DUE DATE: extended to 11/18/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

Lecture 6: Linear Regression (continued)

Local Polynomial Modelling and Its Applications

ORIE 4741: Learning with Big Messy Data. Feature Engineering

Families of Functions, Taylor Polynomials, l Hopital s

The Bayes classifier

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

COMS 4771 Introduction to Machine Learning. Nakul Verma

Machine Learning Practice Page 2 of 2 10/28/13

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

STAT 705 Nonlinear regression

Modelling Non-linear and Non-stationary Time Series

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Econometrics I Lecture 3: The Simple Linear Regression Model

Exam 2. Average: 85.6 Median: 87.0 Maximum: Minimum: 55.0 Standard Deviation: Numerical Methods Fall 2011 Lecture 20

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Lecture 1 Intro to Spatial and Temporal Data

II. Linear Models (pp.47-70)

... SPARROW. SPARse approximation Weighted regression. Pardis Noorzad. Department of Computer Engineering and IT Amirkabir University of Technology

Lecture 15 and 16: BCH Codes: Error Correction

CS 188: Artificial Intelligence Spring Announcements

Review: Power series define functions. Functions define power series. Taylor series of a function. Taylor polynomials of a function.

High-dimensional regression

Lecture 6: Linear Regression

Stat 602 Exam 1 Spring 2017 (corrected version)

BAYESIAN DECISION THEORY

Gaussian Processes (10/16/13)

Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:

Applied inductive learning - Lecture 4

Transcription:

Lecture 02 Linear classification methods I 22 January 2016 Taylor B. Arnold Yale Statistics STAT 365/665 1/32

Coursewebsite: A copy of the whole course syllabus, including a more detailed description of topics I plan to cover are on the course website. http://www.stat.yale.edu/~tba3/stat665/ This is where all lecture notes, homeworks, and other references will appear. 2/32

Coursewebsite: A copy of the whole course syllabus, including a more detailed description of topics I plan to cover are on the course website. http://www.stat.yale.edu/~tba3/stat665/ This is where all lecture notes, homeworks, and other references will appear. The first problem set is now posted. 2/32

Somethoughtsfromcoursesurvey 1. several people mentioned Matlab as alternative to R or Python 2. equal interest in CV and NLP 3. requests for finance and biological applications 4. several requests for explaining methods for large, distributed computations (Hadoop, Spark, MPI) 5. a diverse set of background in statistics, computer science, and data analysis 3/32

Today, we will consider 1-dimensional non-parametric regression. Specifically, we will consider observing n pairs (x i, y i ) such that: y i = g(x i ) + ϵ i For an unknown function g and random variable ϵ i, where the mean of the random variable is zero. 4/32

Today, we will consider 1-dimensional non-parametric regression. Specifically, we will consider observing n pairs (x i, y i ) such that: y i = g(x i ) + ϵ i For an unknown function g and random variable ϵ i, where the mean of the random variable is zero. You are free to think of the errors being independent, identically distributed and with finite variances. You may even assume that they are normally distributed. 4/32

Our goal is to estimate the function g. More specifically, we want to estimate the value of g at the input point x new, denoted as ĝ(x new ). 5/32

Our goal is to estimate the function g. More specifically, we want to estimate the value of g at the input point x new, denoted as ĝ(x new ). The value of x new may be in the original set and it may not be; however, we assume that ϵ new is an entirely different random sample. 5/32

Our goal is to estimate the function g. More specifically, we want to estimate the value of g at the input point x new, denoted as ĝ(x new ). The value of x new may be in the original set and it may not be; however, we assume that ϵ new is an entirely different random sample. Next class I will cover more of the details about training, testing, validation, and other techniques for evaluating how well a predictive model is performing. Today we will just spend some time introducing several standard techniques and gaining some intuition about them. 5/32

The specific methods I will introduce are: 1. k-nearest neighbors (knn) 2. kernel smoothing (Nadaraya-Watson estimator) 3. linear regression (OLS) 4. local regression (LOESS) 6/32

Here is an example dataset that we will use to illustrate these four techniques: 0 1 2 3 4 5 6 0 1 2 3 4 7/32

k-nearestneighbors(knn) One of the most straightforward estimators, knn simply sets ĝ(x new ) to be the average value of the observed y of the k closest points x j to x new. 8/32

0 1 2 3 4 5 6 0 1 2 3 4 9/32

0 1 2 3 4 5 6 0 1 2 3 4 10/32

0 1 2 3 4 5 6 0 1 2 3 4 11/32

0 1 2 3 4 5 6 0 1 2 3 4 12/32

k-nearestneighbors(knn) How might you expect the optimal parameter k to change as n increases? 13/32

Kernelsmoother Rather than averaging the k-closest points, kernel smoothers average observations in the dataset using weights that are inversely proportional to the distance from the prediction point. 14/32

Kernelsmoothers As an example, consider this estimator i ĝ(x new ) = y i ϕ( x i x new 2 2) i ϕ( x i x new 2 2 ) Where ϕ is defined as the density function of a standard normal distribution: ϕ(z) = 1 2π e z2 /2 15/32

Kernelsmoothers The function ϕ can be replaced with any other function that you would like to use. It is often replaced by a truncated variant, as observations more than a few standard deviations do not give a noticeable impact on the result. 16/32

Kernelsmoothers-bandwidth What is the main tuning parameter for kernel smoothers? We need to modify our estimator slightly to include the bandwidth parameter h: i ĝ(x new ) = y i ϕ( x i x new 2 2/h) i ϕ( x i x new 2 2 /h) What does the bandwidth control? 17/32

Kernelsmoothers-bandwidth What is the main tuning parameter for kernel smoothers? We need to modify our estimator slightly to include the bandwidth parameter h: i ĝ(x new ) = y i ϕ( x i x new 2 2/h) i ϕ( x i x new 2 2 /h) What does the bandwidth control? Why might we think of the bandwidth as a standard deviation? 17/32

0 1 2 3 4 5 6 0 1 2 3 4 18/32

0 1 2 3 4 5 6 0 1 2 3 4 19/32

0 1 2 3 4 5 6 0 1 2 3 4 20/32

Linearregression Hopefully you have already seen linear regression in some context. Recall that here we assume the data are generated by a process that looks something like this: y i = x 1,i β 1 + x 2,i β 2 + + x p,1 β p + ϵ i For some random variable ϵ i and fixed (but unknown) parameters β j. The parameters β j are most commonly estimated by finding the values that minimize the squared residuals (known as ols, or ordinary least squares). 21/32

Linearregression In our one dimensional case, even including an intercept, this does not seem very interesting as we have only two parameters to use to fit the data: y i = β 0 + x i β 1 + ϵ i 22/32

0 1 2 3 4 5 6 0 1 2 3 4 23/32

Linearregression Why might ordinary least squares be useful in non-parametric regression? 24/32

Linearregression Why might ordinary least squares be useful in non-parametric regression? basis functions! 24/32

Linearregression Why might ordinary least squares be useful in non-parametric regression? basis functions! We can expand the model to include non-linear terms. For example, we can expand to write y as a power basis: Or as a Fourier basis: y i = β 0 + y i = β 0 + p x j i β j + ϵ i j=1 p sin(x i j) β 2j + cos(x i j) β 2j+1 + ϵ i j=1 24/32

Linearregression With enough terms this basis expansion can approximate nearly all functional forms of g. 25/32

localregression(loess) Local regression combines the ideas of kernel smoothers and linear regression. A separate linear regression is fit at each input point x new for which ĝ is to be evaluated at. The regression uses sample weights based on how for each sample is from x new. 26/32

localregression(loess) Local regression combines the ideas of kernel smoothers and linear regression. A separate linear regression is fit at each input point x new for which ĝ is to be evaluated at. The regression uses sample weights based on how for each sample is from x new. One can use linear regression or use higher order polynomials. Typically at most cubic functions are used. 26/32

0 1 2 3 4 5 6 0 1 2 3 4 27/32

0 1 2 3 4 5 6 0 1 2 3 4 28/32

0 1 2 3 4 5 6 0 1 2 3 4 29/32

Linearsmoothers How are these four techniques all related? 30/32

Linearsmoothers How are these four techniques all related? All of them can be written as: ŷ new = i w i y i Where the weights w i depend only on the values x i (and x new ) and not on the values of y i. 30/32

Linearsmoothers How are these four techniques all related? All of them can be written as: ŷ new = i w i y i Where the weights w i depend only on the values x i (and x new ) and not on the values of y i. Anything that can be written in this form is called a linear smoother. 30/32

Linearsmoothers, cont. For example, in the case of knn, the weights are 1 k remained of them. for the k closest values and 0 for the 31/32

Linearsmoothers, cont. For example, in the case of knn, the weights are 1 k remained of them. for the k closest values and 0 for the We have already seen the weights for the kernel smoother. 31/32

Linearsmoothers, cont. For example, in the case of knn, the weights are 1 k remained of them. for the k closest values and 0 for the We have already seen the weights for the kernel smoother. The weights for ordinary least squares are given by our derived ordinary least squares estimator: w = x new (X t X) 1 X t And LOWESS is simply a sample weighted variant of this. 31/32

Stilltocome computational issues of computing them how to pick tuning parameters and choose amongst them what happens when we have higher dimensional spaces 32/32