COS 424: Interacting with Data

Similar documents
Linear Regression and Its Applications

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Statistical Methods for SVM

Support Vector Machines

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

4 Bias-Variance for Ridge Regression (24 points)

Lecture 14: Shrinkage

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space

Midterm exam CS 189/289, Fall 2015

Lecture 6: Methods for high-dimensional problems

A TOUR OF LINEAR ALGEBRA FOR JDEP 384H

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

A Modern Look at Classical Multivariate Techniques

Plan of Class 4. Radial Basis Functions with moving centers. Projection Pursuit Regression and ridge. Principal Component Analysis: basic ideas

CIS 520: Machine Learning Oct 09, Kernel Methods

4 Bias-Variance for Ridge Regression (24 points)

Introduction to Machine Learning

Proteomics and Variable Selection

Statistical Methods for Data Mining

Support Vector Machine (SVM) and Kernel Methods

Chapter 4: Interpolation and Approximation. October 28, 2005

Combining estimates in regression and. classication. Michael LeBlanc. and. Robert Tibshirani. and. Department of Statistics. cuniversity of Toronto

Vector Space Basics. 1 Abstract Vector Spaces. 1. (commutativity of vector addition) u + v = v + u. 2. (associativity of vector addition)

Kernels and the Kernel Trick. Machine Learning Fall 2017

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

2 Tikhonov Regularization and ERM

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

MS-C1620 Statistical inference

Linear Algebra, 4th day, Thursday 7/1/04 REU Info:

Spatial Process Estimates as Smoothers: A Review

Kernel Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 27, 2015

Support Vector Machine (continued)

Machine Learning Practice Page 2 of 2 10/28/13

STAT 462-Computational Data Analysis

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Lecture 5 : Projections

A Least Squares Formulation for Canonical Correlation Analysis

Non-Linear Regression

SVMs, Duality and the Kernel Trick

Linear Methods for Regression. Lijun Zhang

Reproducing Kernel Hilbert Spaces

SVM optimization and Kernel methods

Regularization in Reproducing Kernel Banach Spaces

Factorising Cubic Polynomials - Grade 12 *

Linear model selection and regularization

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

1 Review of Interpolation using Cubic Splines

Linear Regression. S. Sumitra

Lecture 5: A step back

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Problem Set 9 Due: In class Tuesday, Nov. 27 Late papers will be accepted until 12:00 on Thursday (at the beginning of class).

Support Vector Machines

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Statistical Machine Learning from Data

LECTURE VI: SELF-ADJOINT AND UNITARY OPERATORS MAT FALL 2006 PRINCETON UNIVERSITY

Classifier Complexity and Support Vector Classifiers

Statistical Pattern Recognition

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

LINEAR REGRESSION, RIDGE, LASSO, SVR

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

1/sqrt(B) convergence 1/B convergence B

x x2 2 + x3 3 x4 3. Use the divided-difference method to find a polynomial of least degree that fits the values shown: (b)

Partial Fractions. June 27, In this section, we will learn to integrate another class of functions: the rational functions.

Today. Calculus. Linear Regression. Lagrange Multipliers

Calculating determinants for larger matrices

Machine Learning for OR & FE

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Regularized Least Squares

Physics 331 Introduction to Numerical Techniques in Physics

Support Vector Machine (SVM) and Kernel Methods

Local regression I. Patrick Breheny. November 1. Kernel weighted averages Local linear regression

Least Squares Approximation

Exam 2. Average: 85.6 Median: 87.0 Maximum: Minimum: 55.0 Standard Deviation: Numerical Methods Fall 2011 Lecture 20

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

Discriminative Models

CS540 Machine learning Lecture 5

6x 2 8x + 5 ) = 12x 8

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

ML (cont.): SUPPORT VECTOR MACHINES

Linear Algebra. Min Yan

The Hilbert Space of Random Variables

Mathematics (MEI) Advanced Subsidiary GCE Core 1 (4751) May 2010

CSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1

A Magiv CV Theory for Large-Margin Classifiers

CS 7140: Advanced Machine Learning

Math 2: Algebra 2, Geometry and Statistics Ms. Sheppard-Brick Chapter 4 Test Review

Support Vector Machines

Max Margin-Classifier

Machine Learning Linear Regression. Prof. Matteo Matteucci

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

(Kernels +) Support Vector Machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Engineering 7: Introduction to computer programming for scientists and engineers

Transcription:

COS 424: Interacting with Data Lecturer: Rob Schapire Lecture #14 Scribe: Zia Khan April 3, 2007 Recall from previous lecture that in regression we are trying to predict a real value given our data. Specically, in the New Jersey school district example discussed in class, we looked at a linear model ˆf(xi ) = wx i where we related x i enrollment in district i and tried to predict the budget y i for district i based on w the money spent per student. In general, we would like to handle multiple dimensions or t a more complex model to the data. 1 Linear Regression To generalize our original model we start with a vector x R n where n is the dimensionality of the data. We use the notation x i (j) to denote the jth component of the ith vector x i. We can write all of the components of x i = x i (1),..., x i (n) which are referred to as variables, predictors, features or attributes - depending on the data we are interested in modelling. In general, the linear model will look as follows ˆf(x) = w 1 x(1) + w w x(2) +... + w n (n) = w x where n is the dimensionality of the data. To t our model we want to minimize m ( ) 2 m ˆf(xi ) y i = (w x i y i ) 2 where m is the number of data points. The process of tting this model is called linear regression. If we look back at the New Jersey school district data, we might want to model w 0 the xed cost for students as well as w 1 the cost per student. Consequently, we should t the model ˆf(x i ) = w 0 + w 1 x i to our data set where x i is the enrollment in the district. To t this model we use a trick where we replace each data point with a 2-dimensional vector x i 1, x i. It is easy to see that when we add 1 as the rst dimension of each vector x i, we obtain the model above. When we t the model to the data, we obtain w 0 = 3, 540, 476 for the xed cost for students and w 1 = 12, 054 for the cost per student. The xed cost number makes little sense. We can x this by allowing for larger variance in the cost for school districts and

we obtain w 0 = 99, 138 for the xed cost for students and w 1 = 12, 054 for the cost per student. This illustrates one of the many ways we can adjust our data so that we obtain a model that is a better t for our data. We can take this further and consider a model where we account for the higher cost of special education students ˆf(x i ) = w 0 + w 1 x i (1) + w 2 x i (2) where x i (1) is the number of regular students and x i (2) is the number of special education students in district i. When we t our model we prepend one to the vector x i (1), x i (2) 1, x i (1), x i (2) to account for w 0 the xed cost for students. With this higher dimensional model we get w 0 = 154, 192 for the xed cost for students and w 1 = 8, 495 for the cost per regular student and a larger w 2 = 35, 288 cost per special education student. 2 Linear Model Now we consider the general problem of tting a vector w = w 1. w n to data. To do so we dene a matrix X = x 1 x 2. x m that has m rows and n columns where m is the number of data points and n is the dimensionality of the data. Thus, the ith row of X is equal to the data vector x i. We also dene a vector y = where y i is the predicted value associated with the ith data vector x i. With these denitions we can rephrase the minimization in terms of matrices and vectors w x 1 y 1 Xw y =. w x m y m 2 y 1. y m

The Euclidean length squared, or L 2 -norm, of the vector on the right hand side is just our original objective for linear regression m (w x i y i ) 2 = Xw y 2 2 (1) We can minimize this objective by multiplying out the norm min w and computing the gradient (Xw y) (Xw y) = min w w X Xw 2w X y + y y w = 2X Xw 2X y = 0 and setting it equal to zero. Now we can solve for w by computing the inverse ( ) X X w = X y ( 1 w = X X) X y The n by m matrix ( X X ) 1 X is referred to as the pseudo-inverse. Of course, there is no guarantee that the pseudo-inverse will exist. It only exists when ( X X ) 1 is non-singular, and in this case, the solution w is unique. However, even when X X is singular, there are techniques for computing the minimum of equation (1). It is not all that limiting to use just a linear model. Many problems can be reduced to the linear case. In this lecture we considered two: the polynomial model and linear splines. 2.1 Fitting Polynomials We can use linear models to t polynomial models. In Figure 1, we t a cubic model ˆf(x) = w 0 + w 1 x + w 2 x + w 3 x 3 by mapping each data point to a vector x i 1, x i, x 2 i, x3 i and tting a linear model as described above. If the data is in n > 1 dimensions and we want to t a polynomial we use a similar mapping. For instance, for n = 2 we can t a quadratic ˆf(x 1, x 2 ) = a + bx 1 + cx 2 + dx 1 x 2 + ex 2 1 + fx 2 2 using the mapping x 1, x 2 1, x 1, x 2, x 1 x 2, x 2 1, x2 2. In general, if we start with n dimensions and want to t a degree k polynomial we will need to map our data up to O(n k ) dimensions. As with SVMs we can solve the problem of explicitly mapping data vectors in higher dimensions by using the Kernel trick where we replace an inner product 3

Figure 1: We can use linear regression to do polynomial regression. To t a cubic polynomial to each data point x i we map the data point to a vector x i 1, x i, x 2 i, x3 i. x i x j = K(x i, x j ) with a kernel function. For the polynomial model we would use the kernel K(x i, x j ) = (1 + x i x j ) k. To obtain a procedure called kernel regression, our original objective must be rewritten in terms of inner products min m (w x i y i ) 2. Of course, we face the same problem in kernel regression as we do in SVMs. As we increase the dimensionality of the data points we require more data. 2.2 Linear Splines Linear splines oer another example of how to reduce a problem to linear regression. Here, we are trying to build a piece-wise linear function to t the data (see Figure 2). For our discussion, we assume the knots, k 1 and k 2 in the example in Figure 2, are known in advance. We can imagine splitting the data points up into three groups using the given knots k 1 and k 2 and solve three separate regression problems, but this does not assure continuity. If we want the curve to be continuous, we can reduce the problem to straight linear regression by noticing that any linear spline can be a simple linear combination of basis functions. In the example in Figure 3, given knots k 1 and k 2 we can build up a piece-wise linear function step by step. We rst start with a linear function ˆf(x) = a + bx which we use to account for points before the rst knot k 1 (green line in Figure 3). Next, we add second line, the blue line in Figure 3, that is zero up until the rst knot k 1 ˆf(x) = a + bx + c(x k 1 ) + 4

k1 k2 Figure 2: The aim of linear splines is to t a piece-wise linear function such as the one shown in red. We can imagine splitting the data points up into three groups using the given knots k 1 and k 2 and solve three separate regression problems. This, however, does not assure continuity. k1 k2 Figure 3: Any linear spline can be a simple combination of linear basis functions. To t linear splines given the knots k 1 and k 2 we map each point x i we map the data point to a vector x i 1, x, (x k 1 ) +, (x k 2 ) + and solve the standard linear regression problem. 5

where (z) + = { z if z 0 0 if z < 0 Last, we add a third line, the purple line in Figure 3, that is zero up until the second knot k 2 ˆf(x) = a + bx + c(x k1)+ + d(x k2)+ The coecients in the function can be found by mapping each data point to a vector x 1, x, (x k 1 ) +, (x k 2 ) + and solving the standard linear regression problem. This approach can generalize to higher order splines. It can incorporate additional constraints and generalize to higher dimensions. 3 Overtting 3.1 Feature Selection As we increase the dimensionality of our data, we run into the problem of overtting. It is at least as bad in regression as it is in classication. The aim of feature selection is to nd a subset of features/dimensions/variables that are best in some sense for our data. One can try to enumerate all 2 n features but for a large number of features this can be computationally expensive. Often one tries a greedy search method where features are added one-by-one to improve the objective or features are removed one-by-one to improve the objective. 3.2 Regularization Another approach is to blame overtting on weights that are too large. With large weight vectors it is easier to overt our data. The idea is to constrain weight vectors in some way. Regularization refers to any time we add a penalty term that involves the weight vector. Shrinkage refers to any technique where we are shrinking the weights in some way. One example of a regularization technique is ridge regression. Here we optimize the objective min w m (w x i y i ) 2 + λ w 2 2 where the smoothing constant λ is chosen in advance. Larger values of λ tend to smooth curves by penalizing larger weight values through the L 2 norm of the weight vector w. Finding a weight vector w for ridge regression is not hard. We can nd a closed for solution in which we modify the pseudo-inverse ( 1 w = X X + λi) X y where I is an n by n identity matrix. 6