Last Lecture Recap. UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 6: Regression Models with Regulariza8on

Size: px

Start display at page:

Download "Last Lecture Recap. UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 6: Regression Models with Regulariza8on"

Corey Daniels
5 years ago
Views:

1 UVA CS 45 - / 65 7 Introduc8on to Machine Learning and Data Mining Lecture 6: Regression Models with Regulariza8on Yanun Qi / Jane University of Virginia Department of Computer Science Last Lecture Recap Linear model is an approximajon Three ways to moving beyond linearity LR with non- linear basis funcjons Locally weighted linear regression Regression trees and MulJlinear InterpolaJon (later)

g. polynomial regression Issue: Overfi]ng OR underfi]ng y =θ + θx y 5 = θ + θx + θx = = y θ x Generalisation: learn funcjon / hypothesis from past data

2 () LR with non- linear basis funcjons LR does not mean we can only deal with linear relajonships T y = θ θ + ( x) ( x) = φ = θ φ We are free to design (non- linear) features (e.g., basis funcjon derived) under LR m where the φ (x) are fixed basis funcjons (also define φ (x) = ). E.g.: polynomial regression: 3 [, x, x x ] φ( x ) : =, 3 () LR With basis funcjons e.g. polynomial regression Issue: Overfi]ng OR underfi]ng y =θ + θx y 5 = θ + θx + θx = = y θ x Generalisation: learn funcjon / hypothesis from past data in order to explain, predict, model or control new data examples K- fold Cross ValidaJon!!!! 4

3 () Locally weighted linear regression The algorithm: Instead of minimizing now we fit θ to minimize Where do w i 's come from? n T J ( θ ) = ( xi θ yi ) i= n T J ( θ ) = w i ( xi θ yi ) i= ( x ) exp i x τ w i = where x is the query point for which we'd like to know its corresponding y à EssenJally we put higher weights on (errors on) training examples that are close to the query point (than those that are further away from the query) 5 Parametric vs. non- parametric Locally weighted linear regression is a non- parametric algorithm. The (unweighted) linear regression algorithm that we saw earlier is known as a parametric learning algorithm because it has a fixed, finite number of parameters (the \theta), which are fit to the data; Once we've fit the \theta and stored them away, we no longer need to keep the training data around to make future predicjons. In contrast, to make predicjons using locally weighted linear regression, we need to keep the enjre training set around. The term "non- parametric" (roughly) refers to the fact that the amount of stuff we need to keep in order to represent the hypothesis grows with linearly the size of the training set. 6 3

Today q A bit more about Linear Regression Extension q Linear regression with predefined RBF basis q Locally weighted regression q An Exemplar ApplicaJon of Regression q

4 Today q A bit more about Linear Regression Extension q Linear regression with predefined RBF basis q Locally weighted regression q An Exemplar ApplicaJon of Regression q Linear Regression Models with RegularizaJon 7 () Linear regression with predefined RBF basis funcjons ϕ(x) := [, k(x,,), k(x,,), k(x, 4,) ] Dr. Nando de Freitas s tutorial slide 8 4

5 Issue: Choices of Basis FuncJons: è Good and Bad RBFs A good D RBF A bad D RBFs 9 () Locally weighted regression aka locally weighted regression, locally linear regression, LOESS, 5

6 () Locally weighted linear regression () Locally weighted linear regression Kernel e.g. when Methods for only one feature variable Separate weighted least squares at each target N point x : min K ( x, x )[ y α( x ) β ( x ) x ] λ i i α ( x ), β ( x ) i= f ˆ ( x ) =α ˆ( x) + ˆ( β x) b(x) T =(,x); B: Nx regression matrix with i-th row b(x) T ; W x ) = diag K ( x, x ), i =, x ( ) N N N ( λ i, ˆf (x ) = b(x ) T (B T W (x )B) B T W (x )y i versus LR ˆf (x q ) = (x q ) T θ * = (x q ) T ( X T X) X T! y 6

3 TYPICAL MACHINE LEARNING SYSTEM Optimization e.g.

7 (). One More è Local Weighted Polynomial Regression Local polynomial fits of any degree d α ( x ), β ( x ), =,, d min fˆ( N K ( x, x ) y α( x ) λ i i i= = d x) = ˆ( α x) + ˆ β = ( x x ) d Kernel Methods β ( x ) x i 3 TYPICAL MACHINE LEARNING SYSTEM Optimization e.g. Data Cleaning Task-relevant Low-level sensing Preprocessing Feature Extract Feature Select Inference, Prediction, Recognition Label Collection Evaluation 8/6/4 4 7

q An Exemplar ApplicaJon of Regression q Linear Regression Models

8 Today q A bit more about Linear Regression Extension q Linear regression with predefined RBF basis q Locally weighted regression q An Exemplar ApplicaJon of Regression q Linear Regression Models with RegularizaJon 5 e.g. A PracJcal ApplicaJon of Regression Model 6 8

7 Movie Reviews and Revenues: An Experiment in

earnings, denoted y, based on features x

9 7 Movie Reviews and Revenues: An Experiment in Text Regression, Proceedings of HLT ' Human Language Technologies: Use linear regression to directly predict the opening weekend gross earnings, denoted y, based on features x extracted from the movie metadata and/or the text of the reviews. 8 9

10 Movie Reviews and Revenues: An Experiment in Text Regression, Proceedings of HLT ' Human Language Technologies: e.g. counts of a ngram in the text 9 HW The feature weights can be directly interpreted as U.S. dollars contributed to the predicted value yˆ by each occurrence of the feature. to movies

11 Today q A bit more about Linear Regression Extension q Linear regression with predefined RBF basis q Locally weighted regression q An Exemplar ApplicaJon of Regression q Linear Regression Model with RegularizaJons q Ridge Regression q Lasso Regression Review: Vector norms A norm of a vector x is informally a measure of the length of the vector. Common norms: L, L (Euclidean) L infinity 8/8/4

12 Review: Vector Norm (L, when p=) 8/8/4 3 Review: Normal equajon for LR Write the cost funcjon in matrix form: n T J ( θ ) = ( xi θ yi ) i=! = =! T ( Xθ y) ( Xθ y) T T T T!! T! T! ( θ X Xθ θ X y y Xθ + y y) To minimize J(θ), take derivajve and set to zero: 9//4 X Xθ = X y T T! The normal equa8ons θ * = ( X T X) X T! y " T % $ x ' $ T x X = ' $ ' $!!! ' $ T # x n ' &! # # Y = # # "# Assume that X T X is inverjble 4 y y! y n $ & & & & %&

() Ridge Regression / L If not inverjble, a solujon is to

ˆβ ridge = argmin(y Xβ) T (y Xβ)+ λβ T β Equivalently ˆ β

FuncJon s Contour lines from Ridge Regression Ridge

13 () Ridge Regression / L If not inverjble, a solujon is to add a small element ^ ^ ^ ^ to diagonal Y = β + β x +! + β p x p Basic Model, ( ) X T! y β * = X T X + λi HW The ridge estimator is solution from ˆβ ridge = argmin(y Xβ) T (y Xβ)+ λβ T β Equivalently ˆ β ridge = arg min( y Xβ ) subect to β T ( y Xβ ) s 5 ObecJve FuncJon s Contour lines from Ridge Regression Ridge Regression solujon Normal equajon solujon s Elements of StaJsJcal Learning, by HasJe, Tibshirani and Friedman 6 3

14 Linear Methods for Regression () Ridge Regression / L The parameter > penalizes to its size Solution is where I is the identity matrix. Note λ β λ ˆ β λ β = ( X X + λi) T proportional = gives the least squares estimator; X T y if λ, then βˆ 7 Linear Methods for Regression () Lasso (least absolute shrinkage and selection operator) / L The lasso is a shrinkage method like ridge, but acts in a nonlinear manner on the outcome y. The lasso is defined by ˆβ lasso = argmin(y Xβ) T (y Xβ) subect to β s 8 4

15 Linear Methods for Regression Lasso (least absolute shrinkage and selection operator) Notice that ridge penalty by β β is replaced Due to the nature of the constraint, if tuning parameter is chosen small enough, then the lasso will set some coefficients exactly to zero. 9 Lasso (least absolute shrinkage and selection operator) Suppose in dimension β= (β, β ) β + β =const β + - β =const -β + β =const -β + -β =const s Elements of StaJsJcal Learning, by HasJe, Tibshirani and Friedman 3 5

16 Elements of StaJsJcal Learning, by HasJe, Tibshirani and Friedman Linear Methods for Regression (3) A family of shrinkage estimators β = arg min subect N β i = to ( y s for q >=, contours of constant value of are shown for the case of two inputs. i β x β ) q T i β q 3 In the example: Hybrid of Ridge and Lasso 3 6

17 Lasso EsJmator Elements of StaJsJcal Learning, by HasJe, Tibshirani and Friedman Ridge Regression s s 33 due to the nature of L_ norm, the viable solujons are limited to the corners, which are on one axis only - in the above case x. Value of x =. This means that the solujon has eliminated the role of x leading to sparsity 34 7

18 Summary: Regularized multivariate linear regression Model: LR esjmajon: Ridge regression esjmajon: ^ LASSO esjmajon: ^ ^ Y = β + β x +! + β min SSE = min SSE = ^ n i= p x p ^ Y Y ^ Y Y + p = β n p " % min SSE = $ Y Y^ ' + β # & i= = 35/54 Extra Not required, though roughly covered during class Subgradient Coordinate descent based learning for Lasso 36 8

19 Elements of StaJsJcal Learning, by HasJe, Tibshirani and Friedman Ridge Regression β λ λ = q 37 Elements of StaJsJcal Learning, by HasJe, Tibshirani and Friedman Lasso EsJmator λ λ = 38 9

20 Today s Recap q A bit more about Linear Regression Extension q Linear regression with predefined RBF basis q Locally weighted regression q An Exemplar ApplicaJon of Regression q Text based movie open weekend revenue predicjon q Linear Regression Models with RegularizaJon q Ridge Regression q Lasso 39 References Big thanks to Prof. Eric CMU for allowing me to reuse some of his slides q Elements of StaJsJcal Learning, by HasJe, Tibshirani and Friedman (page 6-69) 4

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 6: Linear Regression Model with RegularizaEons. Dr. Yanjun Qi. University of Virginia

UVA CS 6316/4501 Fall 2016 Machine Learning Lecture 6: Linear Regression Model with RegularizaEons Dr. Yanjun Qi University of Virginia Department of Computer Science 1 Where are we? è Five major secgons