Machine Learning 4771

Size: px

Start display at page:

Download "Machine Learning 4771"

Lesley McCoy
5 years ago
Views:

1 Machine Learning 477 Instructor: Tony Jebara

2 Topic Regression Empirical Risk Minimization Least Squares Higher Order Polynomials Under-fitting / Over-fitting Cross-Validation

3 Regression Classification O O O O O O O O O Regression, f()=y Density/Structure Estimation Clustering Supervised Feature Selection Anomaly Detection Unsupervised

4 Function Approimation Start with training dataset { ( )} R D = X = (,y ),(,y ),,,y Have (input, output ) pairs Find a function f() to predict y from That fits the training data well y ( ) ( )... D ( ) Eample: predict the price of house in dollars y using = [#rooms; latitude; longitude; ] eed: a) Way to evaluate how good a fit we have b) Class of functions in which to search for f() y R

5 Empirical Risk Minimization Idea: minimize loss on the training data set Empirical = use the training set to find the best fit Define a loss function of how good we fit a single point: ( ( )) L y, f Empirical Risk = the average loss over the dataset R = i= ( ( )) L y i, f i Simplest loss: squared error from y value ( ( )) = y f ( ) i i L y i, f i ( ) Other possible loss: absolute error ( ( )) = y i f ( ) i L y i, f i

6 Linear Function Classes Linear is simplest class of functions to search over: ( ) = θ T D + θ 0 = θ d ( d) f ;θ d = Start with being -dimensional (D=): ( ) = θ + θ 0 f ;θ + θ 0 Plug in the above & minimize empirical risk over θ R θ ( ) = i=( y i i θ ) 0 ote: minimum occurs when R(θ) gets flat (not always!) ote: when R(θ) is flat, gradient R = 0

7 Min by Gradient=0 Gradient=0 means the partial derivatives are all 0 Take partials of empirical risk: R θ ( ) = R = i=( y i i θ ) 0 θ 0 θ = 0 0

8 Min by Gradient=0 Gradient=0 means the partial derivatives are all 0 Take partials of empirical risk: R θ ( ) = θ 0 = R = i=( y i i θ ) 0 y i i θ 0 i=( ) θ 0 θ ( ) = 0 = 0 0

9 Min by Gradient=0 Gradient=0 means the partial derivatives are all 0 Take partials of empirical risk: R θ ( ) = θ 0 θ = = R = i=( y i i θ ) 0 y i i θ 0 i=( ) i=( y i i θ ) 0 θ 0 θ ( ) = 0 ( ) = 0 i = 0 0

10 Min by Gradient=0 Gradient=0 means the partial derivatives are all 0 Take partials of empirical risk: R θ ( ) = θ 0 θ = = R = i=( y i i θ ) 0 y i i θ 0 i=( ) i=( y i i θ ) 0 y i θ 0 = θ 0 θ ( ) = 0 ( ) = 0 i i = 0 0

11 Min by Gradient=0 Gradient=0 means the partial derivatives are all 0 Take partials of empirical risk: R θ ( ) = θ 0 θ = = R = i=( y i i θ ) 0 y i i θ 0 i=( ) i=( y i i θ ) 0 y i θ 0 = θ 0 θ ( ) = 0 ( ) = 0 i i = 0 0 θ i = y i i θ 0 i

12 Min by Gradient=0 Gradient=0 means the partial derivatives are all 0 Take partials of empirical risk: R θ ( ) = θ 0 θ = = R = i=( y i i θ ) 0 y i i θ 0 i=( ) i=( y i i θ ) 0 y i θ 0 = θ 0 θ ( ) = 0 ( ) = 0 i i = 0 0 θ i = y i i θ 0 i θ = y i i i y i i i i

13 Properties of the Solution Setting θ* as before gives least squared error Define error on each data point as: e i = y i θ ote property #: θ 0 = average error is zero ote property #: θ = * i θ i=( y i i θ ) 0 = 0 = 0 i=( y i i θ ) 0 i = 0 error not correlated with data * 0 i = et = 0 e i e i

14 Multi-Dimensional Regression R = 0 More elegant/general to do with linear algebra Rewrite empirical risk in vector-matri notation: R( θ) = y i i θ 0 = = i=( ) i= y y y i i = y Xθ n θ 0 θ θ 0 θ

15 Multi-Dimensional Regression R = 0 More elegant/general to do with linear algebra Rewrite empirical risk in vector-matri notation: R( θ) = y i i θ 0 = = i=( ) i= y y y i i = y Xθ n θ 0 θ θ 0 θ Can add more dimensions by adding columns to X matri and rows to θ vector

16 Multi-Dimensional Regression R = 0 More elegant/general to do with linear algebra Rewrite empirical risk in vector-matri notation: R( θ) = y i i θ 0 = = i=( ) i= y y y i i = y Xθ θ 0 θ ( ) ( D) ( ) ( D) θ 0 θ θ D Can add more dimensions by adding columns to X matri and rows to θ vector

17 Multi-Dimensional Regression More realistic dataset: many measurements Have apartments each with D measurements Each row of X is [#rooms; latitude; longitude, ] X = ( ) ( D) ( ) ( D)

18 Multi-Dimensional Regression Solving gradient=0 R = 0 y Xθ θ = 0

19 Multi-Dimensional Regression Solving gradient=0 R = 0 y Xθ θ = 0 y Xθ ( ) = 0 ( ) T y Xθ

20 Multi-Dimensional Regression Solving gradient=0 R = 0 y Xθ θ = 0 ( y Xθ) T ( y Xθ) = 0 ( y T y y T Xθ + θ T X T Xθ) = 0

21 Multi-Dimensional Regression Solving gradient=0 R = 0 y Xθ θ = 0 ( y Xθ) T ( y Xθ) = 0 ( y T y y T Xθ + θ T X T Xθ) = 0 ( yt X + θ T X T X) = 0 θ T A θ θ u T θ θ θ T θ θ = u T = θ T = θ T ( A + A T )

22 Multi-Dimensional Regression Solving gradient=0 R = 0 y Xθ θ = 0 ( y Xθ) T ( y Xθ) = 0 ( y T y y T Xθ + θ T X T Xθ) = 0 ( yt X + θ T X T X) = 0 X T Xθ = X T y θ T A θ θ u T θ θ θ T θ θ = u T = θ T = θ T ( A + A T )

23 Multi-Dimensional Regression Solving gradient=0 R = 0 y Xθ θ = 0 ( y Xθ) T ( y Xθ) = 0 ( y T y y T Xθ + θ T X T Xθ) = 0 ( yt X + θ T X T X) = 0 X T Xθ = X T y θ * = X T X θ T A θ θ T θ ( ) X T y In Matlab: t=pinv(x)*y or t=x\y or t=inv(x *X)*X *y θ u T θ θ θ = u T = θ T = θ T ( A + A T )

24 Multi-Dimensional Regression Solving gradient=0 X T Xθ = X T y θ * = ( X T X) X T y In Matlab: t=pinv(x)*y or t=x\y or t=inv(x *X)*X *y If the matri X is skinny, the solution is probably unique If X is fat (more dimensions than points) we get multiple solutions for theta which give zero error. The pseudeoinverse (pinv(x)) returns the theta with zero error and which has the smallest norm. min θ θ such that Xθ = y

25 D Linear Regression Once best θ* is found, we can plug it into the function: ( ) = θ * () + θ * () + θ 0 * f ;θ * What would a fat X look like?

Polynomial Function Classes Back to -dim (D=) BUT onlinear Polynomial: Writing Risk: R( θ) = ( ) P = θ p p f ;θ y y + θ 0 p= Order-P polynomial regression fitting for D variable

26 Polynomial Function Classes Back to -dim (D=) BUT onlinear Polynomial: Writing Risk: R( θ) = ( ) P = θ p p f ;θ y y + θ 0 p= Order-P polynomial regression fitting for D variable is same as P-dimensional linear regression! Construct a multidim -vector from scalar i = 0 i More generally any i = φ 0 i P P i i θ 0 θ θ P i 3 T ( ) φ ( ) i φ ( ) i φ ( 3 ) i T

27 Underfitting/Overfitting Try varying P. Higher P fits a more comple function class Observe R(θ*) drops with bigger P P=0 P= P= P=5 P=0 P=5

28 Evaluating The Regression Unfair to use empirical to find best order P High P (vs. ) can overfit, even linear case! min R(θ*) not on training but on future data Want model to Generalize to future data True loss: R true ( θ) = P (,y)l( y, f ( ;θ) )d dy One approach: split data into training / testing portion {(,y ),,(,y )} (,y + +),, +M,y +M Estimate θ* with training loss: Evaluate P with testing loss: { ( )} R train R test ( θ) = ( θ) = M i= +M i= + L( y i, f ( i ;θ)) L( y i, f ( i ;θ))

29 Crossvalidation Try fitting with different polynomial order P Select P which gives lowest R test (θ*) Loss R ( test θ * ) underfitting overfitting R ( train θ * ) P best P Think of P as a measure of the compleity of the model Higher order polynomials are more fleible and comple

Machine Learning 4771

Machine Learning 4771 Instructor: Tony Jebara Topic 3 Additive Models and Linear Regression Sinusoids and Radial Basis Functions Classification Logistic Regression Gradient Descent Polynomial Basis Functions