Linear Feature Engineering 11

Similar documents
1 Matrix representations of canonical matrices

Lecture 10 Support Vector Machines II

EEE 241: Linear Systems

1 Convex Optimization

10-701/ Machine Learning, Fall 2005 Homework 3

STAT 3340 Assignment 1 solutions. 1. Find the equation of the line which passes through the points (1,1) and (4,5).

Generalized Linear Methods

Homework Assignment 3 Due in class, Thursday October 15

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Kernel Methods and SVMs Extension

Section 8.3 Polar Form of Complex Numbers

Chapter 9: Statistical Inference and the Relationship between Two Variables

MA 323 Geometric Modelling Course Notes: Day 13 Bezier Curves & Bernstein Polynomials

Difference Equations

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

Multilayer Perceptron (MLP)

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Week 5: Neural Networks

Errors for Linear Systems

Lecture 12: Discrete Laplacian

Introduction to Regression

NUMERICAL DIFFERENTIATION

Linear Approximation with Regularization and Moving Least Squares

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

The exam is closed book, closed notes except your one-page cheat sheet.

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Quantum Mechanics for Scientists and Engineers. David Miller

Multilayer neural networks

Lecture 4. Instructor: Haipeng Luo

Chapter 11: Simple Linear Regression and Correlation

Formulas for the Determinant

Supporting Information

Lecture 10 Support Vector Machines. Oct

Resource Allocation and Decision Analysis (ECON 8010) Spring 2014 Foundations of Regression Analysis

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Support Vector Machines

Statistics MINITAB - Lab 2

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

36.1 Why is it important to be able to find roots to systems of equations? Up to this point, we have discussed how to find the solution to

y i x P vap 10 A T SOLUTION TO HOMEWORK #7 #Problem

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Section 3.6 Complex Zeros

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

APPENDIX A Some Linear Algebra

Classification as a Regression Problem

1 GSW Iterative Techniques for y = Ax

Support Vector Machines

The Ordinary Least Squares (OLS) Estimator

Linear Correlation. Many research issues are pursued with nonexperimental studies that seek to establish relationships among 2 or more variables

CSE 252C: Computer Vision III

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

Exercises. 18 Algorithms

CSC 411 / CSC D11 / CSC C11

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Kernel Methods and SVMs

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Polynomial Regression Models

Lecture 21: Numerical methods for pricing American type derivatives

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

Module 9. Lecture 6. Duality in Assignment Problems

Primer on High-Order Moment Estimators

Lecture 3 Stat102, Spring 2007

Lecture 5 Decoding Binary BCH Codes

Math1110 (Spring 2009) Prelim 3 - Solutions

Workshop: Approximating energies and wave functions Quantum aspects of physical chemistry

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Differentiating Gaussian Processes

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Limited Dependent Variables

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

Affine transformations and convexity

STAT 3008 Applied Regression Analysis

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

MMA and GCMMA two methods for nonlinear optimization

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

Salmon: Lectures on partial differential equations. Consider the general linear, second-order PDE in the form. ,x 2

Grover s Algorithm + Quantum Zeno Effect + Vaidman

Solutions to exam in SF1811 Optimization, Jan 14, 2015

18. SIMPLE LINEAR REGRESSION III

Modeling curves. Graphs: y = ax+b, y = sin(x) Implicit ax + by + c = 0, x 2 +y 2 =r 2 Parametric:

Solving Nonlinear Differential Equations by a Neural Network Method

Chapter 13: Multiple Regression

Some modelling aspects for the Matlab implementation of MMA

28. SIMPLE LINEAR REGRESSION III

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

Nice plotting of proteins II

Maximal Margin Classifier

Originated from experimental optimization where measurements are very noisy Approximation can be actually more accurate than

1 Derivation of Point-to-Plane Minimization

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

is the calculated value of the dependent variable at point i. The best parameters have values that minimize the squares of the errors

Analytical Chemistry Calibration Curve Handout

Comparison of Regression Lines

THE SUMMATION NOTATION Ʃ

β0 + β1xi. You are interested in estimating the unknown parameters β

Transcription:

Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19 0.88 0.96 0.21 0.33 0.92 0.80 0.49 0.46. 0.62 0.45 0.77 0.67 0.52 0.32 0.30 0.38 0.19 0.37 Now, suppose we d lke to ft a lne to ths data, of the form y = ax. A standard way to ft such a lne s by least-squares, where we measure the sum of the dfference of each data pont to the lne wth R = (ax y ) 2.

Lnear Feature Engneerng 12 Exercse 8. If we use a =0.7, what s the error R? Ths can be vsualzed as: Notce that we want to mnmze the sum of the square of the length of the error lnes. There are varous justfcatons for ths, but t s worth notng that one could alternatvely mnmze the length of the lnes themselves, the cube of them, etc. The square s by far the most common, however, leads to smpler algorthms, and wll be our focus here. Now, how can we fnd the best a? Theorem 9. For R = (ax y ) 2, the mnmum value of R s obtaned for a = x y x. (2.1) x Proof. We can smply calculate the dervatve of R and set t to zero. dr da = 2 x (ax y ) a x x = x y a = x y x x

Lnear Feature Engneerng 13 Exercse 10. Wrte a program that takes a bunch of values (x,y ), along wth a as nput, and returns R. Next wrte a program to compute a from Eq. 2.1. Test that, on the above dataset, ths program correctly returns a =.920. Compute the values of R for a range of a and see that ths s the rght value. We can plot R for a range of a and see that ths s the true mnmum. 5 4 3 R 2 1 0 0 0.5 1 1.5 2 a We can plot the data wth the curve for ths optmal a as.

Lnear Feature Engneerng 14 2.2 Least-squares wth a bas term It s common to add a bas term and, rather than fttng the lnear functon y = ax, ft the affne functon y = ax + b. As before, we can use the basc calculus technque of settng dervatves to zero n order to fnd the best a and b. Theorem 11. For R = (ax + b y ) 2, the mnmum value of R s obtaned for where b = ȳ x xy/ x 2 1 x 2 / x 2 a = 1 xȳ b), Proof. Start wth the dervatves ȳ = 1 y N x = 1 x N x 2 = 1 x 2 N xy = 1 x y. N dr da dr db = 2 = 2 x (ax + b y ) (ax + b y ) a x 2 + b x = x y a x + b 1 = y

Lnear Feature Engneerng 15 Now, makng use of the mean notaton, we can wrte the lnear system a x 2 + b x = xy a x + b = ȳ To elmnate a, we can subtract x/ x 2 tmes the frst equaton from the second, yeldng b b x =ȳ x x2 x x2 xy Whch we can solve for b = ȳ x xy/ x 2 1 x 2 / x. 2 Once we have computed ths, we can recover a from a = 1 xȳ b). Not the prettest theorem or proof n the world, but t works. Addng a bas term slghtly mproves the qualty of the ft. Exercse 12. Wrte a program that takes a bunch of values (x,y ), and fnds the optmal a and b for an affne ft y = ax + b. Verfy that on the example dataset, you correctly fnd a =.765 and b =.100.

Lnear Feature Engneerng 16 2.3 More powerful least squares. Now, suppose we want to ft a quadratc y = dx 2 + ax + b, or, somethng crazy lke y = a sn(x)+be x + d x. It would be an opton to derve an algorthm along the lnes of the prevous theorem, but one tends to shudder at the thought gven how messy t was just to ft an affne functon. Instead, we wll defne an abstracton. Suppose that the nput z s a vector, and we would lke to make predctons by y = w T z = w(j)z(j). j Theorem 13. For R = (wt z y ) 2, the mnmum value of R s obtaned for Proof. w = z z T 1 z y. dr dw =0 = 2 z (w T z y ) z (w T z ) = z y z (z T w) = z y z z T w = z y To actually mplement ths, we would use somethng lke the followng Algorthm (Least-Squares) Input {z }, {y } s z y M z z T

Lnear Feature Engneerng 17 Solve Mw = s for w usng Gaussan Elmnaton Output w. Exercse 14. Implement the above algorthm for solvng a least-squares system. Use your prevous algorthm for solvng lnear systems. Check that, on the dataset z(1) 0.1 0.2 0.3 0.4 z(2) 0.5 0.6 0.7 0.7 y 0.9 1.0 1.1 1.2 t correctly returns w =( 0.159, 1.739). Have your functon also return the resdual error R. Check that ths s R =0.00927. For testng, you mght fnd t useful to check that.3.66 M =,.66 1.59 1.1 s =. 2.66 Note that ths dataset means z 1 =(.1,.5), z 2 =(.2,.6), etc. 2.4 Bass expanson Suppose that we can ft vector-valued least-squares systems of the form y = w T z. Now the queston s: what s the connecton between all the measured data x and the nput vector z? Frst, take our old scalar dataset. Suppose we would lke to ft an equaton of the form How could we do ths? y = ax + b. An dea would be to use what s called a bass expanson. Rather than fttng to the dataset {(x,y )}, ft to {(z,y )}, where x z =. (2.2) 1 That s, we smply take the nput dataset, and replace each nput scalar by an nput vector of length two, consstng of the orgnal scalar plus a constant of one. Then, we would have that y = w T z = w 1 x + w 2,

Lnear Feature Engneerng 18 whch s of exactly the same form as y = ax + b. For example, f we started wth the dataset x 1 2 5 y 2 7 9 then, after bass expanson, we would have the dataset z(1) 1 2 5 z(2) 1 1 1 y 2 7 9 Exercse 15. Make a bass expanson for the orgnal dataset of the form n Eq. 2.2. Plug t nto your least-squares solver from the prevous exercse. Check that you correctly fnd w =(.765,.100) and the resdual error R =0.1128. Now, we have room for personal creatvty. There s nothng that prevents us from fttng more advanced functons. For example, suppose we want to ft a functon of the form y = ax + b + cx 2. What bass expanson should we use? The answer s x z = 1 x 2. Ths s called a quadratc bass expanson. Exercse 16. Make a quadratc bass expanson for the orgnal dataset. Plug t nto your least-squares solver from the prevous exercse. Check that you correctly fnd w =(.721,.406, 1.37 and the resdual error R =0.0632. If we are n a strange mood, we could even ft a functon of the form by smply usng the bass expanson y = a sn(2x)+b log(x)+c x z = sn(2x ) log(x ) x. Exercse 17. Implement ths last bass expanson. Check that you correctly fnd w = ( 1.44, 0.134, 2.41) and the resdual error R =0.0672. The output of the above two exercses wll look lke:

Lnear Feature Engneerng 19 2.4.1 Matrx notaton Exercse 16 Exercse 17 Often, when workng wth a dataset of a bunch of vectors z 1, z 2,... z M, t s convenent to put them together nto a sngle matrx Z by smply placng the vectors next to each other. It s not hard to see that Theorem 18. As defned n Eq. 2.3, Z = z 1 z 2 z M. (2.3) Zy = z y A smlar, but somewhat less obvous result s that Theorem 19. As defned n Eq. 2.3, ZZ T = z z T. All ths means that we can wrte w n the more compact notaton ZZ T w = Zy. Exercse 20. Wrte a program that takes nputs Z and y and fnds the least-squares ft. (Use your prevous routne to solve the lnear system.) Check that, on an nput of 1 1 1 1 Z =, y =(1, 2, 4, 5), 2 3 4 5 t correctly returns w =( 1.9, 1.4).

Lnear Feature Engneerng 20 2.4.2 Lnear or nonlnear? Notce that, wth a few of the above examples, we are actually fttng the output y usng (sometmes hghly nonlnear) functons wth respect to the nput x. However, once the data are gven, the functon s lnear wth respect to the unknown coeffcents / weghts w, and the optmal value of w s obtaned wth a smple leastsquare. Ths type of technques s known as lnear regresson. More formally, lnear regresson refers to the regresson functon s dependence on the regresson coeffcents w, not on nput data x. The crucal pont here s that t s easy to do nonlnear stuff usng lnear regresson wth: where z = y = w T z. f 1 (x) f 2 (x) f s (x) and f (x) can be of varous forms as shown n prevous examples, and can be thought as features extracted from the gven data x. In summary, clever hand-engneered features and lnear regresson (or lnear classfcaton) captures a huge percentage of how machne learnng s done n the real world. The advantages of lnear methods lke ths are: Smplcty Relablty Interpretablty Speed There are also some dsadvantages: Lmted modelng power. Imposng a smple lnear form y = w T z s a huge assumpton and some datasets smply don t have such a relatonshp between nputs and outputs. You need to fnd good features. The prevous ssue can be reduced by comng up wth good features, but ths process s not automated. More fancy machne learnng.

Lnear Feature Engneerng 21 methods try to adapt to the structure of the functon wthout the user specfyng t. Some, methods such as neural networks, can be understood as fndng the features at the same tme they ft w. There s recent work on dscoverng such features n an unsupervsed way.