ORIE 4741: Learning with Big Messy Data. Feature Engineering

Size: px

Start display at page:

Download "ORIE 4741: Learning with Big Messy Data. Feature Engineering"

Jean Kelley Young
5 years ago
Views:

1 ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering Cornell October 2, / 24

2 Agenda announcements Vote today! + registration deadline October (for Nov 2017 election, or to change parties for 2018 election) Homework 1 due this morning Homework 2 out tonight, due in two weeks Always use Julia v0.6 (when you use Julia) Section this week: Github + projects No section next Tuesday Lecture videos: http: //cornell.videonote.com/channels/1036/videos Midterm exam in class on October 5, covers through decision trees The role of confusion in learning feature engineering project discussion 2 / 24

3 Linear models To fit a linear model (= linear in parameters w) pick a transformation φ : X R d predict y using a linear function of φ(x) h(x) = w T φ(x) = d w i (φ(x)) i i=1 we want h(x i ) y i for every i = 1,..., n 3 / 24

4 Linear models To fit a linear model (= linear in parameters w) pick a transformation φ : X R d predict y using a linear function of φ(x) h(x) = w T φ(x) = d w i (φ(x)) i i=1 we want h(x i ) y i for every i = 1,..., n Q: why do we want a model linear in the parameters w? 3 / 24

5 Linear models To fit a linear model (= linear in parameters w) pick a transformation φ : X R d predict y using a linear function of φ(x) h(x) = w T φ(x) = d w i (φ(x)) i i=1 we want h(x i ) y i for every i = 1,..., n Q: why do we want a model linear in the parameters w? A: because the optimization problems are easy to solve! e.g., just use least squares. 3 / 24

6 Feature engineering How to pick φ : X R d? so response y will depend linearly on φ(x) so d is not too big 4 / 24

7 Feature engineering How to pick φ : X R d? so response y will depend linearly on φ(x) so d is not too big if you think this looks like a hack: you re right 5 / 24

8 Feature engineering examples: adding offset standardizing features polynomial fits products of features autoregressive models local linear regression transforming images transforming Booleans transforming ordinals transforming nominals transforming sentences concatenating data all of the above 6 / 24

9 Adding offset X = R d 1 let φ(x) = (x, 1) now h(x) = w T φ(x) = w T 1:d 1 x + w d 7 / 24

10 Fitting a polynomial X = R let φ(x) = (1, x, x 2, x 3,..., x d 1 ) be the vector of all monomials in x of degree < d now h(x) = w T φ(x) = w 1 + w 2 x + w 3 x w d x d 1 8 / 24

11 Demo: Linear models 9 / 24

12 Fitting a multivariate polynomial X = R 2 pick a maximum degree k let φ(x) = (1, x 1, x 2, x 2 1, x 1 x 2, x 2 2, x 3 1, x 2 1 x 2, x 1 x 2 2, x , x k 2 ) be the vector of all monomials in x 1 and x 2 of degree < d now h(x) = w T φ(x) can fit any polynomial of degree k in X 10 / 24

13 Fitting a multivariate polynomial X = R 2 pick a maximum degree k let φ(x) = (1, x 1, x 2, x 2 1, x 1 x 2, x 2 2, x 3 1, x 2 1 x 2, x 1 x 2 2, x , x k 2 ) be the vector of all monomials in x 1 and x 2 of degree < d now h(x) = w T φ(x) can fit any polynomial of degree k in X and similarly for X = R d / 24

14 Example: fitting a multivariate polynomial X = R 2, Y = {0, 1} let φ(x) = (1, x 1, x 2, x 2 1, x 1 x 2, x 2 2 ) be the vector of all monomials of degree 2 now let h(x) = sign(w T φ(x)) Q: if h(x) = sign(5 3x 1 + 2x 2 + x1 2 x 1x 2 + 2x2 2 ), what is {x : h(x) = 1}? 11 / 24

15 Example: fitting a multivariate polynomial X = R 2, Y = {0, 1} let φ(x) = (1, x 1, x 2, x 2 1, x 1 x 2, x 2 2 ) be the vector of all monomials of degree 2 now let h(x) = sign(w T φ(x)) Q: if h(x) = sign(5 3x 1 + 2x 2 + x1 2 x 1x 2 + 2x2 2 ), what is A: An ellipse! {x : h(x) = 1}? 11 / 24

16 Example: fitting a multivariate polynomial X = R 2, Y = { 1, 1} let φ(x) = (1, x 1, x 2, x 2 1, x 1 x 2, x 2 2 ) be the vector of all monomials of degree 2 now let h(x) = sign(w T φ(x)) Q: if h(x) = sign(5 3x 1 + 2x 2 + x1 2 x 1x 2 2x2 2 ), what is {x : h(x) = 1}? 12 / 24

17 Example: fitting a multivariate polynomial X = R 2, Y = { 1, 1} let φ(x) = (1, x 1, x 2, x 2 1, x 1 x 2, x 2 2 ) be the vector of all monomials of degree 2 now let h(x) = sign(w T φ(x)) Q: if h(x) = sign(5 3x 1 + 2x 2 + x1 2 x 1x 2 2x2 2 ), what is A: A hyperbola! {x : h(x) = 1}? 12 / 24

18 Fitting a time series given a time series x t R, t = 1,..., T want to predict the value at the next time T + 1 input: time index t and time series x 1:t up to time t let φ(t, x) = (x t 1, x t 2,..., x t d ) (called the lagged outcomes ) now h(x) = w T φ(x) = w 1 x t 1 + w 2 x t w d x t d also called an auto-regressive (AR) model 13 / 24

19 New notation: boolean indicator function define 1(statement) = { 1 statement is true 0 statement is false examples: 1(1 < 0) = 0 1(17 = 17) = 1 14 / 24

20 Fitting a local model: local neighbors X = R m pick a set of points µ i R m, i = 1,..., d, radius δ R let φ(x) = [1( x µ 1 2 δ),..., 1( x µ d 2 δ)] often, d = n and µ i = x i 15 / 24

21 Fitting a local model: 1 nearest neighbor X = R m pick a set of points µ i R m, i = 1,..., d let δ = min( x µ 1 2,..., x µ d 2 ) let φ(x) = [1( x µ 1 2 δ),..., 1( x µ d 2 δ)] often, d = n and µ i = x i 16 / 24

22 Fitting a local model: smoothing X = R m pick a set of points µ i R m, i = 1,..., d, parameter α R let φ(x) = (exp ( α x µ 1 2),..., exp ( α x µ d 2) ) often, d = n and µ i = x i 17 / 24

23 Demo: Crime preprocessing: predicting: Predicting%20crime.ipynb 18 / 24

24 Boolean variables X = {true, false} let φ(x) = 1(x) 19 / 24

25 Boolean expressions X = {true, false} 2 = {true, false} 2 = {(true, true), (true, false), (false, true), (false, false)} 2. let φ(x) = [1(x 1 ), 1(x 2 ), 1(x 1 and x 2 ), 1(x 1 or x 2 )] equivalent: polynomials in [1(x 1 ), 1(x 2 )] span the same space encodes logical expressions! 20 / 24

26 Nominal values X = {apple, orange, banana} let φ(x) = [1(x = apple), 1(x = orange), 1(x = banana)] called one-hot encoding: only one element is non-zero 21 / 24

27 Language X = sentences, documents, tweets,... let {w 1,..., w d } be a set of words (or hashtags, or emoji, or... ) let φ(x) = [1(x contains w 1 ),..., 1(x contains w d )] called bag-of-words model: ignores order of words in sentence 22 / 24

28 Review linear models are linear in the parameters w can fit many different models by picking feature mapping φ : X R d 23 / 24

29 What makes a good project? Clear outcome to predict Linear regression should do something interesting New, interesting model; not a Kaggle competition Avoid: images, time series, NLP 24 / 24

ORIE 4741: Learning with Big Messy Data. Generalization

ORIE 4741: Learning with Big Messy Data Generalization Professor Udell Operations Research and Information Engineering Cornell September 23, 2017 1 / 21 Announcements midterm 10/5 makeup exam 10/2, by