Stats 170A: Project in Data Science Predictive Modeling: Regression

Size: px

Start display at page:

Download "Stats 170A: Project in Data Science Predictive Modeling: Regression"

Valerie Craig
5 years ago
Views:

1 Stats 170A: Project in Data Science Predictive Modeling: Regression Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California, Irvine

2 Reading, Homework, Lectures Reference reading: Chapters 1, 2 and 4 in Geron s text, Hands-On Machine Learning with Scikit- Learn and TensorFlow Homework 6 Will be announced shortly Will be due by 2pm Wednesday next week (Monday is a holiday) Next Lectures Today: prediction with regression Wednesday: prediction with classification Wednesday next week: text analysis and classification Likely to be 1 more homework (#7) and then project mode Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 2

3 Data Science: from Data to Actions Raw Data Consumers Data Wrangling Exploratory Data Analysis External Business Customers Internal Business Customers Predictive Modeling Government Data Management Scientists Databases, Algorithms, Software Engineering Machine Learning, Statistics Domain knowledge Business knowledge Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 3

4 Basic Principles of Predictive Modeling Reference reading: Chapters 1, 2 and 4 in Geron s text, Hands-On Machine Learning with Scikit-Learn and TensorFlow Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 4

5 Predictive Modeling Two basic types of predictive models Regression: predict a real-valued variable Y given input vector X E.g., predict what a customer will spend with a merchant in the next 12 months Classification: predict a categorical variable Y give input vector X E.g., predict if a credit card transaction is fraudulent or not Both problems can be addressed by statistical or machine learning approaches Both problems share common mathematical/statistical foundations Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 5

6 Learning a Regression Model Training Data Lot Area Lot Frontage 1stFlr SF GrLiv Area Mo Sold Yr Sold Year Built Sale Price Learning algorithm learns a function that takes feature values on the left to predict the target value on the right Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 6

7 Learning a Regression Model Training Data Lot Area Lot Frontage 1stFlr SF GrLiv Area Mo Sold Yr Sold Year Built Sale Price Test Data Lot Area Lot Frontage 1stFlr SF GrLiv Area Mo Sold Yr Sold Year Built ?? ?? Sale Price We can then use the model to make predictions when target values are unknown Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 7

8 Machine Learning Notation Features x inputs to the model Targets y target value we would like to predict Predictions ŷ = f(x, w) model s prediction given inputs x and weights w Parameters α or w weights, coefficients specifying the model Error e( y, ŷ ) error between target and prediction (e.g., squared error) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 8

9 Translating between Statistics and Machine Learning Statistics Parameter estimation Coefficients, parameters Covariates, independent variables Dependent variable Machine Learning Learning algorithm Weights Inputs, features, attributes Target Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 9

10 Regression Y is real-valued E.g., Y can take values y from - to +, or y from 0 to +, etc Typical applications Predicting how many items of a particular type Amazon will need in 6 months Predicting how many times a Facebook user will login in the next month Predicting how much a house will sell for Predicting the value of Dow Jones Index by end of day tomorrow and so on Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 10

11 Classification Y is categorical e.g., Y is binary, takes values y in {0, 1} Many applications Predict if an is spam or not based on its content Predict what word a person spoke given an audio signal Predict if a moving object in an image is a person or not.. Classification is closely related to regression Similar underlying mathematical principles (sidenote: mathematically, both are trying to predict E[ y x ] In classification its often very useful to predict p(y = 1 x), i.e. a real-valued number between 0 and 1 (rather just a hard decision) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 11

12 Machine Learning Algorithms: A Component View Prediction Model + Loss Function + Optimization Method The functional form of f( x w ), e.g., weighted linear sum, decision tree, How we measure the quality of the model s predictions for specific parameter settings w, e.g., squared error, absolute error, The algorithm that finds the parameters w that minimize the error function, e.g., solving a set of linear equations, gradient descent, etc Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 12

13 Linear Model A linear model computes a weighted sum of the inputs e.g., in 2 dimensions (2 input features) f(x, w) = predicted output= w 0 + w 1 x 1 + w 2 x 2 Why do we need this extra constant weight? More generally, with an arbitrary number of input features f(x, w) = w 0 + Σ j w j x j = w 0 + w T x Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 13

14 Optimizing (Minimizing) the Error Function (for squared error and a linear model) E ( w ) = Σ i squared_error (ith target, ith prediction) = Σ i ( y i f (x i ; w) ) 2 = Σ i ( y i w 0 + w 1 x i1 + w 2 x i2 ) 2 How do we minimize this as a function of the weights w? Necessary condition: partial derivative for each weight = 0 Because this equation is quadratic in the w s, each partial derivative is linear in w In the example above, we get 3 simultaneous linear equations (all = 0) in 3 unknowns In general we will get D simultaneous linear equations with D parameters Time complexity of solving a set of D simultaneous linear equations is O(D 3 + D 2 N) (may be challenging for large D) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 14

15 Gradient Vectors w 2 Contours of error function E(w) Consider a problem with 2 weights E(w) is a scalar function in 2 dimensions (in this example: in general D dimensions) For squared error loss function E(w) is a smooth bowl w 1 The gradient vector is the vector of 2 partial derivatives, one for w 1, one for w 2 Points in the direction where error increases the most We can define the gradient at any point in the 2d-space Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 15

16 Alternative Optimization Method: Gradient Descent Algorithm Simple algorithm that uses the gradient to search for the minimum of E (w) Start at some random w location Compute the gradient Δ(w) Move in the direction -Δ(w) (typically take small steps) Recomputethe gradient, iterate. Repeat until there is no improvement Theoretical properties? If step sizes are small enough, guaranteed to find a (local) minimum Simple example of gradient descent for p = 2 dimensions Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 16

17 Optimizing E(w) with a single weight w E(w) w Easy to find the minimum (convex problem with a single global minimum) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 17

18 Optimizing E(w) with a single weight w E(w) w Easy to find the minimum (convex problem with a single global minimum) E(w) w Hard to find the minimum (non-convex problem with local minima) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 18

Visualization of the Effect of Different Learning Rates Learning rate too small Learning rate about right Learning rate too large Blue lines show fitted lines after each gradient update

19 Visualization of the Effect of Different Learning Rates Learning rate too small Learning rate about right Learning rate too large Blue lines show fitted lines after each gradient update Red dotted lines are the initial start point From: Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 19

20 Other Variants of Gradient-Based Optimization Second order derivative (Newton) methods: Use a matrix of second derivatives in determining direction to move Scales in computation time as O(D 3 ), since it needs to invert a D x D matrix of derivatives, where D is the number of features not practical with large numbers of features Stochastic gradient descent: compute the gradient only using a subset (a minibatch) of the data and then move in the direction of this approximate gradient Continue to cycle through smalll minibatches Can result in very large speedups on large data sets Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 20

21 Stochastic Gradient Descent (Example in 2-dimensional Parameter Space) Stochastic gradient steps Gradient steps Empirically works very well on large data sets: some theoretical support (See Adam algorithm, by Kingma and Ba, ICLR, 2015) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 21

22 Evaluating the Quality of a Predictive Model Predictive error/accuracy on new data (machine learning perspective) We can measure the error on the training data But a better measure of error is on held-out test data Quality of model fit to training data (statistics perspective) How certain are we about the parameter values? Other aspects may be important in certain applications Interpretability of the model E..g, in law, in credit scoring, in medicine, in autonomous vehicles Time and memory required to make predictions E.g., for models deployed on mobile devices Ease of updating the model E..g, for models that must be continually updated (spam , advertising) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 22

23 Squared Error and R 2 Evaluation Metrics Mean-Squared Error MSE = # $ Σ i ( y i f (x i ; w) ) 2 R-squared = %&' ( )*+, %&' (() or (Var(y) MSE) / Var(y) R-squared varies between 0 and 1, is percent of variance explained by the model Var(y) = MSE if we predicted y with the best constant (which is the mean of y) R-squared is the relative improvement we get by using the model, over the best constant predictor Example: Var(y) = 20, MSE = 16, R_squared = (20-16)/20 = 20% If MSE = 1, then R-squared = (20-1)/20 = 95% (a much better model) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 23

24 Diagnosis: Plotting Predictions versus Actual Targets Housing prices data set: - linear regression model with predictions on the training data Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 24

25 Example: Regression Data Set for Housing Data Lot Area Lot Frontage 1stFlr SF GrLiv Area Mo Sold Yr Sold Year Built Sale Price Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 25

26 Output from linear regression using statsmodels Dep. Variable: SalePrice R-squared: Model: OLS Adj. R-squared: Method: Least Squares F-statistic: Date: Mon, 12 Feb 2018 Prob (F-statistic): 8.91e-297 Time: 05:56:48 Log-Likelihood: No. Observations: 1201 AIC: 1.265e+04 Df Residuals: 1193 BIC: 1.269e+04 Df Model: 7 Covariance Type: nonrobust Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 26

27 Output from linear regression using statsmodels coef std err t P> t [95.0% Conf. Int.] const LotArea LotFrontage stFlrSF GrLivArea MoSold YrSold YearBuilt Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 27

28 Scaling Variables Say we have learned a regression model: f(x, w) = predicted output = x x 2 How should we interpret the weights w 1 = 2 and w 2 = 200? At first glance it looks like w 2 is 100 times more important than w 1 But this depends on the scale of the 2 variables: Say x 2 ranges from 0 to 1 and x 1 ranges from 0 to 1000 => the min/max effect on y is [0 2000] for x 1 and [0 200] for x 2 (so x 1 may have more impact on y) For this reason, it can be helpful to prescale all x s (e.g., to the range [0,1]) See for examples and methods in scikit-learn Conclusion: when interpreting coefficients we need to consider the scales of the x s Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 28

29 Collinearity of Input Variables Collinearity in regression means that some of the input x s are highly correlated with each other Consider the model f(x, w) = x x 2 Say we find out that x 1 and x 2 are highly correlated and on the same scale. If so, we can effectively replace 2 x x 2 with any other equivalent linear combination, e.g., 3 x x 2 5 x x 2-10 x x 2 Conclusion: with collinearity we need to be careful when interpreting regression coefficients (Collinearity can also cause numerical issues, particularly in small data sets) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 29

30 Encoding Categorical Variables Our features might not be real-valued but instead categorical with string values, e.g., ["male", "female"], ["from Europe", "from US", "from Asia"], ["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]. Such features are often encoded as multiple binary ( one-hot ) variables, i.e., if there are M string values we get M binary variables, one per value For large M this leads to highly sparse input vectors (many 0 s) Many machine learning packages are optimized to work with sparse inputs One-hot representations are common when using machine learning with text where each binary variable represents whether a word occurred or not in a document See for more information Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 30

31 Missing Data at Training Time Training Data Lot Area Lot Frontage 1stFlr SF GrLiv Area Mo Sold Yr Sold Year Built Sale Price ?? 11250?? 920?? ?? Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 31

32 Missing Data at Prediction Time Training Data Lot Area Lot Frontage 1stFlr SF GrLiv Area Mo Sold Yr Sold Year Built Sale Price ?? 11250?? 920?? ?? Test Data Lot Area Lot Frontage 1stFlr SF GrLiv Area Mo Sold Yr Sold Year Built ?? ?? 11924?? ?? Sale Price We can then use the model to make predictions when target values are unknown Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 32

33 Dealing with Missing Data Missing at random...versus other missing mechanisms Simple Methods: Removal: Remove that row from the data Imputation: replace a value with the median value from that column Note that this can be a poor way to infer a likely value it ignores row values More complex methods Use regression to predict each missing value from all the other known data Define additional binary variables that encode missing, i.e., treat missing as an additional categorical value that might have predictive power Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 33

34 Machine Learning Algorithms: A Component View Prediction Model + Loss Function + Optimization Method The functional form of f( x w ), e.g., weighted linear sum, decision tree, How we measure the quality of the model s predictions for specific parameter settings w, e.g., squared error, absolute error, The algorithm that finds the parameters w that minimize the error function, e.g., solving a set of linear equations, gradient descent, etc Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 34

35 Different Types of Predictive Models Linear models: f(x ; α) = α 0 + α 1 x 1 + α 2 x 2 +. α d x d = α 0 + Σ α j x j Generalized linear models f(x ; α) = g ( α 0 + Σ α j x j ) Non-linear link function, e.g., logistic function Neural networks f(x ; α) = α 0 + Σ α k g k (β k0 +Σ β kj x j ) Non-linear hidden units, e.g., logistic function Many different types of model based on compositions of linear weighted sums, typically trained via gradient methods Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 35

36 Other Types of Models Decision-tree regression Recursively split the input space into regions of approximately constant y values Nearest-neighbor regression For a new input x, find the K nearest neighbors of x in the training data, and predict the average y values of the K neighbors Many variations on this theme using weighted neighbors, etc An example of non-parametric regression Tends not to work well in high dimensions Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 36

37 Decision Tree Regression Example From: Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 37

38 Decision Tree Regression Example 2 Predicting gas mileage given characteristics of cars From: Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 38

39 Learning Decision Tree Regression Models Greedy search: for each input feature, find the best binary split (threshold) best = split that results in MSE on each side of split being lowest (like kmeans, k=2) Select the feature+split that results in the greatest decrease in average MSE Partition the data, and recursively keep splitting Use some heuristic to stop growing each subtree Prediction: Predict the mean value of the training data at each leaf node Problem: Trees can be unstable and less accurate Solution: perturb the data, build multiple trees, and average predictions Broad class of tree-averaging algoithms, one of the best known is Random Forests Note: trees can work well with both real and categorical data Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 39

40 Time-Series Prediction as Regression Time-series prediction as regression Predict y(t+1) where X = {y(t), y(t-1), y(t-2),.} This is known as autoregression Can use the same types of models and fitting techniques as regression Can also use additional variables, e.g., X = {y(t), v(t), z(t) } Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 40

41 Time-Series Prediction as Regression Original time-series data = 63.1, 65.0, 80.0, 68.0, 84.0, 92.1, 98.0, Convert it into format suitable for regression Training Data Month Sales Last Month Sales this Month Sales Next Month Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 41

42 Differences between Machine Learning and Statistics Statistics Emphasis on interpreting parameters in the model Less emphasis on prediction Many models for regression Machine Learning Emphasis on prediction on new data Less emphasis on interpreting parameters in the model Many models for classification Python Statsmodels Scikit-learn Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 42

Linear Regression (continued)

Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression