Stats 170A: Project in Data Science Predictive Modeling: Regression

Size: px
Start display at page:

Download "Stats 170A: Project in Data Science Predictive Modeling: Regression"

Transcription

1 Stats 170A: Project in Data Science Predictive Modeling: Regression Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California, Irvine

2 Reading, Homework, Lectures Reference reading: Chapters 1, 2 and 4 in Geron s text, Hands-On Machine Learning with Scikit- Learn and TensorFlow Homework 6 Will be announced shortly Will be due by 2pm Wednesday next week (Monday is a holiday) Next Lectures Today: prediction with regression Wednesday: prediction with classification Wednesday next week: text analysis and classification Likely to be 1 more homework (#7) and then project mode Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 2

3 Data Science: from Data to Actions Raw Data Consumers Data Wrangling Exploratory Data Analysis External Business Customers Internal Business Customers Predictive Modeling Government Data Management Scientists Databases, Algorithms, Software Engineering Machine Learning, Statistics Domain knowledge Business knowledge Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 3

4 Basic Principles of Predictive Modeling Reference reading: Chapters 1, 2 and 4 in Geron s text, Hands-On Machine Learning with Scikit-Learn and TensorFlow Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 4

5 Predictive Modeling Two basic types of predictive models Regression: predict a real-valued variable Y given input vector X E.g., predict what a customer will spend with a merchant in the next 12 months Classification: predict a categorical variable Y give input vector X E.g., predict if a credit card transaction is fraudulent or not Both problems can be addressed by statistical or machine learning approaches Both problems share common mathematical/statistical foundations Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 5

6 Learning a Regression Model Training Data Lot Area Lot Frontage 1stFlr SF GrLiv Area Mo Sold Yr Sold Year Built Sale Price Learning algorithm learns a function that takes feature values on the left to predict the target value on the right Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 6

7 Learning a Regression Model Training Data Lot Area Lot Frontage 1stFlr SF GrLiv Area Mo Sold Yr Sold Year Built Sale Price Test Data Lot Area Lot Frontage 1stFlr SF GrLiv Area Mo Sold Yr Sold Year Built ?? ?? Sale Price We can then use the model to make predictions when target values are unknown Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 7

8 Machine Learning Notation Features x inputs to the model Targets y target value we would like to predict Predictions ŷ = f(x, w) model s prediction given inputs x and weights w Parameters α or w weights, coefficients specifying the model Error e( y, ŷ ) error between target and prediction (e.g., squared error) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 8

9 Translating between Statistics and Machine Learning Statistics Parameter estimation Coefficients, parameters Covariates, independent variables Dependent variable Machine Learning Learning algorithm Weights Inputs, features, attributes Target Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 9

10 Regression Y is real-valued E.g., Y can take values y from - to +, or y from 0 to +, etc Typical applications Predicting how many items of a particular type Amazon will need in 6 months Predicting how many times a Facebook user will login in the next month Predicting how much a house will sell for Predicting the value of Dow Jones Index by end of day tomorrow and so on Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 10

11 Classification Y is categorical e.g., Y is binary, takes values y in {0, 1} Many applications Predict if an is spam or not based on its content Predict what word a person spoke given an audio signal Predict if a moving object in an image is a person or not.. Classification is closely related to regression Similar underlying mathematical principles (sidenote: mathematically, both are trying to predict E[ y x ] In classification its often very useful to predict p(y = 1 x), i.e. a real-valued number between 0 and 1 (rather just a hard decision) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 11

12 Machine Learning Algorithms: A Component View Prediction Model + Loss Function + Optimization Method The functional form of f( x w ), e.g., weighted linear sum, decision tree, How we measure the quality of the model s predictions for specific parameter settings w, e.g., squared error, absolute error, The algorithm that finds the parameters w that minimize the error function, e.g., solving a set of linear equations, gradient descent, etc Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 12

13 Linear Model A linear model computes a weighted sum of the inputs e.g., in 2 dimensions (2 input features) f(x, w) = predicted output= w 0 + w 1 x 1 + w 2 x 2 Why do we need this extra constant weight? More generally, with an arbitrary number of input features f(x, w) = w 0 + Σ j w j x j = w 0 + w T x Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 13

14 Optimizing (Minimizing) the Error Function (for squared error and a linear model) E ( w ) = Σ i squared_error (ith target, ith prediction) = Σ i ( y i f (x i ; w) ) 2 = Σ i ( y i w 0 + w 1 x i1 + w 2 x i2 ) 2 How do we minimize this as a function of the weights w? Necessary condition: partial derivative for each weight = 0 Because this equation is quadratic in the w s, each partial derivative is linear in w In the example above, we get 3 simultaneous linear equations (all = 0) in 3 unknowns In general we will get D simultaneous linear equations with D parameters Time complexity of solving a set of D simultaneous linear equations is O(D 3 + D 2 N) (may be challenging for large D) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 14

15 Gradient Vectors w 2 Contours of error function E(w) Consider a problem with 2 weights E(w) is a scalar function in 2 dimensions (in this example: in general D dimensions) For squared error loss function E(w) is a smooth bowl w 1 The gradient vector is the vector of 2 partial derivatives, one for w 1, one for w 2 Points in the direction where error increases the most We can define the gradient at any point in the 2d-space Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 15

16 Alternative Optimization Method: Gradient Descent Algorithm Simple algorithm that uses the gradient to search for the minimum of E (w) Start at some random w location Compute the gradient Δ(w) Move in the direction -Δ(w) (typically take small steps) Recomputethe gradient, iterate. Repeat until there is no improvement Theoretical properties? If step sizes are small enough, guaranteed to find a (local) minimum Simple example of gradient descent for p = 2 dimensions Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 16

17 Optimizing E(w) with a single weight w E(w) w Easy to find the minimum (convex problem with a single global minimum) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 17

18 Optimizing E(w) with a single weight w E(w) w Easy to find the minimum (convex problem with a single global minimum) E(w) w Hard to find the minimum (non-convex problem with local minima) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 18

19 Visualization of the Effect of Different Learning Rates Learning rate too small Learning rate about right Learning rate too large Blue lines show fitted lines after each gradient update Red dotted lines are the initial start point From: Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 19

20 Other Variants of Gradient-Based Optimization Second order derivative (Newton) methods: Use a matrix of second derivatives in determining direction to move Scales in computation time as O(D 3 ), since it needs to invert a D x D matrix of derivatives, where D is the number of features not practical with large numbers of features Stochastic gradient descent: compute the gradient only using a subset (a minibatch) of the data and then move in the direction of this approximate gradient Continue to cycle through smalll minibatches Can result in very large speedups on large data sets Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 20

21 Stochastic Gradient Descent (Example in 2-dimensional Parameter Space) Stochastic gradient steps Gradient steps Empirically works very well on large data sets: some theoretical support (See Adam algorithm, by Kingma and Ba, ICLR, 2015) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 21

22 Evaluating the Quality of a Predictive Model Predictive error/accuracy on new data (machine learning perspective) We can measure the error on the training data But a better measure of error is on held-out test data Quality of model fit to training data (statistics perspective) How certain are we about the parameter values? Other aspects may be important in certain applications Interpretability of the model E..g, in law, in credit scoring, in medicine, in autonomous vehicles Time and memory required to make predictions E.g., for models deployed on mobile devices Ease of updating the model E..g, for models that must be continually updated (spam , advertising) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 22

23 Squared Error and R 2 Evaluation Metrics Mean-Squared Error MSE = # $ Σ i ( y i f (x i ; w) ) 2 R-squared = %&' ( )*+, %&' (() or (Var(y) MSE) / Var(y) R-squared varies between 0 and 1, is percent of variance explained by the model Var(y) = MSE if we predicted y with the best constant (which is the mean of y) R-squared is the relative improvement we get by using the model, over the best constant predictor Example: Var(y) = 20, MSE = 16, R_squared = (20-16)/20 = 20% If MSE = 1, then R-squared = (20-1)/20 = 95% (a much better model) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 23

24 Diagnosis: Plotting Predictions versus Actual Targets Housing prices data set: - linear regression model with predictions on the training data Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 24

25 Example: Regression Data Set for Housing Data Lot Area Lot Frontage 1stFlr SF GrLiv Area Mo Sold Yr Sold Year Built Sale Price Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 25

26 Output from linear regression using statsmodels Dep. Variable: SalePrice R-squared: Model: OLS Adj. R-squared: Method: Least Squares F-statistic: Date: Mon, 12 Feb 2018 Prob (F-statistic): 8.91e-297 Time: 05:56:48 Log-Likelihood: No. Observations: 1201 AIC: 1.265e+04 Df Residuals: 1193 BIC: 1.269e+04 Df Model: 7 Covariance Type: nonrobust Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 26

27 Output from linear regression using statsmodels coef std err t P> t [95.0% Conf. Int.] const LotArea LotFrontage stFlrSF GrLivArea MoSold YrSold YearBuilt Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 27

28 Scaling Variables Say we have learned a regression model: f(x, w) = predicted output = x x 2 How should we interpret the weights w 1 = 2 and w 2 = 200? At first glance it looks like w 2 is 100 times more important than w 1 But this depends on the scale of the 2 variables: Say x 2 ranges from 0 to 1 and x 1 ranges from 0 to 1000 => the min/max effect on y is [0 2000] for x 1 and [0 200] for x 2 (so x 1 may have more impact on y) For this reason, it can be helpful to prescale all x s (e.g., to the range [0,1]) See for examples and methods in scikit-learn Conclusion: when interpreting coefficients we need to consider the scales of the x s Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 28

29 Collinearity of Input Variables Collinearity in regression means that some of the input x s are highly correlated with each other Consider the model f(x, w) = x x 2 Say we find out that x 1 and x 2 are highly correlated and on the same scale. If so, we can effectively replace 2 x x 2 with any other equivalent linear combination, e.g., 3 x x 2 5 x x 2-10 x x 2 Conclusion: with collinearity we need to be careful when interpreting regression coefficients (Collinearity can also cause numerical issues, particularly in small data sets) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 29

30 Encoding Categorical Variables Our features might not be real-valued but instead categorical with string values, e.g., ["male", "female"], ["from Europe", "from US", "from Asia"], ["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]. Such features are often encoded as multiple binary ( one-hot ) variables, i.e., if there are M string values we get M binary variables, one per value For large M this leads to highly sparse input vectors (many 0 s) Many machine learning packages are optimized to work with sparse inputs One-hot representations are common when using machine learning with text where each binary variable represents whether a word occurred or not in a document See for more information Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 30

31 Missing Data at Training Time Training Data Lot Area Lot Frontage 1stFlr SF GrLiv Area Mo Sold Yr Sold Year Built Sale Price ?? 11250?? 920?? ?? Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 31

32 Missing Data at Prediction Time Training Data Lot Area Lot Frontage 1stFlr SF GrLiv Area Mo Sold Yr Sold Year Built Sale Price ?? 11250?? 920?? ?? Test Data Lot Area Lot Frontage 1stFlr SF GrLiv Area Mo Sold Yr Sold Year Built ?? ?? 11924?? ?? Sale Price We can then use the model to make predictions when target values are unknown Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 32

33 Dealing with Missing Data Missing at random...versus other missing mechanisms Simple Methods: Removal: Remove that row from the data Imputation: replace a value with the median value from that column Note that this can be a poor way to infer a likely value it ignores row values More complex methods Use regression to predict each missing value from all the other known data Define additional binary variables that encode missing, i.e., treat missing as an additional categorical value that might have predictive power Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 33

34 Machine Learning Algorithms: A Component View Prediction Model + Loss Function + Optimization Method The functional form of f( x w ), e.g., weighted linear sum, decision tree, How we measure the quality of the model s predictions for specific parameter settings w, e.g., squared error, absolute error, The algorithm that finds the parameters w that minimize the error function, e.g., solving a set of linear equations, gradient descent, etc Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 34

35 Different Types of Predictive Models Linear models: f(x ; α) = α 0 + α 1 x 1 + α 2 x 2 +. α d x d = α 0 + Σ α j x j Generalized linear models f(x ; α) = g ( α 0 + Σ α j x j ) Non-linear link function, e.g., logistic function Neural networks f(x ; α) = α 0 + Σ α k g k (β k0 +Σ β kj x j ) Non-linear hidden units, e.g., logistic function Many different types of model based on compositions of linear weighted sums, typically trained via gradient methods Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 35

36 Other Types of Models Decision-tree regression Recursively split the input space into regions of approximately constant y values Nearest-neighbor regression For a new input x, find the K nearest neighbors of x in the training data, and predict the average y values of the K neighbors Many variations on this theme using weighted neighbors, etc An example of non-parametric regression Tends not to work well in high dimensions Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 36

37 Decision Tree Regression Example From: Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 37

38 Decision Tree Regression Example 2 Predicting gas mileage given characteristics of cars From: Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 38

39 Learning Decision Tree Regression Models Greedy search: for each input feature, find the best binary split (threshold) best = split that results in MSE on each side of split being lowest (like kmeans, k=2) Select the feature+split that results in the greatest decrease in average MSE Partition the data, and recursively keep splitting Use some heuristic to stop growing each subtree Prediction: Predict the mean value of the training data at each leaf node Problem: Trees can be unstable and less accurate Solution: perturb the data, build multiple trees, and average predictions Broad class of tree-averaging algoithms, one of the best known is Random Forests Note: trees can work well with both real and categorical data Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 39

40 Time-Series Prediction as Regression Time-series prediction as regression Predict y(t+1) where X = {y(t), y(t-1), y(t-2),.} This is known as autoregression Can use the same types of models and fitting techniques as regression Can also use additional variables, e.g., X = {y(t), v(t), z(t) } Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 40

41 Time-Series Prediction as Regression Original time-series data = 63.1, 65.0, 80.0, 68.0, 84.0, 92.1, 98.0, Convert it into format suitable for regression Training Data Month Sales Last Month Sales this Month Sales Next Month Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 41

42 Differences between Machine Learning and Statistics Statistics Emphasis on interpreting parameters in the model Less emphasis on prediction Many models for regression Machine Learning Emphasis on prediction on new data Less emphasis on interpreting parameters in the model Many models for classification Python Statsmodels Scikit-learn Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 42

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January, 1 Stat 406: Algorithms for classification and prediction Lecture 1: Introduction Kevin Murphy Mon 7 January, 2008 1 1 Slides last updated on January 7, 2008 Outline 2 Administrivia Some basic definitions.

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Day 3: Classification, logistic regression

Day 3: Classification, logistic regression Day 3: Classification, logistic regression Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 20 June 2018 Topics so far Supervised

More information

Basic Business Statistics, 10/e

Basic Business Statistics, 10/e Chapter 4 4- Basic Business Statistics th Edition Chapter 4 Introduction to Multiple Regression Basic Business Statistics, e 9 Prentice-Hall, Inc. Chap 4- Learning Objectives In this chapter, you learn:

More information

Chapter 14 Student Lecture Notes 14-1

Chapter 14 Student Lecture Notes 14-1 Chapter 14 Student Lecture Notes 14-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter 14 Multiple Regression Analysis and Model Building Chap 14-1 Chapter Goals After completing this

More information

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017 CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 24) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 2, 24 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24 Outline Review

More information

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear. Linear regression Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear. 1/48 Linear regression Linear regression is a simple approach

More information

Chart types and when to use them

Chart types and when to use them APPENDIX A Chart types and when to use them Pie chart Figure illustration of pie chart 2.3 % 4.5 % Browser Usage for April 2012 18.3 % 38.3 % Internet Explorer Firefox Chrome Safari Opera 35.8 % Pie chart

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information

The prediction of house price

The prediction of house price 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Lecture 6: Linear Regression (continued)

Lecture 6: Linear Regression (continued) Lecture 6: Linear Regression (continued) Reading: Sections 3.1-3.3 STATS 202: Data mining and analysis October 6, 2017 1 / 23 Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d.

More information

Chapter 7 Student Lecture Notes 7-1

Chapter 7 Student Lecture Notes 7-1 Chapter 7 Student Lecture Notes 7- Chapter Goals QM353: Business Statistics Chapter 7 Multiple Regression Analysis and Model Building After completing this chapter, you should be able to: Explain model

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition Ad Feelders Universiteit Utrecht Department of Information and Computing Sciences Algorithmic Data

More information

Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2015 Announcements TA Monisha s office hour has changed to Thursdays 10-12pm, 462WVH (the same

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Lecture #11: Classification & Logistic Regression

Lecture #11: Classification & Logistic Regression Lecture #11: Classification & Logistic Regression CS 109A, STAT 121A, AC 209A: Data Science Weiwei Pan, Pavlos Protopapas, Kevin Rader Fall 2016 Harvard University 1 Announcements Midterm: will be graded

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

Machine Learning and Data Mining. Linear regression. Kalev Kask

Machine Learning and Data Mining. Linear regression. Kalev Kask Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ Parameters q Learning algorithm Program ( Learner ) Change q Improve performance

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting

More information

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Module 2 Lecture 05 Linear Regression Good morning, welcome

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Machine Learning (CSE 446): Neural Networks

Machine Learning (CSE 446): Neural Networks Machine Learning (CSE 446): Neural Networks Noah Smith c 2017 University of Washington nasmith@cs.washington.edu November 6, 2017 1 / 22 Admin No Wednesday office hours for Noah; no lecture Friday. 2 /

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018 15-388/688 - Practical Data Science: Decision trees and interpretable models J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Decision trees Training (classification) decision trees Interpreting

More information

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation COMP 55 Applied Machine Learning Lecture 2: Bayesian optimisation Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material posted

More information

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released

More information

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report Hujia Yu, Jiafu Wu [hujiay, jiafuwu]@stanford.edu 1. Introduction Housing prices are an important

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit

More information

Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM

Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM Lecture 9 SEM, Statistical Modeling, AI, and Data Mining I. Terminology of SEM Related Concepts: Causal Modeling Path Analysis Structural Equation Modeling Latent variables (Factors measurable, but thru

More information

Introduction to Statistical modeling: handout for Math 489/583

Introduction to Statistical modeling: handout for Math 489/583 Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect

More information

Tufts COMP 135: Introduction to Machine Learning

Tufts COMP 135: Introduction to Machine Learning Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)

More information

Nonlinear Classification

Nonlinear Classification Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

COMS 4771 Regression. Nakul Verma

COMS 4771 Regression. Nakul Verma COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning Lesson 39 Neural Networks - III 12.4.4 Multi-Layer Perceptrons In contrast to perceptrons, multilayer networks can learn not only multiple decision boundaries, but the boundaries

More information

Collaborative Filtering. Radek Pelánek

Collaborative Filtering. Radek Pelánek Collaborative Filtering Radek Pelánek 2017 Notes on Lecture the most technical lecture of the course includes some scary looking math, but typically with intuitive interpretation use of standard machine

More information

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16 COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-

More information

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018 From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Data Mining Techniques. Lecture 2: Regression

Data Mining Techniques. Lecture 2: Regression Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 2: Regression Jan-Willem van de Meent (credit: Yijun Zhao, Marc Toussaint, Bishop) Administrativa Instructor Jan-Willem van de Meent Email:

More information

Introduction to Logistic Regression

Introduction to Logistic Regression Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the

More information

Lecture 4 Logistic Regression

Lecture 4 Logistic Regression Lecture 4 Logistic Regression Dr.Ammar Mohammed Normal Equation Hypothesis hθ(x)=θ0 x0+ θ x+ θ2 x2 +... + θd xd Normal Equation is a method to find the values of θ operations x0 x x2.. xd y x x2... xd

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Chapter 3 Multiple Regression Complete Example

Chapter 3 Multiple Regression Complete Example Department of Quantitative Methods & Information Systems ECON 504 Chapter 3 Multiple Regression Complete Example Spring 2013 Dr. Mohammad Zainal Review Goals After completing this lecture, you should be

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

CSE 250a. Assignment Noisy-OR model. Out: Tue Oct 26 Due: Tue Nov 2

CSE 250a. Assignment Noisy-OR model. Out: Tue Oct 26 Due: Tue Nov 2 CSE 250a. Assignment 4 Out: Tue Oct 26 Due: Tue Nov 2 4.1 Noisy-OR model X 1 X 2 X 3... X d Y For the belief network of binary random variables shown above, consider the noisy-or conditional probability

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Reading for today: R&N 18.1-18.4 Next lecture: R&N 18.6-18.12, 20.1-20.3.2 Outline The importance of a good representation Different types of learning problems Different

More information

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014 Decision Trees Machine Learning CSEP546 Carlos Guestrin University of Washington February 3, 2014 17 Linear separability n A dataset is linearly separable iff there exists a separating hyperplane: Exists

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

Neural networks. Chapter 20. Chapter 20 1

Neural networks. Chapter 20. Chapter 20 1 Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons Multilayer networks Applications of neural networks Chapter 20 2 Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25

More information

Lecture 7: DecisionTrees

Lecture 7: DecisionTrees Lecture 7: DecisionTrees What are decision trees? Brief interlude on information theory Decision tree construction Overfitting avoidance Regression trees COMP-652, Lecture 7 - September 28, 2009 1 Recall:

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9 Vote Vote on timing for night section: Option 1 (what we have now) Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9 Option 2 Lecture, 6:10-7 10 minute break Lecture, 7:10-8 10 minute break Tutorial,

More information

CSC321 Lecture 2: Linear Regression

CSC321 Lecture 2: Linear Regression CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks. MGS Lecture 2 Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

INTRODUCTION TO DATA SCIENCE

INTRODUCTION TO DATA SCIENCE INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #13 3/9/2017 CMSC320 Tuesdays & Thursdays 3:30pm 4:45pm ANNOUNCEMENTS Mini-Project #1 is due Saturday night (3/11): Seems like people are able to do

More information

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel Lecture 2 Judging the Performance of Classifiers Nitin R. Patel 1 In this note we will examine the question of how to udge the usefulness of a classifier and how to compare different classifiers. Not only

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Lasso Regression: Regularization for feature selection

Lasso Regression: Regularization for feature selection Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 1 Feature selection task 2 1 Why might you want to perform feature selection? Efficiency: - If

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Decision Trees: Overfitting

Decision Trees: Overfitting Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

18.6 Regression and Classification with Linear Models

18.6 Regression and Classification with Linear Models 18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September 19, 2018 1 / 23 Announcements Starting next week, Julia Fukuyama

More information

Learning from Examples

Learning from Examples Learning from Examples Data fitting Decision trees Cross validation Computational learning theory Linear classifiers Neural networks Nonparametric methods: nearest neighbor Support vector machines Ensemble

More information

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Artificial Neural Networks and Nonparametric Methods CMPSCI 383 Nov 17, 2011! Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011! 1 Todayʼs lecture" How the brain works (!)! Artificial neural networks! Perceptrons! Multilayer feed-forward networks! Error

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Machine Learning and Data Mining. Linear classification. Kalev Kask

Machine Learning and Data Mining. Linear classification. Kalev Kask Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1 Decision Trees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 5 th, 2007 2005-2007 Carlos Guestrin 1 Linear separability A dataset is linearly separable iff 9 a separating

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Spring 2018 1 This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning decision trees The ID3 algorithm: A greedy

More information

Lasso Regression: Regularization for feature selection

Lasso Regression: Regularization for feature selection Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 Feature selection task 1 Why might you want to perform feature selection? Efficiency: - If size(w)

More information