CS6220: DATA MINING TECHNIQUES

Similar documents
CS6220: DATA MINING TECHNIQUES

CS145: INTRODUCTION TO DATA MINING

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

CS145: INTRODUCTION TO DATA MINING

CS6220: DATA MINING TECHNIQUES

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING

CS6220: DATA MINING TECHNIQUES

Linear Regression (continued)

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Overfitting, Bias / Variance Analysis

CS249: ADVANCED DATA MINING

Linear Models in Machine Learning

Machine Learning Linear Regression. Prof. Matteo Matteucci

ISyE 691 Data mining and analytics

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning & Data Mining

FINAL: CS 6375 (Machine Learning) Fall 2014

CS 6375 Machine Learning

Machine Learning for OR & FE

ECE521 week 3: 23/26 January 2017

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Introduction to Machine Learning Midterm Exam Solutions

Linear Models for Regression CS534

COMS 4771 Regression. Nakul Verma

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Introduction to Machine Learning Midterm Exam

Mathematical Formulation of Our Example

CS249: ADVANCED DATA MINING

Regression, Ridge Regression, Lasso

CSCI-567: Machine Learning (Spring 2019)

ECLT 5810 Data Preprocessing. Prof. Wai Lam

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Click Prediction and Preference Ranking of RSS Feeds

Pattern Recognition and Machine Learning

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

CSCI567 Machine Learning (Fall 2014)

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning: A Statistics and Optimization Perspective

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

CS60021: Scalable Data Mining. Large Scale Machine Learning

Linear Models for Regression CS534

Introduction to Machine Learning Midterm, Tues April 8

Linear Models for Regression CS534

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning Linear Classification. Prof. Matteo Matteucci

Linear Regression In God we trust, all others bring data. William Edwards Deming

ECE 5424: Introduction to Machine Learning

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

CS6220: DATA MINING TECHNIQUES

Machine Learning for NLP

Association studies and regression

Linear Models for Regression. Sargur Srihari

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Introduction to Statistical modeling: handout for Math 489/583

Final Exam, Machine Learning, Spring 2009

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Midterm: CS 6375 Spring 2015 Solutions

Lecture 2 Machine Learning Review

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

MS&E 226: Small Data

ANÁLISE DOS DADOS. Daniela Barreiro Claro

Midterm: CS 6375 Spring 2018

10-701/ Machine Learning, Fall

Logistic Regression & Neural Networks

Machine Learning Practice Page 2 of 2 10/28/13

VBM683 Machine Learning

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Statistics: A review. Why statistics?

Computational statistics

CMU-Q Lecture 24:

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Linear Models for Regression

10-701/ Machine Learning - Midterm Exam, Fall 2010

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

CS534 Machine Learning - Spring Final Exam

Stochastic Gradient Descent

CS249: ADVANCED DATA MINING

Introduction to Machine Learning and Cross-Validation

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Lecture 6: Linear Regression (continued)

High-dimensional regression modeling

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Variable Selection and Model Building

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

CSC 411: Lecture 04: Logistic Regression

Relevance Vector Machines

Least Squares Regression

Day 3: Classification, logistic regression

Transcription:

CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014

Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2

Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network Classification Decision Tree; Naïve Bayes; Logistic Regression SVM; knn HMM Label Propagation Clustering Frequent Pattern Mining K-means; hierarchical clustering; DBSCAN; Mixture Models; kernel k-means Apriori; FP-growth GSP; PrefixSpan Prediction Linear Regression Autoregression SCAN; Spectral Clustering Similarity Search Ranking DTW P-PageRank PageRank 3

How to learn these algorithms? Three levels When it is applicable? Input, output, strengths, weaknesses, time complexity How it works? Pseudo-code, work flows, major steps Can work out a toy problem by pen and paper Why it works? Intuition, philosophy, objective, derivation, proof 4

Matrix Data: Prediction Matrix Data Linear Regression Model Model Evaluation and Selection Summary 5

Example A matrix of n p: n data objects / points p attributes / dimensions x 11 x i1 x n1 x 1f x if x nf x 1p x ip x np 6

Numerical E.g., height, income Attribute Type Categorical / discrete E.g., Sex, Race 7

Categorical Attribute Types Nominal: categories, states, or names of things Hair_color = {auburn, black, blond, brown, grey, red, white} marital status, occupation, ID numbers, zip codes Binary Nominal attribute with only 2 states (0 and 1) Symmetric binary: both outcomes equally important e.g., gender Asymmetric binary: outcomes not equally important. e.g., medical test (positive vs. negative) Convention: assign 1 to most important outcome (e.g., HIV positive) Ordinal Values have a meaningful order (ranking) but magnitude between successive values is not known. Size = {small, medium, large}, grades, army rankings 8

Matrix Data: Prediction Matrix Data Linear Regression Model Model Evaluation and Selection Summary 9

Linear Regression Ordinary Least Square Regression Linear Regression with Probabilistic Interpretation 10

The Linear Regression Problem 11

Formalization Data: n independent data objects y i, i = 1,, n x i = x i1, x i2,, x ip T, i = 1,, n Usually a constant factor is considered, say, x i0 = 1 Model: y: dependent variable x: explanatory variables β = β 0, β 1,, β p T : weight vector y = x T β = β 0 + x 1 β 1 + x 2 β 2 + + x p β p 12

Model Construction A 2-step Process Use training data to find the best parameter β, denoted as β Model Usage Model Evaluation Use test data to select the best model Feature selection Apply the model to the unseen data: y = x T β 13

Least Square Estimation x Cost function: J β = i x i T β y i 2 Matrix form: J β = Xβ y T (Xβ y) X: n p + 1 matrix 14

Ordinary Least Squares (OLS) Goal: find β that minimizes J β J β = Xβ y T Xβ y = β T X T Xβ y T Xβ β T X T y + y T y Ordinary least squares Set first derivative of J β as 0 J β = 2βT X T X 2y T X = 0 β = X T X 1 X T y 15

Gradient Descent Online Updating Move in the direction of steepest descend β (t+1) :=β (t) η J β β=β (t), Where J β = i x i T β y i 2 = i J i (β) J β = i When a new observation, i, comes in, only need to update: β (t+1) :=β (t) + 2η(y i x i T β (t) )x i J i β = i 2x i (x i T β y i ) If the prediction for object i is smaller than the real value, β should move forward to the direction of x i 16

Other Practical Issues What if X T X is not invertible? Add a small portion of identity matrix, λi, to it (ridge regression* ) What if some attributes are categorical? Set dummy variables E.g., x = 1, if sex = F; x = 0, if sex = M Nominal variable with multiple values? Create more dummy variables for one variable What if non-linear correlation exists? Transform features, say, x to x 2 17

Probabilistic Interpretation Model: y i = x i T β + ε i ε i ~N(0, σ 2 ) y i x i, β~n(x i T β, σ 2 ) E y i x i Likelihood: = x i T β L β = i p y i x i, β) = i 1 2πσ 2 exp{ y i x i T β 2 2σ 2 } Maximum Likelihood Estimation find β that maximizes L β arg max L = arg min J, Equivalent to OLS! 18

Matrix Data: Prediction Matrix Data Linear Regression Model Model Evaluation and Selection Summary 19

Model Selection Problem Basic problem: how to choose between competing linear regression models Model too small: underfit the data; poor predictions; high bias; low variance Model too big: overfit the data; poor predictions; low bias; high variance Model just right: balance bias and variance to get good predictions 20

Bias: E( f x ) f(x) Bias and Variance True predictor f x : x T β Estimated predictor f x : x T β How far away is the expectation of the estimator to the true value? The smaller the better. Variance: Var f x = E[ f x E f x How variant is the estimator? The smaller the better. Reconsider the cost function J β = i x i T β y i 2 Can be considered as E[ f x f(x) ε 2 ] = bias 2 + variance + noise Note E ε = 0, Var ε = σ 2 2 ] 21

Bias-Variance Trade-off 22

Cross-Validation Partition the data into K folds Use K-1 fold as training, and 1 fold as testing Calculate the average accuracy best on K training-testing pairs Accuracy on validation/test dataset! Mean square error can again be used: i x i T β y i 2 /n 23

AIC & BIC AIC and BIC can be used to test the quality of statistical models AIC (Akaike information criterion) AIC = 2k 2ln( L), where k is the number of parameters in the model and L is the likelihood under the estimated parameter BIC (Bayesian Information criterion) BIC = kln(n) 2ln( L), Where n is the number of objects 24

Stepwise Feature Selection Avoid brute-force selection 2 p Forward selection Starting with the best single feature Always add the feature that improves the performance best Stop if no feature will further improve the performance Backward elimination Start with the full model Always remove the feature that results in the best performance enhancement Stop if removing any feature will get worse performance 25

Matrix Data: Prediction Matrix Data Linear Regression Model Model Evaluation and Selection Summary 26

Summary What is matrix data? Attribute types Linear regression OLS Probabilistic interpretation Model Evaluation and Selection Bias-Variance Trade-off Mean square error Cross-validation, AIC, BIC, step-wise feature selection 27