CS145: INTRODUCTION TO DATA MINING

Similar documents
CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ANÁLISE DOS DADOS. Daniela Barreiro Claro

Getting To Know Your Data

Descriptive Data Summarization

Chapter I: Introduction & Foundations

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING

CS249: ADVANCED DATA MINING

CS145: INTRODUCTION TO DATA MINING

MSCBD 5002/IT5210: Knowledge Discovery and Data Minig

CS6220: DATA MINING TECHNIQUES

CS249: ADVANCED DATA MINING

P8130: Biostatistical Methods I

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

CS6220: DATA MINING TECHNIQUES

Data Mining Techniques. Lecture 2: Regression

Variable Selection in Predictive Regressions

CS6220: DATA MINING TECHNIQUES

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CS249: ADVANCED DATA MINING

CS 6375 Machine Learning

Lecture 10: Logistic Regression

CS6220: DATA MINING TECHNIQUES

Linear Models for Regression CS534

Overfitting, Bias / Variance Analysis

CS249: ADVANCED DATA MINING

Unit 2. Describing Data: Numerical

ECE521 week 3: 23/26 January 2017

MATH 1150 Chapter 2 Notation and Terminology

Linear Models for Regression CS534

Statistical Learning with the Lasso, spring The Lasso

Linear Models for Regression CS534

Linear Regression (continued)

Chapter 3. Data Description

Lecture 3. Linear Regression

Chapter 1 - Lecture 3 Measures of Location

Variations. ECE 6540, Lecture 02 Multivariate Random Variables & Linear Algebra

Descriptive Univariate Statistics and Bivariate Correlation

CS6220: DATA MINING TECHNIQUES

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

Math, Stats, and Mathstats Review ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

CIVL 7012/8012. Collection and Analysis of Information

Lecture 6. Notes on Linear Algebra. Perceptron

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

A is one of the categories into which qualitative data can be classified.

CSE5243 INTRO. TO DATA MINING

Chart types and when to use them

Machine Learning Linear Classification. Prof. Matteo Matteucci

Chapter 1. Looking at Data

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

Introduction to Machine Learning Midterm Exam

Chapter 6: Classification

AP Final Review II Exploring Data (20% 30%)

CPSC 340: Machine Learning and Data Mining

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

General Linear Models. with General Linear Hypothesis Tests and Likelihood Ratio Tests

STAT 200 Chapter 1 Looking at Data - Distributions

(quantitative or categorical variables) Numerical descriptions of center, variability, position (quantitative variables)

Chapter 2: Tools for Exploring Univariate Data

Machine Learning for OR & FE

Lecture #11: Classification & Logistic Regression

Regression, Ridge Regression, Lasso

Linear Models for Regression. Sargur Srihari

Linear Models in Machine Learning

Midterm: CS 6375 Spring 2015 Solutions

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

ISyE 691 Data mining and analytics

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

Linear Model Selection and Regularization

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Multinomial Logistic Regression Model for Predicting Tornado Intensity Based on Path Length and Width

Radial Basis Function (RBF) Networks

FINAL: CS 6375 (Machine Learning) Fall 2014

Loss Functions and Optimization. Lecture 3-1

Glossary for the Triola Statistics Series

Learning Objectives for Stat 225

Measures of Central Tendency

COMS 4771 Regression. Nakul Verma

Course in Data Science

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

For instance, we want to know whether freshmen with parents of BA degree are predicted to get higher GPA than those with parents without BA degree.

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

CS249: ADVANCED DATA MINING

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Association studies and regression

Biostatistics for biomedical profession. BIMM34 Karin Källen & Linda Hartman November-December 2015

Pattern Recognition and Machine Learning

Introduction to Statistics

Machine Learning & Data Mining

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

3 Lecture 3 Notes: Measures of Variation. The Boxplot. Definition of Probability

Introduction to Machine Learning Midterm Exam Solutions

Ch 4. Linear Models for Classification

Transcription:

CS145: INTRODUCTION TO DATA MINING 2: Vector Data: Prediction Instructor: Yizhou Sun yzsun@cs.ucla.edu October 8, 2018

TA Office Hour Time Change Junheng Hao: Tuesday 1-3pm Yunsheng Bai: Thursday 1-3pm 2

Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering Logistic Regression; Decision Tree; KNN SVM; NN K-means; hierarchical clustering; DBSCAN; Mixture Models Naïve Bayes for Text PLSA Prediction Frequent Pattern Mining Similarity Search Linear Regression GLM* Apriori; FP growth GSP; PrefixSpan DTW 3

How to learn these algorithms? Three levels When it is applicable? Input, output, strengths, weaknesses, time complexity How it works? Pseudo-code, work flows, major steps Can work out a toy problem by pen and paper Why it works? Intuition, philosophy, objective, derivation, proof 4

Vector Data: Prediction Vector Data Linear Regression Model Model Evaluation and Selection Summary 5

Example A matrix of n pp: n data objects / points p attributes / dimensions x 11 x i1 x n1 x 1f x if x nf x 1p x ip x np 6

Numerical E.g., height, income Attribute Type Categorical / discrete E.g., Sex, Race 7

Categorical Attribute Types Nominal: categories, states, or names of things Hair_color = {auburn, black, blond, brown, grey, red, white} marital status, occupation, ID numbers, zip codes Binary Nominal attribute with only 2 states (0 and 1) Symmetric binary: both outcomes equally important e.g., gender Asymmetric binary: outcomes not equally important. e.g., medical test (positive vs. negative) Convention: assign 1 to most important outcome (e.g., HIV positive) Ordinal Values have a meaningful order (ranking) but magnitude between successive values is not known. Size = {small, medium, large}, grades, army rankings 8

Basic Statistical Descriptions of Data Central Tendency Dispersion of the Data Graphic Displays 9

Measuring the Central Tendency Mean (algebraic measure) (sample vs. population): Note: n is sample size and N is population size. Weighted arithmetic mean: Trimmed mean: chopping extreme values Median: Middle value if odd number of values, or average of the middle two values otherwise Mode Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula: x = x 1 n n i= 1 = n i= 1 n i= 1 w x mean mode = 3 ( mean median) i w i i x i µ = x N 10

Symmetric vs. Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data symmetric positively skewed negatively skewed 11

Measuring the Dispersion of Data Quartiles, outliers and boxplots Quartiles: Q 1 (25 th percentile), Q 3 (75 th percentile) Inter-quartile range: IQR = Q 3 Q 1 Five number summary: min, Q 1, median, Q 3, max Outlier: usually, a value higher/lower than 1.5 x IQR of Q 3 or Q 1 Variance and standard deviation (sample: s, population: σ) Variance: (algebraic, scalable computation) n n n 2 1 2 1 2 1 s = ( xi x) = [ xi ( xi ) n 1 n 1 n i= 1 i= 1 i= 1 σσ 2 = EE[ XX EE XX 2 ] = EE XX 2 EE XX 2 2 ] Box Plot Standard deviation s (or σ) is the square root of variance s 2 (or σ 2) 12

Graphic Displays of Basic Statistical Descriptions Histogram: x-axis are values, y-axis repres. frequencies Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane 13

Histogram Analysis Histogram: Graph display of tabulated frequencies, shown as bars It shows what proportion of cases fall into each of several categories Differs from a bar chart in that it is the area of the bar that denotes the value, not the height as in bar charts, a crucial distinction when the categories are not of uniform width The categories are usually specified as non-overlapping intervals of some variable. The categories (bars) must be adjacent 40 35 30 25 20 15 10 5 0 10000 30000 50000 70000 90000 14

Scatter plot Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plotted as points in the plane 15

Positively and Negatively Correlated Data The left half fragment is positively correlated The right half is negative correlated 16

Uncorrelated Data 17

Scatterplot Matrices Used by ermission of M. Ward, Worcester Polytechnic Institute Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of kk 2 + kk unique scatterplots] 18

Vector Data: Prediction Vector Data Linear Regression Model Model Evaluation and Selection Summary 19

Linear Regression Ordinary Least Square Regression Closed form solution Gradient descent Linear Regression with Probabilistic Interpretation 20

The Linear Regression Problem Any Attributes to Continuous Value: x y {age; major ; gender; race} GPA {income; credit score; profession} loan {college; major ; GPA} future income 21

Example of House Price Living Area (sqft) # of Beds Price (1000$) 2104 3 400 1600 3 330 2400 3 369 1416 2 232 3000 4 540 x=(xx 1, xx 2 ) y yy = ββ 0 + ββ 1 xx 1 + ββ 2 xx 2 22

Illustration 23

Formalization Data: n independent data objects yy ii, ii = 1,, nn xx ii = xx iii, xx iii,, xx iiii T, i = 1,, nn A constant factor is added to model the bias term, i. e., xx iii = 1 T New x: xx ii = xx iii, xx iii, xx iii,, xx iiii Model: yy: dependent variable xx: explanatory variables ββ = ββ 0, ββ 1,, ββ pp TT : wwwwwwwwwww vvvvvvvvvvvv yy = xx TT ββ = ββ 0 + xx 1 ββ 1 + xx 2 ββ 2 + + xx pp ββ pp 24

Model Construction A 3-step Process Use training data to find the best parameter ββ, denoted as ββ Model Selection Use validation data to select the best model E.g., Feature selection Model Usage Apply the model to the unseen data (test data): yy = xx TT ββ 25

Least Square Estimation Cost function (Mean Square Error): JJ ββ = 1 2 ii xx ii TT ββ yy ii 2 /nn Matrix form: JJ ββ = Xββ yy TT (XXββ yy)/2nn 1, 1, 1, x 11 x i1 x n1 or Xββ yy 2 /2nn x 1f x if x nf x 1p x ip x np XX: nn pp + 11 matrix yy 1 yy ii yy nn y: nn 11 vvvvvvvvvvvv 26

Ordinary Least Squares (OLS) Goal: find ββ that minimizes JJ ββ JJ ββ = 1 2nn Xββ yy TT XXββ yy = 1 2nn (ββtt XX TT XXββ yy TT XXββ ββ TT XX TT yy + yy TT yy) Ordinary least squares Set first derivative of JJ ββ as 0 ββ = (XXTT XXββ XX TT yy)/nn = 0 ββ = XX TT XX 1 XX TT yy More about matrix calculus: https://atmos.washington.edu/~dennis/matrixcalculus.pdf Z 27

Gradient Descent Minimize the cost function by moving down in the steepest direction 28

Batch Gradient Descent Move in the direction of steepest descend Repeat until converge { } ββ (tt+1) :=ββ (t) ηη ββ ββ=ββ (t), e.g., ηη = 0.01 Where JJ ββ = 1 2 ii xx TT 2 ii ββ yy ii /nn = ii JJ ii (ββ)/nn and ββ = JJ ii ββ /nn = xx ii xx TT ii ββ yy ii /nn ii ii 29

Stochastic Gradient Descent When a new observation, i, comes in, update weight immediately (extremely useful for largescale datasets): Repeat { for i=1:n { ββ (tt+1) :=ββ (t) + ηη(yy ii xx ii TT ββ (tt) )xx ii } } If the prediction for object i is smaller than the real value, ββ should move forward to the direction of xx ii 30

Probabilistic Interpretation Review of normal distribution X~NN μμ, σσ 2 ff XX = xx = 1 2ππσσ 2 ee xx μμ 2 2σσ 2 31

Probabilistic Interpretation Model: yy ii = xx ii TT ββ + ε ii ε ii ~NN(0, σσ 2 ) yy ii xx ii, ββ~nn(xx ii TT ββ, σσ 2 ) EE yy ii xx ii Likelihood: = xx ii TT ββ LL ββ = ii pp yy ii xx ii, ββ) = ii 1 exp{ yy 2 TT ii xx ii ββ 2ππσσ 2 2σσ 2 } Maximum Likelihood Estimation find ββ that maximizes L ββ arg max LL = arg min JJ, Equivalent to OLS! 32

Other Practical Issues Handle different scales of numerical attributes Z-score: zz = xx μμ σσ x: raw score to be standardized, μ: mean of the population, σ: standard deviation What if some attributes are nominal? Set dummy variables E.g., xx = 1, iiii ssssss = FF; xx = 0, iiii ssssss = MM Nominal variable with multiple values? Create more dummy variables for one variable What if some attribute are ordinal? replace x if by their rank Type equation here. r { 1,, M if f map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by zz iiii = rr iiii 1 MM ff 1 } 33

Other Practical Issues What if XX TT XX is not invertible? Add a small portion of identity matrix, λii, to it ridge regression or linear regression with l2 norm What if non-linear correlation exists? Transform features, say, xx to xx 2 34

Vector Data: Prediction Vector Data Linear Regression Model Model Evaluation and Selection Summary 35

Model Selection Problem Basic problem: how to choose between competing linear regression models Model too simple: underfit the data; poor predictions; high bias; low variance Model too complex: overfit the data; poor predictions; low bias; high variance Model just right: balance bias and variance to get good predictions 36

Bias: EE( ff xx ) ff(xx) Bias and Variance True predictor ff xx : xx TT ββ Estimated predictor ff xx : xx TT ββ How far away is the expectation of the estimator to the true value? The smaller the better. Variance: VVVVVV ff xx = EE[ ff xx EE ff xx How variant is the estimator? The smaller the better. Reconsider mean square error 2 ] JJ ββ /nn = ii xx ii TT ββ yy ii 2 /nn Can be considered as EE[ ff xx ff(xx) εε 2 ] = bbbbbbss 2 + vvvvvvvvvvvvvvvv + nnnnnnnnnn Note EE εε = 0, VVVVVV εε = σσ 2 37

Bias-Variance Trade-off 38

Example: degree d in regression http://www.holehouse.org/mlclass/10_advice_for_applying_machine_learning.html 39

Example: regularization term in regression 40

Cross-Validation Partition the data into K folds Use K-1 fold as training, and 1 fold as testing Calculate the average accuracy best on K training-testing pairs Accuracy on validation/test dataset! Mean square error can again be used: ii xx ii TT ββ yy ii 2 /nn 41

AIC & BIC* AIC and BIC can be used to test the quality of statistical models AIC (Akaike information criterion) AAAAAA = 2kk 2ln( LL), where k is the number of parameters in the model and LL is the likelihood under the estimated parameter BIC (Bayesian Information criterion) BIIII = kkkkkk(nn) 2ln( LL), Where n is the number of objects 42

Stepwise Feature Selection Avoid brute-force selection 2 pp Forward selection Starting with the best single feature Always add the feature that improves the performance best Stop if no feature will further improve the performance Backward elimination Start with the full model Always remove the feature that results in the best performance enhancement Stop if removing any feature will get worse performance 43

Vector Data: Prediction Vector Data Linear Regression Model Model Evaluation and Selection Summary 44

Summary What is vector data? Attribute types Basic statistics Visualization Linear regression OLS Probabilistic interpretation Model Evaluation and Selection Bias-Variance Trade-off Mean square error Cross-validation, AIC, BIC, step-wise feature selection 45