Data Mining. CS57300 Purdue University. Jan 11, Bruno Ribeiro

Similar documents
Time Series Mining and Forecasting

Time Series. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. Mining and Forecasting

CISC 4631 Data Mining

Similarity and Dissimilarity

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Statistics 1. Edexcel Notes S1. Mathematical Model. A mathematical model is a simplification of a real world problem.

Data Mining. CS57300 Purdue University. March 22, 2018

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Terminology. Experiment = Prior = Posterior =

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data

4.27 (a) Normal Q-Q Plot of JI score

With Question/Answer Animations. Chapter 7

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date: Thursday 17th May 2018 Time: 09:45-11:45. Please answer all Questions.

Chapter 6. Exploring Data: Relationships. Solutions. Exercises:

Introduction to Regression Analysis. Dr. Devlina Chatterjee 11 th August, 2017

Strategies for dealing with Missing Data

Matching for Causal Inference Without Balance Checking

UVA CS 4501: Machine Learning

CS570 Introduction to Data Mining

Collaborative Filtering. Radek Pelánek

Privacy in Statistical Databases

Jeffrey D. Ullman Stanford University

Math 2311 Test 1 Review. 1. State whether each situation is categorical or quantitative. If quantitative, state whether it s discrete or continuous.

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Discrete random structures whose limits are described by a PDE: 3 open problems

Introduction to Information Visualization

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

6. Assessing studies based on multiple regression

Time Series and Forecasting

CS 580: Algorithm Design and Analysis. Jeremiah Blocki Purdue University Spring 2018

PhysicsAndMathsTutor.com. Advanced/Advanced Subsidiary. You must have: Mathematical Formulae and Statistical Tables (Blue)

EE126: Probability and Random Processes

Information Management course

STA441: Spring Multiple Regression. More than one explanatory variable at the same time

Job Training Partnership Act (JTPA)

Bayesian Analysis of Multivariate Normal Models when Dimensions are Absent

1. Let X be a random variable with probability density function. 1 x < f(x) = 0 otherwise

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Small Area Modeling of County Estimates for Corn and Soybean Yields in the US

CSE 5243 INTRO. TO DATA MINING

Time Series and Forecasting

Economics 390 Economic Forecasting

STOR 435 Lecture 5. Conditional Probability and Independence - I

9/28/2013. PSY 511: Advanced Statistics for Psychological and Behavioral Research 1

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Switching-state Dynamical Modeling of Daily Behavioral Data

CS110 Personal Computing 1

The subjective worlds of Frequentist and Bayesian statistics

University of California, Berkeley, Statistics 131A: Statistical Inference for the Social and Life Sciences. Michael Lugo, Spring 2012

BAYESIAN DECISION THEORY

Preliminary Statistics course. Lecture 1: Descriptive Statistics

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

date: math analysis 2 chapter 18: curve fitting and models

Dennis Bricker Dept of Mechanical & Industrial Engineering The University of Iowa. Forecasting demand 02/06/03 page 1 of 34

Second Grade: Unit 2: Properties of Matter. Matter solid liquid gas property

CS 188: Artificial Intelligence Spring Today

PhysicsAndMathsTutor.com

Classification 1: Linear regression of indicators, linear discriminant analysis

CS70: Jean Walrand: Lecture 27.

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

CS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber.

Suggested solutions to the 6 th seminar, ECON4260

Candidates may use any calculator allowed by the regulations of the Joint Council for Qualifications. Calculators must not have the facility for

Machine Learning. Regression. Manfred Huber

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Brief Introduction of Machine Learning Techniques for Content Analysis

Today. Calculus. Linear Regression. Lagrange Multipliers

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

CISC 4631 Data Mining

Fast Adaptive Algorithm for Robust Evaluation of Quality of Experience

CSC Discrete Math I, Spring Discrete Probability

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Noise & Data Reduction

Kernel methods, kernel SVM and ridge regression

Differential Privacy

ECE521 week 3: 23/26 January 2017

Sociology 593 Exam 2 Answer Key March 28, 2002

Defining Statistically Significant Spatial Clusters of a Target Population using a Patient-Centered Approach within a GIS

Statistics Revision Questions Nov 2016 [175 marks]

Data analysis and Geostatistics - lecture VI

Descriptive Data Summarization

Time Series Analysis

Lecture 9: Learning Optimal Dynamic Treatment Regimes. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

The propensity score with continuous treatments

Andrew/CS ID: Midterm Solutions, Fall 2006

Slides 5: Reasoning with Random Numbers

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

WU Weiterbildung. Linear Mixed Models

CS540 ANSWER SHEET

CSE446: Naïve Bayes Winter 2016

Question Selection for Crowd Entity Resolution

Chapter 2. Mean and Standard Deviation

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Math 223 Lecture Notes 3/15/04 From The Basic Practice of Statistics, bymoore

Econometrics Problem Set 4

Transcription:

Data Mining CS57300 Purdue University Jan, 208 Bruno Ribeiro

Regression Posteriors Working with Data 2

Linear Regression: Review 3

Linear Regression (use A)? Interpolation (something is missing) (x,, x t ) (y,, y t ) 4

Auto-regression: Predicting Next Value After t Steps Linear Regression (use B)???? Google Finance x t+ = i=t tx noise Similar problem to linear regression: express unknowns as a linear function of knowns w a i x i + w = history window 5

Predictions from High-Dimensional Historical Data y [t ] = X [t w] a [w ] Over-constrained problem a is the vector of the regression coefficients X has the t values of the w independent variables. These independent variables can mix user characteristics with a window of past observations y has the t values of the dependent variable Ack: C. Faloutosos 6

Looking Into Multiplication may want to add social media variables y[t ] = X[t w] a[w ] Predicting corn prices over time time 7

How to Estimate a? a = (X T X) - (X T y) X + = (X T X) - X T is the Moore Penrose pseudoinverse Or: a = X + y a is the vector that minimizes the Root Mean Squared Error (RMSE) of (y X a T ) 8

Details: Least Squares Optimization Least squares cost function: C = 2 tx (y i x T i a) 2 = 2 (y Xa)T (y Xa) i= Find a that minimizes cost C @C @a = @ 2 @a (y Xa)T (y Xa) = (y Xa) T X X a y Optimal value at: @C @a =0 =) X T y = X T Xa =) a =(X T X) X T y 9

How to Estimate a? a = (X T X) - (X T y) X + = (X T X) - X T is the Moore Penrose pseudoinverse Or: a = X + y a is the vector that minimizes the Root Mean Squared Error (RMSE) of (y X a T ) Problems: Matrix X grows over time & needs matrix inversion O(t w 2 ) computation O(t w) storage 0

Recursive Least Squares At time t we know X t =(x,...,x t ), y t =(y,...,y t ) Least squares is solving argmax a? ka T X t y t k 2 which gives a? = X + y where X + = (X T X) - X T Let t = X T t X t t = X T t y t Then t+ is t+ =( t + x t+ x T t+) = t t x t+ x T t+ +x T t+ t t x t+

Recursive Least Squares Algorithm t+ = t t x T t+x T t+ +x T t+ t t x(t + ) t+ = t + x T t+y t+ a t+ = t+ t+ 2

Exponentially Weighted Recursive Least Squares Algorithm t+ = t 2 t x T t+x T t+ +x T t+ t+ = t + x T t+y t+ t t x(t + ) a t+ = t+ t+ 3

Comparison Original Least Squares Needs large matrix (growing in size) O(t w) Costly matrix operation O(t w 2 ) Recursive LS Need much smaller, fixed size matrix O(w w) Fast, incremental computation O( x w 2 ) no matrix inversion Ack: Christos Faloutsos 4

Posteriors 5

Finding (Real) Patterns in Data Data shows: rural counties have the highest average mortality rates. But rural counties also have small populations Why rural counties have the highest rates of cancer? A: Small sample variance Solution? P[ cancer rate data] 6

Posteriors (e.g. Representing Fractions) Consider creating a model that predicts if a soccer striker will score a goal in a game Data includes no. shots on goals and no. goals during the player s career Problems with absolute values: number of goals (older players have larger values than young players) no. shots on goals (does not reflect rate of shot goal conversion) Feature: % shots on goal resulting in goal Alice (Novice): out of 00% Bob (Senior): 300 out of 000 30% Solution? (same problem as in the previous slide (cancer rate). The solution is also Bayesian. 7

Reaching (Real) Conclusions with Data Whatever conclusion using cold data, there are always assumptions Example: Simpson s Paradox (a.k.a. Yule Simpson effect) At Berkeley, women applicants overall have lower acceptance rate than men But if you look at every department women are accepted at the same rate as men How? What was our wrong assumption? O i is a random variable that denotes candidate i gets an o er from Berkeley. O i,j is a random variable that denotes candidate i gets an o er from department j at Berkeley. A i,j is a random variable that denotes candidate i has applied to department j. P [O i ]= X j P [O i,j ]= X j P [O i,j A i,j ]P [A i,j ] Incorrect assumption: P[A i,j ] is the same for men and women 8

Working with Data 9

Data Issues How to represent the data plays a huge role in the question we can ask and the answers we get It influences similarity metrics It influences the data models we can use Data biases (last class) Missing data also plays a huge role And the problem of outliers in the data Outlier chicken-and-egg problem: Outliers may skew decisions But defining what constitutes an outlier requires deciding a model that describes the normal data Deciding such a model requires fitting the data, which may fit the outliers 20

Missing values Reasons for missing values Information is not collected (e.g., people decline to give their age) Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Ways to handle missing values Eliminate entities with missing values Estimate attributes with missing values Ignore the missing values during analysis Replace with all possible values (weighted by their probabilities) Impute missing values 2 Tan, Steinbach, Kumar. Introduction to Data Mining, 2004.

Duplicate Data Data set may include data entities that are duplicates, or almost duplicates of one another Major issue when merging data from heterogeneous sources Example: same person with multiple email addresses Sometimes duplication happens (different users, same features). Issues with some models. Data cleaning Finding and dealing with duplicate entities Finding and correcting measurement error Dealing with missing values 22