From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

Similar documents
Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

CS145: INTRODUCTION TO DATA MINING

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

UVA CS 4501: Machine Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

CS6220: DATA MINING TECHNIQUES

Chart types and when to use them

Course in Data Science

Machine Learning Linear Classification. Prof. Matteo Matteucci

ECE 5424: Introduction to Machine Learning

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

VBM683 Machine Learning

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

15 Introduction to Data Mining

Machine Learning Linear Regression. Prof. Matteo Matteucci

Statistical Machine Learning from Data

Decision Tree Learning

Machine Learning & Data Mining

Classification and Prediction

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Machine Learning 3. week

1 Handling of Continuous Attributes in C4.5. Algorithm

Evaluation. Albert Bifet. April 2012

CS6220: DATA MINING TECHNIQUES

Models, Data, Learning Problems

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

Decision trees COMS 4771

Holdout and Cross-Validation Methods Overfitting Avoidance

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

day month year documentname/initials 1

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

FINAL: CS 6375 (Machine Learning) Fall 2014

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Machine Learning. Boris

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Machine Learning and Deep Learning! Vincent Lepetit!

Decision Trees. Tirgul 5

Data Mining Part 5. Prediction

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Decision Trees: Overfitting

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Generalization Error on Pruning Decision Trees

Administrative notes. Computational Thinking ct.cs.ubc.ca

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Introduction to Machine Learning and Cross-Validation

PATTERN RECOGNITION AND MACHINE LEARNING

Classification Using Decision Trees

CS 6375 Machine Learning

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Lecture VII: Classification I. Dr. Ouiem Bchir

Lecture #11: Classification & Logistic Regression

Deconstructing Data Science

Data Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29

Big Data, Machine Learning, and Causal Inference

Empirical Risk Minimization, Model Selection, and Model Assessment

Induction of Decision Trees

BAGGING PREDICTORS AND RANDOM FOREST

Bayesian Learning (II)

Decision Support Systems MEIC - Alameda 2010/2011. Homework #8. Due date: 5.Dec.2011

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Stat 502X Exam 2 Spring 2014

Machine Learning (CS 567) Lecture 2

Introduction to Machine Learning CMU-10701

Jeffrey D. Ullman Stanford University

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

CS 6375 Machine Learning

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Be able to define the following terms and answer basic questions about them:

Lecture 3: Decision Trees

Decision Tree And Random Forest

Decision T ree Tree Algorithm Week 4 1

Gradient Boosting (Continued)

CPSC 340: Machine Learning and Data Mining

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Logic and machine learning review. CS 540 Yingyu Liang

Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM

Introduction to Machine Learning Midterm Exam

Decision Support. Dr. Johan Hagelbäck.

Examination Artificial Intelligence Module Intelligent Interaction Design December 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Unsupervised Learning. k-means Algorithm

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Variance Reduction and Ensemble Methods

Learning Decision Trees

Transcription:

From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu

Why? How? What? How much? How many? Individual facts (quantities, characters, or symbols) The Data-Information-Knowledge-Wisdom Hierarchy - Russell Ackoff 2

1 exabytes= 1billion GB=10 18 bytes 3

Big data Data science Data (Experiments) Statistics (Probability, uncertainty) Experience How do we make decisions? 4

How much? - or - How many? Regression algorithms What it is? Is this A or B? Classification algorithms Is this weird? Anomaly detection algorithms Questions that you can answer with data science 5

A B Causation is not observed but inferred (1) A B (2) A B (3) A B C (4) A B (5) Coincidence Social drinking vs. earnings Energy consumption vs. economic growth Debt rate vs. performance of company Shoe size vs. reading ability Ice cream consumption vs. rate of drowning Obesity vs. diabetes (risk factor) Children who get tutored get worse grades than children who do not get tutored Correlation vs. causation 6

Population N Statistic Sample n Population vs. sample Standard error SE Y Standard deviation s n 7

Null hypothesis (H 0 ): A has no effect on B. True situation Our conclusion Control errors No effect (negative) Not significant Significant (Reject H 0 ) True negative False positive Type I error Confidence level, P value Has an effect (positive) Significant (Reject H 0 ) Not significant True positive False negative "Type II error" Statistical power, sample size 8

Dependent variable A C D E F Confounding/nuisance variables (undesired sources of variation that affect the dependent variable) If you can, fix the confounding variable (make it a constant). Independent variable B If you can t fix the confounding variable, use blocking. If you can neither fix nor block the confounding variable, use randomization. Avoid confounding variables 9

Common probability distributions 10

Linear regression Logistic regression Nonlinear regression Stepwise regression - Forward - Backward Ridge, LASSO & ElasticNet regression - Handle multicollinearity variables R: correlation coefficient, -1 to +1 R 2 : coefficient of determination, 0 to 1 Regression analysis 11

Learning: - improve performance from experience. Machine learning: - teach computers to make and improve predictions based on data. approach to achieve artificial intelligence - classification - prediction (regression) Data mining: - use algorithms to create knowledge from data. Machine learning 12

Bayes' rule provides the tools to update the probability for a hypothesis as more evidence or information becomes available. New Bayesian statistics for machine learning 13

Linear regression Decision tree Random forest Association rule mining K-Means clustering Unsupervised = exploratory Supervised = predictive Common data science algorithms 14

Std=9.32 5/14 Std=10.87 5/14 4/14 Std=7.78 Std=3.49 http://www.saedsayad.com/decision_tree_reg.htm The attribute with the largest std reduction is chosen for the decision node. Stop when std for the branch becomes smaller than a certain fraction (e.g., 5%) of std for the full dataset or when too few instances remain in the branch. Decision tree 15

Classification & Regression Trees (CART) X2 (Ankit Sharma, 2014) X1 You can define a split-point for either categorical variable or continuous variable. Split the dataset based on homogeneity of data. Decision tree 16

Averaging multiple deep decision trees, trained on different parts of the same training set; Overcoming overfitting problem of individual decision tree Widely used machine learning algorithm for classification - Approx. 2/3rd of the total training data are selected at random to grow each tree. - Predictor variables are selected at random and the best split is used to split the node. - For each tree, using the leftover (1/3rd) data to calculate the out of bag error rate. - Each tree gives a classification. The forest chooses the classification having the most votes over all the trees in the forest. Random forest 17

Classifying income of adults Random forests can be used to rank the importance of variables in a regression or classification problem. Mean decrease accuracy: How much the model accuracy decreases if we drop that variable Mean decrease gini: Measure of variable importance based on the Gini impurity index used for the calculation of splits in trees Variable importance plot 18

An association rule is a pattern that states when X occurs, Y occurs with certain probability (If/then statement). Initially used for Market Basket Analysis to find how items purchased by customers are related. support ( X Y). count n confidence ( X Y). count X. count Goal: Find all rules that satisfy the user-specified minimum support and minimum confidence. Association rule mining 19

TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5 itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3 itemset sup. {1} 2 {2} 3 {3} 3 {5} 3 itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5} itemset sup {2 3 5} 2 itemset {2 3 5} Min support =50% 2,3 5 confidence=100% 3,5 2 confidence=100% 2,5 3 confidence=67% itemset sup {1 3} 2 {2 3} 2 {2 5} 3 {3 5} 2 itemset sup {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2 Association rule mining (the Apriori Algorithm)

The algorithm works iteratively to assign each data point to one of K groups based on feature similarity (ex. defined distance measure). K-Means clustering Find the centroids of the K clusters Labels for the training data 21

Open-source language for data science 22

23

Job trends form indeed.com Demand for deep analytical talent in the U.S. projected to be 50-60% greater than supply by 2018. Become a data scientist? 24