From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu

Why? How? What? How much? How many? Individual facts (quantities, characters, or symbols) The Data-Information-Knowledge-Wisdom Hierarchy - Russell Ackoff 2

1 exabytes= 1billion GB=10 18 bytes 3

Big data Data science Data (Experiments) Statistics (Probability, uncertainty) Experience How do we make decisions? 4

How much? - or - How many? Regression algorithms What it is? Is this A or B? Classification algorithms Is this weird? Anomaly detection algorithms Questions that you can answer with data science 5

A B Causation is not observed but inferred (1) A B (2) A B (3) A B C (4) A B (5) Coincidence Social drinking vs. earnings Energy consumption vs. economic growth Debt rate vs. performance of company Shoe size vs. reading ability Ice cream consumption vs. rate of drowning Obesity vs. diabetes (risk factor) Children who get tutored get worse grades than children who do not get tutored Correlation vs. causation 6

Population N Statistic Sample n Population vs. sample Standard error SE Y Standard deviation s n 7

Null hypothesis (H 0 ): A has no effect on B. True situation Our conclusion Control errors No effect (negative) Not significant Significant (Reject H 0 ) True negative False positive Type I error Confidence level, P value Has an effect (positive) Significant (Reject H 0 ) Not significant True positive False negative "Type II error" Statistical power, sample size 8

Dependent variable A C D E F Confounding/nuisance variables (undesired sources of variation that affect the dependent variable) If you can, fix the confounding variable (make it a constant). Independent variable B If you can t fix the confounding variable, use blocking. If you can neither fix nor block the confounding variable, use randomization. Avoid confounding variables 9

Common probability distributions 10

Linear regression Logistic regression Nonlinear regression Stepwise regression - Forward - Backward Ridge, LASSO & ElasticNet regression - Handle multicollinearity variables R: correlation coefficient, -1 to +1 R 2 : coefficient of determination, 0 to 1 Regression analysis 11

Learning: - improve performance from experience. Machine learning: - teach computers to make and improve predictions based on data. approach to achieve artificial intelligence - classification - prediction (regression) Data mining: - use algorithms to create knowledge from data. Machine learning 12

Bayes' rule provides the tools to update the probability for a hypothesis as more evidence or information becomes available. New Bayesian statistics for machine learning 13

Linear regression Decision tree Random forest Association rule mining K-Means clustering Unsupervised = exploratory Supervised = predictive Common data science algorithms 14

Std=9.32 5/14 Std=10.87 5/14 4/14 Std=7.78 Std=3.49 http://www.saedsayad.com/decision_tree_reg.htm The attribute with the largest std reduction is chosen for the decision node. Stop when std for the branch becomes smaller than a certain fraction (e.g., 5%) of std for the full dataset or when too few instances remain in the branch. Decision tree 15

Classification & Regression Trees (CART) X2 (Ankit Sharma, 2014) X1 You can define a split-point for either categorical variable or continuous variable. Split the dataset based on homogeneity of data. Decision tree 16

Averaging multiple deep decision trees, trained on different parts of the same training set; Overcoming overfitting problem of individual decision tree Widely used machine learning algorithm for classification - Approx. 2/3rd of the total training data are selected at random to grow each tree. - Predictor variables are selected at random and the best split is used to split the node. - For each tree, using the leftover (1/3rd) data to calculate the out of bag error rate. - Each tree gives a classification. The forest chooses the classification having the most votes over all the trees in the forest. Random forest 17

Classifying income of adults Random forests can be used to rank the importance of variables in a regression or classification problem. Mean decrease accuracy: How much the model accuracy decreases if we drop that variable Mean decrease gini: Measure of variable importance based on the Gini impurity index used for the calculation of splits in trees Variable importance plot 18

An association rule is a pattern that states when X occurs, Y occurs with certain probability (If/then statement). Initially used for Market Basket Analysis to find how items purchased by customers are related. support ( X Y). count n confidence ( X Y). count X. count Goal: Find all rules that satisfy the user-specified minimum support and minimum confidence. Association rule mining 19

TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5 itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3 itemset sup. {1} 2 {2} 3 {3} 3 {5} 3 itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5} itemset sup {2 3 5} 2 itemset {2 3 5} Min support =50% 2,3 5 confidence=100% 3,5 2 confidence=100% 2,5 3 confidence=67% itemset sup {1 3} 2 {2 3} 2 {2 5} 3 {3 5} 2 itemset sup {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2 Association rule mining (the Apriori Algorithm)

The algorithm works iteratively to assign each data point to one of K groups based on feature similarity (ex. defined distance measure). K-Means clustering Find the centroids of the K clusters Labels for the training data 21

Open-source language for data science 22

Job trends form indeed.com Demand for deep analytical talent in the U.S. projected to be 50-60% greater than supply by 2018. Become a data scientist? 24