Online Supplementary Material. MetaLP: A Nonparametric Distributed Learning Framework for Small and Big Data

Size: px
Start display at page:

Download "Online Supplementary Material. MetaLP: A Nonparametric Distributed Learning Framework for Small and Big Data"

Transcription

1 Online Supplementary Material MetaLP: A Nonparametric Distributed Learning Framework for Small and Big Data PI : Subhadeep Mukhopadhyay Department of Statistics, Temple University Philadelphia, Pennsylvania, 19122, U.S.A. deep@temple.edu ABSTRACT In this note we will demonstrate the viability and utility of the proposed MetaLP, a nonparametric distributed statistical learning framework, for small and big data science. We perform a proof-of-concept implementation of MetaLP-based variable selection for two data sets (1) Titanic (example of small data) and (2) Expedia personalized hotel search data (example of large data set). 1 MetaLP: Nonparametric Parallelizable Algorithm Figure 1 provides the flowchart of our proposed MetaLP based data analytics scheme. Here we apply this general framework for designing nonparametric distributed variable selection algorithm. Our approach can detect higher-order interaction from massive data by taking advantage of the distributed data processing technologies. Brief description of the four main components of our algorithm are described as follows. (1) Partition. Assign observations to different subpopulations in a reasonable manner. Random assignment is one possible partitioning scheme, but there are many other possibilities. The number of subpopulations k can be specified to manage computational efficiency (this step can be omitted if the dataset is already partitioned by some natural grouping variable). (2) LP Map Function. We apply LP statistical modeling at each data-block. We construct LP statistic for variable selection of a mixed random variable X (either continuous or discrete) based on our specially designed score functions as follows LP[j; X, Y ] = Cor[T j (X; X), Y ] = E[T j (X; X)T 1 (Y ; Y )]. (1) Using empirical process theory we can show that the sample LP-Fourier measures n LP[j; X, Y ] asymptotically converge to i.i.d standard normal distributions (Mukhopahyay and Parzen, 2014). We will also show how the LP statistics unifies and systematically reproduces all the traditional and modern statistical variable selection measures for different data types of Y and 1

2 Figure 1: The workflow of MetaLP based data analytics scheme. X using one single computing formula. The linear LP-Fourier statistic LP[1; X, Y ] measures the location difference between f(x; X Y = 1) and the unconditional distribution f(x; X). The non-linear LP score statistics LP[j; X, Y ], j > 1 detect higher order distributional differences like in variability, skewness, or in tail behavior to identify important variables. The LP map function outputs the corresponding Confidence Distribution (CD) for each subpopulations, LP l [j; X, Y ], l = 1,..., k. We prefer to estimate the Confidence Distribution (CD) of the LP-statistics, as all the traditional forms of statistical estimation and inference (e.g. point estimation, confidence intervals, hypothesis testing) can be produced in a unified way from CD. We will derive (using empirical process and stochastic internal representation) the following form of LP-confidence distribution ( n ( H Φ (LP[j; X, Y ]) = Φ LP[j; X, Y ] LP[j; )) X, Y ]. (2) (3) τ-regularization. Run heterogeneity I 2 diagnostic and perform τ-corrected version of LP-statistics. We have omitted full details due to space constraints. (4) Meta Reducer Step. Apply the meta-analysis formula (after incorporating the heterogeneity correction as described in the following Theorem 1, Eq (3-4)) to estimate the 2

3 combined confidence distribution parameters for the LP statistics for each predictor variable. The output from this step is a collection of estimators and standard errors for the combined τ-corrected LP statistic parameters for all predictor variables. Theorem 1. Setting F 1 0 (t) = Φ 1 (t) and α l = 1/ (τ 2 + (1/n l )), where Φ is cumulative distribution function of the standard normal distribution and n l is the size of subpopulation l = 1,..., k, the following combined CD for LP[j; X, Y ] follows: ( k ) 1/2 H (c) (LP[j; X, Y ]) = Φ 1 (LP[j; X, Y ] τ 2 + (1/n l ) LP (c) [j; X, Y ]) with (3) l=1 LP (c) [j; X, Y ]) = k l=1 (τ 2 + (1/n l )) 1 LPl [j; X, Y ]) k l=1 (τ 2 + (1/n l )) 1 (4) where LP (c) [j; X, Y ]) and ( k l=1 1/(τ 2 + (1/n l ))) 1 are the mean and variance respectively of the combined CD for LP[j; X, Y ] Figure 2: (a) Left panel shows the shape of the first four LP orthonormal score functions for the variable # Siblings aboard, a discrete random variable takes values 0,..., 8; (b) Right: the shape of the LP basis for the continuous variable Passenger fare. As the number of atoms (# distinct values) of a random variable A(X) (moving from discrete to continuous data type) the shape of our custom designed score polynomials automatically approaches to (by construction) a universal shape, which is similar to Legendre-Polynomial. 2 The Titanic Dataset The Titanic data set contains information on 891 of its passengers, including which passengers survived. The goal is to identify which factors (e.g. age, gender, class, etc.) significantly 3

4 influence passenger survival. Complete descriptions of all 8 variables can be found in Table 1. We will use this small data set as a demo on how MetaLP algorithm (presented in the previous section) actually works on real data sets in a distributed manner using a single general algorithm irrespective of the data type of each features. One of the fundamental ingredient of our approach is LP Transformation. The shape of the piecewise-constant orthonormal LP polynomials for the variable # Siblings aboard is shown in Fig 2. Variable Name Type Description Value Survival Binary Survival 0 = No; 1 = Yes Pclass Categorical Passenger Class 1 = 1st; 2 = 2nd; 3 = 3rd Sex Binary Sex Male; Female Age Numeric Age 0-80 Sibsp Numeric Number of Siblings Aboard 0-8 Parch Numeric Number of Children Aboard 0-6 Fare Numeric Passenger Fare Embarked Categorical Port of Embarkation C = Cherbourg; Q = Queenstown; S = Southampton Table 1: Data dictionary for the Titanic dataset The small size of the Titanic data set will allow us to compare the inference based on distributed and traditional entire data-based methods. Figure 3 shows the 95% confidence intervals generated from the MetaLP algorithm for 3 repetitions of random groupings or partitions (k = 5) along with the confidence intervals generated using the whole Titanic dataset. Remarkable fact to note that the confidence intervals estimated using the MetaLP algorithm are extremely similar to the intervals estimated using the entire dataset across all variables. The effect of heterogeneity is reflected in the width of the confidence intervals due to increased between-subpopulation variance. Moreover, the point estimates for the LP statistics are almost identical! Thus our proposed distributed computational scheme successfully reproduces the results for the small data set, which means we can obtain similar statistical inference while taking advantage of the computational efficiency in parallel distributed processing. 3 Expedia Personalized Hotel Search Dataset 3.1 Data Description The dataset contains various user characteristics (e.g. location, search history, etc.), search criteria (e.g. length of stay, number of children, room count, etc.), and hotel information (e.g. star rating, price, location, promotions, review scores, etc.) that may influence users Expedia hotel booking behavior. In total, the training data contains 9, 917, 530 observations 4

5 Random Partition 0.0 Aggregated LP MetaLP1-0.2 MetaLP2 MetaLP Age Embarked Fare Parch Pclass Sex SibSp Variable Figure 3: [color online] 95% Confidence Interval of LP Statistic for each variable based on three MetaLP repetitions and aggregated dataset for Titanic data. across 46 variables. The target variable (response variable), booking bool, is a binary variable that indicates whether the hotel was booked or not. The remaining 45 variables contain the explanatory variables mentioned previously. Some specific examples: prop location score indicates the desirability of a hotels location; prop review score is the mean customer review score for the hotel on a scale of 5; and price usd displays price of the hotel. 3.2 Partition First, we randomly assign search lists, which are collections of observations from search result impressions in the dataset, to 200 different subpopulations for further processing. Random assignment of search lists rather than individual observations ensures that sets of hotels viewed in the same search session are all contained in the same subpopulation. The number of subpopulations chosen here can be adapted to meet the processing and time requirements. On the other hand, there may be situations where we already have some kind of natural groupings in the dataset, which can be directly utilized as subpopulations. For example, consider the scenario where the available Expedia data are collected from different countries by visitor location country id, a indicator of visitor s location (country). In this setting, the distributed statistical inference framework can directly utilize these predetermined sub- 5

6 Visitor Country ID I I Before Correction Afrer Correction Variable Index Variable Index Figure 4: [color online] (a) I 2 Diagnostic for randomly partitioned subpopulations; (b) Predetermined grouping: comparison of I 2 diagnostics between before τ correction (red dots) and after τ correction (blue dots). populations for processing rather than having randomly assign subpopulations. However, practitioners must be careful to consider heterogeneity among subpopulations in these settings. 3.3 LP Map Function In this step, we estimate the LP l [j; X i, Y ] statistics (which denotes the jth LP statistics for the ith variable in the lth subpopulation) and corresponding confidence distribution of each of 45 variables for 200 random subpopulations (or 233 predefined subpopulations defined by the grouping variable visitor location country id), where i = 1,..., 45, l = 1,..., 200, and i and l are the indexes for variable and subpopulation respectively. The estimator values LP l [j; X i, Y ] and n l;i (used to find standard deviation) are stored in a matrix for use in the next step. 3.4 Heterogeneity: Diagnostic and Regularization We then check heterogeneity issues that may occur from partitioning this large Expedia dataset. We use the I 2 diagnostic to measure the severity of heterogeneity across subpopulations for each predictor variable. For the random partitioning scheme, our subpopulations are fairly homogeneous (with respect to all variables) as all I 2 statistics are below 40% (see Figure 4(a)); on the other hand, visitor location country based predefined partitions divide data into heterogeneous subpopulations for some variables as shown in Figure 4(b) (some variables have I 2 values outside of the permissible range of 0 to 40%). In this scenario, we need to include τ 2 regularization to handle the heterogeneity issue. The I 2 diagnostic after τ 2 regularization is shown in Figure 4(b) (blue dots), which suggest that all I 2 values after regularization fall within the acceptable range of 0 to 40%. The results in this section suggest that our framework is appropriate under both settings: 6

7 LP Confidence Interval Variables Figure 5: Expedia Data: 95 % Confidence Intervals for each variables LP Statistics. random partitioning and predetermined partitioning, since we can always perform τ 2 regularization when subpopulations appear to be heterogeneous. 3.5 Meta Reducer Step After applying the τ 2 correction for heterogeneity, we can continue to combine confidence distributions of LP statistics from different subpopulations to estimate the combined confidence distribution of the LP statistic for each variable as outlined in Theorem 1. The results can be found in Figure 5. Variables with indexes 43, 44, and 45 have highly significant positive relationships with booking bool, the binary response variable. Those variables are prop location score2, the second score outlining the desirability of a hotels location, promotion flag, +1 if the hotel had a sale price promotion specifically displayed, and srch query affinity score, the log of the probability a hotel will be clicked on in Internet searches; there are three variables that have highly negative impacts on hotel booking: price usd, displayed price of the hotel for the given search, srch length of stay, number of nights stay that was searched, and srch booking window, number of days in the future the hotel stay started from the search date. Moreover, there are several variables LP statistics whose confidence intervals include zero, which means those variables have an insignificant influence on hotel booking. The top five most influential variables in terms of absolute value of LP statistic estimates are prop location score2, promotion flag, price usd, srch length of stay, and prop starring. 7

Nonparametric Distributed Learning Framework: Algorithm and Application to Variable Selection

Nonparametric Distributed Learning Framework: Algorithm and Application to Variable Selection Nonparametric Distributed Learning Framework: Algorithm and Application to Variable Selection Scott Bruce, Zeda Li, Hsiang-Chieh Yang, and Subhadeep Mukhopadhyay Temple University, Department of Statistics

More information

Nonparametric Distributed Learning Architecture for Big Data: Algorithm and Applications

Nonparametric Distributed Learning Architecture for Big Data: Algorithm and Applications Nonparametric Distributed Learning Architecture for Big Data: Algorithm and Applications Scott Bruce, Zeda Li, Hsiang-Chieh Yang, and Subhadeep Mukhopadhyay Department of Statistical Science, Temple University

More information

NONPARAMETRIC DISTRIBUTED LEARNING ARCHITECTURE: ALGORITHM AND APPLICATION

NONPARAMETRIC DISTRIBUTED LEARNING ARCHITECTURE: ALGORITHM AND APPLICATION arxiv: 1508.03747 NONPARAMETRIC DISTRIBUTED LEARNING ARCHITECTURE: ALGORITHM AND APPLICATION By Scott Bruce, Zeda Li, Hsiang-Chieh Yang, and Subhadeep Mukhopadhyay Department of Statistics, Temple University

More information

Let s see if we can predict whether a student returns or does not return to St. Ambrose for their second year.

Let s see if we can predict whether a student returns or does not return to St. Ambrose for their second year. Assignment #13: GLM Scenario: Over the past few years, our first-to-second year retention rate has ranged from 77-80%. In other words, 77-80% of our first-year students come back to St. Ambrose for their

More information

Semestrial Project - Expedia Hotel Ranking

Semestrial Project - Expedia Hotel Ranking 1 Many customers search and purchase hotels online. Companies such as Expedia make their profit from purchases made through their sites. The ultimate goal top of the list are the hotels that are most likely

More information

Transmogrification: The Magic of Feature Engineering Leah McGuire and Mayukh Bhaowal

Transmogrification: The Magic of Feature Engineering Leah McGuire and Mayukh Bhaowal Transmogrification: The Magic of Feature Engineering Leah McGuire and Mayukh Bhaowal ML algorithms take center stage in AI Modeling Raw Data Feature Engineering Bottleneck Mythical Numeric Matrix X

More information

Package LPTime. March 3, 2015

Package LPTime. March 3, 2015 Type Package Package LPTime March 3, 2015 Title LP Nonparametric Approach to Non-Gaussian Non-Linear Time Series Modelling Version 1.0-2 Date 2015-03-03 URL http://sites.temple.edu/deepstat/d-products/

More information

Chapters 4-6: Inference with two samples Read sections 4.2.5, 5.2, 5.3, 6.2

Chapters 4-6: Inference with two samples Read sections 4.2.5, 5.2, 5.3, 6.2 Chapters 4-6: Inference with two samples Read sections 45, 5, 53, 6 COMPARING TWO POPULATION MEANS When presented with two samples that you wish to compare, there are two possibilities: I independent samples

More information

REVIEW 8/2/2017 陈芳华东师大英语系

REVIEW 8/2/2017 陈芳华东师大英语系 REVIEW Hypothesis testing starts with a null hypothesis and a null distribution. We compare what we have to the null distribution, if the result is too extreme to belong to the null distribution (p

More information

Decision Tree Ensembles

Decision Tree Ensembles Decision Tree Ensembles Random Forest & Gradient Boosting CSE 416 Quiz Section 4/26/2018 Kaggle Titanic Data Passen gerid Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 1 0 3 Braund,

More information

Hypothesis Tests and Estimation for Population Variances. Copyright 2014 Pearson Education, Inc.

Hypothesis Tests and Estimation for Population Variances. Copyright 2014 Pearson Education, Inc. Hypothesis Tests and Estimation for Population Variances 11-1 Learning Outcomes Outcome 1. Formulate and carry out hypothesis tests for a single population variance. Outcome 2. Develop and interpret confidence

More information

UNITED STATISTICAL ALGORITHMS, LP COMOMENTS, COPULA DENSITY, NONPARAMETRIC MODELING

UNITED STATISTICAL ALGORITHMS, LP COMOMENTS, COPULA DENSITY, NONPARAMETRIC MODELING Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong (Session CPS107) p.4719 UNITED STATISTICAL ALGORITHMS, LP COMOMENTS, COPULA DENSITY, NONPARAMETRIC MODELING Emanuel Parzen 1

More information

Noise & Data Reduction

Noise & Data Reduction Noise & Data Reduction Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum Dimension Reduction 1 Remember: Central Limit

More information

Statistics Toolbox 6. Apply statistical algorithms and probability models

Statistics Toolbox 6. Apply statistical algorithms and probability models Statistics Toolbox 6 Apply statistical algorithms and probability models Statistics Toolbox provides engineers, scientists, researchers, financial analysts, and statisticians with a comprehensive set of

More information

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE THE ROYAL STATISTICAL SOCIETY 004 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE PAPER II STATISTICAL METHODS The Society provides these solutions to assist candidates preparing for the examinations in future

More information

T-Test QUESTION T-TEST GROUPS = sex(1 2) /MISSING = ANALYSIS /VARIABLES = quiz1 quiz2 quiz3 quiz4 quiz5 final total /CRITERIA = CI(.95).

T-Test QUESTION T-TEST GROUPS = sex(1 2) /MISSING = ANALYSIS /VARIABLES = quiz1 quiz2 quiz3 quiz4 quiz5 final total /CRITERIA = CI(.95). QUESTION 11.1 GROUPS = sex(1 2) /MISSING = ANALYSIS /VARIABLES = quiz2 quiz3 quiz4 quiz5 final total /CRITERIA = CI(.95). Group Statistics quiz2 quiz3 quiz4 quiz5 final total sex N Mean Std. Deviation

More information

Discrete Multivariate Statistics

Discrete Multivariate Statistics Discrete Multivariate Statistics Univariate Discrete Random variables Let X be a discrete random variable which, in this module, will be assumed to take a finite number of t different values which are

More information

Statistical Process Control for Multivariate Categorical Processes

Statistical Process Control for Multivariate Categorical Processes Statistical Process Control for Multivariate Categorical Processes Fugee Tsung The Hong Kong University of Science and Technology Fugee Tsung 1/27 Introduction Typical Control Charts Univariate continuous

More information

COMP 5331: Knowledge Discovery and Data Mining

COMP 5331: Knowledge Discovery and Data Mining COMP 5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Tan, Steinbach, Kumar And Jiawei Han, Micheline Kamber, and Jian Pei 1 10

More information

You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?

You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What? You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?) I m not goin stop (What?) I m goin work harder (What?) Sir David

More information

Review of the General Linear Model

Review of the General Linear Model Review of the General Linear Model EPSY 905: Multivariate Analysis Online Lecture #2 Learning Objectives Types of distributions: Ø Conditional distributions The General Linear Model Ø Regression Ø Analysis

More information

The Perceptron algorithm

The Perceptron algorithm The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

More information

Checking model assumptions with regression diagnostics

Checking model assumptions with regression diagnostics @graemeleehickey www.glhickey.com graeme.hickey@liverpool.ac.uk Checking model assumptions with regression diagnostics Graeme L. Hickey University of Liverpool Conflicts of interest None Assistant Editor

More information

A3. Statistical Inference Hypothesis Testing for General Population Parameters

A3. Statistical Inference Hypothesis Testing for General Population Parameters Appendix / A3. Statistical Inference / General Parameters- A3. Statistical Inference Hypothesis Testing for General Population Parameters POPULATION H 0 : θ = θ 0 θ is a generic parameter of interest (e.g.,

More information

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami Parametric Assumptions The observations must be independent. Dependent variable should be continuous

More information

Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences h, February 12, 2015

Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences h, February 12, 2015 Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences 18.30 21.15h, February 12, 2015 Question 1 is on this page. Always motivate your answers. Write your answers in English. Only the

More information

STAT Section 2.1: Basic Inference. Basic Definitions

STAT Section 2.1: Basic Inference. Basic Definitions STAT 518 --- Section 2.1: Basic Inference Basic Definitions Population: The collection of all the individuals of interest. This collection may be or even. Sample: A collection of elements of the population.

More information

Data Analysis 1 LINEAR REGRESSION. Chapter 03

Data Analysis 1 LINEAR REGRESSION. Chapter 03 Data Analysis 1 LINEAR REGRESSION Chapter 03 Data Analysis 2 Outline The Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression Other Considerations in Regression Model Qualitative

More information

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Multilevel Statistical Models: 3 rd edition, 2003 Contents Multilevel Statistical Models: 3 rd edition, 2003 Contents Preface Acknowledgements Notation Two and three level models. A general classification notation and diagram Glossary Chapter 1 An introduction

More information

A Guide to Modern Econometric:

A Guide to Modern Econometric: A Guide to Modern Econometric: 4th edition Marno Verbeek Rotterdam School of Management, Erasmus University, Rotterdam B 379887 )WILEY A John Wiley & Sons, Ltd., Publication Contents Preface xiii 1 Introduction

More information

Titanic: Data Analysis

Titanic: Data Analysis Titanic: Data Analysis Victor Bernal Arzola May 2, 26 victor.bernal@mathmods.eu Introduction Data analysis is a process for obtaining raw data and converting it into information useful for decisionmaking

More information

EPSE 594: Meta-Analysis: Quantitative Research Synthesis

EPSE 594: Meta-Analysis: Quantitative Research Synthesis EPSE 594: Meta-Analysis: Quantitative Research Synthesis Ed Kroc University of British Columbia ed.kroc@ubc.ca January 24, 2019 Ed Kroc (UBC) EPSE 594 January 24, 2019 1 / 37 Last time Composite effect

More information

LOGISTIC REGRESSION Joseph M. Hilbe

LOGISTIC REGRESSION Joseph M. Hilbe LOGISTIC REGRESSION Joseph M. Hilbe Arizona State University Logistic regression is the most common method used to model binary response data. When the response is binary, it typically takes the form of

More information

Price Discrimination through Refund Contracts in Airlines

Price Discrimination through Refund Contracts in Airlines Introduction Price Discrimination through Refund Contracts in Airlines Paan Jindapon Department of Economics and Finance The University of Texas - Pan American Department of Economics, Finance and Legal

More information

Unit 10: Simple Linear Regression and Correlation

Unit 10: Simple Linear Regression and Correlation Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1 Introductory Remarks Regression analysis is a method for

More information

CIVL 7012/8012. Collection and Analysis of Information

CIVL 7012/8012. Collection and Analysis of Information CIVL 7012/8012 Collection and Analysis of Information Uncertainty in Engineering Statistics deals with the collection and analysis of data to solve real-world problems. Uncertainty is inherent in all real

More information

Applied Regression Modeling

Applied Regression Modeling Applied Regression Modeling Applied Regression Modeling A Business Approach Iain Pardoe University of Oregon Charles H. Lundquist College of Business Eugene, Oregon WILEY- INTERSCIENCE A JOHN WILEY &

More information

Passing-Bablok Regression for Method Comparison

Passing-Bablok Regression for Method Comparison Chapter 313 Passing-Bablok Regression for Method Comparison Introduction Passing-Bablok regression for method comparison is a robust, nonparametric method for fitting a straight line to two-dimensional

More information

Ron Heck, Fall Week 3: Notes Building a Two-Level Model

Ron Heck, Fall Week 3: Notes Building a Two-Level Model Ron Heck, Fall 2011 1 EDEP 768E: Seminar on Multilevel Modeling rev. 9/6/2011@11:27pm Week 3: Notes Building a Two-Level Model We will build a model to explain student math achievement using student-level

More information

The PAC Learning Framework -II

The PAC Learning Framework -II The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline

More information

Performance Evaluation and Comparison

Performance Evaluation and Comparison Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation

More information

Frequency Distribution Cross-Tabulation

Frequency Distribution Cross-Tabulation Frequency Distribution Cross-Tabulation 1) Overview 2) Frequency Distribution 3) Statistics Associated with Frequency Distribution i. Measures of Location ii. Measures of Variability iii. Measures of Shape

More information

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators Multiple Regression Relating a response (dependent, input) y to a set of explanatory (independent, output, predictor) variables x, x 2, x 3,, x q. A technique for modeling the relationship between variables.

More information

(Where does Ch. 7 on comparing 2 means or 2 proportions fit into this?)

(Where does Ch. 7 on comparing 2 means or 2 proportions fit into this?) 12. Comparing Groups: Analysis of Variance (ANOVA) Methods Response y Explanatory x var s Method Categorical Categorical Contingency tables (Ch. 8) (chi-squared, etc.) Quantitative Quantitative Regression

More information

A Novel Click Model and Its Applications to Online Advertising

A Novel Click Model and Its Applications to Online Advertising A Novel Click Model and Its Applications to Online Advertising Zeyuan Zhu Weizhu Chen Tom Minka Chenguang Zhu Zheng Chen February 5, 2010 1 Introduction Click Model - To model the user behavior Application

More information

Introduction to hypothesis testing

Introduction to hypothesis testing Introduction to hypothesis testing Review: Logic of Hypothesis Tests Usually, we test (attempt to falsify) a null hypothesis (H 0 ): includes all possibilities except prediction in hypothesis (H A ) If

More information

Lecture 26: Chapter 10, Section 2 Inference for Quantitative Variable Confidence Interval with t

Lecture 26: Chapter 10, Section 2 Inference for Quantitative Variable Confidence Interval with t Lecture 26: Chapter 10, Section 2 Inference for Quantitative Variable Confidence Interval with t t Confidence Interval for Population Mean Comparing z and t Confidence Intervals When neither z nor t Applies

More information

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

One-Way ANOVA. Some examples of when ANOVA would be appropriate include: One-Way ANOVA 1. Purpose Analysis of variance (ANOVA) is used when one wishes to determine whether two or more groups (e.g., classes A, B, and C) differ on some outcome of interest (e.g., an achievement

More information

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651 Data Analysis and Statistical Methods Statistics 651 http://www.stat.tamu.edu/~suhasini/teaching.html Suhasini Subba Rao Review Our objective: to make confident statements about a parameter (aspect) in

More information

Lecture 6: Linear Regression (continued)

Lecture 6: Linear Regression (continued) Lecture 6: Linear Regression (continued) Reading: Sections 3.1-3.3 STATS 202: Data mining and analysis October 6, 2017 1 / 23 Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d.

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy Logistic Regression Some slides from Craig Burkett STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy Titanic Survival Case Study The RMS Titanic A British passenger liner Collided

More information

Concepts and Applications of Kriging. Eric Krause Konstantin Krivoruchko

Concepts and Applications of Kriging. Eric Krause Konstantin Krivoruchko Concepts and Applications of Kriging Eric Krause Konstantin Krivoruchko Outline Introduction to interpolation Exploratory spatial data analysis (ESDA) Using the Geostatistical Wizard Validating interpolation

More information

7.1 Sampling Error The Need for Sampling Distributions

7.1 Sampling Error The Need for Sampling Distributions 7.1 Sampling Error The Need for Sampling Distributions Tom Lewis Fall Term 2009 Tom Lewis () 7.1 Sampling Error The Need for Sampling Distributions Fall Term 2009 1 / 5 Outline 1 Tom Lewis () 7.1 Sampling

More information

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. Multicollinearity Read Section 7.5 in textbook. Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. Example of multicollinear

More information

Course Review. Kin 304W Week 14: April 9, 2013

Course Review. Kin 304W Week 14: April 9, 2013 Course Review Kin 304W Week 14: April 9, 2013 1 Today s Outline Format of Kin 304W Final Exam Course Review Hand back marked Project Part II 2 Kin 304W Final Exam Saturday, Thursday, April 18, 3:30-6:30

More information

Review of Multiple Regression

Review of Multiple Regression Ronald H. Heck 1 Let s begin with a little review of multiple regression this week. Linear models [e.g., correlation, t-tests, analysis of variance (ANOVA), multiple regression, path analysis, multivariate

More information

Subject CS1 Actuarial Statistics 1 Core Principles

Subject CS1 Actuarial Statistics 1 Core Principles Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and

More information

Decentralized nonparametric multiple testing

Decentralized nonparametric multiple testing JOURNAL OF NONPARAMETRIC STATISTICS https://doi.org/10.1080/10485252.2018.1508678 Decentralized nonparametric multiple testing Subhadeep Mukhopadhyay Department of Statistical Science, Temple University,

More information

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January, 1 Stat 406: Algorithms for classification and prediction Lecture 1: Introduction Kevin Murphy Mon 7 January, 2008 1 1 Slides last updated on January 7, 2008 Outline 2 Administrivia Some basic definitions.

More information

STAT Chapter 9: Two-Sample Problems. Paired Differences (Section 9.3)

STAT Chapter 9: Two-Sample Problems. Paired Differences (Section 9.3) STAT 515 -- Chapter 9: Two-Sample Problems Paired Differences (Section 9.3) Examples of Paired Differences studies: Similar subjects are paired off and one of two treatments is given to each subject in

More information

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of Probability Sampling Procedures Collection of Data Measures

More information

Basic Business Statistics, 10/e

Basic Business Statistics, 10/e Chapter 4 4- Basic Business Statistics th Edition Chapter 4 Introduction to Multiple Regression Basic Business Statistics, e 9 Prentice-Hall, Inc. Chap 4- Learning Objectives In this chapter, you learn:

More information

WELCOME! Lecture 13 Thommy Perlinger

WELCOME! Lecture 13 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 13 Thommy Perlinger Parametrical tests (tests for the mean) Nature and number of variables One-way vs. two-way ANOVA One-way ANOVA Y X 1 1 One dependent variable

More information

Concepts and Applications of Kriging. Eric Krause

Concepts and Applications of Kriging. Eric Krause Concepts and Applications of Kriging Eric Krause Sessions of note Tuesday ArcGIS Geostatistical Analyst - An Introduction 8:30-9:45 Room 14 A Concepts and Applications of Kriging 10:15-11:30 Room 15 A

More information

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model 1 Linear Regression 2 Linear Regression In this lecture we will study a particular type of regression model: the linear regression model We will first consider the case of the model with one predictor

More information

Lecture 14: Introduction to Poisson Regression

Lecture 14: Introduction to Poisson Regression Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why

More information

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population

More information

STP 226 EXAMPLE EXAM #3 INSTRUCTOR:

STP 226 EXAMPLE EXAM #3 INSTRUCTOR: STP 226 EXAMPLE EXAM #3 INSTRUCTOR: Honor Statement: I have neither given nor received information regarding this exam, and I will not do so until all exams have been graded and returned. Signed Date PRINTED

More information

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, 2015-16 Academic Year Exam Version: A INSTRUCTIONS TO STUDENTS 1 The time allowed for this examination paper is 2 hours. 2 This

More information

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model Checking/Diagnostics Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics The session is a continuation of a version of Section 11.3 of MMD&S. It concerns

More information

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model Checking/Diagnostics Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics The session is a continuation of a version of Section 11.3 of MMD&S. It concerns

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

SPSS LAB FILE 1

SPSS LAB FILE  1 SPSS LAB FILE www.mcdtu.wordpress.com 1 www.mcdtu.wordpress.com 2 www.mcdtu.wordpress.com 3 OBJECTIVE 1: Transporation of Data Set to SPSS Editor INPUTS: Files: group1.xlsx, group1.txt PROCEDURE FOLLOWED:

More information

Validation of Visual Statistical Inference, with Application to Linear Models

Validation of Visual Statistical Inference, with Application to Linear Models Validation of Visual Statistical Inference, with pplication to Linear Models Mahbubul Majumder, Heike Hofmann, Dianne Cook Department of Statistics, Iowa State University pril 2, 212 Statistical graphics

More information

Multiple Regression. Dr. Frank Wood. Frank Wood, Linear Regression Models Lecture 12, Slide 1

Multiple Regression. Dr. Frank Wood. Frank Wood, Linear Regression Models Lecture 12, Slide 1 Multiple Regression Dr. Frank Wood Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 12, Slide 1 Review: Matrix Regression Estimation We can solve this equation (if the inverse of X

More information

An overview of Boosting. Yoav Freund UCSD

An overview of Boosting. Yoav Freund UCSD An overview of Boosting Yoav Freund UCSD Plan of talk Generative vs. non-generative modeling Boosting Alternating decision trees Boosting and over-fitting Applications 2 Toy Example Computer receives telephone

More information

Inferences About the Difference Between Two Means

Inferences About the Difference Between Two Means 7 Inferences About the Difference Between Two Means Chapter Outline 7.1 New Concepts 7.1.1 Independent Versus Dependent Samples 7.1. Hypotheses 7. Inferences About Two Independent Means 7..1 Independent

More information

Final Exam - Solutions

Final Exam - Solutions Ecn 102 - Analysis of Economic Data University of California - Davis March 19, 2010 Instructor: John Parman Final Exam - Solutions You have until 5:30pm to complete this exam. Please remember to put your

More information

Econometrics Problem Set 6

Econometrics Problem Set 6 Econometrics Problem Set 6 WISE, Xiamen University Spring 2016-17 Conceptual Questions 1. This question refers to the estimated regressions shown in Table 1 computed using data for 1988 from the CPS. The

More information

Concepts and Applications of Kriging

Concepts and Applications of Kriging 2013 Esri International User Conference July 8 12, 2013 San Diego, California Technical Workshop Concepts and Applications of Kriging Eric Krause Konstantin Krivoruchko Outline Intro to interpolation Exploratory

More information

A Course on Advanced Econometrics

A Course on Advanced Econometrics A Course on Advanced Econometrics Yongmiao Hong The Ernest S. Liu Professor of Economics & International Studies Cornell University Course Introduction: Modern economies are full of uncertainties and risk.

More information

Do not copy, post, or distribute

Do not copy, post, or distribute 14 CORRELATION ANALYSIS AND LINEAR REGRESSION Assessing the Covariability of Two Quantitative Properties 14.0 LEARNING OBJECTIVES In this chapter, we discuss two related techniques for assessing a possible

More information

Correlation & Simple Regression

Correlation & Simple Regression Chapter 11 Correlation & Simple Regression The previous chapter dealt with inference for two categorical variables. In this chapter, we would like to examine the relationship between two quantitative variables.

More information

In-Database Factorised Learning fdbresearch.github.io

In-Database Factorised Learning fdbresearch.github.io In-Database Factorised Learning fdbresearch.github.io Mahmoud Abo Khamis, Hung Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich December 2017 Logic for Data Science Seminar Alan Turing Institute

More information

Regression With a Categorical Independent Variable

Regression With a Categorical Independent Variable Regression ith a Independent Variable ERSH 8320 Slide 1 of 34 Today s Lecture Regression with a single categorical independent variable. Today s Lecture Coding procedures for analysis. Dummy coding. Relationship

More information

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES Cox s regression analysis Time dependent explanatory variables Henrik Ravn Bandim Health Project, Statens Serum Institut 4 November 2011 1 / 53

More information

STAT 510 Final Exam Spring 2015

STAT 510 Final Exam Spring 2015 STAT 510 Final Exam Spring 2015 Instructions: The is a closed-notes, closed-book exam No calculator or electronic device of any kind may be used Use nothing but a pen or pencil Please write your name and

More information

INFERENCE FOR REGRESSION

INFERENCE FOR REGRESSION CHAPTER 3 INFERENCE FOR REGRESSION OVERVIEW In Chapter 5 of the textbook, we first encountered regression. The assumptions that describe the regression model we use in this chapter are the following. We

More information

Tutorial: Urban Trajectory Visualization. Case Studies. Ye Zhao

Tutorial: Urban Trajectory Visualization. Case Studies. Ye Zhao Case Studies Ye Zhao Use Cases We show examples of the web-based visual analytics system TrajAnalytics The case study information and videos are available at http://vis.cs.kent.edu/trajanalytics/ Porto

More information

A Statistical Look at Spectral Graph Analysis. Deep Mukhopadhyay

A Statistical Look at Spectral Graph Analysis. Deep Mukhopadhyay A Statistical Look at Spectral Graph Analysis Deep Mukhopadhyay Department of Statistics, Temple University Office: Speakman 335 deep@temple.edu http://sites.temple.edu/deepstat/ Graph Signal Processing

More information

CHAPTER 4 & 5 Linear Regression with One Regressor. Kazu Matsuda IBEC PHBU 430 Econometrics

CHAPTER 4 & 5 Linear Regression with One Regressor. Kazu Matsuda IBEC PHBU 430 Econometrics CHAPTER 4 & 5 Linear Regression with One Regressor Kazu Matsuda IBEC PHBU 430 Econometrics Introduction Simple linear regression model = Linear model with one independent variable. y = dependent variable

More information

Section 7.2 Homework Answers

Section 7.2 Homework Answers 25.5 30 Sample Mean P 0.1226 sum n b. The two z-scores are z 25 20(1.7) n 1.0 20 sum n 2.012 and z 30 20(1.7) n 1.0 0.894, 20 so the probability is approximately 0.1635 (0.1645 using Table A). P14. a.

More information

Boosting: Foundations and Algorithms. Rob Schapire

Boosting: Foundations and Algorithms. Rob Schapire Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Basics on t-tests Independent Sample t-tests Single-Sample t-tests Summary of t-tests Multiple Tests, Effect Size Proportions. Statistiek I.

Basics on t-tests Independent Sample t-tests Single-Sample t-tests Summary of t-tests Multiple Tests, Effect Size Proportions. Statistiek I. Statistiek I t-tests John Nerbonne CLCG, Rijksuniversiteit Groningen http://www.let.rug.nl/nerbonne/teach/statistiek-i/ John Nerbonne 1/46 Overview 1 Basics on t-tests 2 Independent Sample t-tests 3 Single-Sample

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Inference with Simple Regression

Inference with Simple Regression 1 Introduction Inference with Simple Regression Alan B. Gelder 06E:071, The University of Iowa 1 Moving to infinite means: In this course we have seen one-mean problems, twomean problems, and problems

More information