Online Supplementary Material. MetaLP: A Nonparametric Distributed Learning Framework for Small and Big Data
|
|
- Moses Weaver
- 5 years ago
- Views:
Transcription
1 Online Supplementary Material MetaLP: A Nonparametric Distributed Learning Framework for Small and Big Data PI : Subhadeep Mukhopadhyay Department of Statistics, Temple University Philadelphia, Pennsylvania, 19122, U.S.A. deep@temple.edu ABSTRACT In this note we will demonstrate the viability and utility of the proposed MetaLP, a nonparametric distributed statistical learning framework, for small and big data science. We perform a proof-of-concept implementation of MetaLP-based variable selection for two data sets (1) Titanic (example of small data) and (2) Expedia personalized hotel search data (example of large data set). 1 MetaLP: Nonparametric Parallelizable Algorithm Figure 1 provides the flowchart of our proposed MetaLP based data analytics scheme. Here we apply this general framework for designing nonparametric distributed variable selection algorithm. Our approach can detect higher-order interaction from massive data by taking advantage of the distributed data processing technologies. Brief description of the four main components of our algorithm are described as follows. (1) Partition. Assign observations to different subpopulations in a reasonable manner. Random assignment is one possible partitioning scheme, but there are many other possibilities. The number of subpopulations k can be specified to manage computational efficiency (this step can be omitted if the dataset is already partitioned by some natural grouping variable). (2) LP Map Function. We apply LP statistical modeling at each data-block. We construct LP statistic for variable selection of a mixed random variable X (either continuous or discrete) based on our specially designed score functions as follows LP[j; X, Y ] = Cor[T j (X; X), Y ] = E[T j (X; X)T 1 (Y ; Y )]. (1) Using empirical process theory we can show that the sample LP-Fourier measures n LP[j; X, Y ] asymptotically converge to i.i.d standard normal distributions (Mukhopahyay and Parzen, 2014). We will also show how the LP statistics unifies and systematically reproduces all the traditional and modern statistical variable selection measures for different data types of Y and 1
2 Figure 1: The workflow of MetaLP based data analytics scheme. X using one single computing formula. The linear LP-Fourier statistic LP[1; X, Y ] measures the location difference between f(x; X Y = 1) and the unconditional distribution f(x; X). The non-linear LP score statistics LP[j; X, Y ], j > 1 detect higher order distributional differences like in variability, skewness, or in tail behavior to identify important variables. The LP map function outputs the corresponding Confidence Distribution (CD) for each subpopulations, LP l [j; X, Y ], l = 1,..., k. We prefer to estimate the Confidence Distribution (CD) of the LP-statistics, as all the traditional forms of statistical estimation and inference (e.g. point estimation, confidence intervals, hypothesis testing) can be produced in a unified way from CD. We will derive (using empirical process and stochastic internal representation) the following form of LP-confidence distribution ( n ( H Φ (LP[j; X, Y ]) = Φ LP[j; X, Y ] LP[j; )) X, Y ]. (2) (3) τ-regularization. Run heterogeneity I 2 diagnostic and perform τ-corrected version of LP-statistics. We have omitted full details due to space constraints. (4) Meta Reducer Step. Apply the meta-analysis formula (after incorporating the heterogeneity correction as described in the following Theorem 1, Eq (3-4)) to estimate the 2
3 combined confidence distribution parameters for the LP statistics for each predictor variable. The output from this step is a collection of estimators and standard errors for the combined τ-corrected LP statistic parameters for all predictor variables. Theorem 1. Setting F 1 0 (t) = Φ 1 (t) and α l = 1/ (τ 2 + (1/n l )), where Φ is cumulative distribution function of the standard normal distribution and n l is the size of subpopulation l = 1,..., k, the following combined CD for LP[j; X, Y ] follows: ( k ) 1/2 H (c) (LP[j; X, Y ]) = Φ 1 (LP[j; X, Y ] τ 2 + (1/n l ) LP (c) [j; X, Y ]) with (3) l=1 LP (c) [j; X, Y ]) = k l=1 (τ 2 + (1/n l )) 1 LPl [j; X, Y ]) k l=1 (τ 2 + (1/n l )) 1 (4) where LP (c) [j; X, Y ]) and ( k l=1 1/(τ 2 + (1/n l ))) 1 are the mean and variance respectively of the combined CD for LP[j; X, Y ] Figure 2: (a) Left panel shows the shape of the first four LP orthonormal score functions for the variable # Siblings aboard, a discrete random variable takes values 0,..., 8; (b) Right: the shape of the LP basis for the continuous variable Passenger fare. As the number of atoms (# distinct values) of a random variable A(X) (moving from discrete to continuous data type) the shape of our custom designed score polynomials automatically approaches to (by construction) a universal shape, which is similar to Legendre-Polynomial. 2 The Titanic Dataset The Titanic data set contains information on 891 of its passengers, including which passengers survived. The goal is to identify which factors (e.g. age, gender, class, etc.) significantly 3
4 influence passenger survival. Complete descriptions of all 8 variables can be found in Table 1. We will use this small data set as a demo on how MetaLP algorithm (presented in the previous section) actually works on real data sets in a distributed manner using a single general algorithm irrespective of the data type of each features. One of the fundamental ingredient of our approach is LP Transformation. The shape of the piecewise-constant orthonormal LP polynomials for the variable # Siblings aboard is shown in Fig 2. Variable Name Type Description Value Survival Binary Survival 0 = No; 1 = Yes Pclass Categorical Passenger Class 1 = 1st; 2 = 2nd; 3 = 3rd Sex Binary Sex Male; Female Age Numeric Age 0-80 Sibsp Numeric Number of Siblings Aboard 0-8 Parch Numeric Number of Children Aboard 0-6 Fare Numeric Passenger Fare Embarked Categorical Port of Embarkation C = Cherbourg; Q = Queenstown; S = Southampton Table 1: Data dictionary for the Titanic dataset The small size of the Titanic data set will allow us to compare the inference based on distributed and traditional entire data-based methods. Figure 3 shows the 95% confidence intervals generated from the MetaLP algorithm for 3 repetitions of random groupings or partitions (k = 5) along with the confidence intervals generated using the whole Titanic dataset. Remarkable fact to note that the confidence intervals estimated using the MetaLP algorithm are extremely similar to the intervals estimated using the entire dataset across all variables. The effect of heterogeneity is reflected in the width of the confidence intervals due to increased between-subpopulation variance. Moreover, the point estimates for the LP statistics are almost identical! Thus our proposed distributed computational scheme successfully reproduces the results for the small data set, which means we can obtain similar statistical inference while taking advantage of the computational efficiency in parallel distributed processing. 3 Expedia Personalized Hotel Search Dataset 3.1 Data Description The dataset contains various user characteristics (e.g. location, search history, etc.), search criteria (e.g. length of stay, number of children, room count, etc.), and hotel information (e.g. star rating, price, location, promotions, review scores, etc.) that may influence users Expedia hotel booking behavior. In total, the training data contains 9, 917, 530 observations 4
5 Random Partition 0.0 Aggregated LP MetaLP1-0.2 MetaLP2 MetaLP Age Embarked Fare Parch Pclass Sex SibSp Variable Figure 3: [color online] 95% Confidence Interval of LP Statistic for each variable based on three MetaLP repetitions and aggregated dataset for Titanic data. across 46 variables. The target variable (response variable), booking bool, is a binary variable that indicates whether the hotel was booked or not. The remaining 45 variables contain the explanatory variables mentioned previously. Some specific examples: prop location score indicates the desirability of a hotels location; prop review score is the mean customer review score for the hotel on a scale of 5; and price usd displays price of the hotel. 3.2 Partition First, we randomly assign search lists, which are collections of observations from search result impressions in the dataset, to 200 different subpopulations for further processing. Random assignment of search lists rather than individual observations ensures that sets of hotels viewed in the same search session are all contained in the same subpopulation. The number of subpopulations chosen here can be adapted to meet the processing and time requirements. On the other hand, there may be situations where we already have some kind of natural groupings in the dataset, which can be directly utilized as subpopulations. For example, consider the scenario where the available Expedia data are collected from different countries by visitor location country id, a indicator of visitor s location (country). In this setting, the distributed statistical inference framework can directly utilize these predetermined sub- 5
6 Visitor Country ID I I Before Correction Afrer Correction Variable Index Variable Index Figure 4: [color online] (a) I 2 Diagnostic for randomly partitioned subpopulations; (b) Predetermined grouping: comparison of I 2 diagnostics between before τ correction (red dots) and after τ correction (blue dots). populations for processing rather than having randomly assign subpopulations. However, practitioners must be careful to consider heterogeneity among subpopulations in these settings. 3.3 LP Map Function In this step, we estimate the LP l [j; X i, Y ] statistics (which denotes the jth LP statistics for the ith variable in the lth subpopulation) and corresponding confidence distribution of each of 45 variables for 200 random subpopulations (or 233 predefined subpopulations defined by the grouping variable visitor location country id), where i = 1,..., 45, l = 1,..., 200, and i and l are the indexes for variable and subpopulation respectively. The estimator values LP l [j; X i, Y ] and n l;i (used to find standard deviation) are stored in a matrix for use in the next step. 3.4 Heterogeneity: Diagnostic and Regularization We then check heterogeneity issues that may occur from partitioning this large Expedia dataset. We use the I 2 diagnostic to measure the severity of heterogeneity across subpopulations for each predictor variable. For the random partitioning scheme, our subpopulations are fairly homogeneous (with respect to all variables) as all I 2 statistics are below 40% (see Figure 4(a)); on the other hand, visitor location country based predefined partitions divide data into heterogeneous subpopulations for some variables as shown in Figure 4(b) (some variables have I 2 values outside of the permissible range of 0 to 40%). In this scenario, we need to include τ 2 regularization to handle the heterogeneity issue. The I 2 diagnostic after τ 2 regularization is shown in Figure 4(b) (blue dots), which suggest that all I 2 values after regularization fall within the acceptable range of 0 to 40%. The results in this section suggest that our framework is appropriate under both settings: 6
7 LP Confidence Interval Variables Figure 5: Expedia Data: 95 % Confidence Intervals for each variables LP Statistics. random partitioning and predetermined partitioning, since we can always perform τ 2 regularization when subpopulations appear to be heterogeneous. 3.5 Meta Reducer Step After applying the τ 2 correction for heterogeneity, we can continue to combine confidence distributions of LP statistics from different subpopulations to estimate the combined confidence distribution of the LP statistic for each variable as outlined in Theorem 1. The results can be found in Figure 5. Variables with indexes 43, 44, and 45 have highly significant positive relationships with booking bool, the binary response variable. Those variables are prop location score2, the second score outlining the desirability of a hotels location, promotion flag, +1 if the hotel had a sale price promotion specifically displayed, and srch query affinity score, the log of the probability a hotel will be clicked on in Internet searches; there are three variables that have highly negative impacts on hotel booking: price usd, displayed price of the hotel for the given search, srch length of stay, number of nights stay that was searched, and srch booking window, number of days in the future the hotel stay started from the search date. Moreover, there are several variables LP statistics whose confidence intervals include zero, which means those variables have an insignificant influence on hotel booking. The top five most influential variables in terms of absolute value of LP statistic estimates are prop location score2, promotion flag, price usd, srch length of stay, and prop starring. 7
Nonparametric Distributed Learning Framework: Algorithm and Application to Variable Selection
Nonparametric Distributed Learning Framework: Algorithm and Application to Variable Selection Scott Bruce, Zeda Li, Hsiang-Chieh Yang, and Subhadeep Mukhopadhyay Temple University, Department of Statistics
More informationNonparametric Distributed Learning Architecture for Big Data: Algorithm and Applications
Nonparametric Distributed Learning Architecture for Big Data: Algorithm and Applications Scott Bruce, Zeda Li, Hsiang-Chieh Yang, and Subhadeep Mukhopadhyay Department of Statistical Science, Temple University
More informationNONPARAMETRIC DISTRIBUTED LEARNING ARCHITECTURE: ALGORITHM AND APPLICATION
arxiv: 1508.03747 NONPARAMETRIC DISTRIBUTED LEARNING ARCHITECTURE: ALGORITHM AND APPLICATION By Scott Bruce, Zeda Li, Hsiang-Chieh Yang, and Subhadeep Mukhopadhyay Department of Statistics, Temple University
More informationLet s see if we can predict whether a student returns or does not return to St. Ambrose for their second year.
Assignment #13: GLM Scenario: Over the past few years, our first-to-second year retention rate has ranged from 77-80%. In other words, 77-80% of our first-year students come back to St. Ambrose for their
More informationSemestrial Project - Expedia Hotel Ranking
1 Many customers search and purchase hotels online. Companies such as Expedia make their profit from purchases made through their sites. The ultimate goal top of the list are the hotels that are most likely
More informationTransmogrification: The Magic of Feature Engineering Leah McGuire and Mayukh Bhaowal
Transmogrification: The Magic of Feature Engineering Leah McGuire and Mayukh Bhaowal ML algorithms take center stage in AI Modeling Raw Data Feature Engineering Bottleneck Mythical Numeric Matrix X
More informationPackage LPTime. March 3, 2015
Type Package Package LPTime March 3, 2015 Title LP Nonparametric Approach to Non-Gaussian Non-Linear Time Series Modelling Version 1.0-2 Date 2015-03-03 URL http://sites.temple.edu/deepstat/d-products/
More informationChapters 4-6: Inference with two samples Read sections 4.2.5, 5.2, 5.3, 6.2
Chapters 4-6: Inference with two samples Read sections 45, 5, 53, 6 COMPARING TWO POPULATION MEANS When presented with two samples that you wish to compare, there are two possibilities: I independent samples
More informationREVIEW 8/2/2017 陈芳华东师大英语系
REVIEW Hypothesis testing starts with a null hypothesis and a null distribution. We compare what we have to the null distribution, if the result is too extreme to belong to the null distribution (p
More informationDecision Tree Ensembles
Decision Tree Ensembles Random Forest & Gradient Boosting CSE 416 Quiz Section 4/26/2018 Kaggle Titanic Data Passen gerid Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 1 0 3 Braund,
More informationHypothesis Tests and Estimation for Population Variances. Copyright 2014 Pearson Education, Inc.
Hypothesis Tests and Estimation for Population Variances 11-1 Learning Outcomes Outcome 1. Formulate and carry out hypothesis tests for a single population variance. Outcome 2. Develop and interpret confidence
More informationUNITED STATISTICAL ALGORITHMS, LP COMOMENTS, COPULA DENSITY, NONPARAMETRIC MODELING
Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong (Session CPS107) p.4719 UNITED STATISTICAL ALGORITHMS, LP COMOMENTS, COPULA DENSITY, NONPARAMETRIC MODELING Emanuel Parzen 1
More informationNoise & Data Reduction
Noise & Data Reduction Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum Dimension Reduction 1 Remember: Central Limit
More informationStatistics Toolbox 6. Apply statistical algorithms and probability models
Statistics Toolbox 6 Apply statistical algorithms and probability models Statistics Toolbox provides engineers, scientists, researchers, financial analysts, and statisticians with a comprehensive set of
More informationTHE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE
THE ROYAL STATISTICAL SOCIETY 004 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE PAPER II STATISTICAL METHODS The Society provides these solutions to assist candidates preparing for the examinations in future
More informationT-Test QUESTION T-TEST GROUPS = sex(1 2) /MISSING = ANALYSIS /VARIABLES = quiz1 quiz2 quiz3 quiz4 quiz5 final total /CRITERIA = CI(.95).
QUESTION 11.1 GROUPS = sex(1 2) /MISSING = ANALYSIS /VARIABLES = quiz2 quiz3 quiz4 quiz5 final total /CRITERIA = CI(.95). Group Statistics quiz2 quiz3 quiz4 quiz5 final total sex N Mean Std. Deviation
More informationDiscrete Multivariate Statistics
Discrete Multivariate Statistics Univariate Discrete Random variables Let X be a discrete random variable which, in this module, will be assumed to take a finite number of t different values which are
More informationStatistical Process Control for Multivariate Categorical Processes
Statistical Process Control for Multivariate Categorical Processes Fugee Tsung The Hong Kong University of Science and Technology Fugee Tsung 1/27 Introduction Typical Control Charts Univariate continuous
More informationCOMP 5331: Knowledge Discovery and Data Mining
COMP 5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Tan, Steinbach, Kumar And Jiawei Han, Micheline Kamber, and Jian Pei 1 10
More informationYou know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?
You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?) I m not goin stop (What?) I m goin work harder (What?) Sir David
More informationReview of the General Linear Model
Review of the General Linear Model EPSY 905: Multivariate Analysis Online Lecture #2 Learning Objectives Types of distributions: Ø Conditional distributions The General Linear Model Ø Regression Ø Analysis
More informationThe Perceptron algorithm
The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following
More informationChecking model assumptions with regression diagnostics
@graemeleehickey www.glhickey.com graeme.hickey@liverpool.ac.uk Checking model assumptions with regression diagnostics Graeme L. Hickey University of Liverpool Conflicts of interest None Assistant Editor
More informationA3. Statistical Inference Hypothesis Testing for General Population Parameters
Appendix / A3. Statistical Inference / General Parameters- A3. Statistical Inference Hypothesis Testing for General Population Parameters POPULATION H 0 : θ = θ 0 θ is a generic parameter of interest (e.g.,
More informationParametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami
Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami Parametric Assumptions The observations must be independent. Dependent variable should be continuous
More informationExam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences h, February 12, 2015
Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences 18.30 21.15h, February 12, 2015 Question 1 is on this page. Always motivate your answers. Write your answers in English. Only the
More informationSTAT Section 2.1: Basic Inference. Basic Definitions
STAT 518 --- Section 2.1: Basic Inference Basic Definitions Population: The collection of all the individuals of interest. This collection may be or even. Sample: A collection of elements of the population.
More informationData Analysis 1 LINEAR REGRESSION. Chapter 03
Data Analysis 1 LINEAR REGRESSION Chapter 03 Data Analysis 2 Outline The Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression Other Considerations in Regression Model Qualitative
More informationMultilevel Statistical Models: 3 rd edition, 2003 Contents
Multilevel Statistical Models: 3 rd edition, 2003 Contents Preface Acknowledgements Notation Two and three level models. A general classification notation and diagram Glossary Chapter 1 An introduction
More informationA Guide to Modern Econometric:
A Guide to Modern Econometric: 4th edition Marno Verbeek Rotterdam School of Management, Erasmus University, Rotterdam B 379887 )WILEY A John Wiley & Sons, Ltd., Publication Contents Preface xiii 1 Introduction
More informationTitanic: Data Analysis
Titanic: Data Analysis Victor Bernal Arzola May 2, 26 victor.bernal@mathmods.eu Introduction Data analysis is a process for obtaining raw data and converting it into information useful for decisionmaking
More informationEPSE 594: Meta-Analysis: Quantitative Research Synthesis
EPSE 594: Meta-Analysis: Quantitative Research Synthesis Ed Kroc University of British Columbia ed.kroc@ubc.ca January 24, 2019 Ed Kroc (UBC) EPSE 594 January 24, 2019 1 / 37 Last time Composite effect
More informationLOGISTIC REGRESSION Joseph M. Hilbe
LOGISTIC REGRESSION Joseph M. Hilbe Arizona State University Logistic regression is the most common method used to model binary response data. When the response is binary, it typically takes the form of
More informationPrice Discrimination through Refund Contracts in Airlines
Introduction Price Discrimination through Refund Contracts in Airlines Paan Jindapon Department of Economics and Finance The University of Texas - Pan American Department of Economics, Finance and Legal
More informationUnit 10: Simple Linear Regression and Correlation
Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1 Introductory Remarks Regression analysis is a method for
More informationCIVL 7012/8012. Collection and Analysis of Information
CIVL 7012/8012 Collection and Analysis of Information Uncertainty in Engineering Statistics deals with the collection and analysis of data to solve real-world problems. Uncertainty is inherent in all real
More informationApplied Regression Modeling
Applied Regression Modeling Applied Regression Modeling A Business Approach Iain Pardoe University of Oregon Charles H. Lundquist College of Business Eugene, Oregon WILEY- INTERSCIENCE A JOHN WILEY &
More informationPassing-Bablok Regression for Method Comparison
Chapter 313 Passing-Bablok Regression for Method Comparison Introduction Passing-Bablok regression for method comparison is a robust, nonparametric method for fitting a straight line to two-dimensional
More informationRon Heck, Fall Week 3: Notes Building a Two-Level Model
Ron Heck, Fall 2011 1 EDEP 768E: Seminar on Multilevel Modeling rev. 9/6/2011@11:27pm Week 3: Notes Building a Two-Level Model We will build a model to explain student math achievement using student-level
More informationThe PAC Learning Framework -II
The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline
More informationPerformance Evaluation and Comparison
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation
More informationFrequency Distribution Cross-Tabulation
Frequency Distribution Cross-Tabulation 1) Overview 2) Frequency Distribution 3) Statistics Associated with Frequency Distribution i. Measures of Location ii. Measures of Variability iii. Measures of Shape
More informationx3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators
Multiple Regression Relating a response (dependent, input) y to a set of explanatory (independent, output, predictor) variables x, x 2, x 3,, x q. A technique for modeling the relationship between variables.
More information(Where does Ch. 7 on comparing 2 means or 2 proportions fit into this?)
12. Comparing Groups: Analysis of Variance (ANOVA) Methods Response y Explanatory x var s Method Categorical Categorical Contingency tables (Ch. 8) (chi-squared, etc.) Quantitative Quantitative Regression
More informationA Novel Click Model and Its Applications to Online Advertising
A Novel Click Model and Its Applications to Online Advertising Zeyuan Zhu Weizhu Chen Tom Minka Chenguang Zhu Zheng Chen February 5, 2010 1 Introduction Click Model - To model the user behavior Application
More informationIntroduction to hypothesis testing
Introduction to hypothesis testing Review: Logic of Hypothesis Tests Usually, we test (attempt to falsify) a null hypothesis (H 0 ): includes all possibilities except prediction in hypothesis (H A ) If
More informationLecture 26: Chapter 10, Section 2 Inference for Quantitative Variable Confidence Interval with t
Lecture 26: Chapter 10, Section 2 Inference for Quantitative Variable Confidence Interval with t t Confidence Interval for Population Mean Comparing z and t Confidence Intervals When neither z nor t Applies
More informationOne-Way ANOVA. Some examples of when ANOVA would be appropriate include:
One-Way ANOVA 1. Purpose Analysis of variance (ANOVA) is used when one wishes to determine whether two or more groups (e.g., classes A, B, and C) differ on some outcome of interest (e.g., an achievement
More informationData Analysis and Statistical Methods Statistics 651
Data Analysis and Statistical Methods Statistics 651 http://www.stat.tamu.edu/~suhasini/teaching.html Suhasini Subba Rao Review Our objective: to make confident statements about a parameter (aspect) in
More informationLecture 6: Linear Regression (continued)
Lecture 6: Linear Regression (continued) Reading: Sections 3.1-3.3 STATS 202: Data mining and analysis October 6, 2017 1 / 23 Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d.
More informationMachine Learning Linear Regression. Prof. Matteo Matteucci
Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares
More informationLogistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy
Logistic Regression Some slides from Craig Burkett STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy Titanic Survival Case Study The RMS Titanic A British passenger liner Collided
More informationConcepts and Applications of Kriging. Eric Krause Konstantin Krivoruchko
Concepts and Applications of Kriging Eric Krause Konstantin Krivoruchko Outline Introduction to interpolation Exploratory spatial data analysis (ESDA) Using the Geostatistical Wizard Validating interpolation
More information7.1 Sampling Error The Need for Sampling Distributions
7.1 Sampling Error The Need for Sampling Distributions Tom Lewis Fall Term 2009 Tom Lewis () 7.1 Sampling Error The Need for Sampling Distributions Fall Term 2009 1 / 5 Outline 1 Tom Lewis () 7.1 Sampling
More informationMulticollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.
Multicollinearity Read Section 7.5 in textbook. Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. Example of multicollinear
More informationCourse Review. Kin 304W Week 14: April 9, 2013
Course Review Kin 304W Week 14: April 9, 2013 1 Today s Outline Format of Kin 304W Final Exam Course Review Hand back marked Project Part II 2 Kin 304W Final Exam Saturday, Thursday, April 18, 3:30-6:30
More informationReview of Multiple Regression
Ronald H. Heck 1 Let s begin with a little review of multiple regression this week. Linear models [e.g., correlation, t-tests, analysis of variance (ANOVA), multiple regression, path analysis, multivariate
More informationSubject CS1 Actuarial Statistics 1 Core Principles
Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and
More informationDecentralized nonparametric multiple testing
JOURNAL OF NONPARAMETRIC STATISTICS https://doi.org/10.1080/10485252.2018.1508678 Decentralized nonparametric multiple testing Subhadeep Mukhopadhyay Department of Statistical Science, Temple University,
More informationStat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,
1 Stat 406: Algorithms for classification and prediction Lecture 1: Introduction Kevin Murphy Mon 7 January, 2008 1 1 Slides last updated on January 7, 2008 Outline 2 Administrivia Some basic definitions.
More informationSTAT Chapter 9: Two-Sample Problems. Paired Differences (Section 9.3)
STAT 515 -- Chapter 9: Two-Sample Problems Paired Differences (Section 9.3) Examples of Paired Differences studies: Similar subjects are paired off and one of two treatments is given to each subject in
More informationPreface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of
Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of Probability Sampling Procedures Collection of Data Measures
More informationBasic Business Statistics, 10/e
Chapter 4 4- Basic Business Statistics th Edition Chapter 4 Introduction to Multiple Regression Basic Business Statistics, e 9 Prentice-Hall, Inc. Chap 4- Learning Objectives In this chapter, you learn:
More informationWELCOME! Lecture 13 Thommy Perlinger
Quantitative Methods II WELCOME! Lecture 13 Thommy Perlinger Parametrical tests (tests for the mean) Nature and number of variables One-way vs. two-way ANOVA One-way ANOVA Y X 1 1 One dependent variable
More informationConcepts and Applications of Kriging. Eric Krause
Concepts and Applications of Kriging Eric Krause Sessions of note Tuesday ArcGIS Geostatistical Analyst - An Introduction 8:30-9:45 Room 14 A Concepts and Applications of Kriging 10:15-11:30 Room 15 A
More informationLinear Regression. In this lecture we will study a particular type of regression model: the linear regression model
1 Linear Regression 2 Linear Regression In this lecture we will study a particular type of regression model: the linear regression model We will first consider the case of the model with one predictor
More informationLecture 14: Introduction to Poisson Regression
Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why
More informationModelling counts. Lecture 14: Introduction to Poisson Regression. Overview
Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week
More informationSTA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).
STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population
More informationSTP 226 EXAMPLE EXAM #3 INSTRUCTOR:
STP 226 EXAMPLE EXAM #3 INSTRUCTOR: Honor Statement: I have neither given nor received information regarding this exam, and I will not do so until all exams have been graded and returned. Signed Date PRINTED
More informationWISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A
WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, 2015-16 Academic Year Exam Version: A INSTRUCTIONS TO STUDENTS 1 The time allowed for this examination paper is 2 hours. 2 This
More informationRegression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics
Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics The session is a continuation of a version of Section 11.3 of MMD&S. It concerns
More informationRegression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics
Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics The session is a continuation of a version of Section 11.3 of MMD&S. It concerns
More informationChapter 1 Statistical Inference
Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationSPSS LAB FILE 1
SPSS LAB FILE www.mcdtu.wordpress.com 1 www.mcdtu.wordpress.com 2 www.mcdtu.wordpress.com 3 OBJECTIVE 1: Transporation of Data Set to SPSS Editor INPUTS: Files: group1.xlsx, group1.txt PROCEDURE FOLLOWED:
More informationValidation of Visual Statistical Inference, with Application to Linear Models
Validation of Visual Statistical Inference, with pplication to Linear Models Mahbubul Majumder, Heike Hofmann, Dianne Cook Department of Statistics, Iowa State University pril 2, 212 Statistical graphics
More informationMultiple Regression. Dr. Frank Wood. Frank Wood, Linear Regression Models Lecture 12, Slide 1
Multiple Regression Dr. Frank Wood Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 12, Slide 1 Review: Matrix Regression Estimation We can solve this equation (if the inverse of X
More informationAn overview of Boosting. Yoav Freund UCSD
An overview of Boosting Yoav Freund UCSD Plan of talk Generative vs. non-generative modeling Boosting Alternating decision trees Boosting and over-fitting Applications 2 Toy Example Computer receives telephone
More informationInferences About the Difference Between Two Means
7 Inferences About the Difference Between Two Means Chapter Outline 7.1 New Concepts 7.1.1 Independent Versus Dependent Samples 7.1. Hypotheses 7. Inferences About Two Independent Means 7..1 Independent
More informationFinal Exam - Solutions
Ecn 102 - Analysis of Economic Data University of California - Davis March 19, 2010 Instructor: John Parman Final Exam - Solutions You have until 5:30pm to complete this exam. Please remember to put your
More informationEconometrics Problem Set 6
Econometrics Problem Set 6 WISE, Xiamen University Spring 2016-17 Conceptual Questions 1. This question refers to the estimated regressions shown in Table 1 computed using data for 1988 from the CPS. The
More informationConcepts and Applications of Kriging
2013 Esri International User Conference July 8 12, 2013 San Diego, California Technical Workshop Concepts and Applications of Kriging Eric Krause Konstantin Krivoruchko Outline Intro to interpolation Exploratory
More informationA Course on Advanced Econometrics
A Course on Advanced Econometrics Yongmiao Hong The Ernest S. Liu Professor of Economics & International Studies Cornell University Course Introduction: Modern economies are full of uncertainties and risk.
More informationDo not copy, post, or distribute
14 CORRELATION ANALYSIS AND LINEAR REGRESSION Assessing the Covariability of Two Quantitative Properties 14.0 LEARNING OBJECTIVES In this chapter, we discuss two related techniques for assessing a possible
More informationCorrelation & Simple Regression
Chapter 11 Correlation & Simple Regression The previous chapter dealt with inference for two categorical variables. In this chapter, we would like to examine the relationship between two quantitative variables.
More informationIn-Database Factorised Learning fdbresearch.github.io
In-Database Factorised Learning fdbresearch.github.io Mahmoud Abo Khamis, Hung Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich December 2017 Logic for Data Science Seminar Alan Turing Institute
More informationRegression With a Categorical Independent Variable
Regression ith a Independent Variable ERSH 8320 Slide 1 of 34 Today s Lecture Regression with a single categorical independent variable. Today s Lecture Coding procedures for analysis. Dummy coding. Relationship
More informationADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables
ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES Cox s regression analysis Time dependent explanatory variables Henrik Ravn Bandim Health Project, Statens Serum Institut 4 November 2011 1 / 53
More informationSTAT 510 Final Exam Spring 2015
STAT 510 Final Exam Spring 2015 Instructions: The is a closed-notes, closed-book exam No calculator or electronic device of any kind may be used Use nothing but a pen or pencil Please write your name and
More informationINFERENCE FOR REGRESSION
CHAPTER 3 INFERENCE FOR REGRESSION OVERVIEW In Chapter 5 of the textbook, we first encountered regression. The assumptions that describe the regression model we use in this chapter are the following. We
More informationTutorial: Urban Trajectory Visualization. Case Studies. Ye Zhao
Case Studies Ye Zhao Use Cases We show examples of the web-based visual analytics system TrajAnalytics The case study information and videos are available at http://vis.cs.kent.edu/trajanalytics/ Porto
More informationA Statistical Look at Spectral Graph Analysis. Deep Mukhopadhyay
A Statistical Look at Spectral Graph Analysis Deep Mukhopadhyay Department of Statistics, Temple University Office: Speakman 335 deep@temple.edu http://sites.temple.edu/deepstat/ Graph Signal Processing
More informationCHAPTER 4 & 5 Linear Regression with One Regressor. Kazu Matsuda IBEC PHBU 430 Econometrics
CHAPTER 4 & 5 Linear Regression with One Regressor Kazu Matsuda IBEC PHBU 430 Econometrics Introduction Simple linear regression model = Linear model with one independent variable. y = dependent variable
More informationSection 7.2 Homework Answers
25.5 30 Sample Mean P 0.1226 sum n b. The two z-scores are z 25 20(1.7) n 1.0 20 sum n 2.012 and z 30 20(1.7) n 1.0 0.894, 20 so the probability is approximately 0.1635 (0.1645 using Table A). P14. a.
More informationBoosting: Foundations and Algorithms. Rob Schapire
Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you
More information401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.
401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis
More informationBasics on t-tests Independent Sample t-tests Single-Sample t-tests Summary of t-tests Multiple Tests, Effect Size Proportions. Statistiek I.
Statistiek I t-tests John Nerbonne CLCG, Rijksuniversiteit Groningen http://www.let.rug.nl/nerbonne/teach/statistiek-i/ John Nerbonne 1/46 Overview 1 Basics on t-tests 2 Independent Sample t-tests 3 Single-Sample
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationInference with Simple Regression
1 Introduction Inference with Simple Regression Alan B. Gelder 06E:071, The University of Iowa 1 Moving to infinite means: In this course we have seen one-mean problems, twomean problems, and problems
More information