Online Supplementary Material. MetaLP: A Nonparametric Distributed Learning Framework for Small and Big Data

Size: px

Start display at page:

Download "Online Supplementary Material. MetaLP: A Nonparametric Distributed Learning Framework for Small and Big Data"

Moses Weaver
5 years ago
Views:

1 Online Supplementary Material MetaLP: A Nonparametric Distributed Learning Framework for Small and Big Data PI : Subhadeep Mukhopadhyay Department of Statistics, Temple University Philadelphia, Pennsylvania, 19122, U.S.A. deep@temple.edu ABSTRACT In this note we will demonstrate the viability and utility of the proposed MetaLP, a nonparametric distributed statistical learning framework, for small and big data science. We perform a proof-of-concept implementation of MetaLP-based variable selection for two data sets (1) Titanic (example of small data) and (2) Expedia personalized hotel search data (example of large data set). 1 MetaLP: Nonparametric Parallelizable Algorithm Figure 1 provides the flowchart of our proposed MetaLP based data analytics scheme. Here we apply this general framework for designing nonparametric distributed variable selection algorithm. Our approach can detect higher-order interaction from massive data by taking advantage of the distributed data processing technologies. Brief description of the four main components of our algorithm are described as follows. (1) Partition. Assign observations to different subpopulations in a reasonable manner. Random assignment is one possible partitioning scheme, but there are many other possibilities. The number of subpopulations k can be specified to manage computational efficiency (this step can be omitted if the dataset is already partitioned by some natural grouping variable). (2) LP Map Function. We apply LP statistical modeling at each data-block. We construct LP statistic for variable selection of a mixed random variable X (either continuous or discrete) based on our specially designed score functions as follows LP[j; X, Y ] = Cor[T j (X; X), Y ] = E[T j (X; X)T 1 (Y ; Y )]. (1) Using empirical process theory we can show that the sample LP-Fourier measures n LP[j; X, Y ] asymptotically converge to i.i.d standard normal distributions (Mukhopahyay and Parzen, 2014). We will also show how the LP statistics unifies and systematically reproduces all the traditional and modern statistical variable selection measures for different data types of Y and 1

2 Figure 1: The workflow of MetaLP based data analytics scheme. X using one single computing formula. The linear LP-Fourier statistic LP[1; X, Y ] measures the location difference between f(x; X Y = 1) and the unconditional distribution f(x; X). The non-linear LP score statistics LP[j; X, Y ], j > 1 detect higher order distributional differences like in variability, skewness, or in tail behavior to identify important variables. The LP map function outputs the corresponding Confidence Distribution (CD) for each subpopulations, LP l [j; X, Y ], l = 1,..., k. We prefer to estimate the Confidence Distribution (CD) of the LP-statistics, as all the traditional forms of statistical estimation and inference (e.g. point estimation, confidence intervals, hypothesis testing) can be produced in a unified way from CD. We will derive (using empirical process and stochastic internal representation) the following form of LP-confidence distribution ( n ( H Φ (LP[j; X, Y ]) = Φ LP[j; X, Y ] LP[j; )) X, Y ]. (2) (3) τ-regularization. Run heterogeneity I 2 diagnostic and perform τ-corrected version of LP-statistics. We have omitted full details due to space constraints. (4) Meta Reducer Step. Apply the meta-analysis formula (after incorporating the heterogeneity correction as described in the following Theorem 1, Eq (3-4)) to estimate the 2

3 combined confidence distribution parameters for the LP statistics for each predictor variable. The output from this step is a collection of estimators and standard errors for the combined τ-corrected LP statistic parameters for all predictor variables. Theorem 1. Setting F 1 0 (t) = Φ 1 (t) and α l = 1/ (τ 2 + (1/n l )), where Φ is cumulative distribution function of the standard normal distribution and n l is the size of subpopulation l = 1,..., k, the following combined CD for LP[j; X, Y ] follows: ( k ) 1/2 H (c) (LP[j; X, Y ]) = Φ 1 (LP[j; X, Y ] τ 2 + (1/n l ) LP (c) [j; X, Y ]) with (3) l=1 LP (c) [j; X, Y ]) = k l=1 (τ 2 + (1/n l )) 1 LPl [j; X, Y ]) k l=1 (τ 2 + (1/n l )) 1 (4) where LP (c) [j; X, Y ]) and ( k l=1 1/(τ 2 + (1/n l ))) 1 are the mean and variance respectively of the combined CD for LP[j; X, Y ] Figure 2: (a) Left panel shows the shape of the first four LP orthonormal score functions for the variable # Siblings aboard, a discrete random variable takes values 0,..., 8; (b) Right: the shape of the LP basis for the continuous variable Passenger fare. As the number of atoms (# distinct values) of a random variable A(X) (moving from discrete to continuous data type) the shape of our custom designed score polynomials automatically approaches to (by construction) a universal shape, which is similar to Legendre-Polynomial. 2 The Titanic Dataset The Titanic data set contains information on 891 of its passengers, including which passengers survived. The goal is to identify which factors (e.g. age, gender, class, etc.) significantly 3

4 influence passenger survival. Complete descriptions of all 8 variables can be found in Table 1. We will use this small data set as a demo on how MetaLP algorithm (presented in the previous section) actually works on real data sets in a distributed manner using a single general algorithm irrespective of the data type of each features. One of the fundamental ingredient of our approach is LP Transformation. The shape of the piecewise-constant orthonormal LP polynomials for the variable # Siblings aboard is shown in Fig 2. Variable Name Type Description Value Survival Binary Survival 0 = No; 1 = Yes Pclass Categorical Passenger Class 1 = 1st; 2 = 2nd; 3 = 3rd Sex Binary Sex Male; Female Age Numeric Age 0-80 Sibsp Numeric Number of Siblings Aboard 0-8 Parch Numeric Number of Children Aboard 0-6 Fare Numeric Passenger Fare Embarked Categorical Port of Embarkation C = Cherbourg; Q = Queenstown; S = Southampton Table 1: Data dictionary for the Titanic dataset The small size of the Titanic data set will allow us to compare the inference based on distributed and traditional entire data-based methods. Figure 3 shows the 95% confidence intervals generated from the MetaLP algorithm for 3 repetitions of random groupings or partitions (k = 5) along with the confidence intervals generated using the whole Titanic dataset. Remarkable fact to note that the confidence intervals estimated using the MetaLP algorithm are extremely similar to the intervals estimated using the entire dataset across all variables. The effect of heterogeneity is reflected in the width of the confidence intervals due to increased between-subpopulation variance. Moreover, the point estimates for the LP statistics are almost identical! Thus our proposed distributed computational scheme successfully reproduces the results for the small data set, which means we can obtain similar statistical inference while taking advantage of the computational efficiency in parallel distributed processing. 3 Expedia Personalized Hotel Search Dataset 3.1 Data Description The dataset contains various user characteristics (e.g. location, search history, etc.), search criteria (e.g. length of stay, number of children, room count, etc.), and hotel information (e.g. star rating, price, location, promotions, review scores, etc.) that may influence users Expedia hotel booking behavior. In total, the training data contains 9, 917, 530 observations 4

5 Random Partition 0.0 Aggregated LP MetaLP1-0.2 MetaLP2 MetaLP Age Embarked Fare Parch Pclass Sex SibSp Variable Figure 3: [color online] 95% Confidence Interval of LP Statistic for each variable based on three MetaLP repetitions and aggregated dataset for Titanic data. across 46 variables. The target variable (response variable), booking bool, is a binary variable that indicates whether the hotel was booked or not. The remaining 45 variables contain the explanatory variables mentioned previously. Some specific examples: prop location score indicates the desirability of a hotels location; prop review score is the mean customer review score for the hotel on a scale of 5; and price usd displays price of the hotel. 3.2 Partition First, we randomly assign search lists, which are collections of observations from search result impressions in the dataset, to 200 different subpopulations for further processing. Random assignment of search lists rather than individual observations ensures that sets of hotels viewed in the same search session are all contained in the same subpopulation. The number of subpopulations chosen here can be adapted to meet the processing and time requirements. On the other hand, there may be situations where we already have some kind of natural groupings in the dataset, which can be directly utilized as subpopulations. For example, consider the scenario where the available Expedia data are collected from different countries by visitor location country id, a indicator of visitor s location (country). In this setting, the distributed statistical inference framework can directly utilize these predetermined sub- 5

6 Visitor Country ID I I Before Correction Afrer Correction Variable Index Variable Index Figure 4: [color online] (a) I 2 Diagnostic for randomly partitioned subpopulations; (b) Predetermined grouping: comparison of I 2 diagnostics between before τ correction (red dots) and after τ correction (blue dots). populations for processing rather than having randomly assign subpopulations. However, practitioners must be careful to consider heterogeneity among subpopulations in these settings. 3.3 LP Map Function In this step, we estimate the LP l [j; X i, Y ] statistics (which denotes the jth LP statistics for the ith variable in the lth subpopulation) and corresponding confidence distribution of each of 45 variables for 200 random subpopulations (or 233 predefined subpopulations defined by the grouping variable visitor location country id), where i = 1,..., 45, l = 1,..., 200, and i and l are the indexes for variable and subpopulation respectively. The estimator values LP l [j; X i, Y ] and n l;i (used to find standard deviation) are stored in a matrix for use in the next step. 3.4 Heterogeneity: Diagnostic and Regularization We then check heterogeneity issues that may occur from partitioning this large Expedia dataset. We use the I 2 diagnostic to measure the severity of heterogeneity across subpopulations for each predictor variable. For the random partitioning scheme, our subpopulations are fairly homogeneous (with respect to all variables) as all I 2 statistics are below 40% (see Figure 4(a)); on the other hand, visitor location country based predefined partitions divide data into heterogeneous subpopulations for some variables as shown in Figure 4(b) (some variables have I 2 values outside of the permissible range of 0 to 40%). In this scenario, we need to include τ 2 regularization to handle the heterogeneity issue. The I 2 diagnostic after τ 2 regularization is shown in Figure 4(b) (blue dots), which suggest that all I 2 values after regularization fall within the acceptable range of 0 to 40%. The results in this section suggest that our framework is appropriate under both settings: 6

7 LP Confidence Interval Variables Figure 5: Expedia Data: 95 % Confidence Intervals for each variables LP Statistics. random partitioning and predetermined partitioning, since we can always perform τ 2 regularization when subpopulations appear to be heterogeneous. 3.5 Meta Reducer Step After applying the τ 2 correction for heterogeneity, we can continue to combine confidence distributions of LP statistics from different subpopulations to estimate the combined confidence distribution of the LP statistic for each variable as outlined in Theorem 1. The results can be found in Figure 5. Variables with indexes 43, 44, and 45 have highly significant positive relationships with booking bool, the binary response variable. Those variables are prop location score2, the second score outlining the desirability of a hotels location, promotion flag, +1 if the hotel had a sale price promotion specifically displayed, and srch query affinity score, the log of the probability a hotel will be clicked on in Internet searches; there are three variables that have highly negative impacts on hotel booking: price usd, displayed price of the hotel for the given search, srch length of stay, number of nights stay that was searched, and srch booking window, number of days in the future the hotel stay started from the search date. Moreover, there are several variables LP statistics whose confidence intervals include zero, which means those variables have an insignificant influence on hotel booking. The top five most influential variables in terms of absolute value of LP statistic estimates are prop location score2, promotion flag, price usd, srch length of stay, and prop starring. 7

Nonparametric Distributed Learning Framework: Algorithm and Application to Variable Selection

Nonparametric Distributed Learning Framework: Algorithm and Application to Variable Selection Scott Bruce, Zeda Li, Hsiang-Chieh Yang, and Subhadeep Mukhopadhyay Temple University, Department of Statistics