arxiv: v1 [cs.cv] 22 Aug 2017

Size: px

Start display at page:

Download "arxiv: v1 [cs.cv] 22 Aug 2017"

Magnus Rogers
5 years ago
Views:

1 1 On Image Classification: Correlation v.s. Causality arxiv: v1 [cs.cv] 22 Aug 2017 ZHEYAN SHEN, Tsinghua University PENG CUI, Tsinghua University KUN KUANG, Tsinghua University BO LI, Tsinghua University Image classification is one of the fundamental problems in computer vision. Owing to the availability of large image datasets like ImageNet and YFCC100M, a plethora of research has been conducted to do high precision image classification and many remarkable achievements have been made. The success of most existing methods hinges on a basic hypothesis that the testing image set has the same distribution as the training image set (i.e. the i.i.d. hypothesis). However, in many real applications, we cannot guarantee the validity of the i.i.d. hypothesis since the testing image set is unseen. It is thus desirable to learn an image classifier, which can perform well even in non-i.i.d. situations. In this paper, we propose a novel Causally Regularized Logistic Regression (CRLR) algorithm to address the non-i.i.d. problem without knowing testing data information by searching for causal features. The causal features refer to characteristics truly determining whether a special object belongs to a category or not. Identifying causal features allows us to construct classifiers adaptive to distributional changes in the non i.i.d circumstances even when the testing set is unseen. Algorithmically, we propose a causal regularizer for causal feature identification by jointly optimizing it with a logistic loss term. Assisted with the causal regularizer, we can estimate the causal contribution (causal effect) of each focal image feature (viewed as a treatment variable) by sample reweighting which ensures the distributions of all remaining image features between images with different focal feature levels are close. The resultant classifier will be based on the estimated causal contributions of the features, rather than traditional correlation-based contributions. To validate the effectiveness of our CRLR algorithm, we manually construct a new image dataset from YFCC100M 1, simulating various non-i.i.d. situations in the real world, and conduct extensive experiments for image classification. Experimental results clearly demonstrate that our CRLR algorithm outperforms the state-of-the-art methods. We further visualize the top causal features selected by our algorithm on our image dataset. CCS Concepts: Computing methodologies Object recognition; Regularization; Additional Key Words and Phrases: Image Classification, Causal Inference, Non-i.i.d Situations, Causally Regularized Logistic Regression ACM Reference format: Zheyan Shen, Peng Cui, Kun Kuang, and Bo Li On Image Classification: Correlation v.s. Causality. 1, 1, Article 1 (August 2017), 16 pages. 1 Yahoo Flickr Creative Commons WebScope dataset Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org Association for Computing Machinery. XXXX-XXXX/2017/8-ART1 $

Training Phase feature visualization Sample-Level Correlation Feature Distribution i.i.d.

case Sample-Level Causal Feature Distribution Testing Phase Fig. 1.

1 INTRODUCTION Image classification has advanced rapidly in recent years.

between image pixels and image categories can be maximally leveraged.

set has the same distribution as the training image set (i.e. the i.i.d. hypothesis).

for unseen testing images. However, a long-standing but largely ignored problem is what if the i.i.d. hypothesis does not hold anymore?

The typical non-i.i.d. situations include context bias and label composition bias or both.

contexts (e.g. for a dog category, the environment like grass is a visual context).

sets for some categories. These non-i.i.d.

differ from those in the testing set, and hence cannot be exploited for predicting in the testing set.

2 1:2 Z. Shen et al Training Data Category Level Causal Feature Distribution Category Level Correlation Feature Distribution Training Phase feature visualization Sample-Level Correlation Feature Distribution i.i.d. case Sample-Level Causal Feature Distribution Sample-Level Correlation Feature Distribution Non-i.i.d. case Sample-Level Causal Feature Distribution Testing Phase Fig. 1. Illustration of the difference between correlation based and causality based classification methods in addressing non-i.i.d. cases. 1 INTRODUCTION Image classification has advanced rapidly in recent years. The exponentially increased image data scale and the significantly improved model capacity ensure that the correlation between image pixels and image categories can be maximally leveraged. The success of these correlation-based image classification methods depends on a basic hypothesis that the testing image set has the same distribution as the training image set (i.e. the i.i.d. hypothesis). In this case, the correlations between image pixels and categories in training set provide sufficient predictive power for unseen testing images. However, a long-standing but largely ignored problem is what if the i.i.d. hypothesis does not hold anymore? In real applications, we cannot guarantee the validity of the i.i.d. hypothesis since the testing images are unseen. The typical non-i.i.d. situations include context bias and label composition bias or both. The context bias refers to situations where the training data and testing data have different distributions over visual contexts (e.g. for a dog category, the environment like grass is a visual context). The label composition bias is induced by different percentages of positive and negative labels in training and testing sets for some categories. These non-i.i.d. situations pose great challenges to correlation-based methods, because the correlation patterns in the training set may differ from those in the testing set, and hence cannot be exploited for predicting in the testing set. An example of non-i.i.d. situation induced by context bias is illustrated in Figure 1. The classifier for dogs is trained by images mostly with dogs on the grass, while tested by an image with a dog in grass context (i.e. i.i.d case) and another image with a dog in snow context (i.e. non i.i.d. case). The correlation-based method can succeed in the i.i.d. example, but fail in the non-i.i.d. example. The failure is mainly because the grass features are assigned with high weights in the classifier due to the fact that they are highly correlated with the label in the training set, but they do not appear in the testing image.

3 On Image Classification: Correlation v.s. Causality 1:3 Recently, there are some works addressing the non-i.i.d. problem by covariate shifting with the objective to minimize the predictive loss under testing data distribution by reweighting the training samples [Huang et al. 2007; Liu and Ziebart 2014; Shimodaira 2000; Sugiyama et al. 2007; Zadrozny 2004]. But these works need prior knowledge of the testing data distribution, which is unavailable in many non-batch applications. In this work, we aim to simulate the existence of various non-i.i.d. settings and address them without knowing the testing data distribution, which is distinct from the above work and more reasonable in real applications. We address the non-i.i.d. problem by discriminating the causal features that truly determine whether a specific object belongs to a category from the correlation features. As shown in Figure 1, if the classifier is based on the causal features of the dog category such as the fur, nose and ear features, it would be insensitive to context bias and thus avoid errors in non-i.i.d. situations. Causal inference is a powerful statistical tool for discovering causal relationship between two variables. A gold standard for causal analysis is to conduct controlled experiments like A/B testing. But controlled experiments are infeasible in scenarios like image classification where micro-level features cannot be manipulated. Causal analysis methods based on observational data include propensity score matching or reweighting [Austin 2011; Bang and Robins 2005; Kuang et al. 2017], markov blankets [Aliferis et al. 2003; Koller and Sahami 1996; Pellet and Elisseeff 2008; Yu and Liu 2004] and confounder balancing [Athey et al. 2016; Hainmueller 2011; Zubizarreta 2015] etc. The key problem that these methods trying to solve is to remove the confounding bias induced by the different confounder distributions of the treated and control units. In pursuit of marrying causal analysis with image classification, we successively regard each image feature (e.g. a visual word) as a treatment variable, and all other image features as confounders. A straightforward approach is to first select causal features and then learn a classifier based on the causal features. However, this approach is statistically sensitive to the threshold for causal feature selection, and the step-by-step method is difficult to optimize in practice. So how to develop a unified approach for joint causal analysis and image classification remains as a challenging problem. Moreover, existing causal analysis methods are proposed in well-designed settings where typically a small number of treatment variables are considered. While in image classification, we have little prior knowledge on causal relationships, and thus have to regard all features as treatment variables. This leads to a large scale causal effect learning problem involving a large number of potential causal treatment variables to which existing methods are not directly applicable. Hence, It is highly non-trivial to design a scalable causal learning method adaptive to the image classification setting. In this paper, we propose a novel Causally Regularized Logistic Regression (CRLR) model to address the non-i.i.d. problem in image classification. The model consists of a logistic loss term and a causal regularizer. The latter makes the model prone to exploit causal features for prediction. Specifically, the causal regularizer aims to directly balance confounder distributions for each treatment feature through reweighting the sample. In order to reduce model complexity, we propose a global sample reweighting method which learns a common sample reweighting matrix to maximally balance confounders for all treatment features. The sample weights are also taken into account when calculating the logistic loss, and, in this way, the logistic loss and causal regularizer are jointly optimized. The notion behind the method is to resample the images so that the training images of a given category have uniform distribution across different contexts. For example, in a dog category, the method will enforce the total weight of images in different contexts like grass, beach and car to be almost the same. In this case, minimizing the logistic loss will lead to higher weights on the dog features rather than the context features. In order to evaluate the performance of a classification method in the non-i.i.d. situations, we construct a new image dataset from YFCC100M. This dataset includes 10 categories, and the images in a category are divided into 5 contexts. For example, in the dog category, the 5 contexts are grass,

4 1:4 Z. Shen et al beach, car, sea, and snow. With this dataset, we can easily simulate various non-i.i.d. situations by tuning the training and testing distributions. The technical contributions of this paper are four-fold: We investigate a new problem of image classification in non-i.i.d. situations induced by unseen testing data, which is long-standing but largely ignored in literature. We propose a novel Causally Regularized Logistic Regression model to address this problem, where the causal analysis and image classification are jointly optimized in an effective and efficient way. We conduct extensive experiments, and the results demonstrate the inability of existing methods in dealing with the non-i.i.d. situations, and the superiority of our method in such scenarios. The explainability of our method is also a notable merit. We have shared the new dataset 2 we constructed for evaluating image classification in non-i.i.d. situations to promote future research in this direction. The remaining sections are organized as follows. Section 2 reviews the related work. Section 3 describes the problem formulation and our CRLR algorithm. Section 4 optimizes the CRLR algorithm and analyzes its complexity. Section 5 gives the experimental results. Finally, Section 6 concludes the paper. 2 RELATED WORK The previous related works can be categorized into non-i.i.d. learning methods and causal inference methods which we briefly review and discuss in this section. Non-i.i.d. learning problem, particularly the covariate shift problem, has been investigated in machine learning field. Their main ideas are to shift the training set distribution to the same with the testing set distribution. Zadrozny et al. [Zadrozny 2004] used rejection sampling method to correct the training set distribution into the same as testing set. Huang et al. [Huang et al. 2007] presented a nonparametric method to match distributions between training and testing set in the feature space. Sugiyama et al. [Sugiyama et al. 2007] designed an importance weighted cross validation for model selection under non-i.i.d. situations. Liu et al. [Liu and Ziebart 2014] proposed a robust bias-aware probabilistic classifier by reweighting the training distribution via a minimax estimation formulation. Most of these methods require prior knowledge on the testing set distribution, which is infeasible in many real applications including non-batch image classification. In this work, we address the non-i.i.d. problem using only the training set, which is more reasonable in real scenario. Causal inference is a powerful statistical modeling tool for explanatory analysis. The major question in estimating causal effect is to balance the distributions of confounders across different treatment levels. Rosenbaum and Rubin [Rosenbaum and Rubin 1983] proposed to achieve the balance by propensity score matching or reweighting. Methods based on propensity scores have been widely used in various fields, including economics [Stuart 2010], epidemiology [Funk et al. 2011], health care [Dos Reis and Culotta 2015], social science [Lechner 1999] and advertising [Sun et al. 2015]. But these methods can only handle one or a few treatment variables and cannot be directly applied in image classification in which a huge number of features are viewed as potential treatment variables. There is a growing recent literature proposing to directly optimize sample weights to balance confounder distributions. Hainmueller [Hainmueller 2011] introduced entropy balancing to directly adjust sample weights by the specified sample moments. Athey et al. [Athey et al. 2016] proposed approximate residual balancing for sample weights learning via a lasso residual regression adjustment. Zubizarreta [Zubizarreta 2015] learnt the stable balancing weights via minimizing its variance and adjust them for confounder balancing. These methods provide an 2

5 On Image Classification: Correlation v.s. Causality 1:5 effective way to estimate causal effects without prior on knowledge structure, but they reweight samples targeting a single treatment variable. We will adapt the reweighting balance technique to large-scale causal effect exploration settings we target. Recently, Lopez-Paz et al. [Lopez-Paz et al. 2017] combined a neural network and causal framework to identify the causal signals in visual images. Their work has the potential to be extended for causal-based classification. But it learns the causal features one by one and errors may accumulate fast given the huge number of features in images. 3 CAUSAL CLASSIFICATION In this section, we first provide the formulation of causality-based image classification problem (called causal classification later for brevity). After that, we present some preliminaries on confounder balancing to make the paper self-contained, followed by a detailed introduction to our proposed Causally Regularized Logistic Regression (CRLR) method. 3.1 Problem Formulation As stated in the introduction, the problem we focus on in this paper is the non-i.i.d. problem in image classification, which can be formulated as following: Problem 1 (Non-i.i.d. problem in image classification). Given the training image set D tr ain = (X tr ain,y tr ain ), where X tr ain R n p represents the image features and Y tr ain R n 1 represents the image label, the task is to learn an image classifier f θ ( ) with parameter θ to precisely predict the label of testing image set D test = (X test,y test ), where D test is unseen and its distribution Ψ(D test ) Ψ(D tr ain ). The non-i.i.d. problem means the different distribution between training and testing image set, namely Ψ(D test ) Ψ(D tr ain ), including the different distribution of image features Ψ(X test ) Ψ(X tr ain ), and image labels Ψ(Y test ) Ψ(Y tr ain ) or both. To address the non-i.i.d. problem, we adapt causal inference to analyze the causal contribution of each image feature on image label and identify the causal features that truly determine whether an image belongs to a category or not. By adapting causal inference to image classification, we can regard each image feature X j as a treated variable (i.e. treatment), all the remaining features X j = X \ X j as confounding variables (i.e. confounders), and the image category Y as the outcome variable. Given a feature and a label, when the feature occurs (or does not occur) in a image, the image becomes a treated (or control) image. To safely estimate the causal contribution of a given image feature X j on image label Y, one have to remove the confounding bias induced by the different distributions of confounder X j between the treated and control image sets. After removing the confounding bias, the difference of image label Y between treated and control image sets can be seen as the causal contribution of feature X j on the image category Y. With causal analysis on image classification, we can identify the causal contribution β R p 1 for all image features, which is robust and insensitive to the distributional changes of unseen testing image set. Namely, the causal contribution for all image features on training set β tr ain β test, even though the testing set is unseen and Ψ(D test ) Ψ(D tr ain ). We address the non-i.i.d. problem in image classification by following causal classification problem. Problem 2 (Causal Classification Problem). Given the training image set D = (X,Y), where X R n p represents the image features and Y R n 1 represents the image category, the task is to identify the causal contribution β for all image features and jointly learn a image classifier f β ( ) based on β for image classification.

6 1:6 Z. Shen et al Table 1. Symbols and definitions. Symbol n p X R n p I R n p Y R n 1 W R n 1 β R p 1 Definition Number of images Dimension of image features Image features Indicator for Treated and Control Image category (outcome) Sample weight Causal contribution of feature The key challenge in causal classification problem is how to jointly optimize the causal contribution identification and image classification. In our paper, we propose a synergistic learning algorithm composed of causal regularizer and logistic loss term. In which causal regularizer balances the confounder distributions associated with each treatment feature by sample reweighting, and causal contributions for all image features are estimated from logistic regression on the reweighted sample. The output logistic regression model gives a causal classification rule. Our proposed synergistic optimization program is a one-step method and can be tuned easily. 3.2 Confounder Balancing In observational studies, the confounder distributions need to be balanced to correct for bias from the non-random treatment assignments. Confounder balancing approaches exploit moments to characterize distributions, as moments can uniquely determine a distribution. Instead of balancing confounder distributions, they directly balance confounder moments by adjusting weights of samples. The sample weights W are learned by: W = arg min W X t W j X j 2 2. (1) Given a treatment feature T, the X t and j:t j =0 W j X j represent the mean value of confounders on samples with and without treatment, respectively. Note that only first-order moment is considered in Eq. 1 and higher order moments can be easily incorporated by including more features. We will adapt the sample reweighting technique to simultaneously balance confounder distributions associated with all treatment features. We will elaborate on this in Section Causally Regularized Logistic Regression Inspired by the confounder balancing method, we propose a causal regularizer to reweight training images by successively setting each image feature as treatment variable and all remaining features as confounders. Our proposed causal regularizer is: p X j T (W I j ) 2 j=1 2, (2) j:t j =0 X j T (W (1 I j )) W T I j W T (1 I j ) where W is the sample weights. X j T (W I j ) X j T (W (1 I j )) 2 represents the loss of confounder W T I j W T (1 I j ) 2 balancing when setting image feature j as treatment variable, and X j is all the remaining features (i.e. confounders), which is from X by replacing its j th column as 0. The I j means the j th column of I, and I ij refers to the treatment status of unit i when setting feature j as treatment variable. The notion behind this regularizer is that when we estimate the causal effect of treatment variable, we

7 On Image Classification: Correlation v.s. Causality 1:7 are trying to control the distributions of other variables(confounders) to be as much as possible consistent in different level of treatment variable. By incorporating the causal regularizer in Eq. 2, we give our Causally Regularized Logistical Regression (CRLR) algorithm to jointly optimize sample weigh W and causal contribution β for causal classification based on logistic regression as: min n i=1 W i log(1 + exp((1 2Y i ) (x i β))), (3) s.t. p X j T (W I j ) 2 j=1 2 λ 1, X j T (W (1 I j )) W T I j W T (1 I j ) W 0, W 2 2 λ 2, β 2 2 λ 3, β 1 λ 4, ( n k=1 W k 1) 2 λ 5, where n i=1 W i log(1 + exp((1 2Y i ) (x i β))) represents the loss of logistic regression after sample reweighting, where x i is the i th row of X, represents the features of unit i. Elastic net constraints β 2 2 λ 3 and β 1 λ 4 help avoid overfitting. The termsw 0 constrain each of sample weights to be non-negative. With norm W 2 2 λ 2, we can reduce the variance of the sample weights to achieve stability. The formula ( n k=1 W k 1) 2 λ 5 avoids all the sample weights to be zero. In the traditional logistic regression model, the coefficients capture the correlation between the features and the category label. While the highly correlated features do not imply causation due to confounding bias. The sample weights produced from the causal regularizer are capable of correcting the bias. The estimated coefficients from the CRLR can thus be viewed as causal contributions of the features, which can also be ranked after feature standardization. 4 OPTIMIZATION In this section, we give the optimization details of our CRLR algorithm and analyze its complexity. 4.1 Algorithm The goal for optimizing the aforementioned model in Eq. 3 is equivalent to optimize following problem, which is to minimize J(W, β) with constraint on parameters W and β. J(W, β) = n i=1 W i log(1 + exp((1 2Y i ) (x i β))) (4) p +λ 1 X j T (W I j ) j=1 X j T (W (1 I j )) 2 W T I j W T (1 I j ) 2 +λ 2 W λ 3 β λ 4 β 1 + λ 5 ( n k=1 W k 1) 2 s.t. W 0. It is difficult to get an analytical solution for the final optimization problem in Eq. 4. We solve it with iterative optimization algorithm. Firstly, we initialize sample weight W and causal contribution β. Once the initial values are given, in each iteration, we first update β by fixing W, and then update W by fixing β. These steps are described below: Update β: When fixing W, the problem (4) is equivalent to optimize following objective function: J(β) = n i=1 W i log(1 + exp((1 2Y i ) (x i β))) (5) +λ 3 β λ 4 β 1 which is a standard l 1 norm regularized least squares problem and can be solved with any LASSO (or elastic net) solver. Here, we use the proximal gradient algorithm [Parikh et al. 2014] with proximal operator to optimize the objective function in (5).

8 1:8 Z. Shen et al Update W : By fixing β, we can obtain W by optimizing (4). It is equivalent to optimize following objective function: J(W ) = n i=1 W i log(1 + exp((1 2Y i ) (x i β))) (6) p +λ 1 X j T (W I j ) j=1 X j T (W (1 I j )) 2 W T I j W T (1 I j ) 2 +λ 2 W λ 5( n k=1 W k 1) 2 s.t. W 0. For ensuring non-negativity of W, we let W = ω ω, where ω R n 1 and refers to the Hadamard product. Then the problem (6) can be reformulated as: J(ω) = n i=1 (ω i ω i ) log(1 + exp((1 2Y i ) (x i β))) (7) p + λ 1 X j T (ω ω I j ) j=1 X j T (ω ω (1 I j )) 2 (ω ω) T I j (ω ω) T (1 I j ) 2 + λ 2 ω ω λ 5( n k=1 ω k ω k 1) 2 The partial gradient of term J(ω) with respect to ω is: J(ω) ω where 1 T = (1, 1,, 1), and } {{ } p = p j=1 4 ( J b ω (ω 1T ) ) T Jb, J b = XT j (ω ω I j) (ω ω) T XT j (ω ω (1 I j)) I j (ω ω) T, (1 I j ) J b ω = X T j (I j 1 T ) ((ω ω) T I j ) X j T (ω ω I j ) Ij ) T 2 ((ω ω) T I j X j T ((1 I j ) 1 T ) ((ω ω) T (1 I j )) X j T (ω ω (1 I j )) (1 I j ) ( ) T 2 (ω ω) T (1 I j ) Then we determine the step size a with line search, and update ω at t th iteration as: and update W (t) at t th iteration with: ω (t) = ω (t 1) a J(ω(t 1) ) ω (t 1), W (t) = ω (t) ω (t). We update β and W iteratively until the objective function (4) converges. The whole algorithm is summarized in Algorithm 1. Finally, with the optimized causal contribution β by our CRLR algorithm, we can make causal classification with interpretable insights on any testing image set. 4.2 Complexity Analysis During the procedure of optimization, the main cost is to calculate the loss J(W, β), update causal feature weights β and sample weights W. We analyze the time complexity of each of them respectively. For the calculation of the loss, its complexity is O(np 2 ), where n is the sample size and p is the dimension of variables. For updating β, this is standard LASSO problem and its complexity is O(np). For updating W, the complexity is dominated by the step of calculating the partial gradients of function J(ω) with respect to variable ω. The complexity of J(ω) ω is O(np 2 ).

9 On Image Classification: Correlation v.s. Causality 1:9 Algorithm 1 Causal Regularized Logistic Regression (CRLR) Input: Tradeoff parameters λ 1 > 0, λ 2 > 0, λ 3 > 0, λ 4 > 0, λ 5 > 0, Variables Matrix X and Outcome Y. Output: Causal Contribution β and Sample Weight W 1: Calculate Indicator Matrix I from Variables Matrix X. 2: Initialize Causal Contribution β (0), Sample Weight W (0) 3: Calculate the current value of J(W, β) (0) = J(W (0), β (0) ) with Equation (4) 4: Initialize the iteration variable t 0 5: repeat 6: t t + 1 7: Update β (t) by solving J(β (t 1) ) in Equation (5) 8: Update W (t) by solving J(W (t 1) ) in Equation (6) 9: Calculate J(W, β) (t) = J(W (t), β (t) ) 10: until J(W, β) (t) converges or max iteration is reached 11: return β, W. In total, the complexity of each iteration in Algorithm 1 is O(np 2 ). 5 EXPERIMENTS In this section, we introduce the experimental settings, present results of comparative study in three non i.i.d. situations, and visualize the explainable causal features. 5.1 Experimental Settings Dataset. In order to simulate the potential non-i.i.d. situations in real world, we manually construct a 10-category dataset based on images from YFCC100M [Thomee et al. 2016]. YFCC100M dataset provides 100 million images and each image contains multiple tags. In constructing our dataset, we first select a major object tag (e.g. dog) as the category label, and select 5 other context tags which are frequently co-occurred with the major tag (e.g. grass, beach, car). We then combine the major tag and a context tag as the query to retrieve images from YFCC100M. After scrutinizing the image contents to guarantee their correctness w.r.t. the category and context, we get a number of images of a certain context in a category. In total, we organize 10 categories with each containing 5 contexts. The details of the dataset is described in Table 2. Table 2. Statistics of our dataset with 10 categories, where each category has 5 contexts. Context 1 Context 2 Context 3 Context 4 Context 5 Total bird duck(210) gull(200) hawk(200) heron(200) parrot(190) 1000 bridge san francisco(160) london(110) nyc(110) street(100) sydney(180) 660 car art(114) bmw(120) classic(200) ferrari(200) racing(180) 814 cat black(180) house(120) kitten(200) tabby(200) white(240) 940 church basilica(94) catholic(83) gothic(104) orthodox(100) roman(81) 462 dog beach(200) car(150) grass(200) home(200) snow(190) 940 flower blossom(200) lily(240) orchid(240) rose(220) tulip(190) 1090 horse dressage(260) equestrian(206) jumping(200) pony(50) racing(140) 856 train diesel(250) locomotive(230) metro(100) station(68) steam(150) 798 tree christmas(140) leaves(220) palm(170) snow(160) spring(170) 860

10 1:10 Z. Shen et al Image Representation. For ease of visualization and interpretation, we use visual-words [Csurka et al. 2004] as features to represent images. In details, we first detect interest points with 8*8 grids on images. For the sake of brevity and generality, we define a gird (8*8, 16*16, etc) to extract feature points. We then use SURF descriptor (speeded up robust features) [Bay et al. 2006] to quantize each feature points into numerical feature vector. Finally, we apply k-means [Hartigan and Wong 1979] clustering algorithm on the feature descriptors extracted from the whole image set and get 500 visual clusters. By the k-means algorithm, we can assign each feature descriptor into one of the k mutually exclusive clusters. Each cluster center is defined as a "visual word", and each feature descriptor can be assigned into a visual word by nearest neighbor. Then each image is encoded into a 500-dimensional feature vector with each dimension being a binary variable to indicate whether a visual word occurs in the image Baselines. We implement following competitive baseline classifiers to compare with our algorithm. LR [Menard 2002]: Logistic Regression (LR) is a typical correlation-based method and has been widely used in many classification problems. LR+L 1 [Tibshirani 1996]: To avoid overfitting on LR model, we impose L 1 regularizer on Logistic Regression. SVM [Cortes and Vapnik 1995]: Support Vector Machine (SVM) is another typical classification method, and we use the SVM with kernel method as a baseline. Two-Step: It is a straight-forward two step solution which first performs causal feature selection via confounder balancing [Athey et al. 2016] and then apply Logistic Regression. We tuned the number of top causal features, and reported the results with the optimal one. MLP [LeCun et al. 1998]: We implement a Multi-layer Perceptron (MLP) as a baseline classifier. After tuning the neural network structure, we adopt the optimal structure on validation set with 3 hidden layers( ) in experiments. We tuned the parameters in our algorithm and baselines via cross validation by gird searching with validation set. 5.2 Experimental Results In this section, we report our experimental results under three different non-i.i.d. situations, including radical context bias situation, moderate context bias situation, we explore a more general condition which the distribution of contexts differs between training and testing set. In the label composition bias situation, we investigate a common problem that the percentage of positive samples varies from training to testing set Radical Context Bias. Settings. In this experiment, we simulate the non-i.i.d. situation by splitting different contexts into training, validation and testing set. For each category, we use context 1,2,3 for training, context 4 for validation and context 5 for testing. Moreover, we perform a non-uniform sampling among different contexts in the training set and make the context 1/2/3 occupies 0.66/0.17/0.17 percentage respectively. This setting is consistent with the natural phenomena that visual concepts follow a power-law distribution [Clauset et al. 2009], indicating that only a few visual concepts are common and the rest majority are rare. We transfer this into visual contexts with a similar notion. Results. We report the performances in Accuracy and F1 in Table 3. From the results, we have following observations. (1) Our CRLR model achieves the best performance in almost all categories

11 On Image Classification: Correlation v.s. Causality 1:11 Table 3. Results of classifiers under non-i.i.d. situation with radical context bias in data. bird bridge car cat church Accuracy F1 Accuracy F1 Accuracy F1 Accuracy F1 Accuracy F1 LR LR+L SVM Two-Step MLP CRLR dog flower horse train tree Accuracy F1 Accuracy F1 Accuracy F1 Accuracy F1 Accuracy F1 LR LR+L SVM Two-Step MLP CRLR (9/10). Since the major difference between CRLR and a standard Logistic Regression model is the causal regularizer, we can safely attribute the significant improvement to the effective confounder balancing term and its seamless joint with logistic regression model. (2) The performance of the two-step approach is much worse than CRLR, which clearly demonstrate the importance of jointly optimizing causal feature selection and classification. (3) Not surprisingly, the correlation-based classification methods do not work well in this setting, mainly because they erroneously put correlational but non-causal features in important positions, leading to their sensitivity to context change Context Bias 5 0 bird bridge car cat church dog flower horse train tree Fig. 2. The relationship between our CRLR algorithm performance and context bias on each category. The more context bias in data, the more relative F1 improvement of our CRLR algorithm Relative F1 Improvement Insightful Analysis. An interesting question is to validate whether CRLR can perform much better in categories where bias is more serious. Here we quantify the bias level of a category with the EMD distance between the average feature vector of training images and the average feature vector of testing images. We also quantify the superiority of CRLR by its relative F1 improvement over

12 1:12 Z. Shen et al the best baseline. Then we show the results in Figure 2. We can see that relative F1 improvement and the category bias level are correlated to some degree. The extreme cases are more obvious. For example, dog category is most biased where our CRLR s relative improvement in F1 can reach about 50%. In contrast, the bias in the church category is not obvious, which can account for CRLR s ordinary performance in church category in Table Moderate Context Bias. Settings. In this experiment, we explore a more general condition where the training and testing set consists of the same contexts but with different percentages. The training set is constructed in the same way as Section 5.2.1, where the percentages of context 1, 2 and 3 are 0.66, 0.17 and 0.17 respectively. In order to simulate different levels of context bias between training and testing set, we construct the testing data also from context 1,2 and 3, but vary the percentage of context 1 occupying the whole testing set from 0.2 to 0.8, and the remained percentage is equally divided by context 2 and 3. For each setting, we execute our algorithm and baselines in all 10 categories and report the average F1. Average F LR LRL1 SVM Two-Step MLP CRLR Percentage of Context 1 in testing set Fig. 3. Average F1 performance on different context bias. Results. The results are shown in Figure 3. From the results, we can see that our proposed CRLR algorithm outperforms the baselines at different levels of context bias. The most competitive baseline is SVM. By comparing CRLR and SVM, we can see that CRLR has more obvious advantage when the percentage of context 1 is less. Considering that context 1 dominates the training set with 0.66 percentage, the result is easy to understand as the less percentage for context 1 in testing set implies larger context bias between training and testing set. Besides, the point between 0.6 and 0.7 in the percentage of context 1 should also be noted. At that point, the percentages of context 1,2, and 3 in testing set is around 0.66, 0.17 and 0.17 respectively, which are almost the same as training set. This implies an i.i.d. situation. In this situation, the performance of CRLR is almost the same as SVM. These results demonstrate that CRLR can significantly outperform correlation-based methods in non-i.i.d. situations, while perform equally well as correlation-based methods in i.i.d. situations Label Composition Bias. Settings. In this experiment, we consider a common situation that the percentage of positive and negative samples are different in training and testing set. We use context 1,2,3 for training, context 4 for validation and context 5 for testing. For each context, we fix the positive sample rate to be 25% in training set, while vary the positive sample rate in

13 On Image Classification: Correlation v.s. Causality 1:13 testing set from 0.1 to 0.9. For each percentage of positive samples in testing set, we report the average F1 over 10 categories. Average F LR LRL1 SVM Two-Step MLP CRLR Percentage of Positive Samples Fig. 4. Average F1 performance on different label composition bias. Results. From Figure 4, we can see that our CRLR algorithm performs the best at all settings, and our CRLR gets more superiority over baselines as the percentage of positive samples in testing set increasing. When the percentage reaches 0.9, our CRLR can improve the baselines from 0.58 to 0.73 in average F1 among all categories. As we know that when the training set is dominated by negative samples, the traditional classifiers would assign higher weights to negative features and be cautious to give positive predictions. Then if the testing set is dominated by positive samples, these classifier cannot work well. But our CRLR model always poses emphasis on causal features, and are robust to the scenarios where positive samples are rather sparse. This merit is fully demonstrated by Figure Feature Visualization and Explanation. Another important goal of introducing causality into image classification is to make the image classification models more explainable. The previous classification models, especially deep learning models, are typical black-box models which are hardly explainable. Explainable models are very much desired in many applications especially the ones involving people to make decisions. To demonstrate the interpretability of our method, we visualize the top-5 features in each category selected by CRLR and LR respectively. Due to space limitation, we only show some examples in 4 categories in Figure 5. We can see that most of the features selected by CRLR are positioned on the major object. For example, in the dog category, the selected features by CRLR are indeed from dog nose, ear, fur etc, which are causal features to determine whether an image belongs to the dog category. In contrast, the many of the features selected by LR are context features. From the explainable angle of view, CRLR can provide sufficient explanations on why it classifies an image into the dog category because it detects the causal features like dog nose and fur. As for LR, the correlational features are often difficult to interpret. But we still find that our method would exploit correlation features in some cases, as depicted in Figure 5.(m) and 5.(o). It might because the bias level in the train category is fairly low, which weakens the effect of the causal regularizer.

14 1:14 Z. Shen et al (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) (p) Fig. 5. Top 5 features selected by CRLR and Logistic Regression, the red boxes indicate the features that CRLR selects and the green boxes indicate the features that Logistic Regression selects. Note that each feature represents a visual word and may corresponds to multiple bounding boxes, so the number of red and green boxes may not be equal bird bridge dog train Average F λ 1 Fig. 6. Sensitivity analysis with respect to λ Parameter Sensitivity. In this section, we investigate the parameter sensitivity in our CRLR algorithm. As λ 2 to λ 5 are the weights of commonly used regularizers, we evaluate the effect of parameter λ 1 on the results. λ 1 is eventually a trade-off parameter to control the relative weights of logistic regression and causal regularizer. More intrinsically, it controls the trade-off between predictive power and the degree of bias balancing. Because the results of different categories show

15 On Image Classification: Correlation v.s. Causality 1:15 similar tends, we just use four categories as examples(bird, Bridge, Dog, Train) for brevity. The value of λ 1 varies from {0.1, 0.3, 1, 3, 5, 10, 15}. We plot the results in Figure 6. We can see that the average F1 changes smoothly with the variation of parameter λ 1 and there is quite a large stable region that we can select the optimal λ 1 from, demonstrating that our method is not sensitive to the parameter. 6 CONCLUSION AND DISCUSSION In this paper, we focus on the image classification task under non-i.i.d. situation. We argue that most previous correlation-based methods can only preserve their predictive power when the training and testing set are drawn from the same distribution but cannot generalize well under non-i.i.d. situation. Moreover, the results produced by those correlation-based methods can hardly be interpreted. To address the non-i.i.d. challenge, we introduce causality into image classification and propose a novel Causally Regularized Logistic Regression (CRLR) model to jointly optimize logistic loss and causal regularizer for causal classification on images. We construct a new dataset to simulate various non-i.i.d. situations in real world applications and conduct extensive experiments. The experimental results demonstrate that our CRLR algorithm outperforms the traditional correlationbased methods in various settings. We also demonstrate that the top causal features selected by our CRLR can provide explainable insights. In this paper, we omit the comparison with CNN-based image classifiers, as we cannot retrain a CNN model with only thousands of images in our dataset. But if we use the pre-trained CNN like AlexNet, it is seriously unfair because the representations in AlexNet is trained by millions of images. As we know that these complicated deep models are indeed overfitting the data. In this sense, they are also sensitive to the context bias and label composition bias. We will leave the empirical study on this problem to the future work. REFERENCES Constantin F Aliferis, Ioannis Tsamardinos, and Alexander Statnikov HITON: a novel Markov Blanket algorithm for optimal variable selection. In AMIA Annual Symposium Proceedings, Vol American Medical Informatics Association, 21. Susan Athey, Guido W Imbens, and Stefan Wager Approximate Residual Balancing: De-Biased Inference of Average Treatment Effects in High Dimensions. arxiv preprint arxiv: (2016). Peter C Austin An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate behavioral research 46, 3 (2011), Heejung Bang and James M Robins Doubly robust estimation in missing data and causal inference models. Biometrics 61, 4 (2005), Herbert Bay, Tinne Tuytelaars, and Luc Van Gool Surf: Speeded up robust features. Computer vision ECCV 2006 (2006), Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman Power-law distributions in empirical data. SIAM review 51, 4 (2009), Corinna Cortes and Vladimir Vapnik Support-vector networks. Machine learning 20, 3 (1995), Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and Cédric Bray Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision, ECCV, Vol. 1. Prague, 1 2. Virgile Landeiro Dos Reis and Aron Culotta Using matched samples to estimate the effects of exercise on mental health from Twitter. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Michele Jonsson Funk, Daniel Westreich, Chris Wiesen, Til Stürmer, M Alan Brookhart, and Marie Davidian Doubly robust estimation of causal effects. American journal of epidemiology 173, 7 (2011), Jens Hainmueller Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis (2011), mpr025. John A Hartigan and Manchek A Wong Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 1 (1979), Jiayuan Huang, Alexander J Smola, Arthur Gretton, Karsten M Borgwardt, Bernhard Schölkopf, et al Correcting sample selection bias by unlabeled data. Advances in neural information processing systems 19 (2007), 601. Daphne Koller and Mehran Sahami Toward optimal feature selection. Technical Report. Stanford InfoLab.

16 1:16 Z. Shen et al Kun Kuang, Peng Cui, Bo Li, Meng Jiang, Shiqiang Yang, and Fei Wang Treatment effect estimation with data-driven variable decomposition. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI. Michael Lechner Earnings and employment effects of continuous gff-the-job training in east germany after unification. Journal of Business & Economic Statistics 17, 1 (1999), Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), Anqi Liu and Brian Ziebart Robust classification under sample selection bias. In Advances in Neural Information Processing Systems David Lopez-Paz, Robert Nishihara, Soumith Chintala, Bernhard Schölkopf, and Léon Bottou Discovering Causal Signals in Images. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017). Scott Menard Applied logistic regression analysis. Number 106. Sage. Neal Parikh, Stephen P Boyd, et al Proximal Algorithms. Foundations and Trends in optimization 1, 3 (2014), Jean-Philippe Pellet and André Elisseeff Using Markov blankets for causal structure learning. Journal of Machine Learning Research 9, Jul (2008), Paul R Rosenbaum and Donald B Rubin The central role of the propensity score in observational studies for causal effects. Biometrika (1983), Hidetoshi Shimodaira Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference 90, 2 (2000), Elizabeth A Stuart Matching methods for causal inference: A review and a look forward. Statistical science: a review journal of the Institute of Mathematical Statistics 25, 1 (2010), 1. Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert MÃžller Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research 8, May (2007), Wei Sun, Pengyuan Wang, Dawei Yin, Jian Yang, and Yi Chang Causal Inference via Sparse Additive Models with Application to Online Advertising.. In AAAI Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li Yfcc100m: The new data in multimedia research. Commun. ACM 59, 2 (2016), Robert Tibshirani Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) (1996), Lei Yu and Huan Liu Efficient feature selection via analysis of relevance and redundancy. Journal of machine learning research 5, Oct (2004), Bianca Zadrozny Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on Machine learning. ACM, 114. José R Zubizarreta Stable weights that balance covariates for estimation with incomplete outcome data. J. Amer. Statist. Assoc. 110, 511 (2015),

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a