arxiv: v2 [stat.ml] 4 Jun 2015 Abstract

Size: px

Start display at page:

Download "arxiv: v2 [stat.ml] 4 Jun 2015 Abstract"

Eugenia Wilkerson
5 years ago
Views:

1 Visual Causal Feature Learning Krzysztof Calupka Computation and Neural Systems California Institute of Tecnology Pasadena, CA, USA Pietro Perona Electrical Engineering California Institute of Tecnology Pasadena, CA, USA Frederick Eberardt Humanities and Social Sciences California Institute of Tecnology Pasadena, CA, USA arxiv: v2 [stat.ml] 4 Jun 2015 Abstract We provide a rigorous definition of te visual cause of a beavior tat is broadly applicable to te visually driven beavior in umans, animals, neurons, robots and oter perceiving systems. Our framework generalizes standard accounts of causal learning to settings in wic te causal variables need to be constructed from micro-variables. We prove te Causal Coarsening Teorem, wic allows us to gain causal knowledge from observational data wit minimal experimental effort. Te teorem provides a connection to standard inference tecniques in macine learning tat identify features of an image tat correlate wit, but may not cause, te target beavior. Finally, we propose an active learning sceme to learn a manipulator function tat performs optimal manipulations on te image to automatically identify te visual cause of a target beavior. We illustrate our inference and learning algoritms in experiments based on bot syntetic and real data. 1 INTRODUCTION Visual perception is an important trigger of uman and animal beavior. Te visual cause of a beavior can be easy to define, say, wen a traffic ligt turns green, or quite subtle: apparently it is te increased symmetry of features tat leads people to judge faces more attractive tan oters (Grammer and Tornill, 1994). Significant scientific and economic effort is focused on visual causes in advertising, entertainment, communication, design, medicine, robotics and te study of uman and animal cognition. Visual causes profoundly influence our daily activity, yet our understanding of wat constitutes a visual cause lacks a teoretical basis. In practice, it is well-known tat images are composed of millions of variables (te pixels) but it is functions of te pixels (often called features ) tat ave meaning, rater tan te pixels temselves. We present a teoretical framework and inference algoritms for visual causes in images. A visual cause is defined (more formally below) as a function (or feature) of raw image pixels tat as a causal effect on te target beavior of a perceiving system of interest. We present tree advances: We provide a definition of te visual cause of a target beavior as a macro-variable tat is constructed from te micro-variables (pixels) tat make up te image space. Te visual cause is distinguised from oter macro-variables in tat it contains all te causal information about te target beavior tat is available in te image. We place te visual cause witin te standard framework of causal grapical models (Spirtes et al., 2000; Pearl, 2009), tereby contributing to an account of ow to construct causal variables. We prove te Causal Coarsening Teorem (CCT), wic sows ow observational data can be used to learn te visual cause wit minimal experimental effort. It connects te present results to standard classification tasks in macine learning. We describe a metod to learn te manipulator function, wic automatically performs perceptually optimal manipulations on te visual causes. We illustrate our ideas using syntetic and real-data experiments. Pyton code tat implements our algoritms, as well as reproduces some of te experimental results, is available online at ttp://vision.caltec.edu/ kcalupk/code.tml. We cose to develop te teory witin te context of visual causes as tis setting makes te definitions most intuitive and is itself of significant practical interest. However, te framework and results can be equally well applied to extract causal information from any aggregate of microvariables on wic manipulations are possible. Examples include auditory, olfactory and oter sensory stimuli; igdimensional neural recordings; market data in finance; consumer data in marketing. Tere, causal feature learning is bot of teoretical ( Wat is te cause? ) and practical ( Can we automatically manipulate it? ) importance.

2 1.1 PREVIOUS WORK Our framework extends te teory of causal grapical models (Spirtes et al., 2000; Pearl, 2009) to a setting in wic te input data consists of raw pixel (or oter microvariable) data. In contrast to te standard setting, in wic te macro-variables in te statistical dataset already specify te candidate causal relata, te causal variables in our setting ave to be constructed from te micro-variables tey supervene on, before any causal relations can be establised. We empasize te difference between our metod of causal feature learning and metods for causal feature selection (Guyon et al., 2007; Pellet and Elisseeff, 2008). Te latter coose te best (under some causal criterion) features from a restricted set of plausible macro-variable candidates. In contrast, our framework efficiently searces te wole space of all te possible macro-variables tat can be constructed from an image. Our approac derives its teoretical underpinnings from computational mecanics (Salizi and Crutcfield, 2001; Salizi, 2001), but supports a more explicitly causal interpretation by incorporating te possibility of confounding and interventions. Since we allow for unmeasured common causes of te features in te image and te target beavior, we ave to distinguis between te plain conditional probability distribution of te target beavior (T ) given te (observed) image (I) and te distribution of te target beavior given tat te observed image was manipulated (i.e. P (T I) vs. P (T do(i))). Hoel et al. (2013), wo develop a similar model to investigate te relationsip between causal micro- and macro-variables, avoid tis distinction by assuming tat all teir data was generated from wat in our setting would be te manipulated distribution P (T do(i)). We take te distinction between interventional and observational distributions to be one of te key features of a causal analysis. Te extant literature on causal learning from image or video data does not generally consider te aggregation from pixel variables into causal macro-variables, but instead starts from annotated or pre-defined features of te image (see e.g. Fire and Zu (2013a,b)). 1.2 CAUSAL FEATURE LEARNING: AN EXAMPLE Fig. 1 presents a paradigmatic case study in visual causal feature learning, wic we will use as a running example. Te contents of an image I are caused by external, nonvisual binary idden variables H 1 and H 2 suc tat if H 1 is on, I contains a vertical bar (v-bar 1 ) at a random position, and if H 2 is on, I contains a orizontal bar (-bar) at a random position. A target beavior T {0, 1} is caused by H 1 and I, suc tat T = 1 is more likely wenever H 1 = 1 and wenever te image contains an -bar. 1 We take a v-bar (-bar) to consist of a complete column (row) of black pixels. We deliberately constructed tis example suc tat te visual cause is clearly identifiable: manipulating te presence of an -bar in te image will influence te distribution of T. Tus, we can call te following function C : I {0, 1} te causal feature of I or te visual cause of T : { 1 if I contains an -bar C(I) = 0 oterwise. Te presence of a v-bar, on te oter and, is not a causal feature. Manipulating te presence of a v-bar in te image as no effect on H 1 or T. Still, te presence of a v-bar is as strongly correlated wit te value of T (via te common cause H 1 ) as te presence of an -bar is. We will call te following function S : I {0, 1} te spurious correlate of T in I: { 1 if I contains a v-bar S(I) = 0 oterwise. Bot te presence of -bars and te presence of v-bars are good individual (and even better joint) predictors of te target variable, but only one of tem is a cause. Identifying te visual cause from te image tus requires te ability to distinguis among te correlates of te target variables tose tat are actually causal, even if te non-causal correlates are (possibly more strongly) correlated wit te target. Wile te values of S and C in our example stand in a bijective correspondence to te values of H 1 and H 2, respectively, tis is only to keep te illustration simple. In general, te visual cause and te spurious correlate can be probabilistic functions of any number of (not necessarily independent) idden variables, and can sare te same idden causes. 2 A THEORY OF VISUAL CAUSAL FEATURES In our example te identification of te visual cause wit te presence of an -bar is intuitively obvious, as te model is constructed to ave an easily describable visual cause. But te example does not provide a teoretical account of wat it takes to be a visual cause in te general case wen we do not know wat te causally relevant pixel configurations are. In tis section, we provide a general account of ow te visual cause is related to te pixel data. 2.1 VISUAL CAUSES AS MACRO-VARIABLES A visual cause is a ig-level random variable tat is a function (or feature) of te image, wic in turn is defined by te random micro-variables tat determine te pixel values. Te functional relation between te image and te visual cause is, in general, surjective, toug in principle it could be bijective. Wile we are interested in identifying

3 P( I H1=0, H2=0) = U( ) P( I H1=0, H2=1) = U( ) P( I H1=1, H2=0) = U( ) P( I H1=1, H2=1) = U( ) H1 H2 I T P(H2=0) = 0.5 P(H1=0) = 0.5 P(T=0 I (,, ), H1=1) = 0 P(T=0 I (, ), H1=0) =.33 P(T=0 I (, ), H1=1) =.66 P(T=0 I (, ), H1=0) = 1 Figure 1: Our case study generative model. Two binary idden (non-visual) variables H 1 and H 2 toss unbiased coins. Te content of te image I depends on tese variables as follows. If H 1 = H 2 = 0, I is cosen uniformly at random from all te images containing no v-bars and no -bars. If H 1 = 0 and H 2 = 1, I is cosen uniformly at random from all images containing at least one -bar but no v-bars. If H 1 = 1 and H 2 = 0, I is cosen uniformly at random from all te images containing at least one v-bar but no -bars. Finally, if H 1 = H 2 = 1, I is cosen from images containing at least one v-bar and at least one -bar. Te distribution of te binary beavior T depends only on te presence of an -bar in I and te value of H 1. In observational studies, H 1 = 1 iff I contains a v-bar. However, a manipulation of any specific image I = i tat introduces a v-bar (witout canging H 1 ) will in general not cange te probability of T occurring. Tus, T does not depend causally on te presence of v-bars in I. te visual causes of a target beavior, te functional relation between te image pixels and te visual cause sould not itself be interpreted as causal. Pixels do not cause te features of an image, tey constitute tem, just as te atoms of a table constitute te table (and its features). Te difference between te causal and te constitutive relation is tat te former requires te possibility of independent manipulation (at least to some extent), wereas by definition one cannot manipulate te visual cause witout manipulating te image pixels. Te probability distribution over te visual cause is induced by te probability distribution over te pixels in te image and te functional mapping from te image to te visual cause. But since a visual cause stands in a constitutive relation wit te image, we cannot witout furter explanation describe interventions on te visual cause in terms of te standard do-operation (Pearl, 2009). Our goal will be to define a macro-variable C, wic contains all te causal information available in an image about a given beavior T, and define its manipulation. To make te problem approacable, we introduce two (natural) assumptions about te causal relation between te image and te beavior: (i) Te value of te target beavior T is determined subsequently to te image in time, and (ii) te variable T is in no way represented in te image. Tese assumptions exclude te possibility tat T is a cause of features in te image or tat T can be seen as causing itself. 2.2 GENERATIVE MODELS: FROM MICRO- TO MACRO-VARIABLES Let T {0, 1} represent a target beavior. 2 Let I be a discrete space of all te images tat can influence te target beavior (in our experiments in Section 4, I is te space of n-dimensional black-and-wite images). We use te following generative model to describe te relation between te images and te target beavior: An image is generated by a finite set of unobserved discrete variables H 1,..., H m (we write H for sort). Te target beavior is ten determined by te image and possibly a subset of variables H c H tat are confounders of te image and te target beavior: P (T, I) = H = H P (T I, H)P (I H)P (H) P (T I, H c )P (I H)P (H). (1) Independent noise tat may contribute to te target beavior is marginalized and omitted for te sake of simplicity in te above equation. Te noise term incorporates any idden variables wic influence te beavior but stand in no causal relation to te image. Suc variables are not directly relevant to te problem. Fig. 2 sows tis generative model. Under tis model, we can define an observational partition of te space of images I tat groups images into classes tat ave te same conditional probability P (T I): Definition 1 (Observational Partition, Observational Class). Te observational partition Π o (T, I) of te set I w.r.t. beavior T is te partition induced by te equivalence relation suc tat i j if and only if P (T I = i) = P (T I = j). We will denote it as Π o wen te context is clear. A cell of an observational partition is called an observational class. In standard classification tasks in macine learning, te observational partition is associated wit class labels. In our case, two images tat belong to te same cell of te observational partition assign equal predictive probability to te target beavior. Tus, knowing te observational class 2 An extension of te framework to non-binary, discrete T is easy but complicates te notation significantly. An extension to te continuous case is beyond te scope of tis article.

4 called a causal class. HC = (H2, HN) H1 H = (H1,..., HN) I H2 T HN Te underlying idea is tat images are considered causally equivalent wit respect to T if tey ave te same causal effect on T. Given te causal partition of te image space, we can now define te visual cause of T : Definition 4 (Visual Cause). Te visual cause C of a target beavior T is a random variable wose value stands in a bijective relation to te causal class of I. Figure 2: A general model of visual causation. In our model eac image I is caused by a number of idden nonvisual variables H i, wic need not be independent. Te image itself is te only observed cause of a target beavior T. In addition, a (not necessarily proper) subset of te idden variables can be a cause of te target beavior. Tese confounders create visual spurious correlates of te beavior in I. of an image allows us to predict te value of T. However, te predictive probability assigned to an image does not tell us te causal effect of te image on T. For example, a barometer is widely taken to be an excellent predictor of te weater. But canging te barometer needle does not cause an improvement of te weater. It is not a (visual or oterwise) cause of te weater. In contrast, seeing a particular barometer reading may well be a visual cause of weter we pack an umbrella. Our notion of a visual cause depends on te ability to manipulate te image. Definition 2 (Visual Manipulation). A visual manipulation is te operation man(i = i) tat canges (te pixels of) te image to image i I, wile not affecting any oter variables (suc as H or T ). Tat is, te manipulated probability distribution of te generative model in Eq. (1) is given by P (T man(i = i)) = H c P (T I = i, H c )P (H c ). Te manipulation canges te values of te image pixels, but does not cange te underlying world, represented in our model by te H i tat generated te image. Formally, te manipulation is similar to te do-operator for standard causal models. However, we ere reserve te do-operation for interventions on causal macro-variables, suc as te visual cause of T. We discuss te distinction in more detail below. We can now define te causal partition of te image space (wit respect to te target beavior T ) as: Definition 3 (Causal Partition, Causal Class). Te causal partition Π c (T, I) of te set I w.r.t. beavior T is te partition induced by te equivalence relation defined on I suc tat i j if and only if P (T man(i = i)) = P (T man(i = j)) for i, j I. Wen te image space and te target beavior are clear from te context, we will indicate te causal partition by Π c. A cell of a causal partition is Te visual cause is tus a function over I, wose values correspond to te post-manipulation distributions C(i) = P (T man(i = i)). We will write C(i) = c to indicate tat te causal class of image i I is c, or in oter words, tat in image i, te visual cause C takes value c. Knowing C allows us to predict te effects of a visual manipulation P (T man(i = i)), as long as we ave estimated P (T man(i = i k )) for one representative i k of eac causal class k. 2.3 THE CAUSAL COARSENING THEOREM Our main teorem relates te causal and observational partitions for a given I and T. It turns out tat in general te causal partition is a coarsening of te observational partition. Tat is, te causal partition aligns wit te observational partition, but te observational partition may subdivide some of te causal classes. Teorem 5 (Causal Coarsening). Among all te generative distributions of te form sown in Fig. 2 wic induce a given observational partition Π o, almost all induce a causal partition Π c tat is a coarsening of te Π o. Trougout tis article, we use almost all to mean all except for a subset of Lebesgue measure zero. Fig. 3 illustrates te relation between te causal and te observational partition implied by te teorem. We note tat te measure-zero subset were Π C does not coarsen Π O can indeed be non-empty. We provide suc counter-examples in Appendix 7. We prove te CCT in Appendix 6 using a tecnique tat extends tat of Meek (1995): We sow tat (1) restricting te space of all te possible P (T, H, I) to only te distributions compatible wit a fixed observational partition puts a linear constraint on te distribution space; (2) requiring tat te CCT be false puts a non-trivial polynomial constraint on tis subspace, and finally, (3) it follows tat te teorem olds for almost all distributions tat agree wit te given observational partition. Te proof strategy indicates a close connection between te CCT and te faitfulness assumption (Spirtes et al., 2000). Two points are wort noting ere: First, te CCT is interesting inasmuc as te visual causes of a beavior do not contain all te information in te image tat predict te beavior. Suc information, toug not itself a cause of

5 P(T=0 ) = 0 P(T=0 ) =.33 P(T=0 ) =.66 P(T=0 ) = 1 P(T=0 do{ }) =.17 P(T=0 do{ }) =.83 Figure 3: Te Causal Coarsening Teorem. Te observational probabilities of T given I (gray frame) induce an observational partition on te space of all te images (left, observational partition in gray). Te causal probabilities (red frame) induce a causal partition, indicated on te left in red. Te CCT allows us to expect tat te causal partition is a coarsening of te observational partition. Te observational and causal probabilities correspond to te generative model sown in Fig. 1. te beavior, can be informative about te state of oter non-visual causes of te target beavior. Second, te CCT allows us to take any classification problem in wic te data is divided into observational classes, and assume tat te causal labels do not cange witin eac observational class. Tis will elp us develop efficient causal inference algoritms in Section VISUAL CAUSES IN A CAUSAL MODEL CONSISTING OF MACRO-VARIABLES We can now simplify our generative model by omitting all te information in I unrelated to beavior T. Assume tat te observational partition Π T o refines te causal partition Π T c. Eac of te causal classes c 1,, c K delineates a region in te image space I suc tat all te images belonging to tat region induce te same P (T man(i)). Eac of tose regions say, te k-t one can be furter partitioned into sub-regions s k 1,, s k M k suc tat all te images in te m-t sub-region of te k-t causal region induce te same observational probability P (T I). By assumption, te observational partition as a finite number of classes, and we can arbitrarily order te observational classes witin eac causal class. Once suc an ordering is fixed, we can assign an integer m {1, 2,, M k } to eac image i belonging to te k-t causal class suc tat i belongs to te m-t observational class among te M k observational classes contained in c k. By construction, tis integer explains all te variation of te observational class witin a given causal class. Tis suggests te following definition: Definition 6 (Spurious Correlate). Te spurious correlate S is a discrete random variable wose value differentiates between te observational classes contained in any causal Figure 4: A macro-variable model of visual causation. Using our teory of visual causation we can aggregate te information present in visual micro-variables (image pixels) into te visual cause C and spurious correlate S. According to Teorem 7, C and S contain all te information about T available in I. class. Te spurious correlate is a well-defined function on I, wose value ranges between 1 and max k M k. Like C, te spurious correlate S is a macro-variable constructed from te pixels tat make up te image. C and S togeter contain all and only te visual information in I relevant to T, but only C contains te causal information: Teorem 7 (Complete Macro-variable Description). Te following two statements old for C and S as defined above: 1. P (T I) = P (T C, S). 2. Any oter variable X suc tat P (T I) = P (T X) as Sannon entropy H(X) H(C, S). We prove te teorem in Appendix 8. It guarantees tat C and S constitute te smallest-entropy macro-variables tat encompass all te information about te relationsip between T and I. Fig. 4 sows te relationsip between C, S and T, te image space I and te observational and causal partitions scematically. C is now a cause of T, S correlates wit T due to te unobserved common causes H C, and any information irrelevant to T is pused into te independent noise variables (commonly not sown in grapical representations of structural equation models). 3 Te macro-variable model lends itself to te standard treatment of causal grapical models described in Pearl (2009). We can define interventions on te causal variables {C, S, T } using te standard do-operation. Te dooperator only sets te value of te intervened variable to 3 We note tat C may retain predictive information about T tat is not causal, i.e. it is not te case tat all spurious correlations can be accounted for in S. See Appendix 9 for an example.

6 te desired value, making it independent of its causes, but it does not (directly) affect te oter variables in te system or te relationsips between tem (see te modularity assumption in Pearl (2009)). However, unlike te standard case were causal variables are separated in location (e.g. smoking and lung cancer), te causal variables in an image may involve te same pixels: C may be te average brigtness of te image, wereas S may indicate te presence or absence of particular sapes in te image. An intervention on a causal variable using te do-operator tus requires tat te underlying manipulation of te image respects te state of te oter causal variables: Definition 8 (Causal Intervention on Macro-variables). Given te set of macro-variables {C, S} tat take on values {c, s} for an image i I, an intervention do(c = c ) on te macro-variable C is given by te manipulation of te image man(i = i ) suc tat C(i ) = c and S(i ) = s. Te intervention do(s = s ) is defined analogously as te cange of te underlying image tat keeps te value of C constant. In some cases it can be impossible to manipulate C to a desired value witout canging S. We do not take tis to be a problem special to our case. In fact, in te standard macrovariable setting of causal analysis we would expect interventions to be muc more restricted by pysical constraints tan we are wit our interventions in te image space. 3 CAUSAL FEATURE LEARNING: INFERENCE ALGORITHMS Given te teoretical specification of te concepts of interest in te previous section, we can now develop algoritms to learn C, te visual cause of a beavior. In addition, knowledge of C will allow us to specify a manipulator function: a function tat, given any image, can return a maximally similar image wit te desired causal effect. Definition 9 (Manipulator Function). Let C be te causal variable of T and d a metric on I. Te manipulator function of C is a function M C : I C I suc tat M C (i, k) = arg minî C 1 (k) d(i, î) for any i I, k C. In case d(i,.) as multiple minima, we group tem togeter into one equivalence class and leave te coice of te representative to te manipulator function. Te manipulator searces for an image closest to I among all te images wit te desired causal effect k. Te meaning of closest depends on te metric d and is discussed furter in Section 3.2 below. Note tat te manipulator function can find candidates for te image manipulation underlying te desired causal manipulation do(c = c), but it does not ceck weter oter variables in te system (in particular, te spurious correlate) remain in fact uncanged. Using te closest possible image wit desired causal effect is a euristic approac to fulfilling tat requirement. Algoritm 1: Causal Predictor Training input : D obs = {(i 1, p 1 = p(t i 1 )),, (i N, p N = p(t i N )} observational data P = {P 1,, P M } te set of observational classes (so tat k, p k P, 1 k N) Train a neural net training algoritm output: C : I [0, 1] te causal variable 1 Pick {i k1,, i km } {i 1,, i N } s.t. p km = P m ; 2 Estimate Ĉm P (T man(i = i km )) for eac m; 3 For all k let Ĉ(i k) Ĉm if p k = P m ; 4 D csl {(i 1, Ĉ(i 1)),, (i N, Ĉ(i N))}; 5 C Train(D csl ); Tere are several reasons wy we migt want suc a manipulator function: If our goal is to perform causal manipulations on images, te manipulator function offers an automated solution. A manipulator tat uses a given C and produces images wit te desired causal effect provides strong evidence tat C is indeed te visual cause of te beavior. Using te manipulator function we can enric our dataset wit new datapoints, in ope of acieving better generalization on bot te causal and predictive learning tasks. Te problem of visual causal feature learning can now be posed as follows: Given an image space I and a metric d, learn C te visual cause of T and te manipulator M C. 3.1 CAUSAL EFFECT PREDICTION A standard macine learning approac to learning te relation between I and T would be to take an observational dataset D obs = {(i k, P (T i k ))} k=1,,n and learn a predictor f wose training performance guarantees a low test error (so tat f(i ) P (T i ) for a test image i ). In causal feature learning, low test error on observational data is insufficient; it is entirely possible tat D contains spurious information useful in predicting test labels wic is neverteless not causal. Tat is, te prediction may be igly accurate for observational data, but completely inaccurate for a prediction of te effect of a manipulation of te image (recall te barometer example). However, we can use te CCT to obtain a causal dataset from te observational data, and ten train a predictor on tat dataset. Algoritm 1 uses tis strategy to learn a function C tat, presented wit any image i I, returns C(i) P (T man(i = i)). We use a fixed neural network arcitecture to learn C, but any differentiable ypotesis class could be susbtituted instead. Differentiability of C is necessary in Section 3.2 in order to learn te manipulator function.

7 In Step 1 te algoritm picks a representative member of eac observational class. Te CCT tells us tat te causal partition coarsens te observational one. Tat is, in principle (ignoring sampling issues) it is sufficient to estimate Ĉm = P (T man(i = i km )) for just one image in an observational class m in order to know tat P (T man(i = i)) = Ĉm for any oter i in te same observational class. Te coice of te experimental metod of estimating te causal class in Step 2 is left to te user and depends on te beaving agent and te beavior in question. If, for example, T represents weter te spiking rate of a recorded neuron is above a fixed tresold, estimating P (T man(i = i)) could consist of recording te neuron s response to i in a laboratory setting multiple times, and ten calculating te probability of spiking from te finite sample. Te causal dataset created in Step 4 consists of te observational inputs and teir causal classes. Te causal dataset is acquired troug O(N) experiments, were N is te number of observational classes. Te final step of te algoritm trains a neural network tat predicts te causal labels on unseen images. Te coice of te metod of training is again left to te user. 3.2 CAUSAL FEATURE MANIPULATION Once we ave learned C we can use te causal neural network to create syntetic examples of images as similar as possible to te originals, but wit a different causal label. Te meaning of as similar as possible depends on te image metric d (see Definition 9). Te coice of d is taskspecific and crucial to te quality of te manipulations. In our experiments, we use a metric induced by an L 2 norm. Alternatives include oter L p -induced metrics, distances in implicit feature spaces induced by image kernels (Harcaoui and Bac, 2007; Grauman and Darrell, 2007; Bosc et al., 2007; Viswanatan, 2010) and distances in learned representation spaces (Bengio et al., 2013). Algoritm 2 proposes one way to learn te manipulator function using a simple manipulation procedure tat approximates te requirements of Definition 9 up to local minima. Te algoritm, inspired by te active learning tecniques of uncertainty sampling (Lewis and Gale, 1994) and density weiging (Settles and Craven, 2008), starts off by training a causal neural network in Step 2. If only observational data is available, tis can be acieved using Algoritm 1. Next, it randomly cooses a set of images to be manipulated, and teir target post-manipulation causal labels. Te loop tat starts in Step 6 ten takes eac of tose images and searces for te image tat, among te images wit te same desired causal class, is closest to te original image. Note tat te causal class boundaries are defined by te current causal neural net C. Since C is in general a igly nonlinear function and it can be ard to find its inverse sets, we use an approximate solution. Te algoritm tus finds te minimum of a weigted sum of C(j) ĉ l,k Algoritm 2: Manipulator Function Learning input : d: I I R + a metric on te image space D csl = {(i 1, c 1 ), (i N, c N )} causal data C = {C 1,, C M } te set of causal classes (so tat i c i C) Train a neural net training algoritm niters number of experiment iterations Q number of queries per iteration α manipulation tuning parameter A: I C an oracle for P (T do(i)) output: M C : I C I te manipulator function 1 for l 1 to niters do 2 C Train(D csl ); 3 Coose manipulation starting points {i l,1,, i l,q } at random from D csl ; 4 Coose manipulation targets {ĉ l,1,, ĉ l,q } suc tat ĉ l,k c l,k ; 5 for k 1 to Q do 6 î l,k argmin (1 α) C(j) ĉ l,k j I + α d(j, i l,k ); 7 end 8 D csl D csl {(î l,1, A(î l,1 )),, (î l,q, A(î l,q ))}; 9 end (te difference of te output image j s label and te desired label ĉ l,k ) and d(i l,k, j) (te distance of te output image j from te original image i l,k ). At eac iteration, te algoritm performs Q manipulations and te same number of causal queries to te agent, wic result in new datapoints (î l,1, A(î l,1 )),, (î l,q, A(î l,q )). It is natural to claim tat te manipulator performs well if A(î l,k ) ĉ l,k for many k, wic means te target causal labels agree wit te true causal labels. We tus define te manipulation error of te lt iteration MErr l as MErr l = 1 Q Q A(î l,k ) ĉ l,k. (2) k=1 Wile it is important tat our manipulations are accurate, we also want tem to be minimal. Anoter measure of interest is tus te average manipulation distance MDist l = 1 Q Q d(i l,k, î l,k ). (3) k=1 A natural variant of Algoritm 2 is to set niters to a large integer and break te loop wen one or bot of tese performance criteria reaces a desired value.

8 4 EXPERIMENTS In order to illustrate te concepts presented in tis article we perform two causal feature learning experiments. Te first experiment, called GRATING, uses observational and causal data generated by te model from Section 1.2. Te GRATING experiment confirms tat our system can learn te ground trut cause and ignore te spurious correlates of a beavior. Te second experiment, MNIST, uses images of and-written digits (LeCun et al., 1998) to exemplify te use of te manipulator function on sligtly more realistic data: in tis example, we transform an image into a maximally similar image wit anoter class label. We cose problems tat are simple from te computer vision point of view. Our goal is to develop te teory of visual causal feature learning and sow tat it as feasible algoritmic solutions; we are at tis point not engineering advanced computer vision systems. MDist MErr Iteration NONE 4.1 THE GRATING EXPERIMENT In tis experiment we generate data using te model of Fig. 1, wit two minor differences: H 1 and H 2 only induce one v-bar or -bar in te image and we restrict our observational dataset to images wit only about 3% of te pixels filled wit random noise (see Fig. 5). Bot restrictions increase te clarity of presentation. We use Algoritms 1 and 2 (wit minor modifications imposed by te binary nature of te images) to learn te visual cause of beavior T. Figure 5 (top) sows te progress of te training process. Te first step (not sown in te figure) uses te CCT to learn te causal labels on te observational data. We ten train a simple neural network (a fully connected network wit one idden layer of 100 units) on tis data. Te same network is used on Iteration 1 to create new manipulated exemplars. We ten follow Algoritm 2 to train te manipulator iteratively. Fig. 5 (bottom) illustrates te difference between te manipulator on Iteration 1 (wic fails almost 40% of te time) and Iteration 20, were te error is about 6%. Eac column sows example manipulations of a particular kind. Columns wit green labels indicate successful manipulations of wic tere are two kinds: switcing te causal variable on (0 1, adding te -bar ), or switcing it off (1 0, removing te -bar ). Red-labeled columns sow cases in wic te manipulator failed to influence te cause: Tat is, eac red column sows an original image and its manipulated version wic te manipulator believes sould cause a cange in T, but wic does not induce suc cange. Te red/green orizontal bars sow te percentage of success/error for eac manipulation direction. Fig. 5 (bottom, a) sows tat after training on te causally-coarsened observational dataset, te manipulator fails about 40% of te time. In Fig. 5 (b), after twenty manipulator learning iterations, only six manipulations out of (a) Iteration 1 (b) Iteration NONE Figure 5: Manipulator learning for GRATING. Top. Te plots sow te progress of our manipulator function learning algoritm over ten iterations of experiments for te GRATING problem. Te manipulation error decreases quickly wit progressing iterations, wereas te manipulation distance stays close to constant. Bottom. Original and manipulated GRATING images. See text for te details. a undred are unsuccessful. Furtermore, te causally irrelevant image pixels are also muc better preserved tan at iteration 1. Te fully-trained manipulator correctly learned to manipulate te presence of te -bar to cause canges in T, and ignores te v-bar tat is strongly correlated wit te beavior but does not cause it. 4.2 THE MNIST ON MTURK EXPERIMENT In tis experiment we start wit te MNIST dataset of andwritten digits. In our terminology, tis as well as any standard vision dataset is already causal data: te labels are assigned in an experimental setting, not in nature. Consider te following binary uman beavior: T = 1 if a uman observer answers affirmatively to te question Does tis image contain te digit 7?, wile T = 0 if te observer judges tat te image does not contain te digit 7. For simplicity we will assume tat for any image ei-

9 MErr MDist Starting Digit Iteration Target Class Figure 6: Manipulator Learning for MNIST ON MTURK. Top. In contrast to te GRATING experiment, ere te manipulation distance grows as te manipulation error decreases. Tis is because a successful manipulator needs to cange significant parts of eac image (suc as continuous strokes). Bottom. Visualization of manipulator training on randomly selected (not cerry-picked) MNIST digits. See text for te details. ter P (T = 1 man(i)) = 0 or P (T = 1 man(i)) = 1. Our task is to learn te manipulator function tat will take any image and modify it minimally suc tat it will become a 7 if it was not before, or will stop resembling a 7 if it did originally. We conduct te manipulator training separately for all te ten MNIST digits using uman annotators on Amazon Mecanical Turk. Te exact training procedure is described in Appendix 10. Fig. 6 (top) sows training progress. As in Fig. 5, te manipulation error decreases wit training. Fig. 6 (bottom) visualizes te manipulator training progress. In te first row we see a randomly cosen MNIST 9 being manipulated to resemble a 0, pused troug successive 0-vs-all manipulators trained at iterations 0, 1,..., 5 (iteration 1 sows wat te neural net takes to be te closest manipulation to cange te 9 to a purely on te basis of te non-manipulated data). Furter rows perform similar experiments for te oter digits. Te plots sow ow successive manipulators progressively remove te original digits features and add target class features to te image. 5 DISCUSSION We provide a link between causal reasoning and neural network models tat ave recently enjoyed tremendous success in te fields of macine learning and computer vision (LeCun et al., 1998; Russakovsky et al., 2014). Despite very encouraging results in image classification (Krizevsky et al., 2012), object detection (Dollar et al., 2012) and fine-grained classification (Branson et al., 2014; Zang et al., 2014), some researcers ave found tat visual neural networks can be easily fooled using adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2014). Te learning procedure for our manipulator function could be viewed as an attempt to train a classifier tat is robust against suc examples. Te procedure uses causal reasoning to improve on te boundaries of a standard, correlational classifier (Fig. 5 and 6 sow te improvement). However, te ultimate purpose of a causal manipulator network is to extract truly causal features from data and automatically perform causal manipulations based on tose features. A second contribution concerns te field of causal discovery. Modern causal discovery algoritms presuppose tat te set of causal variables is well-defined and meaningful. Wat exactly tis presupposition entails is unclear, but tere are clear counter-examples: x and 2x cannot be two distinct causal variables. Tere are also well understood problems wen causal variables are aggregates of oter variables (Cu et al., 2003; Spirtes and Sceines, 2004). We provide an account of ow causal macro-variables can supervene on micro-variables. Tis article is an attempt to clarify ow one may construct a set of well-defined causal macro-variables tat function as basic relata in a causal grapical model. Tis step strikes us as essential if causal metodology is to be successful in areas were we do not ave clearly delineated candidate causes or were causes supervene on micro-variables, suc as in climate science and neuroscience, economics and in our specific case vision. Acknowledgements KC s work was funded by te Qualcomm Innovation Fellowsip KC s and PP s work was supported by te ONR MURI grant N FE would like to tank Cosma Salizi for pointers to many relevant results tis paper builds on.

10 6 APPENDIX: PROOF OF THE CAUSAL COARSENING THEOREM Before we prove te Causal Coarsening Teorem, we prove its less general version in order to split te rater complex proof of CCT into two parts. Tis Auxiliary Teorem can be proven using simpler tecniques, owever ere we deliberately use tecniques tat transfer directly to te proof of te CCT. Auxiliary Teorem Among all te generative models of te form discussed in Fig. 2 (in te main text), te subset of distributions P (T, H, I) for wic te causal partition is not a coarsening (proper or improper) of te observational partition is Lebesgue measure zero. Proof. Our proof is inspired by a proof used by Meek (1995) to prove tat almost all distributions compatible wit a given causal grap are faitful. Te proof strategy is tus first to express te proposition tat for a given distribution, te observational partition does not refine te causal partition as a polynomial equation on te space of all distributions compatible wit te model. We ten sow tat tis polynomial equation is not trivial, i.e. tere is at least one distribution tat is not its root. By a simple algebraic lemma, tis will prove te teorem. We extend Meek s proof tecnique in our usage of Fubini s Teorem for te Lebesgue integral. It allows us to split te polynomial constraint into multiple different constraints along several of te distribution parameters. Tis allows for additional flexibility in creating useful assumptions (in our proof, te assumption tat te datapoints ave well-defined causal classes, but te observational class can still vary freely). Assume tat T is binary and H = (H 1,, H M ), I are discrete variables (say H i = K i, I = N, toug N can be very large. We will use te notation K K 1 K M for simplicity later on). Te discreteness assumption is not crucial, but will simplify te reasoning. We can factorize te joint as P (T, H, I) = P (T H, I)P (I H)P (H). P (T H, I) can be parametrized by H 1 H M I = K N parameters, P (I H) by (N 1) K parameters, and P (H) by anoter K parameters, all of wic are independent. Call te parameters, respectively, α,i P (T = 0 H =, I = i) β i, P (I = i H = ) γ P (H = ) We will denote parameter vectors as α = (α 1,i 1,, α K,i N ) R K N β = (β i1, 1,, β in 1, K ) R (N 1) K γ = (γ 1,, γ K ) R K, were te indices are arranged in lexicograpical order. Tis creates a one-to-one correspondence of eac possible joint distribution P (T, H, I) wit a point (α, β, γ) P [α, β, γ] R K3 N (N 1), were P [α, β, γ] is te K 3 N (N 1)-dimensional simplex of multinomial distributions. To proceed wit te proof, we first pick any point in te P (T H, I) P (H) space: tat is, we fix te values of α and γ. Te only free parameters are now β i, for all values of i, ; varying tese values creates a subset of te space of all te distributions wic we will call P [β; α, γ] = {(α, β, γ) β [0, 1] (N 1) K }. P [β; α, γ] is a subset of P [α, β, γ] isometric to te [0, 1] (N 1) K -dimensional simplex of multinomials. We will use te term P [β; α, γ] to refer bot te subset of P [α, β, γ] and te lower-dimensional simplex it is isometric to, remembering tat te latter comes equipped wit te Lebesgue measure on R (N 1) K. Now we are ready to sow tat te subset of P [β; α, γ] wic does not satisfy te Causal Coarsening constraint is of measure zero wit respect to te Lebesgue measure. To see tis, first note tat since α and γ are fixed, eac image i as a well-defined causal class C(i) = α,iγ. Te Causal Coarsening constraint says For every pair of images i, j suc tat P (T i) = P (T j) it olds tat C(i) = C(j). Te subset of P [β; α, γ] of all distributions tat do not satisfy te constraint consists of te P (T, H, I) for wic for some i, j it olds tat P (T = 0 i) = P (T = 0 j) and C(i) C(j). Take any pair i, j for wic C(i) C(j) (if suc a pair does not exist, ten te Causal Coarsening constraint olds for all te distributions in P [β; α, γ]). We can write P (T = 0 i) = P (T = 0, i)p ( i) = 1 P (T = 0, i)p (i )P (). P (i) Since te same equation applies to P (T = 0 j), te constraint P (T i) = P (T j) can be rewritten 1 P (i) P (T = 0, i)p (i )P () = 1 P (j) P (j) P (i) P (T = 0, j)p (j )P () P (T = 0, i)p (i )P () P (T = 0, j)p (j )P () = 0,

11 wic we can rewrite in terms of te independent parameters (after defining α 0,,i = α,i and α 1,,i = 1 α,i ) and furter simplify as α t,,j γ β j, α 0,,i γ β i, t {0,1} α t,,i γ β i, α 0,,j γ β j, = 0 t {0,1} ( ) α 1,,j γ β j, α 0,,i γ β i, ( ) α 1,,i γ β i, α 0,,j γ β j, = 0 ( ) (1 α,j )γ β j, α,i γ β i, ( ) (1 α,i )γ β i, α,j γ β j, = 0 ( ) γ β j, α,i γ β i, ( ) γ β i, α,j γ β j, = 0, (4) wic is a polynomial constraint on P [β; α, γ] (note tat to keep te notation manageable, we ave omitted te dependent term 1 γ from te equations). By a simple algebraic lemma (proven by Okamoto, 1973), if te above constraint is not trivial (tat is, if tere exists β for wic te constraint does not old), te subset of P [β; α, γ] on wic it olds is measure zero. To see tat Eq. (4) does not always old, note tat if for any we set β i, = 1 (and tus β i, = 0 for any ) and β j, = 1, te equation reduces to (γ ) 2 (α i,i α j,) = 0. Tus if Eq. (4) was trivially true, we would ave α,i = α,j or γ = 0 for all. However, tis implies C(i) = C(j), wic contradicts our assumption. We ave now sown tat te subset of P [β; α, γ] wic consists of distributions for wic P (T i) = P (T j) (even toug C(i) C(j)) is Lebesgue measure zero. Since tere are only finitely many pairs of images i, j for wic C(i) C(j), te subset of P [β; α, γ] of distributions wic violate te Causal Coarsening constraint is also Lebesgue measure zero. Te remainder of te proof is a direct application of Fubini s teorem. For eac α, γ, call te (measure zero) subset of P [β; α, γ] tat violates te Causal Coarsening constraint z[α, γ]. Let Z = α,γ z[α, γ] P [α, β, γ] be te set of all te joint distributions wic violate te Causal Coarsening constraint. We want to prove tat µ(z) = 0, were µ is te Lebesgue measure. To sow tis, we will use te indicator function ẑ(α, β, γ) = { 1 if β z[α, γ], 0 oterwise. By te basic properties of positive measures we ave µ(z) = P [α,β,γ] ẑ dµ. It is a standard application of Fubini s Teorem for te Lebesgue integral to sow tat te integral in question equals zero. For simplicity of notation, let We ave P [α,β,γ] ẑ dµ = = = = = 0. A = R K N B = R N K G = R K. A B G A G A G A G B ẑ(α, β, γ) d(α, β, γ) ẑ(α, β, γ) d(β) d(α, γ) µ(z[α, γ]) d(α, γ) (5) 0 d(α, γ) Equation (5) follows as ẑ restricted to P [β; α, γ] is te indicator function of z[α, γ]. Tis completes te proof tat Z, te set of joint distributions over T, H and I tat violate te Causal Coarsening constraint, is measure zero. We are now ready to prove te main teorem. Teorem (Causal Coarsening Teorem) Among all te generative models of te form discussed in Fig. 2 (in te main text) tat ave distributions P (T, H, I) tat induce some given observational partition Π o, almost all induce a causal partition Π c tat is a coarsening of Π o.

12 Proof. Any variables tat appear in tis proof witout definition are defined in te proof of te Auxiliary Teorem. We take te same α, β, γ parametrization of distributions. Fixing an observational partition means fixing a set of observational constraints (OCs) P (T i 1 1) = = P (T i 1 N 1 ),. P (T i L 1 ) = = P (T i L N K ), were 1 L N is te number of observational classes. Since P (T, H, I) = P (H T, I)P (T I)P (I), P (T i) is an independent parameter in te unrestricted P (T, H, I), and te OCs reduce te number of independent parameters of te joint by L l=1 (N l 1). We want to express tis parameter-space reduction in terms of te α, β and γ parameterization and ten apply te proof of te Auxiliary Teorem. To do tis, for eac observational class l, coose a representative image î l suc tat P (T i l m) = P (T î l ) m 1 Nk. Ten for eac i l m î l it olds tat P (T, i l m) = P (T î l )P (i l m) or P (T,, i l m) = P (T î l ) P (, i l m). Picking an arbitrary 0, we can separate te left-and side as P (T, 0, i l m) = P (T î l ) P (, i l m) P (T,, i l m). 0 Finally, tis equation can be rewritten in terms of α, β and γ as α 0,iβ i,0 γ 0 = P (T î l ) β,i l m γ α,i l m β i l m γ, 0 or (P (T î l ) β,i γ lm ) α 0,i β lm i γ lm α 0,i = β i,0 γ 0 for any i l m î l. Tere are precisely L l=1 (N l 1) suc equations, altogeter equivalent to te observational constraints. Tus we can express any P (T, H, I) distribution tat is consistent wit a given observational partition in terms of te full range of β and γ parameters, and a restricted number of independent α parameters. Te rest of te proof now follows similarily to te proof of te Auxiliary Teorem and sows tat witin tis restricted parameter space, te parameters for wic te (fixed) observational partition is not a refinement of te causal partition is measure zero. 7 APPENDIX: CCT EXAMPLES AND COUNTER-EXAMPLES In Fig. 7 we provide examples of tree distributions over binary variables H, T and tree-valued I. Te first model induces a causal partition tat is a proper coarsening of te observational partition, and tus agrees wit te CCT. Te second model induces an observational partition tat is a proper coarsening of te causal partition CCT implies tat tis is a measure-zero case and tat, after fixing te observational partition, we ad to carefully tweak te parameters to align te causal partition as it is. Te tird model induces causal and observational partitions tat are incompatible tat is, neiter is a coarsening of te oter. Tis is also a measure-zero case. We provide a Tetrad (ttp:// file tat contains tese tree models at ttp://vision. caltec.edu/ kcalupk/code.tml. It can be used to verify our observational and causal partition computations. 8 APPENDIX: PROOF OF THE COMPLETE MACRO-VARIABLE DESCRIPTION THEOREM Teorem (Complete Macro-variable Description) Te following two statements old for C and S as defined in te main text: 1. P (T I) = P (T C, S). 2. Any oter variable X suc tat P (T I) = P (T X) as Sannon entropy H(X) H(C, S). Proof. Te first part follows by construction of S. For te second part, note tat by te CCT tere is a bijective correspondence between te pairs of values (c, s) and te observational probabilities P (T I). Call tis correspondence f, tat is f(c, s) = P (T c, s) and f 1 (p) = (c, s s.t. P (T c, s) = p). Furter, define g as te function on X, wit g : x P (T x). But since P (T X) = P (T I), we ave (c, s) = f 1 (g(x)). Tat is, te value of C and S is a function of te value of X, and tus te entropy of C and S is smaller tan te entropy of X. 9 APPENDIX: PREDICTIVE NON-CAUSAL INFORMATION IN CAUSAL VARIABLE C In some cases C retains predictive information tat is not causal. Consider te following example: We ave a causal grap consisting of tree variables {I, T, H} were te causal relations are I T and I H T. All tree variables are binary and we ave a positive distribution over

Visual Causal Feature Learning

Visual Causal Feature Learning Krzysztof Chalupka Computation and Neural Systems California Institute of Technology Pasadena, CA, USA Pietro Perona Electrical Engineering California Institute of Technology