arxiv: v2 [stat.ml] 4 Jun 2015 Abstract
|
|
- Eugenia Wilkerson
- 5 years ago
- Views:
Transcription
1 Visual Causal Feature Learning Krzysztof Calupka Computation and Neural Systems California Institute of Tecnology Pasadena, CA, USA Pietro Perona Electrical Engineering California Institute of Tecnology Pasadena, CA, USA Frederick Eberardt Humanities and Social Sciences California Institute of Tecnology Pasadena, CA, USA arxiv: v2 [stat.ml] 4 Jun 2015 Abstract We provide a rigorous definition of te visual cause of a beavior tat is broadly applicable to te visually driven beavior in umans, animals, neurons, robots and oter perceiving systems. Our framework generalizes standard accounts of causal learning to settings in wic te causal variables need to be constructed from micro-variables. We prove te Causal Coarsening Teorem, wic allows us to gain causal knowledge from observational data wit minimal experimental effort. Te teorem provides a connection to standard inference tecniques in macine learning tat identify features of an image tat correlate wit, but may not cause, te target beavior. Finally, we propose an active learning sceme to learn a manipulator function tat performs optimal manipulations on te image to automatically identify te visual cause of a target beavior. We illustrate our inference and learning algoritms in experiments based on bot syntetic and real data. 1 INTRODUCTION Visual perception is an important trigger of uman and animal beavior. Te visual cause of a beavior can be easy to define, say, wen a traffic ligt turns green, or quite subtle: apparently it is te increased symmetry of features tat leads people to judge faces more attractive tan oters (Grammer and Tornill, 1994). Significant scientific and economic effort is focused on visual causes in advertising, entertainment, communication, design, medicine, robotics and te study of uman and animal cognition. Visual causes profoundly influence our daily activity, yet our understanding of wat constitutes a visual cause lacks a teoretical basis. In practice, it is well-known tat images are composed of millions of variables (te pixels) but it is functions of te pixels (often called features ) tat ave meaning, rater tan te pixels temselves. We present a teoretical framework and inference algoritms for visual causes in images. A visual cause is defined (more formally below) as a function (or feature) of raw image pixels tat as a causal effect on te target beavior of a perceiving system of interest. We present tree advances: We provide a definition of te visual cause of a target beavior as a macro-variable tat is constructed from te micro-variables (pixels) tat make up te image space. Te visual cause is distinguised from oter macro-variables in tat it contains all te causal information about te target beavior tat is available in te image. We place te visual cause witin te standard framework of causal grapical models (Spirtes et al., 2000; Pearl, 2009), tereby contributing to an account of ow to construct causal variables. We prove te Causal Coarsening Teorem (CCT), wic sows ow observational data can be used to learn te visual cause wit minimal experimental effort. It connects te present results to standard classification tasks in macine learning. We describe a metod to learn te manipulator function, wic automatically performs perceptually optimal manipulations on te visual causes. We illustrate our ideas using syntetic and real-data experiments. Pyton code tat implements our algoritms, as well as reproduces some of te experimental results, is available online at ttp://vision.caltec.edu/ kcalupk/code.tml. We cose to develop te teory witin te context of visual causes as tis setting makes te definitions most intuitive and is itself of significant practical interest. However, te framework and results can be equally well applied to extract causal information from any aggregate of microvariables on wic manipulations are possible. Examples include auditory, olfactory and oter sensory stimuli; igdimensional neural recordings; market data in finance; consumer data in marketing. Tere, causal feature learning is bot of teoretical ( Wat is te cause? ) and practical ( Can we automatically manipulate it? ) importance.
2 1.1 PREVIOUS WORK Our framework extends te teory of causal grapical models (Spirtes et al., 2000; Pearl, 2009) to a setting in wic te input data consists of raw pixel (or oter microvariable) data. In contrast to te standard setting, in wic te macro-variables in te statistical dataset already specify te candidate causal relata, te causal variables in our setting ave to be constructed from te micro-variables tey supervene on, before any causal relations can be establised. We empasize te difference between our metod of causal feature learning and metods for causal feature selection (Guyon et al., 2007; Pellet and Elisseeff, 2008). Te latter coose te best (under some causal criterion) features from a restricted set of plausible macro-variable candidates. In contrast, our framework efficiently searces te wole space of all te possible macro-variables tat can be constructed from an image. Our approac derives its teoretical underpinnings from computational mecanics (Salizi and Crutcfield, 2001; Salizi, 2001), but supports a more explicitly causal interpretation by incorporating te possibility of confounding and interventions. Since we allow for unmeasured common causes of te features in te image and te target beavior, we ave to distinguis between te plain conditional probability distribution of te target beavior (T ) given te (observed) image (I) and te distribution of te target beavior given tat te observed image was manipulated (i.e. P (T I) vs. P (T do(i))). Hoel et al. (2013), wo develop a similar model to investigate te relationsip between causal micro- and macro-variables, avoid tis distinction by assuming tat all teir data was generated from wat in our setting would be te manipulated distribution P (T do(i)). We take te distinction between interventional and observational distributions to be one of te key features of a causal analysis. Te extant literature on causal learning from image or video data does not generally consider te aggregation from pixel variables into causal macro-variables, but instead starts from annotated or pre-defined features of te image (see e.g. Fire and Zu (2013a,b)). 1.2 CAUSAL FEATURE LEARNING: AN EXAMPLE Fig. 1 presents a paradigmatic case study in visual causal feature learning, wic we will use as a running example. Te contents of an image I are caused by external, nonvisual binary idden variables H 1 and H 2 suc tat if H 1 is on, I contains a vertical bar (v-bar 1 ) at a random position, and if H 2 is on, I contains a orizontal bar (-bar) at a random position. A target beavior T {0, 1} is caused by H 1 and I, suc tat T = 1 is more likely wenever H 1 = 1 and wenever te image contains an -bar. 1 We take a v-bar (-bar) to consist of a complete column (row) of black pixels. We deliberately constructed tis example suc tat te visual cause is clearly identifiable: manipulating te presence of an -bar in te image will influence te distribution of T. Tus, we can call te following function C : I {0, 1} te causal feature of I or te visual cause of T : { 1 if I contains an -bar C(I) = 0 oterwise. Te presence of a v-bar, on te oter and, is not a causal feature. Manipulating te presence of a v-bar in te image as no effect on H 1 or T. Still, te presence of a v-bar is as strongly correlated wit te value of T (via te common cause H 1 ) as te presence of an -bar is. We will call te following function S : I {0, 1} te spurious correlate of T in I: { 1 if I contains a v-bar S(I) = 0 oterwise. Bot te presence of -bars and te presence of v-bars are good individual (and even better joint) predictors of te target variable, but only one of tem is a cause. Identifying te visual cause from te image tus requires te ability to distinguis among te correlates of te target variables tose tat are actually causal, even if te non-causal correlates are (possibly more strongly) correlated wit te target. Wile te values of S and C in our example stand in a bijective correspondence to te values of H 1 and H 2, respectively, tis is only to keep te illustration simple. In general, te visual cause and te spurious correlate can be probabilistic functions of any number of (not necessarily independent) idden variables, and can sare te same idden causes. 2 A THEORY OF VISUAL CAUSAL FEATURES In our example te identification of te visual cause wit te presence of an -bar is intuitively obvious, as te model is constructed to ave an easily describable visual cause. But te example does not provide a teoretical account of wat it takes to be a visual cause in te general case wen we do not know wat te causally relevant pixel configurations are. In tis section, we provide a general account of ow te visual cause is related to te pixel data. 2.1 VISUAL CAUSES AS MACRO-VARIABLES A visual cause is a ig-level random variable tat is a function (or feature) of te image, wic in turn is defined by te random micro-variables tat determine te pixel values. Te functional relation between te image and te visual cause is, in general, surjective, toug in principle it could be bijective. Wile we are interested in identifying
3 P( I H1=0, H2=0) = U( ) P( I H1=0, H2=1) = U( ) P( I H1=1, H2=0) = U( ) P( I H1=1, H2=1) = U( ) H1 H2 I T P(H2=0) = 0.5 P(H1=0) = 0.5 P(T=0 I (,, ), H1=1) = 0 P(T=0 I (, ), H1=0) =.33 P(T=0 I (, ), H1=1) =.66 P(T=0 I (, ), H1=0) = 1 Figure 1: Our case study generative model. Two binary idden (non-visual) variables H 1 and H 2 toss unbiased coins. Te content of te image I depends on tese variables as follows. If H 1 = H 2 = 0, I is cosen uniformly at random from all te images containing no v-bars and no -bars. If H 1 = 0 and H 2 = 1, I is cosen uniformly at random from all images containing at least one -bar but no v-bars. If H 1 = 1 and H 2 = 0, I is cosen uniformly at random from all te images containing at least one v-bar but no -bars. Finally, if H 1 = H 2 = 1, I is cosen from images containing at least one v-bar and at least one -bar. Te distribution of te binary beavior T depends only on te presence of an -bar in I and te value of H 1. In observational studies, H 1 = 1 iff I contains a v-bar. However, a manipulation of any specific image I = i tat introduces a v-bar (witout canging H 1 ) will in general not cange te probability of T occurring. Tus, T does not depend causally on te presence of v-bars in I. te visual causes of a target beavior, te functional relation between te image pixels and te visual cause sould not itself be interpreted as causal. Pixels do not cause te features of an image, tey constitute tem, just as te atoms of a table constitute te table (and its features). Te difference between te causal and te constitutive relation is tat te former requires te possibility of independent manipulation (at least to some extent), wereas by definition one cannot manipulate te visual cause witout manipulating te image pixels. Te probability distribution over te visual cause is induced by te probability distribution over te pixels in te image and te functional mapping from te image to te visual cause. But since a visual cause stands in a constitutive relation wit te image, we cannot witout furter explanation describe interventions on te visual cause in terms of te standard do-operation (Pearl, 2009). Our goal will be to define a macro-variable C, wic contains all te causal information available in an image about a given beavior T, and define its manipulation. To make te problem approacable, we introduce two (natural) assumptions about te causal relation between te image and te beavior: (i) Te value of te target beavior T is determined subsequently to te image in time, and (ii) te variable T is in no way represented in te image. Tese assumptions exclude te possibility tat T is a cause of features in te image or tat T can be seen as causing itself. 2.2 GENERATIVE MODELS: FROM MICRO- TO MACRO-VARIABLES Let T {0, 1} represent a target beavior. 2 Let I be a discrete space of all te images tat can influence te target beavior (in our experiments in Section 4, I is te space of n-dimensional black-and-wite images). We use te following generative model to describe te relation between te images and te target beavior: An image is generated by a finite set of unobserved discrete variables H 1,..., H m (we write H for sort). Te target beavior is ten determined by te image and possibly a subset of variables H c H tat are confounders of te image and te target beavior: P (T, I) = H = H P (T I, H)P (I H)P (H) P (T I, H c )P (I H)P (H). (1) Independent noise tat may contribute to te target beavior is marginalized and omitted for te sake of simplicity in te above equation. Te noise term incorporates any idden variables wic influence te beavior but stand in no causal relation to te image. Suc variables are not directly relevant to te problem. Fig. 2 sows tis generative model. Under tis model, we can define an observational partition of te space of images I tat groups images into classes tat ave te same conditional probability P (T I): Definition 1 (Observational Partition, Observational Class). Te observational partition Π o (T, I) of te set I w.r.t. beavior T is te partition induced by te equivalence relation suc tat i j if and only if P (T I = i) = P (T I = j). We will denote it as Π o wen te context is clear. A cell of an observational partition is called an observational class. In standard classification tasks in macine learning, te observational partition is associated wit class labels. In our case, two images tat belong to te same cell of te observational partition assign equal predictive probability to te target beavior. Tus, knowing te observational class 2 An extension of te framework to non-binary, discrete T is easy but complicates te notation significantly. An extension to te continuous case is beyond te scope of tis article.
4 called a causal class. HC = (H2, HN) H1 H = (H1,..., HN) I H2 T HN Te underlying idea is tat images are considered causally equivalent wit respect to T if tey ave te same causal effect on T. Given te causal partition of te image space, we can now define te visual cause of T : Definition 4 (Visual Cause). Te visual cause C of a target beavior T is a random variable wose value stands in a bijective relation to te causal class of I. Figure 2: A general model of visual causation. In our model eac image I is caused by a number of idden nonvisual variables H i, wic need not be independent. Te image itself is te only observed cause of a target beavior T. In addition, a (not necessarily proper) subset of te idden variables can be a cause of te target beavior. Tese confounders create visual spurious correlates of te beavior in I. of an image allows us to predict te value of T. However, te predictive probability assigned to an image does not tell us te causal effect of te image on T. For example, a barometer is widely taken to be an excellent predictor of te weater. But canging te barometer needle does not cause an improvement of te weater. It is not a (visual or oterwise) cause of te weater. In contrast, seeing a particular barometer reading may well be a visual cause of weter we pack an umbrella. Our notion of a visual cause depends on te ability to manipulate te image. Definition 2 (Visual Manipulation). A visual manipulation is te operation man(i = i) tat canges (te pixels of) te image to image i I, wile not affecting any oter variables (suc as H or T ). Tat is, te manipulated probability distribution of te generative model in Eq. (1) is given by P (T man(i = i)) = H c P (T I = i, H c )P (H c ). Te manipulation canges te values of te image pixels, but does not cange te underlying world, represented in our model by te H i tat generated te image. Formally, te manipulation is similar to te do-operator for standard causal models. However, we ere reserve te do-operation for interventions on causal macro-variables, suc as te visual cause of T. We discuss te distinction in more detail below. We can now define te causal partition of te image space (wit respect to te target beavior T ) as: Definition 3 (Causal Partition, Causal Class). Te causal partition Π c (T, I) of te set I w.r.t. beavior T is te partition induced by te equivalence relation defined on I suc tat i j if and only if P (T man(i = i)) = P (T man(i = j)) for i, j I. Wen te image space and te target beavior are clear from te context, we will indicate te causal partition by Π c. A cell of a causal partition is Te visual cause is tus a function over I, wose values correspond to te post-manipulation distributions C(i) = P (T man(i = i)). We will write C(i) = c to indicate tat te causal class of image i I is c, or in oter words, tat in image i, te visual cause C takes value c. Knowing C allows us to predict te effects of a visual manipulation P (T man(i = i)), as long as we ave estimated P (T man(i = i k )) for one representative i k of eac causal class k. 2.3 THE CAUSAL COARSENING THEOREM Our main teorem relates te causal and observational partitions for a given I and T. It turns out tat in general te causal partition is a coarsening of te observational partition. Tat is, te causal partition aligns wit te observational partition, but te observational partition may subdivide some of te causal classes. Teorem 5 (Causal Coarsening). Among all te generative distributions of te form sown in Fig. 2 wic induce a given observational partition Π o, almost all induce a causal partition Π c tat is a coarsening of te Π o. Trougout tis article, we use almost all to mean all except for a subset of Lebesgue measure zero. Fig. 3 illustrates te relation between te causal and te observational partition implied by te teorem. We note tat te measure-zero subset were Π C does not coarsen Π O can indeed be non-empty. We provide suc counter-examples in Appendix 7. We prove te CCT in Appendix 6 using a tecnique tat extends tat of Meek (1995): We sow tat (1) restricting te space of all te possible P (T, H, I) to only te distributions compatible wit a fixed observational partition puts a linear constraint on te distribution space; (2) requiring tat te CCT be false puts a non-trivial polynomial constraint on tis subspace, and finally, (3) it follows tat te teorem olds for almost all distributions tat agree wit te given observational partition. Te proof strategy indicates a close connection between te CCT and te faitfulness assumption (Spirtes et al., 2000). Two points are wort noting ere: First, te CCT is interesting inasmuc as te visual causes of a beavior do not contain all te information in te image tat predict te beavior. Suc information, toug not itself a cause of
5 P(T=0 ) = 0 P(T=0 ) =.33 P(T=0 ) =.66 P(T=0 ) = 1 P(T=0 do{ }) =.17 P(T=0 do{ }) =.83 Figure 3: Te Causal Coarsening Teorem. Te observational probabilities of T given I (gray frame) induce an observational partition on te space of all te images (left, observational partition in gray). Te causal probabilities (red frame) induce a causal partition, indicated on te left in red. Te CCT allows us to expect tat te causal partition is a coarsening of te observational partition. Te observational and causal probabilities correspond to te generative model sown in Fig. 1. te beavior, can be informative about te state of oter non-visual causes of te target beavior. Second, te CCT allows us to take any classification problem in wic te data is divided into observational classes, and assume tat te causal labels do not cange witin eac observational class. Tis will elp us develop efficient causal inference algoritms in Section VISUAL CAUSES IN A CAUSAL MODEL CONSISTING OF MACRO-VARIABLES We can now simplify our generative model by omitting all te information in I unrelated to beavior T. Assume tat te observational partition Π T o refines te causal partition Π T c. Eac of te causal classes c 1,, c K delineates a region in te image space I suc tat all te images belonging to tat region induce te same P (T man(i)). Eac of tose regions say, te k-t one can be furter partitioned into sub-regions s k 1,, s k M k suc tat all te images in te m-t sub-region of te k-t causal region induce te same observational probability P (T I). By assumption, te observational partition as a finite number of classes, and we can arbitrarily order te observational classes witin eac causal class. Once suc an ordering is fixed, we can assign an integer m {1, 2,, M k } to eac image i belonging to te k-t causal class suc tat i belongs to te m-t observational class among te M k observational classes contained in c k. By construction, tis integer explains all te variation of te observational class witin a given causal class. Tis suggests te following definition: Definition 6 (Spurious Correlate). Te spurious correlate S is a discrete random variable wose value differentiates between te observational classes contained in any causal Figure 4: A macro-variable model of visual causation. Using our teory of visual causation we can aggregate te information present in visual micro-variables (image pixels) into te visual cause C and spurious correlate S. According to Teorem 7, C and S contain all te information about T available in I. class. Te spurious correlate is a well-defined function on I, wose value ranges between 1 and max k M k. Like C, te spurious correlate S is a macro-variable constructed from te pixels tat make up te image. C and S togeter contain all and only te visual information in I relevant to T, but only C contains te causal information: Teorem 7 (Complete Macro-variable Description). Te following two statements old for C and S as defined above: 1. P (T I) = P (T C, S). 2. Any oter variable X suc tat P (T I) = P (T X) as Sannon entropy H(X) H(C, S). We prove te teorem in Appendix 8. It guarantees tat C and S constitute te smallest-entropy macro-variables tat encompass all te information about te relationsip between T and I. Fig. 4 sows te relationsip between C, S and T, te image space I and te observational and causal partitions scematically. C is now a cause of T, S correlates wit T due to te unobserved common causes H C, and any information irrelevant to T is pused into te independent noise variables (commonly not sown in grapical representations of structural equation models). 3 Te macro-variable model lends itself to te standard treatment of causal grapical models described in Pearl (2009). We can define interventions on te causal variables {C, S, T } using te standard do-operation. Te dooperator only sets te value of te intervened variable to 3 We note tat C may retain predictive information about T tat is not causal, i.e. it is not te case tat all spurious correlations can be accounted for in S. See Appendix 9 for an example.
6 te desired value, making it independent of its causes, but it does not (directly) affect te oter variables in te system or te relationsips between tem (see te modularity assumption in Pearl (2009)). However, unlike te standard case were causal variables are separated in location (e.g. smoking and lung cancer), te causal variables in an image may involve te same pixels: C may be te average brigtness of te image, wereas S may indicate te presence or absence of particular sapes in te image. An intervention on a causal variable using te do-operator tus requires tat te underlying manipulation of te image respects te state of te oter causal variables: Definition 8 (Causal Intervention on Macro-variables). Given te set of macro-variables {C, S} tat take on values {c, s} for an image i I, an intervention do(c = c ) on te macro-variable C is given by te manipulation of te image man(i = i ) suc tat C(i ) = c and S(i ) = s. Te intervention do(s = s ) is defined analogously as te cange of te underlying image tat keeps te value of C constant. In some cases it can be impossible to manipulate C to a desired value witout canging S. We do not take tis to be a problem special to our case. In fact, in te standard macrovariable setting of causal analysis we would expect interventions to be muc more restricted by pysical constraints tan we are wit our interventions in te image space. 3 CAUSAL FEATURE LEARNING: INFERENCE ALGORITHMS Given te teoretical specification of te concepts of interest in te previous section, we can now develop algoritms to learn C, te visual cause of a beavior. In addition, knowledge of C will allow us to specify a manipulator function: a function tat, given any image, can return a maximally similar image wit te desired causal effect. Definition 9 (Manipulator Function). Let C be te causal variable of T and d a metric on I. Te manipulator function of C is a function M C : I C I suc tat M C (i, k) = arg minî C 1 (k) d(i, î) for any i I, k C. In case d(i,.) as multiple minima, we group tem togeter into one equivalence class and leave te coice of te representative to te manipulator function. Te manipulator searces for an image closest to I among all te images wit te desired causal effect k. Te meaning of closest depends on te metric d and is discussed furter in Section 3.2 below. Note tat te manipulator function can find candidates for te image manipulation underlying te desired causal manipulation do(c = c), but it does not ceck weter oter variables in te system (in particular, te spurious correlate) remain in fact uncanged. Using te closest possible image wit desired causal effect is a euristic approac to fulfilling tat requirement. Algoritm 1: Causal Predictor Training input : D obs = {(i 1, p 1 = p(t i 1 )),, (i N, p N = p(t i N )} observational data P = {P 1,, P M } te set of observational classes (so tat k, p k P, 1 k N) Train a neural net training algoritm output: C : I [0, 1] te causal variable 1 Pick {i k1,, i km } {i 1,, i N } s.t. p km = P m ; 2 Estimate Ĉm P (T man(i = i km )) for eac m; 3 For all k let Ĉ(i k) Ĉm if p k = P m ; 4 D csl {(i 1, Ĉ(i 1)),, (i N, Ĉ(i N))}; 5 C Train(D csl ); Tere are several reasons wy we migt want suc a manipulator function: If our goal is to perform causal manipulations on images, te manipulator function offers an automated solution. A manipulator tat uses a given C and produces images wit te desired causal effect provides strong evidence tat C is indeed te visual cause of te beavior. Using te manipulator function we can enric our dataset wit new datapoints, in ope of acieving better generalization on bot te causal and predictive learning tasks. Te problem of visual causal feature learning can now be posed as follows: Given an image space I and a metric d, learn C te visual cause of T and te manipulator M C. 3.1 CAUSAL EFFECT PREDICTION A standard macine learning approac to learning te relation between I and T would be to take an observational dataset D obs = {(i k, P (T i k ))} k=1,,n and learn a predictor f wose training performance guarantees a low test error (so tat f(i ) P (T i ) for a test image i ). In causal feature learning, low test error on observational data is insufficient; it is entirely possible tat D contains spurious information useful in predicting test labels wic is neverteless not causal. Tat is, te prediction may be igly accurate for observational data, but completely inaccurate for a prediction of te effect of a manipulation of te image (recall te barometer example). However, we can use te CCT to obtain a causal dataset from te observational data, and ten train a predictor on tat dataset. Algoritm 1 uses tis strategy to learn a function C tat, presented wit any image i I, returns C(i) P (T man(i = i)). We use a fixed neural network arcitecture to learn C, but any differentiable ypotesis class could be susbtituted instead. Differentiability of C is necessary in Section 3.2 in order to learn te manipulator function.
7 In Step 1 te algoritm picks a representative member of eac observational class. Te CCT tells us tat te causal partition coarsens te observational one. Tat is, in principle (ignoring sampling issues) it is sufficient to estimate Ĉm = P (T man(i = i km )) for just one image in an observational class m in order to know tat P (T man(i = i)) = Ĉm for any oter i in te same observational class. Te coice of te experimental metod of estimating te causal class in Step 2 is left to te user and depends on te beaving agent and te beavior in question. If, for example, T represents weter te spiking rate of a recorded neuron is above a fixed tresold, estimating P (T man(i = i)) could consist of recording te neuron s response to i in a laboratory setting multiple times, and ten calculating te probability of spiking from te finite sample. Te causal dataset created in Step 4 consists of te observational inputs and teir causal classes. Te causal dataset is acquired troug O(N) experiments, were N is te number of observational classes. Te final step of te algoritm trains a neural network tat predicts te causal labels on unseen images. Te coice of te metod of training is again left to te user. 3.2 CAUSAL FEATURE MANIPULATION Once we ave learned C we can use te causal neural network to create syntetic examples of images as similar as possible to te originals, but wit a different causal label. Te meaning of as similar as possible depends on te image metric d (see Definition 9). Te coice of d is taskspecific and crucial to te quality of te manipulations. In our experiments, we use a metric induced by an L 2 norm. Alternatives include oter L p -induced metrics, distances in implicit feature spaces induced by image kernels (Harcaoui and Bac, 2007; Grauman and Darrell, 2007; Bosc et al., 2007; Viswanatan, 2010) and distances in learned representation spaces (Bengio et al., 2013). Algoritm 2 proposes one way to learn te manipulator function using a simple manipulation procedure tat approximates te requirements of Definition 9 up to local minima. Te algoritm, inspired by te active learning tecniques of uncertainty sampling (Lewis and Gale, 1994) and density weiging (Settles and Craven, 2008), starts off by training a causal neural network in Step 2. If only observational data is available, tis can be acieved using Algoritm 1. Next, it randomly cooses a set of images to be manipulated, and teir target post-manipulation causal labels. Te loop tat starts in Step 6 ten takes eac of tose images and searces for te image tat, among te images wit te same desired causal class, is closest to te original image. Note tat te causal class boundaries are defined by te current causal neural net C. Since C is in general a igly nonlinear function and it can be ard to find its inverse sets, we use an approximate solution. Te algoritm tus finds te minimum of a weigted sum of C(j) ĉ l,k Algoritm 2: Manipulator Function Learning input : d: I I R + a metric on te image space D csl = {(i 1, c 1 ), (i N, c N )} causal data C = {C 1,, C M } te set of causal classes (so tat i c i C) Train a neural net training algoritm niters number of experiment iterations Q number of queries per iteration α manipulation tuning parameter A: I C an oracle for P (T do(i)) output: M C : I C I te manipulator function 1 for l 1 to niters do 2 C Train(D csl ); 3 Coose manipulation starting points {i l,1,, i l,q } at random from D csl ; 4 Coose manipulation targets {ĉ l,1,, ĉ l,q } suc tat ĉ l,k c l,k ; 5 for k 1 to Q do 6 î l,k argmin (1 α) C(j) ĉ l,k j I + α d(j, i l,k ); 7 end 8 D csl D csl {(î l,1, A(î l,1 )),, (î l,q, A(î l,q ))}; 9 end (te difference of te output image j s label and te desired label ĉ l,k ) and d(i l,k, j) (te distance of te output image j from te original image i l,k ). At eac iteration, te algoritm performs Q manipulations and te same number of causal queries to te agent, wic result in new datapoints (î l,1, A(î l,1 )),, (î l,q, A(î l,q )). It is natural to claim tat te manipulator performs well if A(î l,k ) ĉ l,k for many k, wic means te target causal labels agree wit te true causal labels. We tus define te manipulation error of te lt iteration MErr l as MErr l = 1 Q Q A(î l,k ) ĉ l,k. (2) k=1 Wile it is important tat our manipulations are accurate, we also want tem to be minimal. Anoter measure of interest is tus te average manipulation distance MDist l = 1 Q Q d(i l,k, î l,k ). (3) k=1 A natural variant of Algoritm 2 is to set niters to a large integer and break te loop wen one or bot of tese performance criteria reaces a desired value.
8 4 EXPERIMENTS In order to illustrate te concepts presented in tis article we perform two causal feature learning experiments. Te first experiment, called GRATING, uses observational and causal data generated by te model from Section 1.2. Te GRATING experiment confirms tat our system can learn te ground trut cause and ignore te spurious correlates of a beavior. Te second experiment, MNIST, uses images of and-written digits (LeCun et al., 1998) to exemplify te use of te manipulator function on sligtly more realistic data: in tis example, we transform an image into a maximally similar image wit anoter class label. We cose problems tat are simple from te computer vision point of view. Our goal is to develop te teory of visual causal feature learning and sow tat it as feasible algoritmic solutions; we are at tis point not engineering advanced computer vision systems. MDist MErr Iteration NONE 4.1 THE GRATING EXPERIMENT In tis experiment we generate data using te model of Fig. 1, wit two minor differences: H 1 and H 2 only induce one v-bar or -bar in te image and we restrict our observational dataset to images wit only about 3% of te pixels filled wit random noise (see Fig. 5). Bot restrictions increase te clarity of presentation. We use Algoritms 1 and 2 (wit minor modifications imposed by te binary nature of te images) to learn te visual cause of beavior T. Figure 5 (top) sows te progress of te training process. Te first step (not sown in te figure) uses te CCT to learn te causal labels on te observational data. We ten train a simple neural network (a fully connected network wit one idden layer of 100 units) on tis data. Te same network is used on Iteration 1 to create new manipulated exemplars. We ten follow Algoritm 2 to train te manipulator iteratively. Fig. 5 (bottom) illustrates te difference between te manipulator on Iteration 1 (wic fails almost 40% of te time) and Iteration 20, were te error is about 6%. Eac column sows example manipulations of a particular kind. Columns wit green labels indicate successful manipulations of wic tere are two kinds: switcing te causal variable on (0 1, adding te -bar ), or switcing it off (1 0, removing te -bar ). Red-labeled columns sow cases in wic te manipulator failed to influence te cause: Tat is, eac red column sows an original image and its manipulated version wic te manipulator believes sould cause a cange in T, but wic does not induce suc cange. Te red/green orizontal bars sow te percentage of success/error for eac manipulation direction. Fig. 5 (bottom, a) sows tat after training on te causally-coarsened observational dataset, te manipulator fails about 40% of te time. In Fig. 5 (b), after twenty manipulator learning iterations, only six manipulations out of (a) Iteration 1 (b) Iteration NONE Figure 5: Manipulator learning for GRATING. Top. Te plots sow te progress of our manipulator function learning algoritm over ten iterations of experiments for te GRATING problem. Te manipulation error decreases quickly wit progressing iterations, wereas te manipulation distance stays close to constant. Bottom. Original and manipulated GRATING images. See text for te details. a undred are unsuccessful. Furtermore, te causally irrelevant image pixels are also muc better preserved tan at iteration 1. Te fully-trained manipulator correctly learned to manipulate te presence of te -bar to cause canges in T, and ignores te v-bar tat is strongly correlated wit te beavior but does not cause it. 4.2 THE MNIST ON MTURK EXPERIMENT In tis experiment we start wit te MNIST dataset of andwritten digits. In our terminology, tis as well as any standard vision dataset is already causal data: te labels are assigned in an experimental setting, not in nature. Consider te following binary uman beavior: T = 1 if a uman observer answers affirmatively to te question Does tis image contain te digit 7?, wile T = 0 if te observer judges tat te image does not contain te digit 7. For simplicity we will assume tat for any image ei-
9 MErr MDist Starting Digit Iteration Target Class Figure 6: Manipulator Learning for MNIST ON MTURK. Top. In contrast to te GRATING experiment, ere te manipulation distance grows as te manipulation error decreases. Tis is because a successful manipulator needs to cange significant parts of eac image (suc as continuous strokes). Bottom. Visualization of manipulator training on randomly selected (not cerry-picked) MNIST digits. See text for te details. ter P (T = 1 man(i)) = 0 or P (T = 1 man(i)) = 1. Our task is to learn te manipulator function tat will take any image and modify it minimally suc tat it will become a 7 if it was not before, or will stop resembling a 7 if it did originally. We conduct te manipulator training separately for all te ten MNIST digits using uman annotators on Amazon Mecanical Turk. Te exact training procedure is described in Appendix 10. Fig. 6 (top) sows training progress. As in Fig. 5, te manipulation error decreases wit training. Fig. 6 (bottom) visualizes te manipulator training progress. In te first row we see a randomly cosen MNIST 9 being manipulated to resemble a 0, pused troug successive 0-vs-all manipulators trained at iterations 0, 1,..., 5 (iteration 1 sows wat te neural net takes to be te closest manipulation to cange te 9 to a purely on te basis of te non-manipulated data). Furter rows perform similar experiments for te oter digits. Te plots sow ow successive manipulators progressively remove te original digits features and add target class features to te image. 5 DISCUSSION We provide a link between causal reasoning and neural network models tat ave recently enjoyed tremendous success in te fields of macine learning and computer vision (LeCun et al., 1998; Russakovsky et al., 2014). Despite very encouraging results in image classification (Krizevsky et al., 2012), object detection (Dollar et al., 2012) and fine-grained classification (Branson et al., 2014; Zang et al., 2014), some researcers ave found tat visual neural networks can be easily fooled using adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2014). Te learning procedure for our manipulator function could be viewed as an attempt to train a classifier tat is robust against suc examples. Te procedure uses causal reasoning to improve on te boundaries of a standard, correlational classifier (Fig. 5 and 6 sow te improvement). However, te ultimate purpose of a causal manipulator network is to extract truly causal features from data and automatically perform causal manipulations based on tose features. A second contribution concerns te field of causal discovery. Modern causal discovery algoritms presuppose tat te set of causal variables is well-defined and meaningful. Wat exactly tis presupposition entails is unclear, but tere are clear counter-examples: x and 2x cannot be two distinct causal variables. Tere are also well understood problems wen causal variables are aggregates of oter variables (Cu et al., 2003; Spirtes and Sceines, 2004). We provide an account of ow causal macro-variables can supervene on micro-variables. Tis article is an attempt to clarify ow one may construct a set of well-defined causal macro-variables tat function as basic relata in a causal grapical model. Tis step strikes us as essential if causal metodology is to be successful in areas were we do not ave clearly delineated candidate causes or were causes supervene on micro-variables, suc as in climate science and neuroscience, economics and in our specific case vision. Acknowledgements KC s work was funded by te Qualcomm Innovation Fellowsip KC s and PP s work was supported by te ONR MURI grant N FE would like to tank Cosma Salizi for pointers to many relevant results tis paper builds on.
10 6 APPENDIX: PROOF OF THE CAUSAL COARSENING THEOREM Before we prove te Causal Coarsening Teorem, we prove its less general version in order to split te rater complex proof of CCT into two parts. Tis Auxiliary Teorem can be proven using simpler tecniques, owever ere we deliberately use tecniques tat transfer directly to te proof of te CCT. Auxiliary Teorem Among all te generative models of te form discussed in Fig. 2 (in te main text), te subset of distributions P (T, H, I) for wic te causal partition is not a coarsening (proper or improper) of te observational partition is Lebesgue measure zero. Proof. Our proof is inspired by a proof used by Meek (1995) to prove tat almost all distributions compatible wit a given causal grap are faitful. Te proof strategy is tus first to express te proposition tat for a given distribution, te observational partition does not refine te causal partition as a polynomial equation on te space of all distributions compatible wit te model. We ten sow tat tis polynomial equation is not trivial, i.e. tere is at least one distribution tat is not its root. By a simple algebraic lemma, tis will prove te teorem. We extend Meek s proof tecnique in our usage of Fubini s Teorem for te Lebesgue integral. It allows us to split te polynomial constraint into multiple different constraints along several of te distribution parameters. Tis allows for additional flexibility in creating useful assumptions (in our proof, te assumption tat te datapoints ave well-defined causal classes, but te observational class can still vary freely). Assume tat T is binary and H = (H 1,, H M ), I are discrete variables (say H i = K i, I = N, toug N can be very large. We will use te notation K K 1 K M for simplicity later on). Te discreteness assumption is not crucial, but will simplify te reasoning. We can factorize te joint as P (T, H, I) = P (T H, I)P (I H)P (H). P (T H, I) can be parametrized by H 1 H M I = K N parameters, P (I H) by (N 1) K parameters, and P (H) by anoter K parameters, all of wic are independent. Call te parameters, respectively, α,i P (T = 0 H =, I = i) β i, P (I = i H = ) γ P (H = ) We will denote parameter vectors as α = (α 1,i 1,, α K,i N ) R K N β = (β i1, 1,, β in 1, K ) R (N 1) K γ = (γ 1,, γ K ) R K, were te indices are arranged in lexicograpical order. Tis creates a one-to-one correspondence of eac possible joint distribution P (T, H, I) wit a point (α, β, γ) P [α, β, γ] R K3 N (N 1), were P [α, β, γ] is te K 3 N (N 1)-dimensional simplex of multinomial distributions. To proceed wit te proof, we first pick any point in te P (T H, I) P (H) space: tat is, we fix te values of α and γ. Te only free parameters are now β i, for all values of i, ; varying tese values creates a subset of te space of all te distributions wic we will call P [β; α, γ] = {(α, β, γ) β [0, 1] (N 1) K }. P [β; α, γ] is a subset of P [α, β, γ] isometric to te [0, 1] (N 1) K -dimensional simplex of multinomials. We will use te term P [β; α, γ] to refer bot te subset of P [α, β, γ] and te lower-dimensional simplex it is isometric to, remembering tat te latter comes equipped wit te Lebesgue measure on R (N 1) K. Now we are ready to sow tat te subset of P [β; α, γ] wic does not satisfy te Causal Coarsening constraint is of measure zero wit respect to te Lebesgue measure. To see tis, first note tat since α and γ are fixed, eac image i as a well-defined causal class C(i) = α,iγ. Te Causal Coarsening constraint says For every pair of images i, j suc tat P (T i) = P (T j) it olds tat C(i) = C(j). Te subset of P [β; α, γ] of all distributions tat do not satisfy te constraint consists of te P (T, H, I) for wic for some i, j it olds tat P (T = 0 i) = P (T = 0 j) and C(i) C(j). Take any pair i, j for wic C(i) C(j) (if suc a pair does not exist, ten te Causal Coarsening constraint olds for all te distributions in P [β; α, γ]). We can write P (T = 0 i) = P (T = 0, i)p ( i) = 1 P (T = 0, i)p (i )P (). P (i) Since te same equation applies to P (T = 0 j), te constraint P (T i) = P (T j) can be rewritten 1 P (i) P (T = 0, i)p (i )P () = 1 P (j) P (j) P (i) P (T = 0, j)p (j )P () P (T = 0, i)p (i )P () P (T = 0, j)p (j )P () = 0,
11 wic we can rewrite in terms of te independent parameters (after defining α 0,,i = α,i and α 1,,i = 1 α,i ) and furter simplify as α t,,j γ β j, α 0,,i γ β i, t {0,1} α t,,i γ β i, α 0,,j γ β j, = 0 t {0,1} ( ) α 1,,j γ β j, α 0,,i γ β i, ( ) α 1,,i γ β i, α 0,,j γ β j, = 0 ( ) (1 α,j )γ β j, α,i γ β i, ( ) (1 α,i )γ β i, α,j γ β j, = 0 ( ) γ β j, α,i γ β i, ( ) γ β i, α,j γ β j, = 0, (4) wic is a polynomial constraint on P [β; α, γ] (note tat to keep te notation manageable, we ave omitted te dependent term 1 γ from te equations). By a simple algebraic lemma (proven by Okamoto, 1973), if te above constraint is not trivial (tat is, if tere exists β for wic te constraint does not old), te subset of P [β; α, γ] on wic it olds is measure zero. To see tat Eq. (4) does not always old, note tat if for any we set β i, = 1 (and tus β i, = 0 for any ) and β j, = 1, te equation reduces to (γ ) 2 (α i,i α j,) = 0. Tus if Eq. (4) was trivially true, we would ave α,i = α,j or γ = 0 for all. However, tis implies C(i) = C(j), wic contradicts our assumption. We ave now sown tat te subset of P [β; α, γ] wic consists of distributions for wic P (T i) = P (T j) (even toug C(i) C(j)) is Lebesgue measure zero. Since tere are only finitely many pairs of images i, j for wic C(i) C(j), te subset of P [β; α, γ] of distributions wic violate te Causal Coarsening constraint is also Lebesgue measure zero. Te remainder of te proof is a direct application of Fubini s teorem. For eac α, γ, call te (measure zero) subset of P [β; α, γ] tat violates te Causal Coarsening constraint z[α, γ]. Let Z = α,γ z[α, γ] P [α, β, γ] be te set of all te joint distributions wic violate te Causal Coarsening constraint. We want to prove tat µ(z) = 0, were µ is te Lebesgue measure. To sow tis, we will use te indicator function ẑ(α, β, γ) = { 1 if β z[α, γ], 0 oterwise. By te basic properties of positive measures we ave µ(z) = P [α,β,γ] ẑ dµ. It is a standard application of Fubini s Teorem for te Lebesgue integral to sow tat te integral in question equals zero. For simplicity of notation, let We ave P [α,β,γ] ẑ dµ = = = = = 0. A = R K N B = R N K G = R K. A B G A G A G A G B ẑ(α, β, γ) d(α, β, γ) ẑ(α, β, γ) d(β) d(α, γ) µ(z[α, γ]) d(α, γ) (5) 0 d(α, γ) Equation (5) follows as ẑ restricted to P [β; α, γ] is te indicator function of z[α, γ]. Tis completes te proof tat Z, te set of joint distributions over T, H and I tat violate te Causal Coarsening constraint, is measure zero. We are now ready to prove te main teorem. Teorem (Causal Coarsening Teorem) Among all te generative models of te form discussed in Fig. 2 (in te main text) tat ave distributions P (T, H, I) tat induce some given observational partition Π o, almost all induce a causal partition Π c tat is a coarsening of Π o.
12 Proof. Any variables tat appear in tis proof witout definition are defined in te proof of te Auxiliary Teorem. We take te same α, β, γ parametrization of distributions. Fixing an observational partition means fixing a set of observational constraints (OCs) P (T i 1 1) = = P (T i 1 N 1 ),. P (T i L 1 ) = = P (T i L N K ), were 1 L N is te number of observational classes. Since P (T, H, I) = P (H T, I)P (T I)P (I), P (T i) is an independent parameter in te unrestricted P (T, H, I), and te OCs reduce te number of independent parameters of te joint by L l=1 (N l 1). We want to express tis parameter-space reduction in terms of te α, β and γ parameterization and ten apply te proof of te Auxiliary Teorem. To do tis, for eac observational class l, coose a representative image î l suc tat P (T i l m) = P (T î l ) m 1 Nk. Ten for eac i l m î l it olds tat P (T, i l m) = P (T î l )P (i l m) or P (T,, i l m) = P (T î l ) P (, i l m). Picking an arbitrary 0, we can separate te left-and side as P (T, 0, i l m) = P (T î l ) P (, i l m) P (T,, i l m). 0 Finally, tis equation can be rewritten in terms of α, β and γ as α 0,iβ i,0 γ 0 = P (T î l ) β,i l m γ α,i l m β i l m γ, 0 or (P (T î l ) β,i γ lm ) α 0,i β lm i γ lm α 0,i = β i,0 γ 0 for any i l m î l. Tere are precisely L l=1 (N l 1) suc equations, altogeter equivalent to te observational constraints. Tus we can express any P (T, H, I) distribution tat is consistent wit a given observational partition in terms of te full range of β and γ parameters, and a restricted number of independent α parameters. Te rest of te proof now follows similarily to te proof of te Auxiliary Teorem and sows tat witin tis restricted parameter space, te parameters for wic te (fixed) observational partition is not a refinement of te causal partition is measure zero. 7 APPENDIX: CCT EXAMPLES AND COUNTER-EXAMPLES In Fig. 7 we provide examples of tree distributions over binary variables H, T and tree-valued I. Te first model induces a causal partition tat is a proper coarsening of te observational partition, and tus agrees wit te CCT. Te second model induces an observational partition tat is a proper coarsening of te causal partition CCT implies tat tis is a measure-zero case and tat, after fixing te observational partition, we ad to carefully tweak te parameters to align te causal partition as it is. Te tird model induces causal and observational partitions tat are incompatible tat is, neiter is a coarsening of te oter. Tis is also a measure-zero case. We provide a Tetrad (ttp:// file tat contains tese tree models at ttp://vision. caltec.edu/ kcalupk/code.tml. It can be used to verify our observational and causal partition computations. 8 APPENDIX: PROOF OF THE COMPLETE MACRO-VARIABLE DESCRIPTION THEOREM Teorem (Complete Macro-variable Description) Te following two statements old for C and S as defined in te main text: 1. P (T I) = P (T C, S). 2. Any oter variable X suc tat P (T I) = P (T X) as Sannon entropy H(X) H(C, S). Proof. Te first part follows by construction of S. For te second part, note tat by te CCT tere is a bijective correspondence between te pairs of values (c, s) and te observational probabilities P (T I). Call tis correspondence f, tat is f(c, s) = P (T c, s) and f 1 (p) = (c, s s.t. P (T c, s) = p). Furter, define g as te function on X, wit g : x P (T x). But since P (T X) = P (T I), we ave (c, s) = f 1 (g(x)). Tat is, te value of C and S is a function of te value of X, and tus te entropy of C and S is smaller tan te entropy of X. 9 APPENDIX: PREDICTIVE NON-CAUSAL INFORMATION IN CAUSAL VARIABLE C In some cases C retains predictive information tat is not causal. Consider te following example: We ave a causal grap consisting of tree variables {I, T, H} were te causal relations are I T and I H T. All tree variables are binary and we ave a positive distribution over
Visual Causal Feature Learning
Visual Causal Feature Learning Krzysztof Chalupka Computation and Neural Systems California Institute of Technology Pasadena, CA, USA Pietro Perona Electrical Engineering California Institute of Technology
More informationarxiv: v1 [stat.ml] 25 Dec 2015
Multi-Level Cause-Effect Systems Krzysztof Calupka Pietro Perona Frederick Eberardt California Institute of Tecnology arxiv:1512.07942v1 [stat.ml] 25 Dec 2015 Abstract We present a domain-general account
More informationMulti-Level Cause-Effect Systems
Multi-Level Cause-Effect Systems Krzysztof Calupka Pietro Perona Frederick Eberardt California Institute of Tecnology Pasadena, CA, USA Abstract We present a domain-general account of causation tat applies
More informationLearning based super-resolution land cover mapping
earning based super-resolution land cover mapping Feng ing, Yiang Zang, Giles M. Foody IEEE Fellow, Xiaodong Xiuua Zang, Siming Fang, Wenbo Yun Du is work was supported in part by te National Basic Researc
More informationEfficient algorithms for for clone items detection
Efficient algoritms for for clone items detection Raoul Medina, Caroline Noyer, and Olivier Raynaud Raoul Medina, Caroline Noyer and Olivier Raynaud LIMOS - Université Blaise Pascal, Campus universitaire
More informationCopyright c 2008 Kevin Long
Lecture 4 Numerical solution of initial value problems Te metods you ve learned so far ave obtained closed-form solutions to initial value problems. A closedform solution is an explicit algebriac formula
More informationlecture 26: Richardson extrapolation
43 lecture 26: Ricardson extrapolation 35 Ricardson extrapolation, Romberg integration Trougout numerical analysis, one encounters procedures tat apply some simple approximation (eg, linear interpolation)
More informationVolume 29, Issue 3. Existence of competitive equilibrium in economies with multi-member households
Volume 29, Issue 3 Existence of competitive equilibrium in economies wit multi-member ouseolds Noriisa Sato Graduate Scool of Economics, Waseda University Abstract Tis paper focuses on te existence of
More informationTeaching Differentiation: A Rare Case for the Problem of the Slope of the Tangent Line
Teacing Differentiation: A Rare Case for te Problem of te Slope of te Tangent Line arxiv:1805.00343v1 [mat.ho] 29 Apr 2018 Roman Kvasov Department of Matematics University of Puerto Rico at Aguadilla Aguadilla,
More informationA = h w (1) Error Analysis Physics 141
Introduction In all brances of pysical science and engineering one deals constantly wit numbers wic results more or less directly from experimental observations. Experimental observations always ave inaccuracies.
More information4. The slope of the line 2x 7y = 8 is (a) 2/7 (b) 7/2 (c) 2 (d) 2/7 (e) None of these.
Mat 11. Test Form N Fall 016 Name. Instructions. Te first eleven problems are wort points eac. Te last six problems are wort 5 points eac. For te last six problems, you must use relevant metods of algebra
More informationThe derivative function
Roberto s Notes on Differential Calculus Capter : Definition of derivative Section Te derivative function Wat you need to know already: f is at a point on its grap and ow to compute it. Wat te derivative
More informationRegularized Regression
Regularized Regression David M. Blei Columbia University December 5, 205 Modern regression problems are ig dimensional, wic means tat te number of covariates p is large. In practice statisticians regularize
More information2.1 THE DEFINITION OF DERIVATIVE
2.1 Te Derivative Contemporary Calculus 2.1 THE DEFINITION OF DERIVATIVE 1 Te grapical idea of a slope of a tangent line is very useful, but for some uses we need a more algebraic definition of te derivative
More information2.11 That s So Derivative
2.11 Tat s So Derivative Introduction to Differential Calculus Just as one defines instantaneous velocity in terms of average velocity, we now define te instantaneous rate of cange of a function at a point
More informationExam 1 Review Solutions
Exam Review Solutions Please also review te old quizzes, and be sure tat you understand te omework problems. General notes: () Always give an algebraic reason for your answer (graps are not sufficient),
More informationSymmetry Labeling of Molecular Energies
Capter 7. Symmetry Labeling of Molecular Energies Notes: Most of te material presented in tis capter is taken from Bunker and Jensen 1998, Cap. 6, and Bunker and Jensen 2005, Cap. 7. 7.1 Hamiltonian Symmetry
More informationDifferentiation in higher dimensions
Capter 2 Differentiation in iger dimensions 2.1 Te Total Derivative Recall tat if f : R R is a 1-variable function, and a R, we say tat f is differentiable at x = a if and only if te ratio f(a+) f(a) tends
More informationChapter 2 Limits and Continuity
4 Section. Capter Limits and Continuity Section. Rates of Cange and Limits (pp. 6) Quick Review.. f () ( ) () 4 0. f () 4( ) 4. f () sin sin 0 4. f (). 4 4 4 6. c c c 7. 8. c d d c d d c d c 9. 8 ( )(
More informationNotes on Neural Networks
Artificial neurons otes on eural etwors Paulo Eduardo Rauber 205 Consider te data set D {(x i y i ) i { n} x i R m y i R d } Te tas of supervised learning consists on finding a function f : R m R d tat
More informationMVT and Rolle s Theorem
AP Calculus CHAPTER 4 WORKSHEET APPLICATIONS OF DIFFERENTIATION MVT and Rolle s Teorem Name Seat # Date UNLESS INDICATED, DO NOT USE YOUR CALCULATOR FOR ANY OF THESE QUESTIONS In problems 1 and, state
More informationPolynomial Interpolation
Capter 4 Polynomial Interpolation In tis capter, we consider te important problem of approximatinga function fx, wose values at a set of distinct points x, x, x,, x n are known, by a polynomial P x suc
More informationMath 312 Lecture Notes Modeling
Mat 3 Lecture Notes Modeling Warren Weckesser Department of Matematics Colgate University 5 7 January 006 Classifying Matematical Models An Example We consider te following scenario. During a storm, a
More information2.8 The Derivative as a Function
.8 Te Derivative as a Function Typically, we can find te derivative of a function f at many points of its domain: Definition. Suppose tat f is a function wic is differentiable at every point of an open
More informationNumerical Differentiation
Numerical Differentiation Finite Difference Formulas for te first derivative (Using Taylor Expansion tecnique) (section 8.3.) Suppose tat f() = g() is a function of te variable, and tat as 0 te function
More informationFundamentals of Concept Learning
Aims 09s: COMP947 Macine Learning and Data Mining Fundamentals of Concept Learning Marc, 009 Acknowledgement: Material derived from slides for te book Macine Learning, Tom Mitcell, McGraw-Hill, 997 ttp://www-.cs.cmu.edu/~tom/mlbook.tml
More informationCombining functions: algebraic methods
Combining functions: algebraic metods Functions can be added, subtracted, multiplied, divided, and raised to a power, just like numbers or algebra expressions. If f(x) = x 2 and g(x) = x + 2, clearly f(x)
More informationExercises for numerical differentiation. Øyvind Ryan
Exercises for numerical differentiation Øyvind Ryan February 25, 2013 1. Mark eac of te following statements as true or false. a. Wen we use te approximation f (a) (f (a +) f (a))/ on a computer, we can
More informationA MONTE CARLO ANALYSIS OF THE EFFECTS OF COVARIANCE ON PROPAGATED UNCERTAINTIES
A MONTE CARLO ANALYSIS OF THE EFFECTS OF COVARIANCE ON PROPAGATED UNCERTAINTIES Ronald Ainswort Hart Scientific, American Fork UT, USA ABSTRACT Reports of calibration typically provide total combined uncertainties
More informationPolynomial Interpolation
Capter 4 Polynomial Interpolation In tis capter, we consider te important problem of approximating a function f(x, wose values at a set of distinct points x, x, x 2,,x n are known, by a polynomial P (x
More informationThe Complexity of Computing the MCD-Estimator
Te Complexity of Computing te MCD-Estimator Torsten Bernolt Lerstul Informatik 2 Universität Dortmund, Germany torstenbernolt@uni-dortmundde Paul Fiscer IMM, Danisc Tecnical University Kongens Lyngby,
More informationLecture XVII. Abstract We introduce the concept of directional derivative of a scalar function and discuss its relation with the gradient operator.
Lecture XVII Abstract We introduce te concept of directional derivative of a scalar function and discuss its relation wit te gradient operator. Directional derivative and gradient Te directional derivative
More information. If lim. x 2 x 1. f(x+h) f(x)
Review of Differential Calculus Wen te value of one variable y is uniquely determined by te value of anoter variable x, ten te relationsip between x and y is described by a function f tat assigns a value
More informationMathematics 5 Worksheet 11 Geometry, Tangency, and the Derivative
Matematics 5 Workseet 11 Geometry, Tangency, and te Derivative Problem 1. Find te equation of a line wit slope m tat intersects te point (3, 9). Solution. Te equation for a line passing troug a point (x
More informationNUMERICAL DIFFERENTIATION. James T. Smith San Francisco State University. In calculus classes, you compute derivatives algebraically: for example,
NUMERICAL DIFFERENTIATION James T Smit San Francisco State University In calculus classes, you compute derivatives algebraically: for example, f( x) = x + x f ( x) = x x Tis tecnique requires your knowing
More informationIntroduction to Machine Learning. Recitation 8. w 2, b 2. w 1, b 1. z 0 z 1. The function we want to minimize is the loss over all examples: f =
Introduction to Macine Learning Lecturer: Regev Scweiger Recitation 8 Fall Semester Scribe: Regev Scweiger 8.1 Backpropagation We will develop and review te backpropagation algoritm for neural networks.
More information1. Questions (a) through (e) refer to the graph of the function f given below. (A) 0 (B) 1 (C) 2 (D) 4 (E) does not exist
Mat 1120 Calculus Test 2. October 18, 2001 Your name Te multiple coice problems count 4 points eac. In te multiple coice section, circle te correct coice (or coices). You must sow your work on te oter
More information1 The concept of limits (p.217 p.229, p.242 p.249, p.255 p.256) 1.1 Limits Consider the function determined by the formula 3. x since at this point
MA00 Capter 6 Calculus and Basic Linear Algebra I Limits, Continuity and Differentiability Te concept of its (p.7 p.9, p.4 p.49, p.55 p.56). Limits Consider te function determined by te formula f Note
More informationImpact of Lightning Strikes on National Airspace System (NAS) Outages
Impact of Ligtning Strikes on National Airspace System (NAS) Outages A Statistical Approac Aurélien Vidal University of California at Berkeley NEXTOR Berkeley, CA, USA aurelien.vidal@berkeley.edu Jasenka
More informationIntroduction to Derivatives
Introduction to Derivatives 5-Minute Review: Instantaneous Rates and Tangent Slope Recall te analogy tat we developed earlier First we saw tat te secant slope of te line troug te two points (a, f (a))
More informationFinancial Econometrics Prof. Massimo Guidolin
CLEFIN A.A. 2010/2011 Financial Econometrics Prof. Massimo Guidolin A Quick Review of Basic Estimation Metods 1. Were te OLS World Ends... Consider two time series 1: = { 1 2 } and 1: = { 1 2 }. At tis
More informationCSCE 478/878 Lecture 2: Concept Learning and the General-to-Specific Ordering
Outline Learning from eamples CSCE 78/878 Lecture : Concept Learning and te General-to-Specific Ordering Stepen D. Scott (Adapted from Tom Mitcell s slides) General-to-specific ordering over ypoteses Version
More informationRobotic manipulation project
Robotic manipulation project Bin Nguyen December 5, 2006 Abstract Tis is te draft report for Robotic Manipulation s class project. Te cosen project aims to understand and implement Kevin Egan s non-convex
More informationBob Brown Math 251 Calculus 1 Chapter 3, Section 1 Completed 1 CCBC Dundalk
Bob Brown Mat 251 Calculus 1 Capter 3, Section 1 Completed 1 Te Tangent Line Problem Te idea of a tangent line first arises in geometry in te context of a circle. But before we jump into a discussion of
More informationREVIEW LAB ANSWER KEY
REVIEW LAB ANSWER KEY. Witout using SN, find te derivative of eac of te following (you do not need to simplify your answers): a. f x 3x 3 5x x 6 f x 3 3x 5 x 0 b. g x 4 x x x notice te trick ere! x x g
More informationSECTION 3.2: DERIVATIVE FUNCTIONS and DIFFERENTIABILITY
(Section 3.2: Derivative Functions and Differentiability) 3.2.1 SECTION 3.2: DERIVATIVE FUNCTIONS and DIFFERENTIABILITY LEARNING OBJECTIVES Know, understand, and apply te Limit Definition of te Derivative
More informationCubic Functions: Local Analysis
Cubic function cubing coefficient Capter 13 Cubic Functions: Local Analysis Input-Output Pairs, 378 Normalized Input-Output Rule, 380 Local I-O Rule Near, 382 Local Grap Near, 384 Types of Local Graps
More informationLab 6 Derivatives and Mutant Bacteria
Lab 6 Derivatives and Mutant Bacteria Date: September 27, 20 Assignment Due Date: October 4, 20 Goal: In tis lab you will furter explore te concept of a derivative using R. You will use your knowledge
More informationConsider a function f we ll specify which assumptions we need to make about it in a minute. Let us reformulate the integral. 1 f(x) dx.
Capter 2 Integrals as sums and derivatives as differences We now switc to te simplest metods for integrating or differentiating a function from its function samples. A careful study of Taylor expansions
More informationTime (hours) Morphine sulfate (mg)
Mat Xa Fall 2002 Review Notes Limits and Definition of Derivative Important Information: 1 According to te most recent information from te Registrar, te Xa final exam will be eld from 9:15 am to 12:15
More informationPre-Calculus Review Preemptive Strike
Pre-Calculus Review Preemptive Strike Attaced are some notes and one assignment wit tree parts. Tese are due on te day tat we start te pre-calculus review. I strongly suggest reading troug te notes torougly
More informationDifferential Calculus (The basics) Prepared by Mr. C. Hull
Differential Calculus Te basics) A : Limits In tis work on limits, we will deal only wit functions i.e. tose relationsips in wic an input variable ) defines a unique output variable y). Wen we work wit
More information1 1. Rationalize the denominator and fully simplify the radical expression 3 3. Solution: = 1 = 3 3 = 2
MTH - Spring 04 Exam Review (Solutions) Exam : February 5t 6:00-7:0 Tis exam review contains questions similar to tose you sould expect to see on Exam. Te questions included in tis review, owever, are
More informationHow to Find the Derivative of a Function: Calculus 1
Introduction How to Find te Derivative of a Function: Calculus 1 Calculus is not an easy matematics course Te fact tat you ave enrolled in suc a difficult subject indicates tat you are interested in te
More informationf a h f a h h lim lim
Te Derivative Te derivative of a function f at a (denoted f a) is f a if tis it exists. An alternative way of defining f a is f a x a fa fa fx fa x a Note tat te tangent line to te grap of f at te point
More informationSECTION 1.10: DIFFERENCE QUOTIENTS LEARNING OBJECTIVES
(Section.0: Difference Quotients).0. SECTION.0: DIFFERENCE QUOTIENTS LEARNING OBJECTIVES Define average rate of cange (and average velocity) algebraically and grapically. Be able to identify, construct,
More informationDerivatives. By: OpenStaxCollege
By: OpenStaxCollege Te average teen in te United States opens a refrigerator door an estimated 25 times per day. Supposedly, tis average is up from 10 years ago wen te average teenager opened a refrigerator
More information232 Calculus and Structures
3 Calculus and Structures CHAPTER 17 JUSTIFICATION OF THE AREA AND SLOPE METHODS FOR EVALUATING BEAMS Calculus and Structures 33 Copyrigt Capter 17 JUSTIFICATION OF THE AREA AND SLOPE METHODS 17.1 THE
More informationProbabilistic Graphical Models Homework 1: Due January 29, 2014 at 4 pm
Probabilistic Grapical Models 10-708 Homework 1: Due January 29, 2014 at 4 pm Directions. Tis omework assignment covers te material presented in Lectures 1-3. You must complete all four problems to obtain
More informationMaterial for Difference Quotient
Material for Difference Quotient Prepared by Stepanie Quintal, graduate student and Marvin Stick, professor Dept. of Matematical Sciences, UMass Lowell Summer 05 Preface Te following difference quotient
More informationChapter 5 FINITE DIFFERENCE METHOD (FDM)
MEE7 Computer Modeling Tecniques in Engineering Capter 5 FINITE DIFFERENCE METHOD (FDM) 5. Introduction to FDM Te finite difference tecniques are based upon approximations wic permit replacing differential
More informationHOMEWORK HELP 2 FOR MATH 151
HOMEWORK HELP 2 FOR MATH 151 Here we go; te second round of omework elp. If tere are oters you would like to see, let me know! 2.4, 43 and 44 At wat points are te functions f(x) and g(x) = xf(x)continuous,
More informationLecture 15. Interpolation II. 2 Piecewise polynomial interpolation Hermite splines
Lecture 5 Interpolation II Introduction In te previous lecture we focused primarily on polynomial interpolation of a set of n points. A difficulty we observed is tat wen n is large, our polynomial as to
More informationAdaptive Neural Filters with Fixed Weights
Adaptive Neural Filters wit Fixed Weigts James T. Lo and Justin Nave Department of Matematics and Statistics University of Maryland Baltimore County Baltimore, MD 150, U.S.A. e-mail: jameslo@umbc.edu Abstract
More informationThe Laws of Thermodynamics
1 Te Laws of Termodynamics CLICKER QUESTIONS Question J.01 Description: Relating termodynamic processes to PV curves: isobar. Question A quantity of ideal gas undergoes a termodynamic process. Wic curve
More information5.1 We will begin this section with the definition of a rational expression. We
Basic Properties and Reducing to Lowest Terms 5.1 We will begin tis section wit te definition of a rational epression. We will ten state te two basic properties associated wit rational epressions and go
More informationContinuity and Differentiability Worksheet
Continuity and Differentiability Workseet (Be sure tat you can also do te grapical eercises from te tet- Tese were not included below! Typical problems are like problems -3, p. 6; -3, p. 7; 33-34, p. 7;
More informationch (for some fixed positive number c) reaching c
GSTF Journal of Matematics Statistics and Operations Researc (JMSOR) Vol. No. September 05 DOI 0.60/s4086-05-000-z Nonlinear Piecewise-defined Difference Equations wit Reciprocal and Cubic Terms Ramadan
More informationOptimal parameters for a hierarchical grid data structure for contact detection in arbitrarily polydisperse particle systems
Comp. Part. Mec. 04) :357 37 DOI 0.007/s4057-04-000-9 Optimal parameters for a ierarcical grid data structure for contact detection in arbitrarily polydisperse particle systems Dinant Krijgsman Vitaliy
More informationLIMITATIONS OF EULER S METHOD FOR NUMERICAL INTEGRATION
LIMITATIONS OF EULER S METHOD FOR NUMERICAL INTEGRATION LAURA EVANS.. Introduction Not all differential equations can be explicitly solved for y. Tis can be problematic if we need to know te value of y
More informationTHE IDEA OF DIFFERENTIABILITY FOR FUNCTIONS OF SEVERAL VARIABLES Math 225
THE IDEA OF DIFFERENTIABILITY FOR FUNCTIONS OF SEVERAL VARIABLES Mat 225 As we ave seen, te definition of derivative for a Mat 111 function g : R R and for acurveγ : R E n are te same, except for interpretation:
More informationA Reconsideration of Matter Waves
A Reconsideration of Matter Waves by Roger Ellman Abstract Matter waves were discovered in te early 20t century from teir wavelengt, predicted by DeBroglie, Planck's constant divided by te particle's momentum,
More informationSection 2: The Derivative Definition of the Derivative
Capter 2 Te Derivative Applied Calculus 80 Section 2: Te Derivative Definition of te Derivative Suppose we drop a tomato from te top of a 00 foot building and time its fall. Time (sec) Heigt (ft) 0.0 00
More informationDerivatives of trigonometric functions
Derivatives of trigonometric functions 2 October 207 Introuction Toay we will ten iscuss te erivates of te si stanar trigonometric functions. Of tese, te most important are sine an cosine; te erivatives
More informationNotes on wavefunctions II: momentum wavefunctions
Notes on wavefunctions II: momentum wavefunctions and uncertainty Te state of a particle at any time is described by a wavefunction ψ(x). Tese wavefunction must cange wit time, since we know tat particles
More informationNear-Optimal conversion of Hardness into Pseudo-Randomness
Near-Optimal conversion of Hardness into Pseudo-Randomness Russell Impagliazzo Computer Science and Engineering UC, San Diego 9500 Gilman Drive La Jolla, CA 92093-0114 russell@cs.ucsd.edu Ronen Saltiel
More information1 Calculus. 1.1 Gradients and the Derivative. Q f(x+h) f(x)
Calculus. Gradients and te Derivative Q f(x+) δy P T δx R f(x) 0 x x+ Let P (x, f(x)) and Q(x+, f(x+)) denote two points on te curve of te function y = f(x) and let R denote te point of intersection of
More information3.4 Worksheet: Proof of the Chain Rule NAME
Mat 1170 3.4 Workseet: Proof of te Cain Rule NAME Te Cain Rule So far we are able to differentiate all types of functions. For example: polynomials, rational, root, and trigonometric functions. We are
More informationOn the Identifiability of the Post-Nonlinear Causal Model
UAI 9 ZHANG & HYVARINEN 647 On te Identifiability of te Post-Nonlinear Causal Model Kun Zang Dept. of Computer Science and HIIT University of Helsinki Finland Aapo Hyvärinen Dept. of Computer Science,
More information7.1 Using Antiderivatives to find Area
7.1 Using Antiderivatives to find Area Introduction finding te area under te grap of a nonnegative, continuous function f In tis section a formula is obtained for finding te area of te region bounded between
More informationMAT 145. Type of Calculator Used TI-89 Titanium 100 points Score 100 possible points
MAT 15 Test #2 Name Solution Guide Type of Calculator Used TI-89 Titanium 100 points Score 100 possible points Use te grap of a function sown ere as you respond to questions 1 to 8. 1. lim f (x) 0 2. lim
More information1watt=1W=1kg m 2 /s 3
Appendix A Matematics Appendix A.1 Units To measure a pysical quantity, you need a standard. Eac pysical quantity as certain units. A unit is just a standard we use to compare, e.g. a ruler. In tis laboratory
More informationGeneric maximum nullity of a graph
Generic maximum nullity of a grap Leslie Hogben Bryan Sader Marc 5, 2008 Abstract For a grap G of order n, te maximum nullity of G is defined to be te largest possible nullity over all real symmetric n
More informationRecall from our discussion of continuity in lecture a function is continuous at a point x = a if and only if
Computational Aspects of its. Keeping te simple simple. Recall by elementary functions we mean :Polynomials (including linear and quadratic equations) Eponentials Logaritms Trig Functions Rational Functions
More information1 Introduction Radiative corrections can ave a significant impact on te predicted values of Higgs masses and couplings. Te radiative corrections invol
RADCOR-2000-001 November 15, 2000 Radiative Corrections to Pysics Beyond te Standard Model Clint Eastwood 1 Department of Radiative Pysics California State University Monterey Bay, Seaside, CA 93955 USA
More informationConvergence and Descent Properties for a Class of Multilevel Optimization Algorithms
Convergence and Descent Properties for a Class of Multilevel Optimization Algoritms Stepen G. Nas April 28, 2010 Abstract I present a multilevel optimization approac (termed MG/Opt) for te solution of
More informationDerivatives of Exponentials
mat 0 more on derivatives: day 0 Derivatives of Eponentials Recall tat DEFINITION... An eponential function as te form f () =a, were te base is a real number a > 0. Te domain of an eponential function
More informationCS522 - Partial Di erential Equations
CS5 - Partial Di erential Equations Tibor Jánosi April 5, 5 Numerical Di erentiation In principle, di erentiation is a simple operation. Indeed, given a function speci ed as a closed-form formula, its
More informationTechnology-Independent Design of Neurocomputers: The Universal Field Computer 1
Tecnology-Independent Design of Neurocomputers: Te Universal Field Computer 1 Abstract Bruce J. MacLennan Computer Science Department Naval Postgraduate Scool Monterey, CA 9393 We argue tat AI is moving
More informationMathematics 105 Calculus I. Exam 1. February 13, Solution Guide
Matematics 05 Calculus I Exam February, 009 Your Name: Solution Guide Tere are 6 total problems in tis exam. On eac problem, you must sow all your work, or oterwise torougly explain your conclusions. Tere
More informationDomination Problems in Nowhere-Dense Classes of Graphs
LIPIcs Leibniz International Proceedings in Informatics Domination Problems in Nowere-Dense Classes of Graps Anuj Dawar 1, Stepan Kreutzer 2 1 University of Cambridge Computer Lab, U.K. anuj.dawar@cl.cam.ac.uk
More information0.1 Differentiation Rules
0.1 Differentiation Rules From our previous work we ve seen tat it can be quite a task to calculate te erivative of an arbitrary function. Just working wit a secon-orer polynomial tings get pretty complicate
More informationProblem Solving. Problem Solving Process
Problem Solving One of te primary tasks for engineers is often solving problems. It is wat tey are, or sould be, good at. Solving engineering problems requires more tan just learning new terms, ideas and
More informationBounds on the Moments for an Ensemble of Random Decision Trees
Noname manuscript No. (will be inserted by te editor) Bounds on te Moments for an Ensemble of Random Decision Trees Amit Durandar Received: Sep. 17, 2013 / Revised: Mar. 04, 2014 / Accepted: Jun. 30, 2014
More informationRECOGNITION of online handwriting aims at finding the
SUBMITTED ON SEPTEMBER 2017 1 A General Framework for te Recognition of Online Handwritten Grapics Frank Julca-Aguilar, Harold Moucère, Cristian Viard-Gaudin, and Nina S. T. Hirata arxiv:1709.06389v1 [cs.cv]
More information7 Semiparametric Methods and Partially Linear Regression
7 Semiparametric Metods and Partially Linear Regression 7. Overview A model is called semiparametric if it is described by and were is nite-dimensional (e.g. parametric) and is in nite-dimensional (nonparametric).
More informationInvestigating Euler s Method and Differential Equations to Approximate π. Lindsay Crowl August 2, 2001
Investigating Euler s Metod and Differential Equations to Approximate π Lindsa Crowl August 2, 2001 Tis researc paper focuses on finding a more efficient and accurate wa to approximate π. Suppose tat x
More informationDedicated to the 70th birthday of Professor Lin Qun
Journal of Computational Matematics, Vol.4, No.3, 6, 4 44. ACCELERATION METHODS OF NONLINEAR ITERATION FOR NONLINEAR PARABOLIC EQUATIONS Guang-wei Yuan Xu-deng Hang Laboratory of Computational Pysics,
More informationQuantum Numbers and Rules
OpenStax-CNX module: m42614 1 Quantum Numbers and Rules OpenStax College Tis work is produced by OpenStax-CNX and licensed under te Creative Commons Attribution License 3.0 Abstract Dene quantum number.
More informationMA455 Manifolds Solutions 1 May 2008
MA455 Manifolds Solutions 1 May 2008 1. (i) Given real numbers a < b, find a diffeomorpism (a, b) R. Solution: For example first map (a, b) to (0, π/2) and ten map (0, π/2) diffeomorpically to R using
More information