Revisiting Uncertainty in Graph Cut Solutions

Size: px

Start display at page:

Download "Revisiting Uncertainty in Graph Cut Solutions"

Ruth Davis
5 years ago
Views:

1 Revisiting Uncertainty in Graph Cut Solutions Daniel Tarlow Dept. of Computer Science University of Toronto Ryan P. Aams School of Engineering an Applie Sciences Harvar University Abstract Graph cuts is a popular algorithm for fining the MAP assignment of many large-scale graphical moels that are common in computer vision. While graph cuts is powerful, it oes not provie information about the marginal probabilities associate with the solution it fins. To assess uncertainty, we are force to fall back on less efficient an inexact inference algorithms such as loopy belief propagation, or use less principle surrogate representations of uncertainty such as the min-marginal approach of Kohli & Torr [8]. In this work, we give new justification for using minmarginals to compute the uncertainty in conitional ranom fiels, framing the min-marginal outputs as exact marginals uner a specially-chosen generative probabilistic moel. We leverage this view to learn properly calibrate marginal probabilities as the result of straightforwar maximization of the training likelihoo, showing that the necessary subgraients can be compute efficiently using ynamic graph cut operations. We also show how this approach can be extene to compute multi-label marginal istributions, where again ynamic graph cuts enable efficient marginal inference an maximum likelihoo learning. We emonstrate empirically that after proper training uncertainties base on min-marginals provie bettercalibrate probabilities than baselines an that these istributions can be exploite in a ecision-theoretic way for improve segmentation in low-level vision. 1. Introuction Queries on ranom fiels can be broaly classifie into two types: queries for an optimum (fining a moe), an queries for a sum or integral (marginalization). In the first case, one might ask for the most likely joint configuration of the entire fiel. In the secon class, one might ask for the marginal probability of a single variable taking some assignment. At first glance, these two types of queries may appear computationally similar; inee, on a tree-structure graphical moel they take the same amount of time. However, for some moel classes there is a large iscrepancy between the computational complexities of these queries. For example, when a graphical moel is constraine to have binary variables an submoular interactions, the moe can be foun in polynomial time using the graph cuts algorithm, while marginalization is #P-complete [7]. In computer vision, this iscrepancy has contribute to a proliferation of optimization proceures centere aroun the graph cuts algorithm. Graph cuts are use both as a stan-alone proceure an as subroutine for algorithms such as alpha expansion [2], the min-marginal uncertainty of [8], the message passing of [5], an the Quaratic Pseuo Boolean Optimization algorithm [9]. Particularly given the efficient, freely available implementation of [1], graph cuts coul be consiere one of the most practical an powerful algorithms for inference in graphical moels that is available to the computer vision practitioner. Despite the successes of graph cuts, the algorithm is of limite applicability to queries of the secon broa type. When viewe as a metho for approximating the marginal probability of a variable in a graphical moel, we show in the supplementary material that the min-marginal uncertainty of [8] can be off by a factor that is exponentially large in the number of variables in the moel, an we show empirically that learning using this metho as approximate inference can lea to poorly calibrate estimates of marginal probabilities. Marginal probabilities are important in many applications, incluing interactive segmentation, active learning, an multilabel image segmentation. They are perhaps even more important in low-level vision tasks, as ranom fiel moels are often only the first component of a larger computer vision system. In this respect, it is esirable to be able to provie higher-level moules with properly calibrate probabilities, so that informe ecisions can be mae in a well-foune ecision-theoretic framework. To our knowlege, the only metho for using graph cuts to prouce probabilistic marginals is base on the work of Kohli & Torr (KT) [8]. In this paper, we hope to provie aitional insight into the practice of using graph cuts to construct probabilistic moels, by framing the metho of KT as exact marginal inference in a moel that we will elaborate on in later sections. Practically, our goal in this work is to revisit the question of how graph cuts can be use to 1

2 prouce proper uncertainty in ranom fiel moels. Perhaps surprisingly, we will leave the test-time inference proceure of KT unchange. evelop a new training proceure that irectly consiers the question of how to set parameters so that the metho of KT prouces well-calibrate testtime marginal probabilities. We will show that with this new training proceure, graph cuts can be mae to prouce very goo measures of uncertainty. We then show how this same concept enables us to generalize the binary graph cuts moel to multi-label ata. We make several contributions: We evelop theoretical unerpinnings for the inference proceure of KT, showing that there is a generative probabilistic moel for which their inference proceure prouces exact probabilistic marginals. We show how to efficiently train this new generative moel uner the maximum likelihoo objective, an evelop an algorithm for efficiently computing subgraients using ynamic graph cuts. We evelop a new moel of multilabel ata, where exact marginals an subgraients can be compute efficiently using ynamic graph cuts. We show empirically that our approach prouces better measures of uncertainty than the metho of KT an loopy belief propagation-base learning. We show our properly calibrate marginal probabilities can be use in a ecision theoretic framework to approximately optimize test performance on the intersection-over-union ( ) loss function, an we show empirically that this improves test performance. 2. Backgroun Our task is to prouce a istribution over a D- imensional space Y ={1,..., K} D in which each component takes one of K iscrete values. In particular, this istribution shoul be conitione upon a feature vector x, which takes values in X. This is known as a conitional ranom fiel (CRF) moel. Our training ata are N feature/label pairs, D ={x (n), y (n) } N n=1, y (n) Y. We will procee by constructing a moel p(y x, w) parameterize by weights w. As is typical, we will assume that the y (n) are inepenent of each other, given the x (n) an w. The classical formulation of the CRF likelihoo function in this setting is to construct an energy function E(y ; x, w) an use the Gibbs istribution: p(y w, x) = Z(x, w) = y Y 1 exp { E(y ; x, w)} (1) Z(w, x) exp { E(y ; x, w)}. (2) One natural way to formulate the problem of learning an appropriate w from the ata is to maximize the the log likelihoo of the training ata, L(w; D) = 1 N N log p(y (n) x (n), w). (3) n=1 Optimization of Eq. 3 is often ifficult ue to the fact that the graients require computing expectations which are sums over an exponentially large set of states. Various approximation schemes (e.g., [12, 3]) have been evelope to attempt to grapple with this ifficulty. Given parameters, we then nee to perform inference. Even restricte to the case of graph-structure submoular interactions over binary variables, computing exact probabilistic marginals is intractable ue to the ifficulty of computing the partition function [7]; however, MAP inference can be performe exactly in low-orer polynomial time using the graph cuts algorithm which reuces the problem to the computation of maximum flow in a network [6]. In aition to the solution to the MAP inference problem, we will also make use of quantities known as minmarginals. Whereas the value of the MAP solution is minŷ Y E(ŷ ; x, w), min-marginals are efine as the value of a constraine minimization problem where a single variable y is clampe to take on label k, then all other variables are minimize out: Φ (k) = min E(ŷ ; x, w). (4) ŷ Y,ŷ =k This constraine minimization problem can also be solve efficiently using graph cuts, an the set of all min-marginals {Φ (k)} =1:D,k=1:K can be compute in only slightly more time than is require to solve a single graph cuts problem, by using the ynamic graph cuts approach of [8]. Kohli an Torr [8] further suggest that min-marginals can be use to prouce a measure of uncertainty q by taking a softmax over the negative min-marginals: q (k) = exp{ Φ (k)} k exp{ Φ (k )}. Given these marginals, they further suggest that a CRF moel can be traine by replacing exact marginals neee for the graient with these approximate marginals. We will evaluate this learning metho in the experiments section. Finally, we will make use of assignments that we will term argmin-marginals, η (k) = arg min E(ŷ ; x, w), (5) ŷ Y,ŷ =k which simply replaces the min in min-marginals with arg min. These also can be compute efficiently using ynamic graph cuts. Limitations of Kohli-Torr. In the supplementary material, we iscuss the worst-case behavior of KT, showing 2

3 that even for pairwise graphical moels, the KT-estimate marginal can iffer from the Gibbs istribution marginal by a factor that has exponential epenence on the number of variables in the moel. 3. Our Moel In this paper, we avoi the problem of approximating CRF marginals, an in fact avoi the problem of a complicate partition function altogether. We o this by efining the following generative moel: Φ (k; x, w) = p(y =k {Φ (k ; x, w)} K k =1) = min E(ŷ ; x, w) ŷ Y,ŷ =k e Φ (k;x,w) K k =1 e Φ (k ;x,w). We are here enoting the th component of y an y by y an y, respectively. We then interpret the min-marginals as proviing a fully-factorize istribution on y given x. In contrast to Gibbs-base energy moels, this proceure is truly generative: we compute the min-marginals an this gives rise to local istributions over labels. The likelihoo for w is then given by N D K p({y (n) } N n=1 w,{x (n) } N n=1) = q δ(y(n),k) nk, (6) where q nk = n=1 =1 k=1 { } exp Φ (n) (k;x,w) k { Φ exp (n) (k ;x,w) }, an δ(, ) is the Kronecker elta function. This likelihoo makes the nature of the moel clear: we are parameterizing a large set of multinomial istributions with x an w. It simply happens that the parameters of these multinomials are the result of a set of constraine energy minima. Importantly, we can compute q s an thus compute these marginals efficiently when E(y ; x, w) is a binary submoular energy function, using the approach of Kohli & Torr. For the binary moel we use in much of this paper, we will assume that the weights w parameterize the energy via a sum of weighte unary an pairwise potentials: E(y; x, w) = f U w f ψ f (y; x) + f Pw f ψ f (y; x), (7) where U an P are the sets of unary an pairwise features, respectively. The potentials are sums over all local configurations: ψ f (y; x) = ψ f,(y; x) for f U an ψ f (y; x) = (, ) ψ f, (y; x) for f P; the local configurations have the form: { αf, (x) if y ψ f, (y; x) = = 1 (8) 0 otherwise { βf, (x) if y ψ f, (y; x) = y. (9) 0 otherwise Here, α f, (x) (or β f, (x)) are the result at location (or ege ) of running a preefine filter f on input x. 4. Maximum Likelihoo Learning As our goal is to prouce well-calibrate conitional probabilities for test ata, the natural training objective is to maximize the (possibly penalize) likelihoo. That is, given a set of observations D ={x (n), y (n) } N n=1, we wish to fin the MLE (or MAP) of the parameters w. In this section, we show that subgraients of this objective can be compute efficiently for any moel where we have efficient proceures for computing min-marginals. In reality, images may be of ifferent sizes. To remove the bias that larger images have a larger effect on the learning than smaller images, we rescale likelihoos an instea sum the average log likelihoo of each instance. Note that if all images are of the same size, optimizing this objective is equivalent to optimizing the earlier objective Eq. 6. The objective for the nth ata instance can then be written as L (n) (w) = 1 D (n) + log k=1 D (n) =1 [ Φ (y (n) ; w, x(n) ) K { exp Φ (k; w, x )} ] (n). (10) We are intereste in the partial erivative with respect to one parameter, say w f. Dropping superscripts n to reuce notational clutter, L(w) = 1 D D K =1 k=1 L(w) Φ (k; w, x). (11) Φ (k; w, x) The first term is a stanar softmax erivative: L(w) Φ (k) = δ(y, k) + exp{ Φ (k; w, x)} k exp{ Φ (k ; w, x)}. (12) To compute the secon term, first expan the efinition of Φ (k; w, x), then compute a subgraient: Φ (k; w, x) = min w f ψ f (ŷ; x) ŷ Y,ŷ =k = ψ f (η k ; x), (13) where recall η k = arg minŷ Y,ŷ =k E(ŷ ; x, θ) is the argmin-marginal for y = k. The total subgraient for one instance is then L(w) = 1 D D =1 k=1 f K ψ f (η k ; x) [q nk δ(y, k)]. (14) Using these graients we can train the moel to optimize the likelihoo of the training ata. 3

4 5. Faster Computation of Subgraients The subgraients in Eq. 14 naively take O(D 2 ) time to compute, which can be expensive for large images. In this section, we show how to significantly reuce this time by (a) leveraging the locality of changes within the ynamic graph cuts proceure use to compute min-marginals; (b) reorering the computation of min-marginals; an (c) istributing computation across many CPUs. The result of (a) is that computation of graients is only a constant factor slower than computing min-marginals; (b) spees up the computation of min-marginals an thus subgraients; an (c) allows us to easily scale to large ata sets, assuming we have access to a large cluster of machines. (a) Locality of Changes in Argmin Marginals. The maxflow algorithm of [1] caches search trees from iteration to iteration. The only noes that can change are ones that are orphane (that is, their connection to the root of the search tree is severe) after an ege capacity moification or subsequent path augmentations. This list of potentially change noes can be store uring the graph cuts proceure (this option is available in the coe of Kolmogorov [1]), an it is typically much smaller than D. So in the inner loop, we look only at potentially change noes. This moification makes the subgraient computations equivalent in computational cost to computing minmarginals, up to a constant factor, because the subgraient computation only consiers noes that are processe in the min-marginal computation. In Section 7, we compare the time taken using our metho to the time taken using only min-marginals an confirm that this hols empirically. (b) Orering Min-marginal Computations. Computing min-marginals requires solving D + 1 graph cuts problems. The cost is greatly reuce by using ynamic graph cuts, but we have foun experimentally that the orer of problems can make a large ifference in the time it takes to compute min-marginals. The strategy we use is as follows: first, compute the MAP; next, compute min-marginals for variables that take on value 0 in the MAP assignment, iterating over the variables in scanline orering; finally, compute min-marginals for variables that take on value 1 in the MAP assignment, iterating over the variables in scanline orering. The intuition for this orer is ynamic graph cuts is more efficient when the initial solution is closer to the final solution. If after clamping a variable y = 0, the neighboring variable y is also clampe to 0, solutions will ten to be more similar than if y = 1. This effect tens to increase as pairwise potentials become stronger. (c) Distribute Computation. Graients can be compute for each image in parallel, enabling istribution of the learning algorithm over multiple cores. In our implementation, we use C++ to buil a istribute learning system in which one master process communicates with the workers via RPC or MPI. The master sens the workers a current setting of weights, an each worker returns a vector of graients. The master accumulates the graients, upates weights, then sens out a new request. This process repeats until termination. This parallelization resulte in an almost linear speeup with the number of cores. 6. Tractable Multilabel Moel In the multilabel setting, MAP inference becomes NPhar in most cases [2], so we cannot compute exact minmarginals Φ (k; x, θ); thus, it appears that the moel presente above cannot be applie. Notice, however, that there is no requirement in our generative moel that the Φ (k; x, θ) values correspon to exact min-marginals. We require only that they be a eterministic function of parameters, that they be efficiently computable, an that we can compute subgraients of them with respect to moel parameters. In this section, we replace the intractable multilabel min-marginal calculations with a tractable surrogate. For multilabel moels, as is typical, we let there be a separate set of weights for each feature f an class k, efining e.g., the unary potential for pixel taking on label k as θ (k) = f wk f ψ f,(k; x). In this section, to represent a multilabel assignment for pixel, we will use K binary variables, y 1,..., y K. We then efine separate energy functions for each k {1,..., K}: E k (y; x, θ) = θ(y k k ) + θ k (y k, y k), (15) V E where θ k(0) = 0, θk (1) = θ (k), an θ k (y k, y k) are pairwise potentials with ifferent parameters per k. We can then efine separate min-marginals Φ k (y k) = minŷ =y k E k (ŷ; x, θ). These can be compute exactly using a graph cuts min-marginals computation for each k. Finally, we efine multilabel surrogate min-marginals to be Φ (k) = Φ k (1) Φk (0), then let q k = exp{ Φ (k)} ˆk be exp{ Φ (ˆk)} the multilabel probability of pixel taking label k. These surrogate min-marginals then have the properties that we esire: they are eterministic, efficiently computable, an we can (sub)ifferentiate through them. They o not correspon to min-marginals for a CRF moel, but we can think of them as coming from a some other generative process where exact maximum likelihoo learning an marginal inference are tractable via graph cuts. We have seen in the previous section how to erive Φk. To erive subgraients for the multilabel moel, we simply nee to observe that Φ (k) = Φk (1) Φk (0). Focusing on a single instance, L(w) w k f = 1 D,k L Φ (k ) ( Φ k (1) w k f ) Φk (0). (16) w k f The first term is a stanar softmax erivative, just as be- 4

5 fore. For both unary an pairwise features f, Φ k (y k ) wf k = 1 {k=k }ψ f (η k (y k ); x), (17) where η k (y k) = arg minŷ =y k E k (ŷ; x, θ). These subgraients can also be compute efficiently using the methos escribe in Section Experiments Experimentally, we apply our moels to image segmentation tasks an investigate three main questions. The first is how well our metho can optimize the maximum likelihoo objective. We compare against the learning metho suggeste by Kohli & Torr (KT), an against a logistic regression baseline. Secon, we look at the generalization capabilities of our moels both the binary an multilabel variants. Our main evaluation measure is the probability assigne to hel-out test examples, but we also look at har preictive performance, measure in terms of test accuracy an area uner the ROC curve. Thir, we investigate the suitability of the marginals for riving ecision theoretic preictions in terms of expecte loss. We use 84 unary an 4 pairwise features. The unary features are simple color-base an texture-base filters, run on patches surrouning the pixel. One pairwise feature is uniformly set to 1, while the others are base on threshole responses of the pb bounary etector [10]. We emphasize that these features inclue only low level cues. For our experiments, we use a subset of the PASCAL VOC Image Segmentation ata. We buil binary atasets by consiering only images containing a given object class (e.g., airplane), then the task is to label the given object pixels as figure an all other pixels as groun. We buil multilabel atasets by taking a subset of classes an only consiering images that have at least one of the selecte classes present. Images are scale so the minimum imension is 100 pixels. We focuse on Aeroplane, Car, Cow, an Dog classes but expect results to be representative of the case where unary information is fairly weak, ue to the simplicity of our input features. We believe to be a common an important case to consier for low-level vision systems Evaluation of Binary Moel Optimization Recall that the test-time proceures for our metho an KT are ientical. Consequently, we can compare the effectiveness of training the moel escribe in Section 3 using softmaxe negative min-marginals as approximate graients (as in [8]) versus exact subgraients (our metho). We also consier a baseline with no pairwise potentials. The likelihoo evaluations of these moels are focuse on the case where the goal at test time is to prouce a pixel-wise measure of uncertainty, as woul be appropriate in e.g., in- (a) (b) Figure 1. Comparison of training negative log likelihoos achieve by our metho (y-axis) versus (a) logistic regression, an (b) the KT metho (x-axis). There is one marker for each of 30 images, which were optimize inepenently. In all cases, we achieve better training likelihoos than the alternative methos. teractive image segmentation, multiscale segmentation, an in the ecision-theoretic preiction setting of Section 7.2. Single Image Datasets. For the first experiment, we consiere 30 ata sets, each with a single aeroplane instance. We optimize the logistic regression moel to convergence using graient ascent, then we initialize the other two methos with the result, initially setting all pairwise weights to zero. We then ran graient-base optimization using the (sub)graients compute by the two methos an recore the best objective achieve. For KT, we followe [8] an use a fixe step size that was tune by han but left fixe across experiments. For our metho, we use a ynamic step size ecay scheule, which we foun in practice to outperform various static ecay scheules: we maintain a quantity f (t) best = min t {1,...,t} f(θ t ), where f is the negative average log likelihoo objective function. We then use λ f (t) best as an estimate of the optimal value f at iteration t an perform a Polyak-like upate, setting step size ϕ t = (f(θ t ) λf (t) best )/ g 2, where g is the subgraient. We chose λ = 0.95 an left it fixe across experiments. (We also experimente with ynamic step size ecay scheules for the KT graients, but we coul not get them to outperform the fixe upate scheule.) Results are shown in Fig. 1. While KT always prouces better likelihoos than a unary-only (logistic regression) moel, its graients are not irectly optimizing this quantity (an inee, it is unclear that there is any quantity being exactly optimize with the KT approach). When the correct graients are use (our metho), we achieve much better training likelihoos. Full Datasets. Next, we focuse on the comparison to KT an experimente with larger ata sets. For each class, we constructe a training set with 48 images, an parallelize the optimizations over 17 CPUs (1 master, 16 workers). In Fig. 2, we show the best training objective achieve as a function of wall-clock time. An iteration of KT is faster than an iteration of our metho (ue to the fact that we nee to compute argmin-marginals in aition to minmarginals), but within 1000 secons, our metho overtakes KT, an then always leas to better training likelihoos. 5

Aero Car Cow Dog Log Lik -.35 (-.29) -.32 (-.28) -.26 (-.24) -.48 (-.65) -.39 (-.51) -.35 (-.51) -.54 (-.64) -.47 (-.52) -.38 (-.41) -.

7) 82.5 (79.9) 81.8 (84.7) 84.1 (86.7) 84.0 (86.8) AUC.85 (.87).85 (.87).90 (.90).69 (.65).66 (.62).76 (.65).76 (.77).66 (.65).84 (.82).

fbest versus time for learning on full training sets. 7.2.

segmentation ata. We constructe a test set with the remaining images (roughly 50 per class) not use for training.

the best training Secon, we let the moel run for longer an took a set of weights from the point that it seeme to converge to.

In the first set of evaluations, we report training an test performance accoring to three measures: average pixel likelihoo, 0-1 pixel

Our approach is consistently best on the likelihoo an AUC measures, which are the ones where a goo measure of uncertainty is require,

In the Car ata, all methos experience some overfitting, but otherwise training performance was inicative of test performance, showing

Quantitative results are shown in Fig. 3, an illustrative qualitative results are shown in Fig. 4.

Given properly calibrate probabilities, we can make preictions that seek to maximize expecte score on the test set.

segmentations. Given true labeling y, the score P P 1{y =k y =k} K 1 P is efine as (y, y ) = K. In k=1 1 Figure 4.

6 Aero Car Cow Dog Log Lik -.35 (-.29) -.32 (-.28) -.26 (-.24) -.48 (-.65) -.39 (-.51) -.35 (-.51) -.54 (-.64) -.47 (-.52) -.38 (-.41) -.52 (-.45) -.43 (-.38) -.38 (-.34) Accuracy 87.3 (89.7) 88.1 (89.9) 88.9 (90.4) 84.1 (78.7) 86.2 (80.0) 86.1 (81.4) 79.7 (75.9) 80.9 (76.7) 82.5 (79.9) 81.8 (84.7) 84.1 (86.7) 84.0 (86.8) AUC.85 (.87).85 (.87).90 (.90).69 (.65).66 (.62).76 (.65).76 (.77).66 (.65).84 (.82).64 (.66).62 (.66).76 (.79) Image Figure 3. Results for binary moels. Format is Train (Test). True Label Figure 2. fbest versus time for learning on full training sets Evaluation of Binary Moel We then compare the results of our optimization to that of KT in terms of performance as a moel of segmentation ata. We constructe a test set with the remaining images (roughly 50 per class) not use for training. We observe that KT tene towars a set of weights that were ifferent from the weights that achieve the best performance uner the maximum likelihoo objective. To give a better representation of the behavior, we report results for the moel at two points: first, takes the weights that achieve the best training likelihoo objective. Secon, we let the moel run for longer an took a set of weights from the point that it seeme to converge to. We call this. Train an Test Performance. In the first set of evaluations, we report training an test performance accoring to three measures: average pixel likelihoo, 0-1 pixel accuracy, an area uner the ROC curve (AUC). Our approach is consistently best on the likelihoo an AUC measures, which are the ones where a goo measure of uncertainty is require, an it is competitive on pixel accuracy in all cases. In the Car ata, all methos experience some overfitting, but otherwise training performance was inicative of test performance, showing that better optimization of the maximum likelihoo objective le to moels with better test performance. Quantitative results are shown in Fig. 3, an illustrative qualitative results are shown in Fig. 4. Decision Theoretic Preictions for Score. Given properly calibrate probabilities, we can make preictions that seek to maximize expecte score on the test set. Here, we take this approach an seek to optimize the intersectionover-union ( ) score that is commonly use to evaluate image segmentations. Given true labeling y, the score P P 1{y =k y =k} K 1 P is efine as (y, y ) = K. In k=1 1 Figure 4. Estimate marginal probabilities for test examples. On many examples (e.g. left), the three methos behave similarly. When they behave ifferently (mile an right), often becomes overconfient; is often uner-confient; an our metho is more able to prouce well-calibrate probabilities. the case of a binary moel, K = 2 an classes are foregroun an backgroun. Given our preictive istribuqd QK δ(y,k) tion Q(y) = =1 k=1 qk P, the expecte score for making preiction y is e(y) = y0 Q(y 0 ) (y, y 0 ). Since oes not ecompose, even evaluating e(y) requires a sum over exponentially many joint configurations. Instea, we efine a smoothe surrogate expecte score that is tractablepto evaluate given preiction y: e (y) = PK EQ(y0 ) [ 1{y0 =k y =k} ] 1 P k=1 E K 1 0 ]. Our strategy will be to 0 [ Q(y ) {y =k y =k} initialize preiction y at the moe of Q, then to greeily {y =k y =k} 6

/ After 63.9 61.8 66.6 44.2 43.3 45.9 51.7 47.1 52.8 52.0 47.2 55.3 Change.5 1.1 2.5.7 2.6 4.1.5 6.4 6.3 -.1 2.7 6.4 0.9 0.9 Actual Score / Before 63.4 60.7 64.

(right-most point of line segment). The surrogate expecte score is on the x-axis, an the true score (which uses the groun truth to compute) is on the y-axis.

Before correspons to preicting the moe of Q; After is the preiction from our expecte score maximization routine.

At each step, we iterate through classes k, proposing to flip pixel to label k, where is the pixel that has largest probability qk amongst pixels not currently

Quantitative results for this approach are shown in Fig. 5.

Because oes not prouce well-calibrate probabilities, the expecte loss optimization either provies little win or hurts preictions. In Fig.

Each line is for a ifferent image, an the left-most enpoint of the line correspons to the initialization of the optimizer.

7, we show the change in preictions from before an after running the optimizer for three images, uner our preictive istributions. Statistical Significance.

measure on each resample set of instances. We repeate the resampling proceure 1000 times an compute the stanar eviations across the resample atasets.

constructe a ataset of 5 classes (the four from previously plus backgroun), an chose 80 images evenly from the 4 foregroun classes.

7 / After Change Actual Score / Before Actual Score Aero Car Cow Dog Expecte Score (a) Expecte Score 0.7 (b) Figure 6. Trajectories as the local optimizer moves from moe preiction (left-most point of line segment) to the preiction that locally maximizes the surrogate expecte score (right-most point of line segment). The surrogate expecte score is on the x-axis, an the true score (which uses the groun truth to compute) is on the y-axis. (a) on og test ata. (b) on og test ata. Figure 5. Test results for maximizing surrogate expecte score. Before correspons to preicting the moe of Q; After is the preiction from our expecte score maximization routine. After Before True Label Image hill climb in terms of e (y) until we reach a local maximum of expecte score. At each step, we iterate through classes k, proposing to flip pixel to label k, where is the pixel that has largest probability qk amongst pixels not currently labele k. When we cycle through all k but o not make a flip, we terminate. This yiels our preiction, y, which we will evaluate uner (y, y ). Quantitative results for this approach are shown in Fig. 5. Interestingly, even though gives higher scores for the initial moe preiction, our metho surpasses it in all cases after running the expecte score optimization. Because oes not prouce well-calibrate probabilities, the expecte loss optimization either provies little win or hurts preictions. In Fig. 6, we illustrate the trajectories that the expecte score optimizer takes as it performs the local ascent. Each line is for a ifferent image, an the left-most enpoint of the line correspons to the initialization of the optimizer. As the line moves right, the expecte score increases, an ieally the true score will also increase, which woul correspon to the line moving upwars. In Fig. 7, we show the change in preictions from before an after running the optimizer for three images, uner our preictive istributions. Statistical Significance. For all of the experiments in this section, we ran a bootstrap experiment, where we resample instances with replacement, an compute the mean of each evaluation measure on each resample set of instances. We repeate the resampling proceure 1000 times an compute the stanar eviations across the resample atasets. Tables with these error bars appear in the Supplementary Material. Figure 7. Expecte loss optimization results on test images. constructe a ataset of 5 classes (the four from previously plus backgroun), an chose 80 images evenly from the 4 foregroun classes. We similarly constructe a test set of a separate 80 images. To optimize, we parallelize computation across 41 CPUs (1 master, 40 slaves), an let each algorithm run for 12 hours (nearly 500 hours of CPU time). The moels prouce similar test performance LBP gave an score of 17.2, while ours prouce a score of average test After expecte score optimization, LBP performance increase to 17.3, while ours increase to However, the most striking ifference between the approaches was the spee an reliability of the inference routines. While LBP 7.3. Evaluation of Multilabel Moel Finally, we ran experiments on the multilabel moel, an compare it to learning a CRF with loopy belief propagation (LBP) for approximate inference. We use the publicly available libdai implementation of LPB [11], setting amping to.3, an using a maximum of 100 iterations. We 7

8 Time (sec) % Not Conv. Loopy BP (first iter.) 11.3 ± 1.7 0% (first iter.) 0.14 ±.02 Loopy BP (last iter.) 59.1 ± % (last iter.) 0.16 ±.02 Figure 8. Time taken for a single inference call on multilabel moels, reporte early an late in learning. As pairwise potentials get stronger, loopy BP gets slower an less reliable; the graph cuts inference is uniformly reliable an two orers of magnitue faster. was consistently more than two orers of magnitue slower, performance got even worse as learning progresse, an we later ha problems with non-convergence. Conversely, the graph cuts base inference was uniformly fast an reliable. Quantitative results are shown in Fig Discussion an Relate Work Our approach is a eviation from the stanar strategy of efining an intractable moel, then evising efficient but approximate inference routines. Instea, we are taking an efficient inference routine an treating it as a moel. Specifically, we aske, for what moel is the metho of [8] an exact marginal inference routine? The answer is the moel that we presente in Section 3. The power of this approach is that we can efficiently compute exact graients of this moel uner the maximum likelihoo objective (Section 4), so we are irectly training the graph cuts inference to prouce well calibrate marginal probabilities at test time. In Section 6, we show how to exten the ieas to multilabel problems where MAP inference is NP-har. The key iea is to buil a moel compose of tractable subcomponent moules, which are as expressive as possible while still amitting efficient exact inference. We showe experimentally that this approach gives strong empirical performance. If we look at a high level an consier works that efine moels aroun efficient computational proceures, there is some relate work. [13] efines a generative probabilistic moels that inclues a iscrete optimization proceure as the final step. [4] efines probability moels aroun a fixe number of iterations of belief propagation. None of these is a min-marginal computation, an thus the specifics are quite ifferent, but the general spirits are similar. At a broaer level, we are aressing a low level vision problem in this work. While low level vision has receive consierable attention in computer vision, there has not been a strong emphasis on proucing properly calibrate probabilistic outputs. Our approach maintains the computational efficiency of previous surrogate measures of uncertainty, but it oes so within a proper probabilistic framework. We believe this irection to be of importance going forwar when builing large probabilistic vision systems. There are also irect applications to multiscale image labeling, interactive image segmentation, an active learning. Finally, our formulation is quite general, an applies to any moel that can be assemble from components where min-marginals can be compute efficiently. Our multilabel moel is one example of how to assemble graph cuts components. A similar approach also may be attractive in other structure output omains, such as those with bipartite matching an shortest path structures, where min-marginals can be compute efficiently [5] but where exact marginal inference in the stanar CRF formulation is NP-har. References [1] Y. Boykov an V. Kolmogorov. An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE TPAMI, 26: , , 4 [2] Y. Boykov, O. Veksler, an R. Zabih. Fast approximate energy minimization via graph cuts. In ICCV, pages , , 4 [3] M. A. Carreira-Perpinan an G. E. Hinton. On contrastive ivergence learning. In International Conference on Artificial Intelligence an Statistics, [4] J. Domke. Parameter learning with truncate messagepassing. In IEEE Conference on Computer Vision an Pattern Recognition (CVPR), [5] J. Duchi, D. Tarlow, G. Elian, an D. Koller. Using combinatorial optimization within max-prouct belief propagation. In NIPS, , 8 [6] D. Greig, B. Porteous, an A. Seheult. Exact maximum a posteriori estimation for binary images. Journal of the Royal Statistical Society, 51:271279, [7] M. Jerrum an A. Sinclair. Polynomial-time approximation algorithms for the Ising moel. SIAM Journal on Computing, 22: , , 2 [8] P. Kohli an P. H. S. Torr. Measuring uncertainty in graph cut solutions. Computer Vision an Image Unerstaning, 112(1):30 38, , 2, 5, 8 [9] V. Kolmogorov an C. Rother. Minimizing nonsubmoular functions with graph cuts a review. PAMI, 29(7): , [10] D. Martin, C. Fowlkes, an J. Malik. Learning to etect natural image bounaries using brightness an texture. In NIPS, [11] J. M. Mooij. libdai: A free an open source C++ library for iscrete approximate inference in graphical moels. JMLR, 11: , Aug [12] I. Murray an Z. Ghahramani. Bayesian learning in unirecte graphical moels: Approximate MCMC algorithms. In Unc. in Art. Intel., pages , [13] G. Papanreou an A. Yuille. Perturb-an-MAP ranom fiels: Using iscrete optimization to learn an sample from energy moels. In Proceeings of the IEEE International Conference on Computer Vision,

Revisiting Uncertainty in Graph Cut Solutions

Revisiting Uncertainty in Graph Cut Solutions Daniel Tarlow Dept. of Computer Science University of Toronto dtarlow@cs.toronto.edu Ryan P. Adams School of Engineering and Applied Sciences Harvard University