Decision Tree Fields

Size: px

Start display at page:

Download "Decision Tree Fields"

Patricia Simon
5 years ago
Views:

1 Sebastian Nowozin, arsten Rother, Shai agon, Toby Sharp, angpeng Yao, Pushmeet Kohli arcelona, 8th November 2011

2 Introduction Random Fields in omputer Vision Markov Random Fields (MRF) (Kindermann and Snell, 1980), (Li, 1995), (lake, Kohli, Rother, 2011) onditional Random Fields (RF) (Lafferty, Mcallum, Perreira, 2001) Structured prediction of multiple dependent variables

3 Introduction RFs: How do we use them? x E (y i, x) y i y j E (y i, y j, x) Factor graph notation (Kschischang, Frey, Loeliger, 1998) x: observed image y i, y j : dependent variables at pixel i and j

4 Introduction RFs: How do we use them? x E (y i, x) y i y j E (y i, y j, x) Unary energy E (y i, x) Machine learning (SVM, oosting, Random Forests, etc.)

5 Introduction RFs: How do we use them? x E (y i, x) y i y j E (y i, y j, x) Pairwise energy E (y i, y j, x) Generalized Potts, image independent ontrast-sensitive smoothing (e.g. Grabut, Textonoost) E (y i, y j, x) = exp( α x i x j 2 )

6 Introduction RFs: How do we use them? E (y i, x) x E (y i, y k, x) y i y j y k???

7 Introduction RFs: How do we use them? E (y i, x) x E (y i, y j, y k, x) y i y j y k???

Introduction Decision Trees in omputer Vision

(Shotton et al., 2008, 2011), (Saffari et al.

8 Introduction Decision Trees in omputer Vision Random Forests (reiman, MLJ 2000) Non-parametric, infinite model capacity Fast inference and training, parallelizable (Shotton et al., 2008, 2011), (Saffari et al., 2009), (Gall and Lempitsky, 2009), etc. No structured prediction

9 Introduction ontributions 1. Learn image-dependent interactions 2. ombine random fields and decision trees 3. Efficient training 4. Superior empirical performance

10 Decision Tree lassifiers x

11 Decision Tree lassifiers x

12 Decision Trees for Image Labeling

13 Decision Trees for Image Labeling (cont) I (i, j) pply decision tree, to each pixel independently

14 Decision Trees for Image Labeling (cont) I (i, j) pply decision tree, to each pixel independently

15 Decision Tree Field (DTF) Example E E D E D E D D D D

16 Unary Factor Example x E (y i, x) y i x: entire observed image y i : prediction at pixel i, y i {1, 2, 3, 4} E (y i, x): energy function

17 Unary Factor Example x E (y i, x) y i

18 Unary Factor Example x E (y i, x) y i

19 Unary Factor Example x E (y i, x) y i

20 Unary Factor Example x E (y i, x) y i

21 Unary Factor Example x E (y i, x) y i E (y i, x) = w (q, y i ) q Path(x)

22 Pairwise Factor Example x E (y i, x) y i y j E (y i, y j, x)

23 Pairwise Factor Example x E (y i, x) y i y j E (y i, y j, x)

24 Pairwise Factor Example x E (y i, x) y i y j E (y i, y j, x)

25 Pairwise Factor Example x E (y i, x) y i y j E (y i, y j, x)

26 Pairwise Factor Example x E (y i, x) y i y j E (y i, y j, x)

27 Pairwise Factor Example x E (y i, x) y i y j E (y i, y j, x) E (y i, y j, x) = w (q, y i, y j ) q Path(x)

28 Full DTF Model E(y, x, w) = F F E tf (y F, x F, w tf ). p(y x, w) = Z(x, w) = y Y 1 exp( E(y, x, w)), Z(x, w) exp( E(y, x, w)) D D D E E D D D E E x: image, y: predicted labels, one for each pixel w: weights/energies, to be learned from data How is this different from other models? What about learning and inference?

29 Full DTF Model E(y, x, w) = F F E tf (y F, x F, w tf ). p(y x, w) = Z(x, w) = y Y 1 exp( E(y, x, w)), Z(x, w) exp( E(y, x, w)) D D D E E D D D E E x: image, y: predicted labels, one for each pixel w: weights/energies, to be learned from data How is this different from other models? What about learning and inference?

30 Related Work Relationship to Other Models Generalizes random forests (learned weights) Markov random fields Here 2 trees

31 Related Work Relationship to Other Models Generalizes random forests (learned weights) Markov random fields n

32 Related Work PT-Trees (1995) onditional Probability Table Trees (Glesner, Koller, 1995), (outilier et al., 1996) Decision tree on states of random variables Limited to ayesian networks

$Related Work Learning a Markov hain Dependency Networks Learn p(x i x V\{i} ) (Heckermann et al.$

33 Related Work Learning a Markov hain Dependency Networks Learn p(x i x V\{i} ) (Heckermann et al., 2000) Decision tree on states of random variables Inference requires simulation (pseudo-gibbs sampling) Random Forest Random Field (Payet and Todorovic, 2010) Decision tree determines sampler Inference: Swendsen-Wang Metropolis MM

34 Learning Learning DTFs Given iid data {(x, y) i } i=1,...,n, need to learn Structure of the factor graph, Tree structure defined by split functions, Weight parameters in decision nodes. Let us assume structure and trees are given

35 Learning Learning DTFs Given iid data {(x, y) i } i=1,...,n, need to learn Structure of the factor graph, Tree structure defined by split functions, Weight parameters in decision nodes. Let us assume structure and trees are given

36 Learning Training Maximum Likelihood Estimation, given ground truth y w = argmax w log p(y x, w)

37 Learning Training Maximum Likelihood Estimation, given ground truth y w = argmax w log p(y x, w) Intractable!

38 Learning Training

39 Learning Training

40 Learning Training Maximum Pseudo-Likelihood Estimation (esag, 1974) w = argmax w 1 V with ground truth y and i V log p(y i yv\{i}, x, w) V : set of pixels in all images. (details in paper)

41 Learning Efficient Training by Subsampling 1M pixels...

42 Learning Efficient Training by Subsampling... 1M pixels

43 Learning Efficient Training by Subsampling We subsample our training set to train a structured model

44 Learning Training lgorithm 1. Fix factor graph structure 2. For each factor: learn classification tree 3. Jointly optimize convex pseudo-likelihood objective in w

45 Learning Test-time Inference in DTFs 1. Energy minimization (MP) E.g. TRW-S 2. Maximum Posterior Marginal (MPM) E.g. Gibbs sampling D E D E D D D D E E

46 Experiments Experiment: Inpainting

47 Experiments Experiment: Inpainting

48 Experiments Experiment: Inpainting

49 Experiments Experiment: Inpainting

50 Experiments Experiment: Inpainting Training set (300 images)

51 Experiments Experiment: Inpainting Test set (100 images, disjoint characters)

52 Experiments Instances Densely-connected, 64 neighbors Each instance: 10k variables, 300k factors hard to minimize energy

53 Experiments Instances Densely-connected, 64 neighbors Each instance: 10k variables, 300k factors hard to minimize energy ssociativity of the learned pairwise potentials Y offset X offset

54 Experiments Experiment: ody-part Recognition ody part recognition (Shotton et al., VPR 2011) 1500 training images, 150 test images

55 Experiments Experiment: ody-part Recognition (cont) Training set

56 Experiments Experiment: ody-part Recognition, Results

57 Experiments Experiment: ody-part Recognition, Results Model Measure Shotton et al. unary +1 +1,20 +1,5,20 MRF avg-acc runtime 6h34 * * * (30h)* weights - 6.3M 6.2M 6.2M 6.3M DTF avg-acc runtime - - * * (40h)* weights M 7.8M 8.8M

58 Experiments Experiment: ody-part Recognition, Results Model Measure Shotton et al. unary +1 +1,20 +1,5,20 MRF avg-acc runtime 6h34 * * * (30h)* weights - 6.3M 6.2M 6.2M 6.3M DTF avg-acc runtime - - * * (40h)* weights M 7.8M 8.8M

59 Experiments Experiment: ody-part Recognition, Results Model Measure Shotton et al. unary +1 +1,20 +1,5,20 MRF avg-acc runtime 6h34 * * * (30h)* weights - 6.3M 6.2M 6.2M 6.3M DTF avg-acc runtime - - * * (40h)* weights M 7.8M 8.8M

60 Experiments Experiment: ody-part Recognition, Results Model Measure Shotton et al. unary +1 +1,20 +1,5,20 MRF avg-acc runtime 6h34 * * * (30h)* weights - 6.3M 6.2M 6.2M 6.3M DTF avg-acc runtime - - * * (40h)* weights M 7.8M 8.8M

61 Experiments DTF Summary : non-parametric RF model for discrete image labeling tasks Non-parametric: model class can scale with training set size Scalable, can make use of large training sets, onditional interactions: richer models without latent variables

62 Experiments DTF Summary : non-parametric RF model for discrete image labeling tasks Non-parametric: model class can scale with training set size Scalable, can make use of large training sets, onditional interactions: richer models without latent variables ode will be made available after VPR deadline!

63 Experiments Thank you!

64 ppendix DTF: Linearity E tf (y F, x F, w tf ) can be written as a function linear in w tf, n Tree(t F ) z Y F w tf (q, z) tf (q, z; y F, x F ), n 0 s 0 n 1 n 2 s 1 where n 3 n 4 { 1 if n Path(xF ) and z = y tf (q, z; y F, x F ) = F, 0 otherwise. overall energy function is linear in w (pseudo-)likelihood function is log-concave Here: not necessarily unique maximizer

65 ppendix Learning the Decision Trees How to learn the decision tree? Ideal world: learn entire model jointly Here: learn decision trees using common information gain criterion Pairwise and order-k factors: treat as L L classification problem (L k ) lthough trees are trained independently, overcounting is avoided by optimizing the weights jointly Training summary 1. For each factor type, train a decision tree using information gain 2. Initialize tree weights to zero 3. Maximize the pseudolikelihood (using L-FGS)

66 ppendix Learning the Decision Trees How to learn the decision tree? Ideal world: learn entire model jointly Here: learn decision trees using common information gain criterion Pairwise and order-k factors: treat as L L classification problem (L k ) lthough trees are trained independently, overcounting is avoided by optimizing the weights jointly Training summary 1. For each factor type, train a decision tree using information gain 2. Initialize tree weights to zero 3. Maximize the pseudolikelihood (using L-FGS)

67 ppendix Experiment: ody-part Recognition, Results Figure: Test recognition results. MRF (top) and DTF (bottom).

68 ppendix Experiment: ody-part Recognition, Results Model Measure Shotton et al. unary +1 +1,20 +1,5,20 MRF avg-acc runtime 6h34 * * * (30h)* weights - 6.3M 6.2M 6.2M 6.3M DTF avg-acc runtime - - * * (40h)* weights M 7.8M 8.8M Table: ody-part recognition results: mean per-class accuracy, training time on a single 8-core machine, and number of model parameters.

69 ppendix Experiment: ody-part Recognition, Results Figure: Learned horizontal interactions: Left: mean silhouette reaching the 32 leaf nodes in the learned tree. One leaf (marked red) and corresponding effective weight matrix. Visualizing the most attractive (blue) and most repulsive (red) weights. Right: superimposing label-label interactions on test images, (a) matching the pattern, (b) no match, interaction is inactive.

70 ppendix Experiment: ody-part Recognition, Results Figure: Learned horizontal interactions: Left: mean silhouette reaching the 32 leaf nodes in the learned tree. One leaf (marked red) and corresponding effective weight matrix. Visualizing the most attractive (blue) and most repulsive (red) weights. Right: superimposing label-label interactions on test images, (a) matching the pattern, (b) no match, interaction is inactive.

Visualizing the most attractive (blue) and

71 ppendix Experiment: ody-part Recognition, Results Figure: Learned horizontal interactions: Left: mean silhouette reaching the 32 leaf nodes in the learned tree. One leaf (marked red) and corresponding effective weight matrix. Visualizing the most attractive (blue) and most repulsive (red) weights. Right: superimposing label-label interactions on test images, (a) matching the pattern, (b) no match, interaction is inactive.

72 ppendix Experiment: Snakes Simplest tasks with conditional label-label structure Snake: 10 labels from head (black) to tail (white) Image contains perfect instructions red = go up, yellow = go right, green = go down, blue = go left Myopic decisions are impossible (weak local evidence) Training: 200 small images Testing: 100 small images Features: relative pixel color tests input image labeling

right, green = go down, blue = go left Myopic decisions are impossible (weak local evidence)

73 ppendix Experiment: Snakes Simplest tasks with conditional label-label structure Snake: 10 labels from head (black) to tail (white) Image contains perfect instructions red = go up, yellow = go right, green = go down, blue = go left Myopic decisions are impossible (weak local evidence) Training: 200 small images Testing: 100 small images Features: relative pixel color tests input image labeling

74 ppendix Experiment: Snakes, Results RF Unary MRF DTF ccuracy ccuracy (tail) ccuracy (mid) Table: Test set accuracies for the snake data set.

75 ppendix Experiment: Snakes, Results Figure: Predictions on a novel test instance.

76 ppendix Experiment: Snakes, onclusion Here, Strong pairwise interactions help when having weak local evidence, Pairwise interactions are strong because they condition on the image, 200 training images are enough

77 ppendix Factor type in DTFs Every factor type has one scope: relative set of variables it acts on, decision tree: tree with split functions, weight parameters: in each node Energy is the sum along path of traversed nodes E tf (y F, x F, w tf ) = w tf (q, y F ) q Path(x F ) n 3 n 0 s 0 n 1 n 2 s 1 n 4

78 ppendix Efficient Training Minimize in w the regularized negative log-pseudolikelihood, l npl (w) = 1 l i (w) 1 log p t (w t ), V V with and i V l i (w) = log p(y i yv\{i}, x, w) V : set of pixels in all images. t

79 ppendix Efficient Training l npl (w) = 1 V i V l i (w) 1 V log p t (w t ), t

80 ppendix Efficient Training l npl (w) = E i U(V) [l i (w)] 1 V log p t (w t ), t

81 ppendix Efficient Training l npl (w) = E i U(V) [l i (w)] 1 V log p t (w t ), omposite objective: expectation + simple function pproximate expectation, deterministic, for V V, l npl (w) 1 V l i (w) 1 V i V MPLE enables subsampling on variable level t log p t (w t ). t

Part 6: Structured Prediction and Energy Minimization (1/2)

Part 6: Structured Prediction and Energy Minimization (1/2) Providence, 21st June 2012 Prediction Problem Prediction Problem y = f (x) = argmax y Y g(x, y) g(x, y) = p(y x), factor graphs/mrf/crf, g(x,