Discriminative Learning of Sum-Product Networks. Robert Gens Pedro Domingos

Size: px

Start display at page:

Download "Discriminative Learning of Sum-Product Networks. Robert Gens Pedro Domingos"

Anis Barker
5 years ago
Views:

1 Discriminative Learning of Sum-Product Networks Robert Gens Pedro Domingos

2 X1 X1 X1 X1 X2 X2 X2 X2 X3 X3 X3 X3 X4 X4 X4 X4 X5 X5 X5 X5 X6 X6 X6 X6

3 Distributions X 1 X 1 X 1 X 1 X 2 X 2 X 2 X 2 X 3 X 3 X 3 X 3 X 4 X 4 X 4 X 4 X 5 X 5 X 5 X 5 X 6 X 6 X 6 X 6

4 Features X 1 X 1 X 1 X 1 X 2 X 2 X 2 X 2 X 3 X 3 X 3 X 3 X 4 X 4 X 4 X 4 X 5 X 5 X 5 X 5 X 6 X 6 X 6 X 6

5 Features X 1 X 1 X 1 X 1 X 2 X 2 X 2 X 2 X 3 X 3 X 3 X 3 X 4 X 4 X 4 X 4 X 5 X 5 X 5 X 5 X 6 X 6 X 6 X 6

6 Mixtures X 1 X 1 X 1 X 1 X 2 X 2 X 2 X 2 X 3 X 3 X 3 X 3 X 4 X 4 X 4 X 4 X 5 X 5 X 5 X 5 X 6 X 6 X 6 X 6

7 Parts X 1 X 1 X 1 X 1 X 2 X 2 X 2 X 2 X 3 X 3 X 3 X 3 X 4 X 4 X 4 X 4 X 5 X 5 X 5 X 5 X 6 X 6 X 6 X 6

8 Parts X 1 X 1 X 1 X 1 X 2 X 2 X 2 X 2 X 3 X 3 X 3 X 3 X 4 X 4 X 4 X 4 X 5 X 5 X 5 X 5 X 6 X 6 X 6 X 6

9 Pooling X 1 X 1 X 1 X 1 X 2 X 2 X 2 X 2 X 3 X 3 X 3 X 3 X 4 X 4 X 4 X 4 X 5 X 5 X 5 X 5 X 6 X 6 X 6 X 6

10 !! Z???! w max x + + f (X) 1 f (X) 2 + X Y max max max max Y 1 Y 1 Y 2 Y 2 Motivation SPN Discriminative Experiments Review Training

11 !! Z???! w max x + + f (X) 1 f (X) 2 + X Y max max max max Y 1 Y 1 Y 2 Y 2 Motivation SPN Discriminative Experiments Review Training

12 Graphical Models!? Z???! SPNs perform fast, exact inference on high treewidth models

13 Deep Architectures!?!?! Z????!!? SPNs have full probabilistic semantics and tractable inference over many layers

14 Discriminative Learning f(x) f(x) f(x) f(x) f(x) f(x) f(x) f(x)!? Z(X)???! f(x) f(x) f(x) f(x) f(x) f(x) f(x) f(x) SPNs combine features with fast, exact inference over high treewidth models NIPS 12

15 !! Z???! w max x + + f (X) 1 f (X) 2 + X Y max max max max Y 1 Y 1 Y 2 Y 2 Motivation SPN Discriminative Experiments Review Training

16 X1 X1 X1 X1 X2 X2 X2 X2 X3 X3 X3 X3 X4 X4 X4 X4 X5 X5 X5 X5 X6 X6 X6 X6

17 A Univariate Distribution Is an SPN. X... Multinomial Gaussian Poisson

18 A Product of SPNs over Disjoint Variables Is an SPN. X Y

19 A Weighted Sum of SPNs over the Same Variables Is an SPN. w 1 w 2 Sums out a mixture variable X Y X Y

20 X1 X1 X1 X1 X2 X2 X2 X2 X3 X3 X3 X3 X4 X4 X4 X4 X5 X5 X5 X5 X6 X6 X6 X6

21 All Marginals Are Computable in Linear Time X Y

22 All Marginals Are Computable in Linear Time w 1 w 2 X Y X Y

23 All Marginals Are Computable in Linear Time X Y X Y

24 All Marginals Are Computable in Linear Time? X Y X Y

25 All Marginals Are Computable in Linear Time? X Y X Y Evidence Evidence

26 All Marginals Are Computable in Linear Time? X Y X Y Evidence Marginalize Evidence Marginalize

27 All Marginals Are Computable in Linear Time = X Y X Y Evidence Marginalize Evidence Marginalize

28 All MAP States Are Computable in Linear Time? max X Y X Y

29 All MAP States Are Computable in Linear Time? max X Y X Y Evidence Evidence

30 All MAP States Are Computable in Linear Time? max X Y X Y Evidence Mode Evidence Mode

31 All MAP States Are Computable in Linear Time? max X Y X Y Evidence Mode Evidence Mode

32 All MAP States Are Computable in Linear Time = 0.12 max X Y X Y Evidence Mode Evidence Mode

33 All MAP States Are Computable in Linear Time = 0.12 max X Y X Y Evidence Mode Evidence Mode

34 Special Cases of SPNs Junction trees Hierarchical mixture models Non-recursive probabilistic context-free grammars Models with context-specific independence Models with determinism Other high-treewidth models

35 Compactly Representable Probability Distributions Graphical Models Sum-Product Networks Existing Tractable Models

36 Learning SPNs Update Soft Inference (Marginals) Hard Inference (MAP States) Gen. EM Gen. Gradient Disc. Gradient Poon & Domingos, UAI 2011

37 !! Z???! w max x + + f (X) 1 f (X) 2 + X Y max max max max Y 1 Y 1 Y 2 Y 2 Motivation SPN Discriminative Experiments Review Training

38 Discriminative SPNs Y Query H Hidden X Evidence Treat as constants

39 Discriminative SPNs H Hidden Y 1 Y 1 Y 2 Y 2 Y Query X Evidence

40 Discriminative SPNs H Hidden Y 1 Y 1 Y 2 Y 2 Y Query f (X) 1 f (X) 2 f(x) Features (non-negative) X Evidence

41 Discriminative SPNs H Hidden Y 1 Y 1 Y 2 Y 2 Y Query f (X) 1 f (X) 2 f(x) Features (non-negative) X Evidence

42 Discriminative SPNs H Hidden Greater variety than 0.1 generative SPNs Y 1 Y 1 Y 2 Y 2 Y Query f (X) 1 f (X) 2 f(x) Features (non-negative) X Evidence

43 Discriminative Training Tractable! Correct label Best guess

44 SPN Backpropagation f (X) f (X) Y 1 Y 1 Y 2 Y 2

45 SPN Backpropagation f (X) f (X) Y 1 Y 1 Y 2 Y 2

46 SPN Backpropagation f (X) f (X) Y 1 Y 1 Y 2 Y 2

47 SPN Backpropagation X) f (X)

48 SPN Backpropagation

49 SPN Backpropagation For each child j: w n n

50 SPN Backpropagation For each + n Y k2ch(n)\{j} S k 0.9 f (X) 1?

51 SPN Backpropagation For each + n Y k2ch(n)\{j} S k 0.9 f (X)

52 Problem with Backpropagation Gradient diffusion

53 Hard Inference Overcomes Gradient Diffusion Soft Inference (Marginals) Hard Inference (MAP States)

54 Reasons to Use Hard Inference To overcome gradient diffusion When goal is to predict most probable structure For speed or tractability

55 Hard Gradient Correct label Best guess

56 Hard Gradient -

57 Hard Gradient -

58 Hard Gradient max max f 1(X) f (X) - f (X) f (X) 2 max max max max max max max max 1 2 Y Y Y Y Y Y Y Y

59 Hard Gradient max max f (X) 1 - f (X) 1 max max max max Y 1 Y 2 Y 1 Y 2

60 Hard Gradient max f (X) 1 max max max max f (X) 2 Y Y Y Y

61 Hard Gradient max f (X) 1 f (X) 2 # w/ correct label # w/ model guess max max max max Y Y Y Y

62 Learning SPNs: Summary S k w i S i Update Gen. EM Soft Inference (Marginals) w i / k Hard Inference (MAP States) w i = c i Gen. Gradient Disc. Gradient w i = ( w i true label k S i k S exp. label S w i = c k ) w i = w i ( true c i w i test c i )

63 Learning SPNs: Summary S k w i S i Update Gen. EM Soft Inference (Marginals) w i / k Hard Inference (MAP States) w i = c i Gen. Gradient Disc. Gradient w i = ( w i true label k S i k S exp. label S w i = c k ) w i = w i ( true c i w i test c i )

64 Learning SPNs: Summary S k w i S i Update Gen. EM Soft Inference (Marginals) w i / k Hard Inference (MAP States) w i = c i Gen. Gradient Disc. Gradient w i = ( w i true label k S i k S exp. label S w i = c k ) w i = w i ( true c i w i test c i )

65 !! Z???! w max x + + f (X) 1 f (X) 2 + X Y max max max max Y 1 Y 1 Y 2 Y 2 Motivation SPN Discriminative Experiments Review Training

66 Image Classification CIFAR-10 32x32px 50k train 10k test STL-10 96x96px 5k train } 10 folds 8k test 100k unlabeled

67 Feature Extraction Coates et al., AISTATS x6 K-means Triangle encoding K K Max-pooling 32x32 27x27xK GxGxK

68 SPN Architecture Classes + Parts x Mixture + Location + e ~x ij ~f cpt WxWxK GxGxK

69 CIFAR-10 Results 84% SPN 80% Pooling Accuracy 76% 72% SVM Autoenc. 68% RBM 64% Dictionary Size (K) 4x4xK

70 CIFAR-10 Results 20x fewer 84% than SVM 80% SPN Pooling Accuracy 76% 72% SVM SPN models spatial 68% structure among features 64% Dictionary Size (K)

71 CIFAR-10 Results 7x7x400 84% SPN Accuracy 83% 82% 81% Best published Learned Pooling 3-Layer Learned RF 80% 79% SVM 0k 38k 75k 113k 150k #Features

72 STL-10 results 1-layer Vector Quantization 54.9% 1-layer Sparse Coding 59.0% 3-layer Learned Receptive Field Discriminative SPN 60.1% 62.3% 52% 55% 58% 61% 64% without unlabeled data

73 Future Work Max-margin SPNs Learning SPN structure Applying discriminative SPNs to structured prediction Approximate inference using SPNs

74 Poster W15 Summary Discriminative SPNs combine the advantages of Tractable inference Deep architectures Discriminative learning Hard gradient combats diffusion in deep models Discriminative SPNs outperform SVMs and deep models on image classification benchmarks

Sum-Product Networks: A New Deep Architecture

Sum-Product Networks: A New Deep Architecture Pedro Domingos Dept. Computer Science & Eng. University of Washington Joint work with Hoifung Poon 1 Graphical Models: Challenges Bayesian Network Markov Network