Information Theory and Feature Selection (Joint Informativeness and Tractability)

Size: px
Start display at page:

Download "Information Theory and Feature Selection (Joint Informativeness and Tractability)"

Transcription

1 Information Theory and Feature Selection (Joint Informativeness and Tractability) Leonidas Lefakis Zalando Research Labs 1 / 66

2 Dimensionality Reduction Feature Construction Construction X 1,..., X D f 1 (X 1,..., X D ),..., f k (X 1,..., X D ) 2 / 66

3 Dimensionality Reduction Feature Construction Construction X 1,..., X D f 1 (X 1,..., X D ),..., f k (X 1,..., X D ) Examples Principal Component Analysis (PCA) Linear Discriminant Analysis (LDA) Autoencoders (Neural Networks) 2 / 66

4 Dimensionality Reduction Feature Selection Selection X 1,..., X D X s1,..., X sk Approaches Wrappers Embedded methods Filters 3 / 66

5 Dimensionality Reduction Feature Selection Selection X 1,..., X D X s1,..., X sk Wrappers Features are selected relative to the performance of a specific predictor. Example RFE-SVM. 4 / 66

6 Dimensionality Reduction Feature Selection Selection X 1,..., X D X s1,..., X sk Embedded Methods Features are selected internally while optimizing the predictor. Example Decision Trees. 5 / 66

7 Dimensionality Reduction Feature Selection Selection X 1,..., X D X s1,..., X sk Filters Features are assessed based on some goodness-of-fit function Φ that is classifier agnostic. Example Correlation. 6 / 66

8 Information Theory Entropy H(X ) = p(x) log p(x) x X Joint Entropy H(X, Y ) = x X Conditional Entropy p(x, y) log p(x, y) y Y H(Y X ) = p(x)h(y X = x) x X 7 / 66

9 Information Theory Relative Entropy (Kullback-Leibler Divergence ) Mutual Information I (X ; Y ) = x X D(p q) = x X y Y p(x) log p(x) q(x) ( p(x, y) p(x, y) log p(x)p(y) ) dxdy } {{ } D KL (p(x,y) p(x)p(y)) 8 / 66

10 Mutual Information I (X ; Y ) = x X y Y ( p(x, y) p(x, y) log p(x)p(y) ) dxdy } {{ } D KL (p(x,y) p(x)p(y)) I (X ; Y ) = H(Y ) H(Y X ) Reduction in uncertainty of Y if X is known X Y I (X ; Y ) = 0 Y = f (X ) I (X ; Y ) = H(Y ) 9 / 66

11 Feature Selection Classification We will look into feature selection in the context of classification. Given features {X 1,..., X D } R and a class variable Y, we wish to select a subset S of size K << D such that a predictor f : R K Y trained in this projected subspace generalizes well. 10 / 66

12 Entropy & Prediction Given X Y R D {1... C}, f (X ) = Ŷ We define the error variable { 0 if Ŷ = Y E = 1 if Ŷ Y 11 / 66

13 Entropy & Prediction H(E, Y Ŷ ) (1) = H(Y Ŷ ) + H(E Y, Ŷ ) }{{} =0 (1) = H(E Ŷ ) +H(Y E, Ŷ ) }{{} H(E) H(A, B) = H(A) + H(B A) (1) 12 / 66

14 Entropy & Prediction H(Y Ŷ )= H(E, Y Ŷ ) H(E) + H(Y E, Ŷ ) 1 + P e log ( Y -1) H(E)= H (B(1, P e )) H (B(1, 12 ) ) = 1 H(Y E, Ŷ ) =(1 P e ) H(Y E = 0, Ŷ ) +P }{{} e H(Y E = 1, Ŷ ) }{{} =0 log ( Y -1) 13 / 66

15 Fano s Inequality H(Y Ŷ ) 1 + P e log ( Y -1) (2) = H(Y ) I (Y ; Ŷ ) 1 + P e log ( Y -1) = P e H(Y ) I (Y ; Ŷ ) 1 log ( Y -1) 3 H(Y ) 1 I (X ; Y ) = P e log ( Y -1) I (A; B) = H(A) H(A B) (2) Y X Z = I (Y ; X ) I (Y ; Z) (3) 14 / 66

16 Fano s Inequality H(Y Ŷ ) 1 + P e log ( Y -1) (2) = H(Y ) I (Y ; Ŷ ) 1 + P e log ( Y -1) = P e H(Y ) I (Y ; Ŷ ) 1 log ( Y -1) 3 H(Y ) 1 I (X ; Y ) = P e log ( Y -1) I (A; B) = H(A) H(A B) (2) Y X Z = I (Y ; X ) I (Y ; Z) (3) 14 / 66

17 Data Processing Inequality H(Y Ŷ ) 1 + P e log ( Y -1) (2) = H(Y ) I (Y ; Ŷ ) 1 + P e log ( Y -1) = P e H(Y ) I (Y ; Ŷ ) 1 log ( Y -1) 3 H(Y ) 1 I (X ; Y ) = P e log ( Y -1) I (A; B) = H(A) H(A B) (2) Y X Z = I (Y ; X ) I (Y ; Z) (3) 14 / 66

18 Fano s Inequality H(Y Ŷ ) 1 + P e log ( Y -1) (2) = H(Y ) I (Y ; Ŷ ) 1 + P e log ( Y -1) = P e H(Y ) I (Y ; Ŷ ) 1 log ( Y -1) 3 H(Y ) 1 I (X ; Y ) = P e log ( Y -1) I (A; B) = H(A) H(A B) (2) Y X Z = I (Y ; X ) I (Y ; Z) (3) 14 / 66

19 Objective S = argmax I (X S (1), X S (2),..., X S (K); Y ) S, S =K 15 / 66

20 Objective S = argmax I (X S (1), X S (2),..., X S (K); Y ) S, S =K NP-HARD 15 / 66

21 Feature Selection Naive S = argmax S, S =K K I (X S (k); Y ) k=1 Considers the Relevance of each variable individually Does not consider Redundancy Does not consider Joint Informativeness 16 / 66

22 Feature Selection Two Popular Solutions mrmr : Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy, H.Peng et al. CMIM : Fast Binary Feature Selection with Conditional Mutual Information, F.Fleuret 17 / 66

23 mrmr Relevance Redundancy mrmr D(S, Y ) = 1 S R(S) = 1 S 2 I (X d ; Y ) X d S X d S X l S I (X d ; X l ) Φ(S, Y ) = D(S, Y ) R(S) 18 / 66

24 Forward Selection S 0 = for k = 1... K do z = 0 for X j F \ S k 1 do S S k 1 X j z Φ(S, Y ) if z > z then s s S S end if end for S i S end for return S K 19 / 66

25 mrmr How do we calculate I ( ; )? Discretize Distributions approximated using Parzen windows Parametric Model ˆp(x) = 1 N N δ(x x (i), h) i=1 δ(dx, h) = 1 2π D exp ( dx T Σ 1 dx 2h 2 ) 20 / 66

26 CMIM (Binary Features) Markov Blanket Markov Blanket of variable A 1 1 Wiki Commons 21 / 66

27 CMIM (Binary Features) Markov Blanket For a set M of variables that form a Markov Blanket of X i we have p (F \ {X i, M}, Y X i, M) = p (F \ {X i, M}, Y M) I (F \ {X i, M}, Y ; X i M) = 0 22 / 66

28 CMIM (Binary Features) Markov Blanket For a set M of variables that form a Markov Blanket of X i we have p (F \ {X i, M}, Y X i, M) = p (F \ {X i, M}, Y M) I (F \ {X i, M}, Y ; X i M) = 0 For Feature Selection Too Strong I (Y ; X i M) = 0 22 / 66

29 CMIM (Binary Features) I (X 1,..., X K ; Y ) = H(Y ) H(Y X 1,..X K ) }{{} intractable 23 / 66

30 CMIM (Binary Features) I (X 1,..., X K ; Y ) = H(Y ) H(Y X 1,..X K ) }{{} intractable I (X i ; Y X j ) = H(Y X j ) H(Y X i, X j ) Distributions over triplets of variables 23 / 66

31 CMIM (Binary Features) I (X 1,..., X K ; Y ) = H(Y ) H(Y X 1,..X K ) }{{} intractable I (X i ; Y X j ) = H(Y X j ) H(Y X i, X j ) S 1 argmax d Î (X d ; Y ) S k argmax{ min Î (Y ; X d X Sl ) } d l<k Distributions over triplets of variables 23 / 66

32 Fast CMIM S k argmax{ min Î (Y ; X d X Sl ) } d l<k }{{} Can only decrease 24 / 66

33 Fast CMIM S k argmax{ min Î (Y ; X d X Sl ) } d l<k }{{} Can only decrease ? 5? 4?? 6?? ? 3????????? ps[n] m[n] ps[n] = min Î (Y ; X d X Sl ) l<m(n) 24 / 66

34 Objective S = argmax I (X S (1), X S (1),..., X S (K); Y ) S, S =K CMIM, mrmr (and others) Compromise Joint Informativeness for Tractability 25 / 66

35 Problem {X 1,..., X D } R, Y {1...C} S = argmax I (X S (1), X S (2),..., X S (K); Y ) S, S =K Calculate I (X ; Y ) using a joint law 26 / 66

36 Information Theory Differential Entropy h(x ) = p(x) log (p(x)) dx X Relative Entropy (Kullback-Leibler Divergence ) ( ) p(x) D(p q) = p(x) log dx q(x) Mutual Information I (X ; Y ) = X Y X ( ) p(x, y) p(x, y) log dxdy p(x)p(y) 27 / 66

37 Addressing Intractability I (X ; Y ) = H(X ) H(X Y ) = H(X ) }{{} intractable Y P(Y = y) H(X Y = y) }{{} intractable 28 / 66

38 Addressing Intractability I (X ; Y ) = H(X ) H(X Y ) = H(X ) }{{} intractable Y Parametric model p X Y P(Y = y) H(X Y = y) }{{} intractable 28 / 66

39 Addressing Intractability I (X ; Y ) = H(X ) H(X Y ) = H(X ) }{{} intractable Y P(Y = y) H(X Y = y) }{{} intractable Parametric model p X Y = N(µ y, Σ y ) 28 / 66

40 Addressing Intractability I (X ; Y ) = H(X ) H(X Y ) = H(X ) }{{} intractable Y P(Y = y) H(X Y = y) }{{} tractable Parametric model p X Y = N(µ y, Σ y ) H(X Y ) = 1 2 log( Σ y ) + n (log 2π + 1) / 66

41 Maximum Entropy Distribution Given E(x) = µ, E ( (x µ)(x µ) T ) = Σ then the multivariate Normal x N( µ, Σ) is the Maximum Entropy Distribution 29 / 66

42 Addressing Intractability I (X ; Y ) = H(X ) H(X Y ) = H(X ) }{{} intractable Y P(Y = y) H(X Y = y) }{{} tractable Parametric model p X Y = N(µ y, Σ y ) H(X Y ) = 1 2 log( Σ y ) + n (log 2π + 1) / 66

43 Addressing Intractability I (X ; Y ) = H(X ) H(X Y ) = H(X ) }{{} intractable Y ( ) H(X ) = H π y N(µ y, Σ y ) y P(Y = y) H(X Y = y) }{{} tractable 31 / 66

44 Addressing Intractability I (X ; Y ) = H(X ) H(X Y ) = H(X ) }{{} intractable Y P(Y = y) H(X Y = y) }{{} tractable ( ) H(X ) = H π y N(µ y, Σ y ) y Entropy of Mixture of Gaussians }{{} No Analytical Solution 31 / 66

45 Addressing Intractability I (X ; Y ) = H(X ) H(X Y ) = H(X ) }{{} intractable Y P(Y = y) H(X Y = y) }{{} tractable ( ) H(X ) = H y π y N(µ y, Σ y ) Entropy of Mixture of Gaussians }{{} No Analytical Solution Upper Bound or Approximate ( ) H y π y N(µ y, Σ y ) 31 / 66

46 A Normal Upper Bound p X = Y π y p X Y 32 / 66

47 A Normal Upper Bound p X = Y π y p X Y p N(µ, Σ ) maxent = H(p X ) H(p ) 32 / 66

48 A Normal Upper Bound p X = Y π y p X Y p N(µ, Σ ) maxent = H(p X ) H(p ) I (X ; Y ) H(p ) Y P Y H(X Y ) 32 / 66

49 A Normal Upper Bound p X = Y π y p X Y p N(µ, Σ ) maxent = H(p X ) H(p ) I (X ; Y ) H(p ) Y P Y H(X Y ) I (X ; Y ) H(Y ) = y p y log p y 32 / 66

50 A Normal Upper Bound p X = Y π y p X Y I (X ; Y ) H(p ) Y P Y H(X Y ) I (X ; Y ) H(Y ) = y p y log p y 33 / 66

51 A Normal Upper Bound p X = Y π y p X Y I (X ; Y ) H(p ) Y P Y H(X Y ) I (X ; Y ) H(Y ) = y p y log p y I (X ; Y ) min ( H(p ), Y P Y (H(X Y ) log(p Y )) ) Y P Y H(X Y ) 33 / 66

52 f f* Disjoint GC 2.2 Entropy Mean difference 34 / 66

53 An approximation p X = Y π y p X Y Under mild assumptions y, H(p ) > H(p X Y =y ) we can use an approximation to I (X ; Y ) Ĩ (X ; Y ) = y min(h(p ), H(X Y ) log p y )p y y H(X Y )p y 35 / 66

54 Feature Selection Criterium Mutual Information Approximation S = argmax (Ĩ (XS (1), X S (2),..., X S (K); Y )) S, S =K 36 / 66

55 Forward Selection S 0 = for k = 1... K do s = 0 for X j F \ S k 1 do S S k 1 X j z Î (S ; Y ) if z > z then s s S S end if end for S i S end for return S K Î (S ; Y ) y min (log( Σ ), log( Σ y ) log p y ) p y y log( Σ y )p y 37 / 66

56 Complexity At iteration k we need to calculate j F \ S k 1 Σ Sk 1 X j 38 / 66

57 Complexity At iteration k we need to calculate j F \ S k 1 Σ Sk 1 X j The cost of calculating each determinant is O(k 3 ) 38 / 66

58 Forward Selection S 0 = for k = 1... K do s = 0 for X j F \ S k 1 do S S k 1 X j z Ĩ (S ; Y ) if z > z then s s S S end if end for S i S end for return S K Overall Complexity O( Y F K 4 ) 39 / 66

59 Forward Selection S 0 = for k = 1... K do s = 0 for X j F \ S k 1 do S S k 1 X j z Ĩ (S ; Y ) if z > z then s s S S end if end for S i S end for return S K Overall Complexity O( Y F K 4 ) However / 66

60 Complexity [ ΣSk 1 Σ j,sk 1 Σ Sk 1 X j = Σ T j,s k 1 σ 2 j ] 40 / 66

61 Complexity [ ΣSk 1 Σ j,sk 1 Σ Sk 1 X j = Σ T j,s k 1 σ 2 j ] We can exploit the matrix determinant lemma (twice) ) Σ + uv T = (1 + v T Σ 1 u Σ To compute each determinant in O(n 2 ) 40 / 66

62 Forward Selection S 0 = for k = 1... K do s = 0 for X j F \ S k 1 do S S k 1 X j z Ĩ (S ; Y ) if z > z then s s S S end if end for S i S end for return S K Overall Complexity O( Y F K 3 ) 41 / 66

63 Forward Selection S 0 = for k = 1... K do s = 0 for X j F \ S k 1 do S S k 1 X j z Ĩ (S ; Y ) if z > z then s s S S end if end for S i S end for return S K Overall Complexity O( Y F K 3 ) Faster? 41 / 66

64 Faster? I (X, Z; Y ) = I (Z; Y ) + I (X ; Y Z) I (S k ; Y ) = I (X j; Y S k 1 ) + I (S k 1 ; Y ) }{{} common 42 / 66

65 Faster? I (X, Z; Y ) = I (Z; Y ) + I (X ; Y Z) I (S k ; Y ) = I (X j; Y S k 1 ) + I (S k 1 ; Y ) }{{} common argmax X j F \S k 1 I (X j ; Y S k 1 ) = argmax (H(X j S k 1 ) H(X j Y, S k 1 )) X j F \S k 1 42 / 66

66 Conditional Entropy H(X j S k 1 ) = H(X j S k 1 = s)µ Sk 1 (s)ds R S k 1 = 1 2 log σ2 j S k (log 2π + 1) 2 σ 2 j S k 1 = σ 2 j Σ T j,s k 1 Σ 1 S k 1 Σ j,sk / 66

67 Updating σ 2 j S k 1 σ 2 j S k 1 = σ 2 j Σ T j,s k 1 Σ 1 S k 1 Σ j,sk 1 Assume X i was chosen at iteration k 1 [ ] ΣSk 2 Σ i,sk 2 Σ Sk 1 = Σ T i,s k 2 σ 2 i [ ΣSk 2 0 = k 2 0 T n 2 σi 2 ] + e n+1 Σ T i,s k 2 + Σ i,sk 2 e T n+1 44 / 66

68 Updating Σ 1 S k 1 From Sherman-Morrison formula (Σ + uv T ) 1 = Σ 1 Σ 1 uv T Σ v T Σ 1 u 45 / 66

69 Updating Σ 1 S k 1 From Sherman-Morrison formula (Σ + uv T ) 1 = Σ 1 Σ 1 uv T Σ v T Σ 1 u Σ 1 S n 1 = [ Σ 1 S n 2 1 u βσi 2 1 βσ 2 i u T 1 βσ 2 i ] + 1 βσ 2 i [ u 0 ] [ u T 0 ] 45 / 66

70 Updating Σ 1 S k 1 Σ 1 S n 1 = [ Σ 1 S n 2 1 u βσi 2 1 βσ 2 i u T 1 βσ 2 i ] + 1 βσ 2 i [ u 0 ] [ u T 0 ] σ 2 j S n 1 = σ 2 j + σ2 ji βσ 2 i 1 βσ 2 i Previous Round {}}{ Σ T j,s n 2 Σ 1 S n 2 Σ j,sn 2 O(n 2 ) ] βσi 2 1 βσi 2 u T Σ j,sn 2 Σ T j,s n 1 [ 1 ( Σ T j,s n 1 [ u 0 u σ 2 ji ]) ([ u T 0 ] Σ jsn 1 ) O(n) O(n) 46 / 66

71 Faster! for k = 1... K do s = 0 for X j F \ S k 1 do S S k 1 X j z Î (S ) if z > z then s s S S end if end for S i S end for return S K Overall Complexity O( Y F K 2 ) 47 / 66

72 Faster! for k = 1... K do s = 0 for X j F \ S k 1 do S S k 1 X j z Î (S ) if z > z then s s S S end if end for S i S end for return S K Overall Complexity O( Y F K 2 ) Even Faster? 47 / 66

73 Even Faster? Main bottleneck O(k) σ 2 j S k 1 = σ 2 j Σ T js k 1 Σ 1 S k 1 Σ jsk 1 48 / 66

74 Even Faster? Main bottleneck O(k) σ 2 j S k 1 = σ 2 j Σ T js k 1 Σ 1 S k 1 Σ jsk 1 Skip non-promising features 48 / 66

75 Forward Selection S 0 = for k = 1... K do s = 0 for X j F \ S k 1 do S S k 1 X j z Î (S ) if z > z then s s S S end if end for S i S end for return S K Cheap (O(1)) score c: if c < z then z < z 49 / 66

76 Forward Selection... z = 0 for X j F \ S k 1 do S S k 1 X j c? if c > z then z Î (S ) if z > z then s s S S end if end if end for... Cheap (O(1)) score c: if c < z then z < z 49 / 66

77 An O(1) Bound σ 2 j S k 1 = σ 2 j Σ T js k 1 Σ 1 S k 1 Σ jsk 1 Σ T js k 1 Σ 1 S k 1 Σ jsk 1 =Σ T js k 1 UΛU T Σ jsk 1 Σ T js k max λ i Σ T js i n 1 Σ 1 S k 1 Σ jsk 1 Σ T js k min λ i }{{ i } O(1) 50 / 66

78 An O(1) Bound σ 2 j S k 1 = σ 2 j Σ T js k 1 Σ 1 S k 1 Σ jsk 1 Σ T js k 1 Σ 1 S k 1 Σ jsk 1 =Σ T js k 1 UΛU T Σ jsk 1 Σ T js k max λ i Σ T js i n 1 Σ 1 S k 1 Σ jsk 1 Σ T js k min λ i }{{ i } O(1) 50 / 66

79 An O(1) Bound σ 2 j S k 1 = σ 2 j Σ T js k 1 Σ 1 S k 1 Σ jsk 1 Σ T js k 1 Σ 1 S k 1 Σ jsk 1 =Σ T js k 1 UΛU T Σ jsk 1 Σ T js k max λ i Σ T js i n 1 Σ 1 S k 1 Σ jsk 1 Σ T js k min λ i }{{ i } O(1) 50 / 66

80 An O(1) Bound σ 2 j S k 1 = σ 2 j Σ T js k 1 Σ 1 S k 1 Σ jsk 1 Σ T js k 1 Σ 1 S k 1 Σ jsk 1 =Σ T js k 1 UΛU T Σ jsk 1 Σ T js k max λ i Σ T js i n 1 Σ 1 S k 1 Σ jsk 1 Σ T js k min λ i }{{ i } O(1) 50 / 66

81 An O(1) Bound σ 2 j S k 1 = σ 2 j Σ T js k 1 Σ 1 S k 1 Σ jsk 1 Σ T js k 1 Σ 1 S k 1 Σ jsk 1 =Σ T js k 1 UΛU T Σ jsk 1 Σ T js k max λ i Σ T js i n 1 Σ 1 S k 1 Σ jsk 1 Σ T js k min λ i }{{ i } O(1) 50 / 66

82 EigenSystem Update Problem Given U n, Λ n such that Σ n = U nt Λ n U n 51 / 66

83 EigenSystem Update Problem Given U n, Λ n such that Find U n+1, Λ n+1 Σ n+1 = assume Σ n+1 S n+1 ++ Σ n = U nt Λ n U n [ Σ n v v T 1 ] = U n+1t Λ n+1 U n+1 51 / 66

84 EigenSystem Update U n = [ u n 1 u n 2... u n n λ n 1 0 ], Λ n = λ n n 52 / 66

85 EigenSystem Update U n = [ u n 1 u n 2... u n n λ n 1 0 ], Λ n = λ n n [ Σ n 0 0 T 1 ] [ u n i 0 ] = λ n i [ ] u n i 0 }{{} u n i 52 / 66

86 EigenSystem Update U n = [ u n 1 u n 2... u n n λ n 1 0 ], Λ n = λ n n [ Σ n 0 0 T 1 [ Σ n 0 0 T 1 ] [ u n i 0 ] [ 0 1 ] [ ] = λ n u n i i 0 }{{} u n i ] = [ 0 1 ] 52 / 66

87 EigenSystem Update [ ] Σ n 0 0 T = 1 }{{} Σ u n 1 T. e n+1t λ n } 0 {{ 1 } Λ [ u n 1... e n+1 ] }{{} U Σ n+1 = Σ + e n+1 [ v 0 ] T [ v + 0 ] e T n+1 ( Σ + e n+1 [ v 0 ] T [ v + 0 ] ) e T n+1 u n+1 = λ n+1 u n+1 53 / 66

88 EigenSystem Update U T ( ( Σ + e n+1 [ v 0 Σ + e n+1 [ v 0 ] T [ v + 0 ] T [ v + 0 ] ) e T n+1 u n+1 = λ n+1 u n+1 ] ) e T n+1 U U T }{{} u n+1 = λ n+1 U T u n+1 U U T =I ( ) ((4)) = Λ + e n+1 q T + qe T n+1 U T u n+1 = λ n+1 U T u n+1 U T Σ U = Λ (4) 54 / 66

89 EigenSystem Update ( ) Λ + e n+1 q T + qe T n+1 U T u n+1 = λ n+1 U T u n+1 }{{} Σ Σ n+1 and Σ share eigenvalues. U n+1 = U U 55 / 66

90 EigenSystem Update Σ λi = j (λ j λ) + i<n+1 q 2 i j i,j<n+1 (λ j λ) f (λ) = λ n+1 λ + i q 2 i (λ i λ). 56 / 66

91 EigenSystem Update Σ λi = j (λ j λ) + i<n+1 q 2 i j i,j<n+1 (λ j λ) f (λ) = λ n+1 λ + i q 2 i (λ i λ). i, lim λ λ > i f (λ) = +, lim λ λ < i f (λ) = f (λ) λ = 1 + i q 2 i (λ i λ) / 66

92 57 / 66

93 30 25 Lapack Update 20 CPU time in secs Matrix size Comparison between scratch and update 58 / 66

94 x max λ update λ Lapack /λ Lapack Matrix size Numerical stability (eigenvalues) 59 / 66

95 Set to go... Now that all the machinery is in place, How did we do? 60 / 66

96 Nice Results CIFAR STL INRIA SVMLin Fisher FCBF MRMR SBMLR ttest InfoGain Spec. Clus RelieFF CFS CMTF BAHSIC GC.E GC. MI GKL.E GKL. MI GC.MI was the fastest of the more complex algorithms 61 / 66

97 Influence of Sample Size We use estimates ˆΣ N = 1 N PT P For (sub)-gaussian data we have 2 If N C(t/ɛ) 2 d then ˆΣ N Σ ɛ 2 with probability at least 1 2e ct2 62 / 66

98 Influence of Sample Size We use estimates ˆΣ N = 1 N PT P For (sub)-gaussian data we have 2 If N C(t/ɛ) 2 d then ˆΣ N Σ ɛ However the faster implementations use ˆΣ 1 N 2 with probability at least 1 2e ct2 62 / 66

99 Influence of Sample Size κ(σ) = Σ 1 e Σ 1 b e b = Σ 1 Σ, for d = 2048 and various values of N 63 / 66

100 50 45 Test Accuracy features 25 features 50 features Number of Samples per Class Effect of sample size on performance when using the Gaussian Approximation for the CIFAR dataset. 64 / 66

101 Jointly Informative Feature Selection Made Tractable by Gaussian Modeling L.Lefakis and F.Fleuret Journal of Machine Learning Research, / 66

102 The End Thank You! 66 / 66

Jointly Informative Feature Selection Made Tractable by Gaussian Modeling

Jointly Informative Feature Selection Made Tractable by Gaussian Modeling Journal of Machine Learning Research 17 216) 1-39 Submitted 1/15; Revised 11/15; Published 1/16 Jointly Informative Feature Selection Made Tractable by Gaussian Modeling Leonidas Lefakis Zalando Research

More information

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with Feature selection c Victor Kitov v.v.kitov@yandex.ru Summer school on Machine Learning in High Energy Physics in partnership with August 2015 1/38 Feature selection Feature selection is a process of selecting

More information

MSc Project Feature Selection using Information Theoretic Techniques. Adam Pocock

MSc Project Feature Selection using Information Theoretic Techniques. Adam Pocock MSc Project Feature Selection using Information Theoretic Techniques Adam Pocock pococka4@cs.man.ac.uk 15/08/2008 Abstract This document presents a investigation into 3 different areas of feature selection,

More information

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information. L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate

More information

Dept. of Linguistics, Indiana University Fall 2015

Dept. of Linguistics, Indiana University Fall 2015 L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission

More information

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H. Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that

More information

Information Theory Primer:

Information Theory Primer: Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen s inequality Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

(Multivariate) Gaussian (Normal) Probability Densities

(Multivariate) Gaussian (Normal) Probability Densities (Multivariate) Gaussian (Normal) Probability Densities Carl Edward Rasmussen, José Miguel Hernández-Lobato & Richard Turner April 20th, 2018 Rasmussen, Hernàndez-Lobato & Turner Gaussian Densities April

More information

Machine learning - HT Maximum Likelihood

Machine learning - HT Maximum Likelihood Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce

More information

Hands-On Learning Theory Fall 2016, Lecture 3

Hands-On Learning Theory Fall 2016, Lecture 3 Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Variable selection and feature construction using methods related to information theory

Variable selection and feature construction using methods related to information theory Outline Variable selection and feature construction using methods related to information theory Kari 1 1 Intelligent Systems Lab, Motorola, Tempe, AZ IJCNN 2007 Outline Outline 1 Information Theory and

More information

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015 10-704 Homework 1 Due: Thursday 2/5/2015 Instructions: Turn in your homework in class on Thursday 2/5/2015 1. Information Theory Basics and Inequalities C&T 2.47, 2.29 (a) A deck of n cards in order 1,

More information

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2 COMPSCI 650 Applied Information Theory Jan 21, 2016 Lecture 2 Instructor: Arya Mazumdar Scribe: Gayane Vardoyan, Jong-Chyi Su 1 Entropy Definition: Entropy is a measure of uncertainty of a random variable.

More information

Quiz 2 Date: Monday, November 21, 2016

Quiz 2 Date: Monday, November 21, 2016 10-704 Information Processing and Learning Fall 2016 Quiz 2 Date: Monday, November 21, 2016 Name: Andrew ID: Department: Guidelines: 1. PLEASE DO NOT TURN THIS PAGE UNTIL INSTRUCTED. 2. Write your name,

More information

5 Mutual Information and Channel Capacity

5 Mutual Information and Channel Capacity 5 Mutual Information and Channel Capacity In Section 2, we have seen the use of a quantity called entropy to measure the amount of randomness in a random variable. In this section, we introduce several

More information

Information theory and decision tree

Information theory and decision tree Information theory and decision tree Jianxin Wu LAMDA Group National Key Lab for Novel Software Technology Nanjing University, China wujx2001@gmail.com June 14, 2018 Contents 1 Prefix code and Huffman

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Feature Engineering. Knowledge Discovery and Data Mining 1. Roman Kern. ISDS, TU Graz

Feature Engineering. Knowledge Discovery and Data Mining 1. Roman Kern. ISDS, TU Graz Feature Engineering Knowledge Discovery and Data Mining 1 Roman Kern ISDS, TU Graz 2017-11-09 Roman Kern (ISDS, TU Graz) Feature Engineering 2017-11-09 1 / 66 Big picture: KDDM Probability Theory Linear

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

CENG 793. On Machine Learning and Optimization. Sinan Kalkan CENG 793 On Machine Learning and Optimization Sinan Kalkan 2 Now Introduction to ML Problem definition Classes of approaches K-NN Support Vector Machines Softmax classification / logistic regression Parzen

More information

Lecture 5 - Information theory

Lecture 5 - Information theory Lecture 5 - Information theory Jan Bouda FI MU May 18, 2012 Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 1 / 42 Part I Uncertainty and entropy Jan Bouda (FI MU) Lecture 5 - Information

More information

Generative MaxEnt Learning for Multiclass Classification

Generative MaxEnt Learning for Multiclass Classification Generative Maximum Entropy Learning for Multiclass Classification A. Dukkipati, G. Pandey, D. Ghoshdastidar, P. Koley, D. M. V. S. Sriram Dept. of Computer Science and Automation Indian Institute of Science,

More information

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

More information

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information 4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information Ramji Venkataramanan Signal Processing and Communications Lab Department of Engineering ramji.v@eng.cam.ac.uk

More information

CS 630 Basic Probability and Information Theory. Tim Campbell

CS 630 Basic Probability and Information Theory. Tim Campbell CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)

More information

U Logo Use Guidelines

U Logo Use Guidelines Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity

More information

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality

More information

the Information Bottleneck

the Information Bottleneck the Information Bottleneck Daniel Moyer December 10, 2017 Imaging Genetics Center/Information Science Institute University of Southern California Sorry, no Neuroimaging! (at least not presented) 0 Instead,

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

c 4, < y 2, 1 0, otherwise,

c 4, < y 2, 1 0, otherwise, Fundamentals of Big Data Analytics Univ.-Prof. Dr. rer. nat. Rudolf Mathar Problem. Probability theory: The outcome of an experiment is described by three events A, B and C. The probabilities Pr(A) =,

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

X={x ij } φ ik =φ k (x i ) Φ={φ ik } Lecture 11: Information Theoretic Methods. Mutual Information as Information Gain. Feature Transforms

X={x ij } φ ik =φ k (x i ) Φ={φ ik } Lecture 11: Information Theoretic Methods. Mutual Information as Information Gain. Feature Transforms Lecture 11: Information Theoretic Methods Isabelle Guyon guyoni@inf.ethz.ch Mutual Information as Information Gain Book Chapter 6 and http://www.jmlr.org/papers/volume3/torkkola03a/torkkola03a.pdf Feature

More information

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang Machine Learning Lecture 02.2: Basics of Information Theory Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Bayesian Learning. Bayesian Learning Criteria

Bayesian Learning. Bayesian Learning Criteria Bayesian Learning In Bayesian learning, we are interested in the probability of a hypothesis h given the dataset D. By Bayes theorem: P (h D) = P (D h)p (h) P (D) Other useful formulas to remember are:

More information

Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p

More information

ECE 4400:693 - Information Theory

ECE 4400:693 - Information Theory ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential

More information

Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint

Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint Xia Hong 1, Sheng Chen 2, Chris J. Harris 2 1 School of Systems Engineering University of Reading, Reading RG6 6AY, UK E-mail: x.hong@reading.ac.uk

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence

More information

Classification & Information Theory Lecture #8

Classification & Information Theory Lecture #8 Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for

More information

In: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999

In: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999 In: Advances in Intelligent Data Analysis (AIDA), Computational Intelligence Methods and Applications (CIMA), International Computer Science Conventions Rochester New York, 999 Feature Selection Based

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Information & Correlation

Information & Correlation Information & Correlation Jilles Vreeken 11 June 2014 (TADA) Questions of the day What is information? How can we measure correlation? and what do talking drums have to do with this? Bits and Pieces What

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Neural coding Ecological approach to sensory coding: efficient adaptation to the natural environment

Neural coding Ecological approach to sensory coding: efficient adaptation to the natural environment Neural coding Ecological approach to sensory coding: efficient adaptation to the natural environment Jean-Pierre Nadal CNRS & EHESS Laboratoire de Physique Statistique (LPS, UMR 8550 CNRS - ENS UPMC Univ.

More information

Lecture 1a: Basic Concepts and Recaps

Lecture 1a: Basic Concepts and Recaps Lecture 1a: Basic Concepts and Recaps Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced

More information

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Kernel Density Estimation, Factor Analysis Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 2: 2 late days to hand it in tonight. Assignment 3: Due Feburary

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Lecture 1: Introduction, Entropy and ML estimation

Lecture 1: Introduction, Entropy and ML estimation 0-704: Information Processing and Learning Spring 202 Lecture : Introduction, Entropy and ML estimation Lecturer: Aarti Singh Scribes: Min Xu Disclaimer: These notes have not been subjected to the usual

More information

High Dimensional Discriminant Analysis

High Dimensional Discriminant Analysis High Dimensional Discriminant Analysis Charles Bouveyron 1,2, Stéphane Girard 1, and Cordelia Schmid 2 1 LMC IMAG, BP 53, Université Grenoble 1, 38041 Grenoble cedex 9 France (e-mail: charles.bouveyron@imag.fr,

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Introduction Consider a zero mean random vector R n with autocorrelation matri R = E( T ). R has eigenvectors q(1),,q(n) and associated eigenvalues λ(1) λ(n). Let Q = [ q(1)

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Machine Learning, Midterm Exam

Machine Learning, Midterm Exam 10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

More information

Variational Inference. Sargur Srihari

Variational Inference. Sargur Srihari Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms

More information

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Information Theory in Computer Vision and Pattern Recognition

Information Theory in Computer Vision and Pattern Recognition Francisco Escolano Pablo Suau Boyan Bonev Information Theory in Computer Vision and Pattern Recognition Foreword by Alan Yuille ~ Springer Contents 1 Introduction...............................................

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

The binary entropy function

The binary entropy function ECE 7680 Lecture 2 Definitions and Basic Facts Objective: To learn a bunch of definitions about entropy and information measures that will be useful through the quarter, and to present some simple but

More information

Bioinformatics: Biology X

Bioinformatics: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA Model Building/Checking, Reverse Engineering, Causality Outline 1 Bayesian Interpretation of Probabilities 2 Where (or of what)

More information

Goodness of Fit Test and Test of Independence by Entropy

Goodness of Fit Test and Test of Independence by Entropy Journal of Mathematical Extension Vol. 3, No. 2 (2009), 43-59 Goodness of Fit Test and Test of Independence by Entropy M. Sharifdoost Islamic Azad University Science & Research Branch, Tehran N. Nematollahi

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 9: Variational Inference Relaxations Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 24/10/2011 (EPFL) Graphical Models 24/10/2011 1 / 15

More information

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University 2018 CS420, Machine Learning, Lecture 5 Tree Models Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html ML Task: Function Approximation Problem setting

More information

Causal Modeling with Generative Neural Networks

Causal Modeling with Generative Neural Networks Causal Modeling with Generative Neural Networks Michele Sebag TAO, CNRS INRIA LRI Université Paris-Sud Joint work: D. Kalainathan, O. Goudet, I. Guyon, M. Hajaiej, A. Decelle, C. Furtlehner https://arxiv.org/abs/1709.05321

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Energy-Based Generative Adversarial Network

Energy-Based Generative Adversarial Network Energy-Based Generative Adversarial Network Energy-Based Generative Adversarial Network J. Zhao, M. Mathieu and Y. LeCun Learning to Draw Samples: With Application to Amoritized MLE for Generalized Adversarial

More information

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010. Material presented Direct Models for Classification SCARF JHU Summer School June 18, 2010 Patrick Nguyen (panguyen@microsoft.com) What is classification? What is a linear classifier? What are Direct Models?

More information

EECS 750. Hypothesis Testing with Communication Constraints

EECS 750. Hypothesis Testing with Communication Constraints EECS 750 Hypothesis Testing with Communication Constraints Name: Dinesh Krithivasan Abstract In this report, we study a modification of the classical statistical problem of bivariate hypothesis testing.

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information

Part 1: Expectation Propagation

Part 1: Expectation Propagation Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud

More information

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Course 495: Advanced Statistical Machine Learning/Pattern Recognition Course 495: Advanced Statistical Machine Learning/Pattern Recognition Goal (Lecture): To present Probabilistic Principal Component Analysis (PPCA) using both Maximum Likelihood (ML) and Expectation Maximization

More information

Day 5: Generative models, structured classification

Day 5: Generative models, structured classification Day 5: Generative models, structured classification Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 22 June 2018 Linear regression

More information

Bayesian Machine Learning - Lecture 7

Bayesian Machine Learning - Lecture 7 Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1

More information

Information geometry for bivariate distribution control

Information geometry for bivariate distribution control Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#8:(November-08-2010) Cancer and Signals Outline 1 Bayesian Interpretation of Probabilities Information Theory Outline Bayesian

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models David Rosenberg, Brett Bernstein New York University April 26, 2017 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 1 / 42 Intro Question Intro

More information

Bayesian Inference Course, WTCN, UCL, March 2013

Bayesian Inference Course, WTCN, UCL, March 2013 Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information

More information

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Machine Learning Srihari. Information Theory. Sargur N. Srihari Information Theory Sargur N. Srihari 1 Topics 1. Entropy as an Information Measure 1. Discrete variable definition Relationship to Code Length 2. Continuous Variable Differential Entropy 2. Maximum Entropy

More information

Decision trees COMS 4771

Decision trees COMS 4771 Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative 1. Fixed basis

More information

Machine Learning for Data Science (CS4786) Lecture 12

Machine Learning for Data Science (CS4786) Lecture 12 Machine Learning for Data Science (CS4786) Lecture 12 Gaussian Mixture Models Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/ Back to K-means Single link is sensitive to outliners We

More information