Information Theory and Feature Selection (Joint Informativeness and Tractability)
|
|
- Irene Josephine Marsh
- 5 years ago
- Views:
Transcription
1 Information Theory and Feature Selection (Joint Informativeness and Tractability) Leonidas Lefakis Zalando Research Labs 1 / 66
2 Dimensionality Reduction Feature Construction Construction X 1,..., X D f 1 (X 1,..., X D ),..., f k (X 1,..., X D ) 2 / 66
3 Dimensionality Reduction Feature Construction Construction X 1,..., X D f 1 (X 1,..., X D ),..., f k (X 1,..., X D ) Examples Principal Component Analysis (PCA) Linear Discriminant Analysis (LDA) Autoencoders (Neural Networks) 2 / 66
4 Dimensionality Reduction Feature Selection Selection X 1,..., X D X s1,..., X sk Approaches Wrappers Embedded methods Filters 3 / 66
5 Dimensionality Reduction Feature Selection Selection X 1,..., X D X s1,..., X sk Wrappers Features are selected relative to the performance of a specific predictor. Example RFE-SVM. 4 / 66
6 Dimensionality Reduction Feature Selection Selection X 1,..., X D X s1,..., X sk Embedded Methods Features are selected internally while optimizing the predictor. Example Decision Trees. 5 / 66
7 Dimensionality Reduction Feature Selection Selection X 1,..., X D X s1,..., X sk Filters Features are assessed based on some goodness-of-fit function Φ that is classifier agnostic. Example Correlation. 6 / 66
8 Information Theory Entropy H(X ) = p(x) log p(x) x X Joint Entropy H(X, Y ) = x X Conditional Entropy p(x, y) log p(x, y) y Y H(Y X ) = p(x)h(y X = x) x X 7 / 66
9 Information Theory Relative Entropy (Kullback-Leibler Divergence ) Mutual Information I (X ; Y ) = x X D(p q) = x X y Y p(x) log p(x) q(x) ( p(x, y) p(x, y) log p(x)p(y) ) dxdy } {{ } D KL (p(x,y) p(x)p(y)) 8 / 66
10 Mutual Information I (X ; Y ) = x X y Y ( p(x, y) p(x, y) log p(x)p(y) ) dxdy } {{ } D KL (p(x,y) p(x)p(y)) I (X ; Y ) = H(Y ) H(Y X ) Reduction in uncertainty of Y if X is known X Y I (X ; Y ) = 0 Y = f (X ) I (X ; Y ) = H(Y ) 9 / 66
11 Feature Selection Classification We will look into feature selection in the context of classification. Given features {X 1,..., X D } R and a class variable Y, we wish to select a subset S of size K << D such that a predictor f : R K Y trained in this projected subspace generalizes well. 10 / 66
12 Entropy & Prediction Given X Y R D {1... C}, f (X ) = Ŷ We define the error variable { 0 if Ŷ = Y E = 1 if Ŷ Y 11 / 66
13 Entropy & Prediction H(E, Y Ŷ ) (1) = H(Y Ŷ ) + H(E Y, Ŷ ) }{{} =0 (1) = H(E Ŷ ) +H(Y E, Ŷ ) }{{} H(E) H(A, B) = H(A) + H(B A) (1) 12 / 66
14 Entropy & Prediction H(Y Ŷ )= H(E, Y Ŷ ) H(E) + H(Y E, Ŷ ) 1 + P e log ( Y -1) H(E)= H (B(1, P e )) H (B(1, 12 ) ) = 1 H(Y E, Ŷ ) =(1 P e ) H(Y E = 0, Ŷ ) +P }{{} e H(Y E = 1, Ŷ ) }{{} =0 log ( Y -1) 13 / 66
15 Fano s Inequality H(Y Ŷ ) 1 + P e log ( Y -1) (2) = H(Y ) I (Y ; Ŷ ) 1 + P e log ( Y -1) = P e H(Y ) I (Y ; Ŷ ) 1 log ( Y -1) 3 H(Y ) 1 I (X ; Y ) = P e log ( Y -1) I (A; B) = H(A) H(A B) (2) Y X Z = I (Y ; X ) I (Y ; Z) (3) 14 / 66
16 Fano s Inequality H(Y Ŷ ) 1 + P e log ( Y -1) (2) = H(Y ) I (Y ; Ŷ ) 1 + P e log ( Y -1) = P e H(Y ) I (Y ; Ŷ ) 1 log ( Y -1) 3 H(Y ) 1 I (X ; Y ) = P e log ( Y -1) I (A; B) = H(A) H(A B) (2) Y X Z = I (Y ; X ) I (Y ; Z) (3) 14 / 66
17 Data Processing Inequality H(Y Ŷ ) 1 + P e log ( Y -1) (2) = H(Y ) I (Y ; Ŷ ) 1 + P e log ( Y -1) = P e H(Y ) I (Y ; Ŷ ) 1 log ( Y -1) 3 H(Y ) 1 I (X ; Y ) = P e log ( Y -1) I (A; B) = H(A) H(A B) (2) Y X Z = I (Y ; X ) I (Y ; Z) (3) 14 / 66
18 Fano s Inequality H(Y Ŷ ) 1 + P e log ( Y -1) (2) = H(Y ) I (Y ; Ŷ ) 1 + P e log ( Y -1) = P e H(Y ) I (Y ; Ŷ ) 1 log ( Y -1) 3 H(Y ) 1 I (X ; Y ) = P e log ( Y -1) I (A; B) = H(A) H(A B) (2) Y X Z = I (Y ; X ) I (Y ; Z) (3) 14 / 66
19 Objective S = argmax I (X S (1), X S (2),..., X S (K); Y ) S, S =K 15 / 66
20 Objective S = argmax I (X S (1), X S (2),..., X S (K); Y ) S, S =K NP-HARD 15 / 66
21 Feature Selection Naive S = argmax S, S =K K I (X S (k); Y ) k=1 Considers the Relevance of each variable individually Does not consider Redundancy Does not consider Joint Informativeness 16 / 66
22 Feature Selection Two Popular Solutions mrmr : Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy, H.Peng et al. CMIM : Fast Binary Feature Selection with Conditional Mutual Information, F.Fleuret 17 / 66
23 mrmr Relevance Redundancy mrmr D(S, Y ) = 1 S R(S) = 1 S 2 I (X d ; Y ) X d S X d S X l S I (X d ; X l ) Φ(S, Y ) = D(S, Y ) R(S) 18 / 66
24 Forward Selection S 0 = for k = 1... K do z = 0 for X j F \ S k 1 do S S k 1 X j z Φ(S, Y ) if z > z then s s S S end if end for S i S end for return S K 19 / 66
25 mrmr How do we calculate I ( ; )? Discretize Distributions approximated using Parzen windows Parametric Model ˆp(x) = 1 N N δ(x x (i), h) i=1 δ(dx, h) = 1 2π D exp ( dx T Σ 1 dx 2h 2 ) 20 / 66
26 CMIM (Binary Features) Markov Blanket Markov Blanket of variable A 1 1 Wiki Commons 21 / 66
27 CMIM (Binary Features) Markov Blanket For a set M of variables that form a Markov Blanket of X i we have p (F \ {X i, M}, Y X i, M) = p (F \ {X i, M}, Y M) I (F \ {X i, M}, Y ; X i M) = 0 22 / 66
28 CMIM (Binary Features) Markov Blanket For a set M of variables that form a Markov Blanket of X i we have p (F \ {X i, M}, Y X i, M) = p (F \ {X i, M}, Y M) I (F \ {X i, M}, Y ; X i M) = 0 For Feature Selection Too Strong I (Y ; X i M) = 0 22 / 66
29 CMIM (Binary Features) I (X 1,..., X K ; Y ) = H(Y ) H(Y X 1,..X K ) }{{} intractable 23 / 66
30 CMIM (Binary Features) I (X 1,..., X K ; Y ) = H(Y ) H(Y X 1,..X K ) }{{} intractable I (X i ; Y X j ) = H(Y X j ) H(Y X i, X j ) Distributions over triplets of variables 23 / 66
31 CMIM (Binary Features) I (X 1,..., X K ; Y ) = H(Y ) H(Y X 1,..X K ) }{{} intractable I (X i ; Y X j ) = H(Y X j ) H(Y X i, X j ) S 1 argmax d Î (X d ; Y ) S k argmax{ min Î (Y ; X d X Sl ) } d l<k Distributions over triplets of variables 23 / 66
32 Fast CMIM S k argmax{ min Î (Y ; X d X Sl ) } d l<k }{{} Can only decrease 24 / 66
33 Fast CMIM S k argmax{ min Î (Y ; X d X Sl ) } d l<k }{{} Can only decrease ? 5? 4?? 6?? ? 3????????? ps[n] m[n] ps[n] = min Î (Y ; X d X Sl ) l<m(n) 24 / 66
34 Objective S = argmax I (X S (1), X S (1),..., X S (K); Y ) S, S =K CMIM, mrmr (and others) Compromise Joint Informativeness for Tractability 25 / 66
35 Problem {X 1,..., X D } R, Y {1...C} S = argmax I (X S (1), X S (2),..., X S (K); Y ) S, S =K Calculate I (X ; Y ) using a joint law 26 / 66
36 Information Theory Differential Entropy h(x ) = p(x) log (p(x)) dx X Relative Entropy (Kullback-Leibler Divergence ) ( ) p(x) D(p q) = p(x) log dx q(x) Mutual Information I (X ; Y ) = X Y X ( ) p(x, y) p(x, y) log dxdy p(x)p(y) 27 / 66
37 Addressing Intractability I (X ; Y ) = H(X ) H(X Y ) = H(X ) }{{} intractable Y P(Y = y) H(X Y = y) }{{} intractable 28 / 66
38 Addressing Intractability I (X ; Y ) = H(X ) H(X Y ) = H(X ) }{{} intractable Y Parametric model p X Y P(Y = y) H(X Y = y) }{{} intractable 28 / 66
39 Addressing Intractability I (X ; Y ) = H(X ) H(X Y ) = H(X ) }{{} intractable Y P(Y = y) H(X Y = y) }{{} intractable Parametric model p X Y = N(µ y, Σ y ) 28 / 66
40 Addressing Intractability I (X ; Y ) = H(X ) H(X Y ) = H(X ) }{{} intractable Y P(Y = y) H(X Y = y) }{{} tractable Parametric model p X Y = N(µ y, Σ y ) H(X Y ) = 1 2 log( Σ y ) + n (log 2π + 1) / 66
41 Maximum Entropy Distribution Given E(x) = µ, E ( (x µ)(x µ) T ) = Σ then the multivariate Normal x N( µ, Σ) is the Maximum Entropy Distribution 29 / 66
42 Addressing Intractability I (X ; Y ) = H(X ) H(X Y ) = H(X ) }{{} intractable Y P(Y = y) H(X Y = y) }{{} tractable Parametric model p X Y = N(µ y, Σ y ) H(X Y ) = 1 2 log( Σ y ) + n (log 2π + 1) / 66
43 Addressing Intractability I (X ; Y ) = H(X ) H(X Y ) = H(X ) }{{} intractable Y ( ) H(X ) = H π y N(µ y, Σ y ) y P(Y = y) H(X Y = y) }{{} tractable 31 / 66
44 Addressing Intractability I (X ; Y ) = H(X ) H(X Y ) = H(X ) }{{} intractable Y P(Y = y) H(X Y = y) }{{} tractable ( ) H(X ) = H π y N(µ y, Σ y ) y Entropy of Mixture of Gaussians }{{} No Analytical Solution 31 / 66
45 Addressing Intractability I (X ; Y ) = H(X ) H(X Y ) = H(X ) }{{} intractable Y P(Y = y) H(X Y = y) }{{} tractable ( ) H(X ) = H y π y N(µ y, Σ y ) Entropy of Mixture of Gaussians }{{} No Analytical Solution Upper Bound or Approximate ( ) H y π y N(µ y, Σ y ) 31 / 66
46 A Normal Upper Bound p X = Y π y p X Y 32 / 66
47 A Normal Upper Bound p X = Y π y p X Y p N(µ, Σ ) maxent = H(p X ) H(p ) 32 / 66
48 A Normal Upper Bound p X = Y π y p X Y p N(µ, Σ ) maxent = H(p X ) H(p ) I (X ; Y ) H(p ) Y P Y H(X Y ) 32 / 66
49 A Normal Upper Bound p X = Y π y p X Y p N(µ, Σ ) maxent = H(p X ) H(p ) I (X ; Y ) H(p ) Y P Y H(X Y ) I (X ; Y ) H(Y ) = y p y log p y 32 / 66
50 A Normal Upper Bound p X = Y π y p X Y I (X ; Y ) H(p ) Y P Y H(X Y ) I (X ; Y ) H(Y ) = y p y log p y 33 / 66
51 A Normal Upper Bound p X = Y π y p X Y I (X ; Y ) H(p ) Y P Y H(X Y ) I (X ; Y ) H(Y ) = y p y log p y I (X ; Y ) min ( H(p ), Y P Y (H(X Y ) log(p Y )) ) Y P Y H(X Y ) 33 / 66
52 f f* Disjoint GC 2.2 Entropy Mean difference 34 / 66
53 An approximation p X = Y π y p X Y Under mild assumptions y, H(p ) > H(p X Y =y ) we can use an approximation to I (X ; Y ) Ĩ (X ; Y ) = y min(h(p ), H(X Y ) log p y )p y y H(X Y )p y 35 / 66
54 Feature Selection Criterium Mutual Information Approximation S = argmax (Ĩ (XS (1), X S (2),..., X S (K); Y )) S, S =K 36 / 66
55 Forward Selection S 0 = for k = 1... K do s = 0 for X j F \ S k 1 do S S k 1 X j z Î (S ; Y ) if z > z then s s S S end if end for S i S end for return S K Î (S ; Y ) y min (log( Σ ), log( Σ y ) log p y ) p y y log( Σ y )p y 37 / 66
56 Complexity At iteration k we need to calculate j F \ S k 1 Σ Sk 1 X j 38 / 66
57 Complexity At iteration k we need to calculate j F \ S k 1 Σ Sk 1 X j The cost of calculating each determinant is O(k 3 ) 38 / 66
58 Forward Selection S 0 = for k = 1... K do s = 0 for X j F \ S k 1 do S S k 1 X j z Ĩ (S ; Y ) if z > z then s s S S end if end for S i S end for return S K Overall Complexity O( Y F K 4 ) 39 / 66
59 Forward Selection S 0 = for k = 1... K do s = 0 for X j F \ S k 1 do S S k 1 X j z Ĩ (S ; Y ) if z > z then s s S S end if end for S i S end for return S K Overall Complexity O( Y F K 4 ) However / 66
60 Complexity [ ΣSk 1 Σ j,sk 1 Σ Sk 1 X j = Σ T j,s k 1 σ 2 j ] 40 / 66
61 Complexity [ ΣSk 1 Σ j,sk 1 Σ Sk 1 X j = Σ T j,s k 1 σ 2 j ] We can exploit the matrix determinant lemma (twice) ) Σ + uv T = (1 + v T Σ 1 u Σ To compute each determinant in O(n 2 ) 40 / 66
62 Forward Selection S 0 = for k = 1... K do s = 0 for X j F \ S k 1 do S S k 1 X j z Ĩ (S ; Y ) if z > z then s s S S end if end for S i S end for return S K Overall Complexity O( Y F K 3 ) 41 / 66
63 Forward Selection S 0 = for k = 1... K do s = 0 for X j F \ S k 1 do S S k 1 X j z Ĩ (S ; Y ) if z > z then s s S S end if end for S i S end for return S K Overall Complexity O( Y F K 3 ) Faster? 41 / 66
64 Faster? I (X, Z; Y ) = I (Z; Y ) + I (X ; Y Z) I (S k ; Y ) = I (X j; Y S k 1 ) + I (S k 1 ; Y ) }{{} common 42 / 66
65 Faster? I (X, Z; Y ) = I (Z; Y ) + I (X ; Y Z) I (S k ; Y ) = I (X j; Y S k 1 ) + I (S k 1 ; Y ) }{{} common argmax X j F \S k 1 I (X j ; Y S k 1 ) = argmax (H(X j S k 1 ) H(X j Y, S k 1 )) X j F \S k 1 42 / 66
66 Conditional Entropy H(X j S k 1 ) = H(X j S k 1 = s)µ Sk 1 (s)ds R S k 1 = 1 2 log σ2 j S k (log 2π + 1) 2 σ 2 j S k 1 = σ 2 j Σ T j,s k 1 Σ 1 S k 1 Σ j,sk / 66
67 Updating σ 2 j S k 1 σ 2 j S k 1 = σ 2 j Σ T j,s k 1 Σ 1 S k 1 Σ j,sk 1 Assume X i was chosen at iteration k 1 [ ] ΣSk 2 Σ i,sk 2 Σ Sk 1 = Σ T i,s k 2 σ 2 i [ ΣSk 2 0 = k 2 0 T n 2 σi 2 ] + e n+1 Σ T i,s k 2 + Σ i,sk 2 e T n+1 44 / 66
68 Updating Σ 1 S k 1 From Sherman-Morrison formula (Σ + uv T ) 1 = Σ 1 Σ 1 uv T Σ v T Σ 1 u 45 / 66
69 Updating Σ 1 S k 1 From Sherman-Morrison formula (Σ + uv T ) 1 = Σ 1 Σ 1 uv T Σ v T Σ 1 u Σ 1 S n 1 = [ Σ 1 S n 2 1 u βσi 2 1 βσ 2 i u T 1 βσ 2 i ] + 1 βσ 2 i [ u 0 ] [ u T 0 ] 45 / 66
70 Updating Σ 1 S k 1 Σ 1 S n 1 = [ Σ 1 S n 2 1 u βσi 2 1 βσ 2 i u T 1 βσ 2 i ] + 1 βσ 2 i [ u 0 ] [ u T 0 ] σ 2 j S n 1 = σ 2 j + σ2 ji βσ 2 i 1 βσ 2 i Previous Round {}}{ Σ T j,s n 2 Σ 1 S n 2 Σ j,sn 2 O(n 2 ) ] βσi 2 1 βσi 2 u T Σ j,sn 2 Σ T j,s n 1 [ 1 ( Σ T j,s n 1 [ u 0 u σ 2 ji ]) ([ u T 0 ] Σ jsn 1 ) O(n) O(n) 46 / 66
71 Faster! for k = 1... K do s = 0 for X j F \ S k 1 do S S k 1 X j z Î (S ) if z > z then s s S S end if end for S i S end for return S K Overall Complexity O( Y F K 2 ) 47 / 66
72 Faster! for k = 1... K do s = 0 for X j F \ S k 1 do S S k 1 X j z Î (S ) if z > z then s s S S end if end for S i S end for return S K Overall Complexity O( Y F K 2 ) Even Faster? 47 / 66
73 Even Faster? Main bottleneck O(k) σ 2 j S k 1 = σ 2 j Σ T js k 1 Σ 1 S k 1 Σ jsk 1 48 / 66
74 Even Faster? Main bottleneck O(k) σ 2 j S k 1 = σ 2 j Σ T js k 1 Σ 1 S k 1 Σ jsk 1 Skip non-promising features 48 / 66
75 Forward Selection S 0 = for k = 1... K do s = 0 for X j F \ S k 1 do S S k 1 X j z Î (S ) if z > z then s s S S end if end for S i S end for return S K Cheap (O(1)) score c: if c < z then z < z 49 / 66
76 Forward Selection... z = 0 for X j F \ S k 1 do S S k 1 X j c? if c > z then z Î (S ) if z > z then s s S S end if end if end for... Cheap (O(1)) score c: if c < z then z < z 49 / 66
77 An O(1) Bound σ 2 j S k 1 = σ 2 j Σ T js k 1 Σ 1 S k 1 Σ jsk 1 Σ T js k 1 Σ 1 S k 1 Σ jsk 1 =Σ T js k 1 UΛU T Σ jsk 1 Σ T js k max λ i Σ T js i n 1 Σ 1 S k 1 Σ jsk 1 Σ T js k min λ i }{{ i } O(1) 50 / 66
78 An O(1) Bound σ 2 j S k 1 = σ 2 j Σ T js k 1 Σ 1 S k 1 Σ jsk 1 Σ T js k 1 Σ 1 S k 1 Σ jsk 1 =Σ T js k 1 UΛU T Σ jsk 1 Σ T js k max λ i Σ T js i n 1 Σ 1 S k 1 Σ jsk 1 Σ T js k min λ i }{{ i } O(1) 50 / 66
79 An O(1) Bound σ 2 j S k 1 = σ 2 j Σ T js k 1 Σ 1 S k 1 Σ jsk 1 Σ T js k 1 Σ 1 S k 1 Σ jsk 1 =Σ T js k 1 UΛU T Σ jsk 1 Σ T js k max λ i Σ T js i n 1 Σ 1 S k 1 Σ jsk 1 Σ T js k min λ i }{{ i } O(1) 50 / 66
80 An O(1) Bound σ 2 j S k 1 = σ 2 j Σ T js k 1 Σ 1 S k 1 Σ jsk 1 Σ T js k 1 Σ 1 S k 1 Σ jsk 1 =Σ T js k 1 UΛU T Σ jsk 1 Σ T js k max λ i Σ T js i n 1 Σ 1 S k 1 Σ jsk 1 Σ T js k min λ i }{{ i } O(1) 50 / 66
81 An O(1) Bound σ 2 j S k 1 = σ 2 j Σ T js k 1 Σ 1 S k 1 Σ jsk 1 Σ T js k 1 Σ 1 S k 1 Σ jsk 1 =Σ T js k 1 UΛU T Σ jsk 1 Σ T js k max λ i Σ T js i n 1 Σ 1 S k 1 Σ jsk 1 Σ T js k min λ i }{{ i } O(1) 50 / 66
82 EigenSystem Update Problem Given U n, Λ n such that Σ n = U nt Λ n U n 51 / 66
83 EigenSystem Update Problem Given U n, Λ n such that Find U n+1, Λ n+1 Σ n+1 = assume Σ n+1 S n+1 ++ Σ n = U nt Λ n U n [ Σ n v v T 1 ] = U n+1t Λ n+1 U n+1 51 / 66
84 EigenSystem Update U n = [ u n 1 u n 2... u n n λ n 1 0 ], Λ n = λ n n 52 / 66
85 EigenSystem Update U n = [ u n 1 u n 2... u n n λ n 1 0 ], Λ n = λ n n [ Σ n 0 0 T 1 ] [ u n i 0 ] = λ n i [ ] u n i 0 }{{} u n i 52 / 66
86 EigenSystem Update U n = [ u n 1 u n 2... u n n λ n 1 0 ], Λ n = λ n n [ Σ n 0 0 T 1 [ Σ n 0 0 T 1 ] [ u n i 0 ] [ 0 1 ] [ ] = λ n u n i i 0 }{{} u n i ] = [ 0 1 ] 52 / 66
87 EigenSystem Update [ ] Σ n 0 0 T = 1 }{{} Σ u n 1 T. e n+1t λ n } 0 {{ 1 } Λ [ u n 1... e n+1 ] }{{} U Σ n+1 = Σ + e n+1 [ v 0 ] T [ v + 0 ] e T n+1 ( Σ + e n+1 [ v 0 ] T [ v + 0 ] ) e T n+1 u n+1 = λ n+1 u n+1 53 / 66
88 EigenSystem Update U T ( ( Σ + e n+1 [ v 0 Σ + e n+1 [ v 0 ] T [ v + 0 ] T [ v + 0 ] ) e T n+1 u n+1 = λ n+1 u n+1 ] ) e T n+1 U U T }{{} u n+1 = λ n+1 U T u n+1 U U T =I ( ) ((4)) = Λ + e n+1 q T + qe T n+1 U T u n+1 = λ n+1 U T u n+1 U T Σ U = Λ (4) 54 / 66
89 EigenSystem Update ( ) Λ + e n+1 q T + qe T n+1 U T u n+1 = λ n+1 U T u n+1 }{{} Σ Σ n+1 and Σ share eigenvalues. U n+1 = U U 55 / 66
90 EigenSystem Update Σ λi = j (λ j λ) + i<n+1 q 2 i j i,j<n+1 (λ j λ) f (λ) = λ n+1 λ + i q 2 i (λ i λ). 56 / 66
91 EigenSystem Update Σ λi = j (λ j λ) + i<n+1 q 2 i j i,j<n+1 (λ j λ) f (λ) = λ n+1 λ + i q 2 i (λ i λ). i, lim λ λ > i f (λ) = +, lim λ λ < i f (λ) = f (λ) λ = 1 + i q 2 i (λ i λ) / 66
92 57 / 66
93 30 25 Lapack Update 20 CPU time in secs Matrix size Comparison between scratch and update 58 / 66
94 x max λ update λ Lapack /λ Lapack Matrix size Numerical stability (eigenvalues) 59 / 66
95 Set to go... Now that all the machinery is in place, How did we do? 60 / 66
96 Nice Results CIFAR STL INRIA SVMLin Fisher FCBF MRMR SBMLR ttest InfoGain Spec. Clus RelieFF CFS CMTF BAHSIC GC.E GC. MI GKL.E GKL. MI GC.MI was the fastest of the more complex algorithms 61 / 66
97 Influence of Sample Size We use estimates ˆΣ N = 1 N PT P For (sub)-gaussian data we have 2 If N C(t/ɛ) 2 d then ˆΣ N Σ ɛ 2 with probability at least 1 2e ct2 62 / 66
98 Influence of Sample Size We use estimates ˆΣ N = 1 N PT P For (sub)-gaussian data we have 2 If N C(t/ɛ) 2 d then ˆΣ N Σ ɛ However the faster implementations use ˆΣ 1 N 2 with probability at least 1 2e ct2 62 / 66
99 Influence of Sample Size κ(σ) = Σ 1 e Σ 1 b e b = Σ 1 Σ, for d = 2048 and various values of N 63 / 66
100 50 45 Test Accuracy features 25 features 50 features Number of Samples per Class Effect of sample size on performance when using the Gaussian Approximation for the CIFAR dataset. 64 / 66
101 Jointly Informative Feature Selection Made Tractable by Gaussian Modeling L.Lefakis and F.Fleuret Journal of Machine Learning Research, / 66
102 The End Thank You! 66 / 66
Jointly Informative Feature Selection Made Tractable by Gaussian Modeling
Journal of Machine Learning Research 17 216) 1-39 Submitted 1/15; Revised 11/15; Published 1/16 Jointly Informative Feature Selection Made Tractable by Gaussian Modeling Leonidas Lefakis Zalando Research
More informationFeature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with
Feature selection c Victor Kitov v.v.kitov@yandex.ru Summer school on Machine Learning in High Energy Physics in partnership with August 2015 1/38 Feature selection Feature selection is a process of selecting
More informationMSc Project Feature Selection using Information Theoretic Techniques. Adam Pocock
MSc Project Feature Selection using Information Theoretic Techniques Adam Pocock pococka4@cs.man.ac.uk 15/08/2008 Abstract This document presents a investigation into 3 different areas of feature selection,
More informationIntroduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.
L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate
More informationDept. of Linguistics, Indiana University Fall 2015
L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission
More information3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.
Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that
More informationInformation Theory Primer:
Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen s inequality Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More information(Multivariate) Gaussian (Normal) Probability Densities
(Multivariate) Gaussian (Normal) Probability Densities Carl Edward Rasmussen, José Miguel Hernández-Lobato & Richard Turner April 20th, 2018 Rasmussen, Hernàndez-Lobato & Turner Gaussian Densities April
More informationMachine learning - HT Maximum Likelihood
Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce
More informationHands-On Learning Theory Fall 2016, Lecture 3
Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationVariable selection and feature construction using methods related to information theory
Outline Variable selection and feature construction using methods related to information theory Kari 1 1 Intelligent Systems Lab, Motorola, Tempe, AZ IJCNN 2007 Outline Outline 1 Information Theory and
More informationHomework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015
10-704 Homework 1 Due: Thursday 2/5/2015 Instructions: Turn in your homework in class on Thursday 2/5/2015 1. Information Theory Basics and Inequalities C&T 2.47, 2.29 (a) A deck of n cards in order 1,
More informationCOMPSCI 650 Applied Information Theory Jan 21, Lecture 2
COMPSCI 650 Applied Information Theory Jan 21, 2016 Lecture 2 Instructor: Arya Mazumdar Scribe: Gayane Vardoyan, Jong-Chyi Su 1 Entropy Definition: Entropy is a measure of uncertainty of a random variable.
More informationQuiz 2 Date: Monday, November 21, 2016
10-704 Information Processing and Learning Fall 2016 Quiz 2 Date: Monday, November 21, 2016 Name: Andrew ID: Department: Guidelines: 1. PLEASE DO NOT TURN THIS PAGE UNTIL INSTRUCTED. 2. Write your name,
More information5 Mutual Information and Channel Capacity
5 Mutual Information and Channel Capacity In Section 2, we have seen the use of a quantity called entropy to measure the amount of randomness in a random variable. In this section, we introduce several
More informationInformation theory and decision tree
Information theory and decision tree Jianxin Wu LAMDA Group National Key Lab for Novel Software Technology Nanjing University, China wujx2001@gmail.com June 14, 2018 Contents 1 Prefix code and Huffman
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationFeature Engineering. Knowledge Discovery and Data Mining 1. Roman Kern. ISDS, TU Graz
Feature Engineering Knowledge Discovery and Data Mining 1 Roman Kern ISDS, TU Graz 2017-11-09 Roman Kern (ISDS, TU Graz) Feature Engineering 2017-11-09 1 / 66 Big picture: KDDM Probability Theory Linear
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationCENG 793. On Machine Learning and Optimization. Sinan Kalkan
CENG 793 On Machine Learning and Optimization Sinan Kalkan 2 Now Introduction to ML Problem definition Classes of approaches K-NN Support Vector Machines Softmax classification / logistic regression Parzen
More informationLecture 5 - Information theory
Lecture 5 - Information theory Jan Bouda FI MU May 18, 2012 Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 1 / 42 Part I Uncertainty and entropy Jan Bouda (FI MU) Lecture 5 - Information
More informationGenerative MaxEnt Learning for Multiclass Classification
Generative Maximum Entropy Learning for Multiclass Classification A. Dukkipati, G. Pandey, D. Ghoshdastidar, P. Koley, D. M. V. S. Sriram Dept. of Computer Science and Automation Indian Institute of Science,
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer
More informationThe exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.
CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,
More informationExpectation Propagation Algorithm
Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,
More information4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information
4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information Ramji Venkataramanan Signal Processing and Communications Lab Department of Engineering ramji.v@eng.cam.ac.uk
More informationCS 630 Basic Probability and Information Theory. Tim Campbell
CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)
More informationU Logo Use Guidelines
Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity
More informationChapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye
Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality
More informationthe Information Bottleneck
the Information Bottleneck Daniel Moyer December 10, 2017 Imaging Genetics Center/Information Science Institute University of Southern California Sorry, no Neuroimaging! (at least not presented) 0 Instead,
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationExpectation Propagation for Approximate Bayesian Inference
Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given
More informationIntroduction to Machine Learning Midterm, Tues April 8
Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend
More informationc 4, < y 2, 1 0, otherwise,
Fundamentals of Big Data Analytics Univ.-Prof. Dr. rer. nat. Rudolf Mathar Problem. Probability theory: The outcome of an experiment is described by three events A, B and C. The probabilities Pr(A) =,
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationCurve Fitting Re-visited, Bishop1.2.5
Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the
More informationX={x ij } φ ik =φ k (x i ) Φ={φ ik } Lecture 11: Information Theoretic Methods. Mutual Information as Information Gain. Feature Transforms
Lecture 11: Information Theoretic Methods Isabelle Guyon guyoni@inf.ethz.ch Mutual Information as Information Gain Book Chapter 6 and http://www.jmlr.org/papers/volume3/torkkola03a/torkkola03a.pdf Feature
More informationMachine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang
Machine Learning Lecture 02.2: Basics of Information Theory Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationBayesian Learning. Bayesian Learning Criteria
Bayesian Learning In Bayesian learning, we are interested in the probability of a hypothesis h given the dataset D. By Bayes theorem: P (h D) = P (D h)p (h) P (D) Other useful formulas to remember are:
More informationVariational Autoencoders (VAEs)
September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p
More informationECE 4400:693 - Information Theory
ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential
More informationSparse Kernel Density Estimation Technique Based on Zero-Norm Constraint
Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint Xia Hong 1, Sheng Chen 2, Chris J. Harris 2 1 School of Systems Engineering University of Reading, Reading RG6 6AY, UK E-mail: x.hong@reading.ac.uk
More informationThe Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision
The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that
More informationMGMT 69000: Topics in High-dimensional Data Analysis Falll 2016
MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence
More informationClassification & Information Theory Lecture #8
Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationMachine Learning for Structured Prediction
Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for
More informationIn: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999
In: Advances in Intelligent Data Analysis (AIDA), Computational Intelligence Methods and Applications (CIMA), International Computer Science Conventions Rochester New York, 999 Feature Selection Based
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne
More informationInformation & Correlation
Information & Correlation Jilles Vreeken 11 June 2014 (TADA) Questions of the day What is information? How can we measure correlation? and what do talking drums have to do with this? Bits and Pieces What
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationNeural coding Ecological approach to sensory coding: efficient adaptation to the natural environment
Neural coding Ecological approach to sensory coding: efficient adaptation to the natural environment Jean-Pierre Nadal CNRS & EHESS Laboratoire de Physique Statistique (LPS, UMR 8550 CNRS - ENS UPMC Univ.
More informationLecture 1a: Basic Concepts and Recaps
Lecture 1a: Basic Concepts and Recaps Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced
More informationECE598: Information-theoretic methods in high-dimensional statistics Spring 2016
ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Kernel Density Estimation, Factor Analysis Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 2: 2 late days to hand it in tonight. Assignment 3: Due Feburary
More informationExpectation Maximization
Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationLecture 1: Introduction, Entropy and ML estimation
0-704: Information Processing and Learning Spring 202 Lecture : Introduction, Entropy and ML estimation Lecturer: Aarti Singh Scribes: Min Xu Disclaimer: These notes have not been subjected to the usual
More informationHigh Dimensional Discriminant Analysis
High Dimensional Discriminant Analysis Charles Bouveyron 1,2, Stéphane Girard 1, and Cordelia Schmid 2 1 LMC IMAG, BP 53, Université Grenoble 1, 38041 Grenoble cedex 9 France (e-mail: charles.bouveyron@imag.fr,
More informationIntroduction to Statistical Learning Theory
Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated
More informationPrincipal Component Analysis
Principal Component Analysis Introduction Consider a zero mean random vector R n with autocorrelation matri R = E( T ). R has eigenvectors q(1),,q(n) and associated eigenvalues λ(1) λ(n). Let Q = [ q(1)
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationMachine Learning, Midterm Exam
10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have
More informationVariational Inference. Sargur Srihari
Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms
More informationLearning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014
Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationInformation Theory in Computer Vision and Pattern Recognition
Francisco Escolano Pablo Suau Boyan Bonev Information Theory in Computer Vision and Pattern Recognition Foreword by Alan Yuille ~ Springer Contents 1 Introduction...............................................
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More informationThe binary entropy function
ECE 7680 Lecture 2 Definitions and Basic Facts Objective: To learn a bunch of definitions about entropy and information measures that will be useful through the quarter, and to present some simple but
More informationBioinformatics: Biology X
Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA Model Building/Checking, Reverse Engineering, Causality Outline 1 Bayesian Interpretation of Probabilities 2 Where (or of what)
More informationGoodness of Fit Test and Test of Independence by Entropy
Journal of Mathematical Extension Vol. 3, No. 2 (2009), 43-59 Goodness of Fit Test and Test of Independence by Entropy M. Sharifdoost Islamic Azad University Science & Research Branch, Tehran N. Nematollahi
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 9: Variational Inference Relaxations Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 24/10/2011 (EPFL) Graphical Models 24/10/2011 1 / 15
More information2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University
2018 CS420, Machine Learning, Lecture 5 Tree Models Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html ML Task: Function Approximation Problem setting
More informationCausal Modeling with Generative Neural Networks
Causal Modeling with Generative Neural Networks Michele Sebag TAO, CNRS INRIA LRI Université Paris-Sud Joint work: D. Kalainathan, O. Goudet, I. Guyon, M. Hajaiej, A. Decelle, C. Furtlehner https://arxiv.org/abs/1709.05321
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationEnergy-Based Generative Adversarial Network
Energy-Based Generative Adversarial Network Energy-Based Generative Adversarial Network J. Zhao, M. Mathieu and Y. LeCun Learning to Draw Samples: With Application to Amoritized MLE for Generalized Adversarial
More informationMaterial presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.
Material presented Direct Models for Classification SCARF JHU Summer School June 18, 2010 Patrick Nguyen (panguyen@microsoft.com) What is classification? What is a linear classifier? What are Direct Models?
More informationEECS 750. Hypothesis Testing with Communication Constraints
EECS 750 Hypothesis Testing with Communication Constraints Name: Dinesh Krithivasan Abstract In this report, we study a modification of the classical statistical problem of bivariate hypothesis testing.
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationWeek 3: The EM algorithm
Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent
More informationPart 1: Expectation Propagation
Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud
More informationCourse 495: Advanced Statistical Machine Learning/Pattern Recognition
Course 495: Advanced Statistical Machine Learning/Pattern Recognition Goal (Lecture): To present Probabilistic Principal Component Analysis (PPCA) using both Maximum Likelihood (ML) and Expectation Maximization
More informationDay 5: Generative models, structured classification
Day 5: Generative models, structured classification Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 22 June 2018 Linear regression
More informationBayesian Machine Learning - Lecture 7
Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1
More informationInformation geometry for bivariate distribution control
Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic
More informationComputational Systems Biology: Biology X
Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#8:(November-08-2010) Cancer and Signals Outline 1 Bayesian Interpretation of Probabilities Information Theory Outline Bayesian
More informationGaussian Mixture Models
Gaussian Mixture Models David Rosenberg, Brett Bernstein New York University April 26, 2017 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 1 / 42 Intro Question Intro
More informationBayesian Inference Course, WTCN, UCL, March 2013
Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information
More informationMachine Learning Srihari. Information Theory. Sargur N. Srihari
Information Theory Sargur N. Srihari 1 Topics 1. Entropy as an Information Measure 1. Discrete variable definition Relationship to Code Length 2. Continuous Variable Differential Entropy 2. Maximum Entropy
More informationDecision trees COMS 4771
Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).
More informationLogistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA
Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative 1. Fixed basis
More informationMachine Learning for Data Science (CS4786) Lecture 12
Machine Learning for Data Science (CS4786) Lecture 12 Gaussian Mixture Models Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/ Back to K-means Single link is sensitive to outliners We
More information