Surrogate loss functions, divergences and decentralized detection
|
|
- Rodger Davis
- 6 years ago
- Views:
Transcription
1 Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1
2 Talk outline nonparametric decentralized detection algorithm use of surrogate loss functions and marginalized kernels use of convex analysis study of surrogate loss functions and divergence functionals correspondence of losses and divergences M-estimator of divergences (e.g., Kullback-Leibler divergence) 2
3 Decentralized decision-making problem learning both classifier and experiment covariate vector X and hypothesis (label) Y = ±1 we do not have access directly to X in order to determine Y learn jointly the mapping (Q, γ) X Q Z γ Y 3
4 Decentralized decision-making problem learning both classifier and experiment covariate vector X and hypothesis (label) Y = ±1 we do not have access directly to X in order to determine Y learn jointly the mapping (Q, γ) roles of experiment Q: X Q Z γ Y due to data collection constraints (e.g., decentralization) data transmission constraints choice of variates (feature selection) dimensionality reduction scheme 3-a
5 A decentralized detection system... Sensors Communication channel Fusion center Decision rule Decentralized setting: Communication constraints between sensors and fusion center (e.g., bit constraints) Goal: Design decision rules for sensors and fusion center Criterion: Minimize probability of incorrect detection 4
6 Concrete example wireless sensor network Light source sensors Set-up: wireless network of tiny sensor motes, each equiped with light/ humidity/ temperature sensing capabilities measurement of signal strength ([0 1024] in magnitude, or 10 bits) Goal: is there a forest fire in a certain region? 5
7 Related work Classical work on classification/detection: completely centralized no consideration of communication-theoretic infrastructure 6
8 Related work Classical work on classification/detection: completely centralized no consideration of communication-theoretic infrastructure Decentralized detection in signal processing (e.g., Tsitsiklis, 1993) joint distribution assumed to be known locally-optimal rules under conditional independence assumptions (i.e., Naive Bayes) 6-a
9 Overview of our approach Treat as a nonparametric estimation (learning) problem under constraints from a distributed system Use kernel methods and convex surrogate loss functions tools from convex optimization to derive an efficient algorithm 7
10 Problem set-up Y {±1} X 1 X 2 X 3... X S γ 1 γ 2 γ 3... γ S Z 1 Z 2 Z 3... Z S X {1,...,M} S Z {1,...,L} S ; L M γ(z 1,...,Z S ) Problem: Given training data (x i, y i ) n i=1, find the decision rules (γ 1,...,γ s ; γ) so as to minimize the detection error probability: P(Y γ(z 1,...,Z s )). 8
11 Kernel methods for classification Classification: Learn γ(z) that predicts label y K(z, z ) is a symmetric positive semidefinite kernel function feature space H in which K acts as an inner product, i.e., K(z, z ) = Ψ(z), Ψ(z ) Kernel-based algorithm finds linear function in H, i.e. γ(z) = w, Ψ(z) = n α i K(z i, z) i=1 Advantages: kernel function classes are sufficiently rich for many applications optimizing over kernel function classes is computionally efficient 9
12 Convex surrogate loss function φ to 0-1 loss Zero one Hinge Logistic Exponential Surrogate loss Margin value minimizing (regularized) empirical φ-risk Êφ(Y γ(z)): n φ(y i γ(z i )) + λ 2 γ 2, min γ H i=1 (z i, y i ) n i=1 are training data in Z {±1} φ is a convex loss function (surrogate to non-convex 0-1 loss) 10
13 Stochastic decision rules at each sensor Y X 1 Z 1 X 2 Z 2 X 3 Z 3... X S Z S γ(z) Q(Z X) K z (z, z ) = Ψ(z), Ψ(z ) γ(z) = w, Ψ(z) Approximate deterministic sensor decisions by stochastic rules Q(Z X) Sensors do not communicate directly = factorization: Q(Z X) = S t=1 Qt (Z t X t ) Q = Q t, The overall decision rule is represented by γ(z) = w, Ψ(z) 11
14 High-level strategy: Joint optimization Minimize over (Q, γ) an empirical version of Eφ(Y γ(z)) Joint minimization: fix Q, optimize over γ: A simple convex problem fix γ, perform a gradient update for Q, sensor by sensor 12
15 Approximating empirical φ-risk The regularized empirical φ-risk Êφ(Y γ(z)) has the form: G 0 = z n φ(y i γ(z))q(z x i ) + λ 2 w 2 i=1 Challenge: even evaluating G 0 at a single point is intractable Requires summing over L S possible values for z Idea: approximate G 0 by another objective function G G is φ-risk of a marginalized feature space G 0 G for deterministic Q 13
16 Marginalizing over feature space z 2 γ(z) = w, Ψ(z) f Q (x) = w, Ψ Q (x) x 2 Q(z x) Quantized: Z space z 1 Original: X space x 1 Stochastic decision rule Q(z x): maps between X and Z induces marginalized feature map Ψ Q from base map Ψ (or marginalized kernel K Q from base kernel K) 14
17 Marginalized feature space {Ψ Q (x)} 15
18 Marginalized feature space {Ψ Q (x)} Define a new feature space Ψ Q (x) and a linear function over Ψ Q (x): Ψ Q (x) = z Q(z x)ψ(z) = Marginalization over z f Q (x) = w, Ψ Q (x) 15-a
19 Marginalized feature space {Ψ Q (x)} Define a new feature space Ψ Q (x) and a linear function over Ψ Q (x): Ψ Q (x) = z Q(z x)ψ(z) = Marginalization over z f Q (x) = w, Ψ Q (x) The alternative objective function G is the φ-risk for f Q : G = n φ(y i f Q (x i )) + λ 2 w 2 i=1 15-b
20 Marginalized feature space {Ψ Q (x)} Define a new feature space Ψ Q (x) and a linear function over Ψ Q (x): Ψ Q (x) = z Q(z x)ψ(z) = Marginalization over z f Q (x) = w, Ψ Q (x) The alternative objective function G is the φ-risk for f Q : G = n φ(y i f Q (x i )) + λ 2 w 2 i=1 Ψ Q (x) induces a marginalized kernel over X: K Q (x, x ) := Ψ Q (x), Ψ Q (x ) = z,z Q(z x)q(z x ) K z (z, z ) Marginalization taken over message z conditioned on sensor signal x 15-c
21 Marginalized kernels Have been used to derive kernel functions from generative models (e.g. Tsuda, 2002) Marginalized kernel K Q (x, x ) is defined as: K Q (x, x ) := z,z Q(z x)q(z x ) }{{} Factorized distributions K z (z, z ), }{{} Base kernel If K z (z, z ) is decomposed into smaller components of z and z, then K Q (x, x ) can be computed efficiently (in polynomial-time) 16
22 Centralized and decentralized function Centralized decision function obtained by minimizing φ-risk: f Q (x) = w, Ψ Q (x) f Q has direct access to sensor signal x 17
23 Centralized and decentralized function Centralized decision function obtained by minimizing φ-risk: f Q (x) = w, Ψ Q (x) f Q has direct access to sensor signal x Optimal w also define decentralized decision function: γ(z) = w, Ψ(z) γ has access only to quantized version z 17-a
24 Centralized and decentralized function Centralized decision function obtained by minimizing φ-risk: f Q (x) = w, Ψ Q (x) f Q has direct access to sensor signal x Optimal w also define decentralized decision function: γ(z) = w, Ψ(z) γ has access only to quantized version z Decentralized γ behaves on average like the centralized f Q : f Q (x) = E[γ(Z) x] 17-b
25 Optimization algorithm Goal: Solve the problem: inf G(w; Q) := ( φ y i w, w;q i z ) Q(z x i )Ψ(z) + λ 2 w 2 Finding optimal weight vector: G is convex in w with Q fixed solve dual problem (quadratic convex program) to obtain optimal w(q) Finding optimal decision rules: G is convex in Q t with w and all other {Q r, r t} fixed efficient computation of subgradient for G at optimal (w(q), Q) Overall: Efficient joint minimization by blockwise coordinate descent 18
26 Simulated sensor networks Y X 1 X 2 X 3 Y X 1 X 2 X 3 X 1 X 2 X 3 Y X 4 X X 10 X 10 X 4 X 5 X 6 X 7 X 8 X 9 Naive Bayes net Chain-structured network Spatially-dependent network 19
27 Kernel Quantization vs. Decentralized LRT 0.3 Naive Bayes networks 0.5 Chain structured (1st order) 0.4 Decentralized LRT Decentralized LRT Kernel quantization (KQ) Kernel quantization (KQ) 0.5 Chain structured (2nd order) 0.5 Fully connected networks Decentralized LRT Decentralized LRT Kernel quantization (KQ) Kernel quantization (KQ) 20
28 Wireless network with tiny Berkeley motes Light source sensors 5 5 = 25 tiny sensor motes, each equipped with a light receiver Light signal strength requires 10-bit ([0 1024] in magnitude) Perform classification with respect to different regions Each problem has 25 training positions, 81 test positions (Data collection courtesy Bruno Sinopoli) 21
29 Classification with Mica sensor motes centralized SVM (10 bit sig) centralized NB classifier decentralized KQ (1 bit) decentralized KQ (2 bit) 0.5 Test error Classification problem instances 22
30 Outline nonparametric decentralized detection algorithm use of surrogate loss functions and marginalized kernels study of surrogate loss functions and divergence functionals correspondence of losses and divergences M-estimator of divergences (e.g., Kullback-Leibler divergence) 23
31 Consistency question recall that our decentralized algorithm essentially solves min γ,q Êφ(Y, γ(z)) does this also imply optimality in the sense of 0-1 loss? P(Y γ(z)) 24
32 Consistency question recall that our decentralized algorithm essentially solves min γ,q Êφ(Y, γ(z)) does this also imply optimality in the sense of 0-1 loss? P(Y γ(z)) answers: hinge loss yields consistent estimates all losses corresponding to variational distance yield consistency and we can identify all of them exponential loss, logistic loss do not the gist lies in the correspondence between loss functions and divergence functionals 24-a
33 Divergence between two distributions The f-divergence between two densities µ and π is given by ( ) µ(z) I f (µ, π) := π(z)f dν. π(z) where f : [0, + ) R {+ } is a continuous convex function z 25
34 Divergence between two distributions The f-divergence between two densities µ and π is given by ( ) µ(z) I f (µ, π) := π(z)f dν. π(z) where f : [0, + ) R {+ } is a continuous convex function Kullback-Leibler divergence: f(u) = u log u. z I f (µ, π) = z µ(z)log µ(z) π(z). variational distance: f(u) = u 1. I f (µ, π) := z µ(z) π(z). Hellinger distance: f(u) = 1 2 ( u 1) 2. I f (µ, π) := ( z Z µ(z) π(z)) a
35 Surrogate loss and f-divergence Map Q induces measures on Z: µ(z) := P(Y = 1, z); π(z) := P(Y = 1, z) Theorem: Fixing Q, the optimal risk for each φ loss is an f-divergence for some convex f, and vice versa: R φ (Q) = I f (µ, π), where R φ (Q) := min γ Eφ(Y, γ(z)) φ 1 φ 2 φ 3 f 1 f 2 f 3 Class of loss functions Class of f-divergences 26
36 Unrolling divergences by convex duality Legendre-Fenchel convex duality: f(u) = sup v R uv f (v), where f is the convex conjugate of f ( ) µ I f (µ, π) = πf dν π 27
37 Unrolling divergences by convex duality Legendre-Fenchel convex duality: f(u) = sup v R uv f (v), where f is the convex conjugate of f ( ) µ I f (µ, π) = πf dν π = πsup(γµ/π f (γ)) dν γ 27-a
38 Unrolling divergences by convex duality Legendre-Fenchel convex duality: f(u) = sup v R uv f (v), where f is the convex conjugate of f ( ) µ I f (µ, π) = πf dν π = πsup(γµ/π f (γ)) dν γ = sup γµ f (γ)π dν γ 27-b
39 Unrolling divergences by convex duality Legendre-Fenchel convex duality: f(u) = sup v R uv f (v), where f is the convex conjugate of f ( ) µ I f (µ, π) = πf dν π = πsup(γµ/π f (γ)) dν γ = sup γµ f (γ)π dν γ = inf f (γ)π γµ dν γ 27-c
40 Unrolling divergences by convex duality Legendre-Fenchel convex duality: f(u) = sup v R uv f (v), where f is the convex conjugate of f ( ) µ I f (µ, π) = πf dν π = πsup(γµ/π f (γ)) dν γ = sup γµ f (γ)π dν γ = inf f (γ)π γµ dν γ The last quantity can be viewed as a risk functional with respect to loss functions f (γ) and γ 27-d
41 Examples 0-1 loss: R bayes (Q) = z Z µ(z) π(z) variational distance hinge loss: R hinge (Q) = 2R bayes (Q) variational distance exponential loss: R exp (Q) = 1 z Z (µ(z) 1/2 π(z) 1/2 ) 2 Hellinger distance logistic loss: R log (Q) = log 2 KL(µ µ+π 2 ) KL(π µ+π 2 ) capacitory dis. distance 28
42 Examples Equivalent surrogage losses corresponding to Hellinger distance (left) and variational distance (right) 4 3 g = exp(u 1) g = u g = u g = e u 1 g = u g = u 2 φ loss 2 φ loss margin margin value the part after 0 is a fixed map of the part before 0! 29
43 Comparison of loss functions φ 1 φ 2 φ3 f 1 f 2 f 3 Class of loss functions Class of f-divergences two loss functions φ 1 and φ 2, corresponding to f-divergence induced by f 1 and f 2 φ 1 and φ 2 are universally equivalent, denoted by φ 1 U φ2 (or, equivalently) f 1 U f2 if for any P(X, Y ) and quantization rules Q A, Q B, there holds: R φ1 (Q A ) R φ1 (Q B ) R φ2 (Q A ) R φ2 (Q B ). 30
44 An equivalence theorem Theorem: φ 1 U φ2 (or, equivalently) f 1 U f2 if and only if f 1 (u) = cf 2 (u) + au + b for constants a, b R and c > 0 in particular, surrogate losses universally equivalent to 0 1 loss are those whose induced f divergence has the form: f(u) = c min{u, 1} + au + b 31
45 Empirical risk minimization procedure let φ be a convex surrogate equivalent to 0 1 loss (C n, D n ) is a sequence of increasing function classes for(γ, Q) our procedure learns: (C 1, D 1 ) (C 2, D 2 )... (Γ, Q) (γn, Q n) := argmin (γ,q) (Cn,D n )Êφ(Y γ(z)) let R bayes := inf (γ,q) (Γ,Q) P(Y γ(z)) optimal Bayes error our procedure is consistent if R bayes (γ n, Q n) R bayes 0 32
46 Consistency result Theorem: If n=1(c n, D n ) is dense in the space of pairs of classifier and quantizer (γ, Q) (Γ, Q) sequence (C n, D n ) increases in size sufficiently slowly then our procedure is consistent, i.e., lim n R bayes(γ n, Q n) R bayes = 0 in probability. proof exploits the equivalence of φ loss and 0 1 loss decomposition of φ risk into approximation error and estimation error 33
47 Outline nonparametric decentralized detection algorithm use of surrogate loss functions and marginalized kernels study of surrogate loss functions and divergence functionals correspondence of losses and divergences M-estimator of divergences (e.g., Kullback-Leibler divergence) 34
48 Estimating divergence and likelihood ratio given i.i.d {x 1,...,x n } Q, {y 1,...,y n } P want to estimate two quantities KL divergence functional D K (P, Q) = p 0 log p 0 q 0 dµ likelihood ratio function g 0 (.) = p 0 (.)/q 0 (.) 35
49 Variational characterization recall the correspondence: min γ Eφ(Y, γ(z)) = I f (µ, π) f-divergence can be estimated by minimizing over some associated φ-risk functional 36
50 Variational characterization recall the correspondence: min γ Eφ(Y, γ(z)) = I f (µ, π) f-divergence can be estimated by minimizing over some associated φ-risk functional for the Kullback-Leibler divergence: D K (P, Q) = sup log g dp g>0 gdq + 1. furthermore, the supremum is attained at g = p 0 /q a
51 M-estimation procedure let G be a function class of X R + dp n and dq n denote the expectation under empirical measures P n and Q n, respectively our estimator has the following form: ˆD K = sup log g dp n g G gdq n + 1. supremum is attained at ĝ n, which estimates the likelihood ratio p 0 /q 0 37
52 Convex empirical risk with penalty in practice, control the size of the function class G by using penalty let I(g) be a measure of complexity for g decompose G as follows: G = 1 M G M, where G M is restricted to g for which I(g) M. the estimation procedure involves solving: ĝ n = argmin g G gdq n log g dp n + λ n 2 I2 (g). 38
53 Convergence analysis for KL divergence estimation, we study ˆD K D K (P, Q) for the likelihood ratio estimation, we use Hellinger distance h 2 Q(g, g 0 ) := 1 (g 1/2 g 1/2 0 ) 2 dq. 2 39
54 Assumptions for convergence analysis true likelihood ratio g 0 is bounded from below by some positive constant: g 0 η 0 > 0. Note: we don t assume that G is bounded away from 0 (not yet)! uniform norm of G M is Lipchitz with respect to the penalty measure I(g): for any M 1: sup g cm. g G M on the entropy of G: For some 0 < γ < 2, H B δ (G M, L 2 (Q)) = O(M/δ) γ. 40
55 Convergence rates when λ n vanishes sufficiently slowly: then under P: λ 1 n = O P (n 2/(2+γ) )(1 + I(g 0 )), h Q (g 0, ĝ n ) = O P (λ 1/2 n )(1 + I(g 0 )) I(ĝ n ) = O P (1 + I(g 0 )). 41
56 Convergence rates when λ n vanishes sufficiently slowly: then under P: λ 1 n = O P (n 2/(2+γ) )(1 + I(g 0 )), h Q (g 0, ĝ n ) = O P (λ 1/2 n )(1 + I(g 0 )) I(ĝ n ) = O P (1 + I(g 0 )). if G is bounded away from 0: D K ˆD K = O P (λ 1/2 n )(1 + I(g 0 )). 41-a
57 {x i } Q, {y j } P G is RKHS function class G is a RKHS with Mercer kernel k(x, x ) = Φ(x), Φ(x ) I(g) = g H ĝ n = argmin g G gdq n log g dp n + λ n 2 g 2 H 42
58 {x i } Q, {y j } P G is RKHS function class G is a RKHS with Mercer kernel k(x, x ) = Φ(x), Φ(x ) I(g) = g H ĝ n = argmin g G gdq n log g dp n + λ n 2 g 2 H Convex dual formulation: α := argmax 1 n log α j 1 n 2λ n j=1 n α j Φ(y j ) 1 n j=1 n Φ(x i ) 2 i=1 ˆD K (P, Q) := 1 n n log α j log n j=1 42-a
59 log G is RKHS function class {x i } Q, {y j } P log G is a RKHS with Mercer kernel k(x, x ) = Φ(x), Φ(x ) I(g) = log g H ĝ n = argmin g G gdq n log g dp n + λ n 2 log g 2 H Convex dual formulation: α := argmax 1 n ( α i log α i +α i log n n e i=1 ) 1 2λ n n α i Φ(x i ) 1 n i=1 n Φ(y j ) 2 j=1 ˆD K (P, Q) := 1 + n α i log α i + α i log n e i=1 43
60 Estimate of KL(Beta(1,2),Unif[0,1]) 0.8 Estimate of KL(1/2 N t (0,1)+ 1/2 N t (1,1),Unif[ 5,5]) M1, σ =.1, λ = 1/n M2, σ =.1, λ =.1/n WKV, s = n 1/2 WKV, s = n 1/ M1, σ =.1, λ = 1/n M2, σ = 1, λ =.1/n WKV, s = n 1/3 WKV, s = n 1/2 WKV, s = n 2/ Estimate of KL(N t (0,1),N t (4,2)) Estimate of KL(N t (4,2),N t (0,1)) M1, σ =.1, λ = 1/n M2, σ =.1, λ =.1/n WKV, s = n 1/3 WKV, s = n 1/2 WKV, s = n 2/ M1, σ = 1, λ =.1/n M2, σ = 1, λ =.1/n WKV, s = n 1/4 WKV, s = n 1/3 WKV, s = n 1/
61 1.5 Estimate of KL(N t (0,I 2 ),N t (1,I 2 )) 1.5 Estimate of KL(N t (0,I 2 ),Unif[ 3,3] 2 ) M1, σ =.5, λ =.1/n M2, σ =.5, λ =.1/n WKV, n 1/3 WKV, n 1/ M1, σ =.5, λ =.1/n M2, σ =.5, λ =.1/n WKV, n 1/3 WKV, n 1/ Estimate of KL(N t (0,I 3 ),N t (1,I 3 )) M1, σ = 1, λ =.1/n M2, σ = 1, λ =.1/n WKV, n 1/2 0 WKV, n 1/ Estimate of KL(N t (0,I 3 ),Unif[ 3,3] 3 ) M1 σ = 1, λ =.1/n 1/2 M2, σ = 1, λ =.1/n M2, σ = 1, λ =.1/n 2/3 WKV, n 1/3 WKV, n 1/
62 Conclusion nonparametric decentralized detection algorithm use of surrogate loss functions and marginalized kernels study of surrogate loss functions and divergence functionals correspondence of losses and divergences M-estimator of divergences (e.g., Kullback-Leibler divergence) 46
Decentralized decision making with spatially distributed data
Decentralized decision making with spatially distributed data XuanLong Nguyen Department of Statistics University of Michigan Acknowledgement: Michael Jordan, Martin Wainwright, Ram Rajagopal, Pravin Varaiya
More informationAre You a Bayesian or a Frequentist?
Are You a Bayesian or a Frequentist? Michael I. Jordan Department of EECS Department of Statistics University of California, Berkeley http://www.cs.berkeley.edu/ jordan 1 Statistical Inference Bayesian
More informationDivergences, surrogate loss functions and experimental design
Divergences, surrogate loss functions and experimental design XuanLong Nguyen University of California Berkeley, CA 94720 xuanlong@cs.berkeley.edu Martin J. Wainwright University of California Berkeley,
More informationOn Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing
On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing XuanLong Nguyen Martin J. Wainwright Michael I. Jordan Electrical Engineering & Computer Science Department
More informationNonparametric Decentralized Detection using Kernel Methods
Nonparametric Decentralized Detection using Kernel Methods XuanLong Nguyen xuanlong@cs.berkeley.edu Martin J. Wainwright, wainwrig@eecs.berkeley.edu Michael I. Jordan, jordan@cs.berkeley.edu Electrical
More informationDecentralized Detection and Classification using Kernel Methods
Decentralized Detection and Classification using Kernel Methods XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@cs.berkeley.edu Martin J. Wainwright Electrical Engineering
More informationOn surrogate loss functions and f-divergences
On surrogate loss functions and f-divergences XuanLong Nguyen, Martin J. Wainwright, xuanlong.nguyen@stat.duke.edu wainwrig@stat.berkeley.edu Michael I. Jordan, jordan@stat.berkeley.edu Department of Statistical
More informationADECENTRALIZED detection system typically involves a
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL 53, NO 11, NOVEMBER 2005 4053 Nonparametric Decentralized Detection Using Kernel Methods XuanLong Nguyen, Martin J Wainwright, Member, IEEE, and Michael I Jordan,
More informationEstimating divergence functionals and the likelihood ratio by convex risk minimization
Estimating divergence functionals and the likelihood ratio by convex risk minimization XuanLong Nguyen Dept. of Statistical Science Duke University xuanlong.nguyen@stat.duke.edu Martin J. Wainwright Dept.
More informationOn divergences, surrogate loss functions, and decentralized detection
On divergences, surrogate loss functions, and decentralized detection XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@eecs.berkeley.edu Martin J. Wainwright Statistics
More informationDecentralized Detection and Classification using Kernel Methods
Decentralied Detection and Classification using Kernel Methods XuanLong Nguyen XUANLONG@CS.BERKELEY.EDU Martin J. Wainwright WAINWRIG@EECS.BERKELEY.EDU Michael I. Jordan JORDAN@CS.BERKELEY.EDU Computer
More informationON SURROGATE LOSS FUNCTIONS AND f -DIVERGENCES 1
The Annals of Statistics 2009, Vol. 37, No. 2, 876 904 DOI: 10.1214/08-AOS595 Institute of Mathematical Statistics, 2009 ON SURROGATE LOSS FUNCTIONS AND f -DIVERGENCES 1 BY XUANLONG NGUYEN, MARTIN J. WAINWRIGHT
More informationStatistical Properties of Large Margin Classifiers
Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides
More informationOn distance measures, surrogate loss functions, and distributed detection
On distance measures, surrogate loss functions, and distributed detection XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@cs.berkeley.edu Martin J. Wainwright EECS
More informationAdaBoost and other Large Margin Classifiers: Convexity in Classification
AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at
More informationNonparametric Decentralized Detection and Sparse Sensor Selection via Weighted Kernel
IEEE TRASACTIOS O SIGAL PROCESSIG, VOL. X, O. X, X X onparametric Decentralized Detection and Sparse Sensor Selection via Weighted Kernel Weiguang Wang, Yingbin Liang, Member, IEEE, Eric P. Xing, Senior
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationLecture 10 February 23
EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009 Lecture 10 February 23 Lecturer: Martin Wainwright Scribe: Dave Golland Note: These lecture notes are still rough, and have only
More informationSTATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationAccelerated Training of Max-Margin Markov Networks with Kernels
Accelerated Training of Max-Margin Markov Networks with Kernels Xinhua Zhang University of Alberta Alberta Innovates Centre for Machine Learning (AICML) Joint work with Ankan Saha (Univ. of Chicago) and
More informationDiscrete Markov Random Fields the Inference story. Pradeep Ravikumar
Discrete Markov Random Fields the Inference story Pradeep Ravikumar Graphical Models, The History How to model stochastic processes of the world? I want to model the world, and I like graphs... 2 Mid to
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationSVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels
SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic
More informationParameter learning in CRF s
Parameter learning in CRF s June 01, 2009 Structured output learning We ish to learn a discriminant (or compatability) function: F : X Y R (1) here X is the space of inputs and Y is the space of outputs.
More informationComputing regularization paths for learning multiple kernels
Computing regularization paths for learning multiple kernels Francis Bach Romain Thibaux Michael Jordan Computer Science, UC Berkeley December, 24 Code available at www.cs.berkeley.edu/~fbach Computing
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationLecture 9: PGM Learning
13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationLearning Binary Classifiers for Multi-Class Problem
Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write
More informationStanford Statistics 311/Electrical Engineering 377
I. Bayes risk in classification problems a. Recall definition (1.2.3) of f-divergence between two distributions P and Q as ( ) p(x) D f (P Q) : q(x)f dx, q(x) where f : R + R is a convex function satisfying
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More information9 Classification. 9.1 Linear Classifiers
9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationA Magiv CV Theory for Large-Margin Classifiers
A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector
More informationCS Lecture 19. Exponential Families & Expectation Propagation
CS 6347 Lecture 19 Exponential Families & Expectation Propagation Discrete State Spaces We have been focusing on the case of MRFs over discrete state spaces Probability distributions over discrete spaces
More informationOslo Class 2 Tikhonov regularization and kernels
RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic
More informationOutline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation
More informationTopics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families
Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines
More informationChapter 9. Support Vector Machine. Yongdai Kim Seoul National University
Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved
More informationLecture 18: Multiclass Support Vector Machines
Fall, 2017 Outlines Overview of Multiclass Learning Traditional Methods for Multiclass Problems One-vs-rest approaches Pairwise approaches Recent development for Multiclass Problems Simultaneous Classification
More informationOutline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22
Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationSupport Vector Machine (continued)
Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationPosterior Regularization
Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationBayesian Regularization
Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors
More informationSupport Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationA Study of Relative Efficiency and Robustness of Classification Methods
A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics
More informationDIVERGENCES (or pseudodistances) based on likelihood
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 56, NO 11, NOVEMBER 2010 5847 Estimating Divergence Functionals the Likelihood Ratio by Convex Risk Minimization XuanLong Nguyen, Martin J Wainwright, Michael
More informationRegML 2018 Class 2 Tikhonov regularization and kernels
RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,
More informationCS229 Supplemental Lecture notes
CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem
More informationSupport Vector Machines and Kernel Methods
2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University
More informationConvexity, Detection, and Generalized f-divergences
Convexity, Detection, and Generalized f-divergences Khashayar Khosravi Feng Ruan John Duchi June 5, 015 1 Introduction The goal of classification problem is to learn a discriminant function for classification
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationDEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE
Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationBayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina
Bayes rule and Bayes error Definition If f minimizes E[L(Y, f (X))], then f is called a Bayes rule (associated with the loss function L(y, f )) and the resulting prediction error rate, E[L(Y, f (X))],
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationAn Introduction to Expectation-Maximization
An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationStatistical Machine Learning Hilary Term 2018
Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html
More informationAlgorithms for Predicting Structured Data
1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure
More informationDoes Modeling Lead to More Accurate Classification?
Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationBayesian Learning (II)
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP
More informationU Logo Use Guidelines
Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity
More informationProbabilistic Graphical Models & Applications
Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with
More informationKernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning
Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 9: Variational Inference Relaxations Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 24/10/2011 (EPFL) Graphical Models 24/10/2011 1 / 15
More information18.9 SUPPORT VECTOR MACHINES
744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the
More informationInformation Geometry
2015 Workshop on High-Dimensional Statistical Analysis Dec.11 (Friday) ~15 (Tuesday) Humanities and Social Sciences Center, Academia Sinica, Taiwan Information Geometry and Spontaneous Data Learning Shinto
More information