Surrogate loss functions, divergences and decentralized detection

Size: px
Start display at page:

Download "Surrogate loss functions, divergences and decentralized detection"

Transcription

1 Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1

2 Talk outline nonparametric decentralized detection algorithm use of surrogate loss functions and marginalized kernels use of convex analysis study of surrogate loss functions and divergence functionals correspondence of losses and divergences M-estimator of divergences (e.g., Kullback-Leibler divergence) 2

3 Decentralized decision-making problem learning both classifier and experiment covariate vector X and hypothesis (label) Y = ±1 we do not have access directly to X in order to determine Y learn jointly the mapping (Q, γ) X Q Z γ Y 3

4 Decentralized decision-making problem learning both classifier and experiment covariate vector X and hypothesis (label) Y = ±1 we do not have access directly to X in order to determine Y learn jointly the mapping (Q, γ) roles of experiment Q: X Q Z γ Y due to data collection constraints (e.g., decentralization) data transmission constraints choice of variates (feature selection) dimensionality reduction scheme 3-a

5 A decentralized detection system... Sensors Communication channel Fusion center Decision rule Decentralized setting: Communication constraints between sensors and fusion center (e.g., bit constraints) Goal: Design decision rules for sensors and fusion center Criterion: Minimize probability of incorrect detection 4

6 Concrete example wireless sensor network Light source sensors Set-up: wireless network of tiny sensor motes, each equiped with light/ humidity/ temperature sensing capabilities measurement of signal strength ([0 1024] in magnitude, or 10 bits) Goal: is there a forest fire in a certain region? 5

7 Related work Classical work on classification/detection: completely centralized no consideration of communication-theoretic infrastructure 6

8 Related work Classical work on classification/detection: completely centralized no consideration of communication-theoretic infrastructure Decentralized detection in signal processing (e.g., Tsitsiklis, 1993) joint distribution assumed to be known locally-optimal rules under conditional independence assumptions (i.e., Naive Bayes) 6-a

9 Overview of our approach Treat as a nonparametric estimation (learning) problem under constraints from a distributed system Use kernel methods and convex surrogate loss functions tools from convex optimization to derive an efficient algorithm 7

10 Problem set-up Y {±1} X 1 X 2 X 3... X S γ 1 γ 2 γ 3... γ S Z 1 Z 2 Z 3... Z S X {1,...,M} S Z {1,...,L} S ; L M γ(z 1,...,Z S ) Problem: Given training data (x i, y i ) n i=1, find the decision rules (γ 1,...,γ s ; γ) so as to minimize the detection error probability: P(Y γ(z 1,...,Z s )). 8

11 Kernel methods for classification Classification: Learn γ(z) that predicts label y K(z, z ) is a symmetric positive semidefinite kernel function feature space H in which K acts as an inner product, i.e., K(z, z ) = Ψ(z), Ψ(z ) Kernel-based algorithm finds linear function in H, i.e. γ(z) = w, Ψ(z) = n α i K(z i, z) i=1 Advantages: kernel function classes are sufficiently rich for many applications optimizing over kernel function classes is computionally efficient 9

12 Convex surrogate loss function φ to 0-1 loss Zero one Hinge Logistic Exponential Surrogate loss Margin value minimizing (regularized) empirical φ-risk Êφ(Y γ(z)): n φ(y i γ(z i )) + λ 2 γ 2, min γ H i=1 (z i, y i ) n i=1 are training data in Z {±1} φ is a convex loss function (surrogate to non-convex 0-1 loss) 10

13 Stochastic decision rules at each sensor Y X 1 Z 1 X 2 Z 2 X 3 Z 3... X S Z S γ(z) Q(Z X) K z (z, z ) = Ψ(z), Ψ(z ) γ(z) = w, Ψ(z) Approximate deterministic sensor decisions by stochastic rules Q(Z X) Sensors do not communicate directly = factorization: Q(Z X) = S t=1 Qt (Z t X t ) Q = Q t, The overall decision rule is represented by γ(z) = w, Ψ(z) 11

14 High-level strategy: Joint optimization Minimize over (Q, γ) an empirical version of Eφ(Y γ(z)) Joint minimization: fix Q, optimize over γ: A simple convex problem fix γ, perform a gradient update for Q, sensor by sensor 12

15 Approximating empirical φ-risk The regularized empirical φ-risk Êφ(Y γ(z)) has the form: G 0 = z n φ(y i γ(z))q(z x i ) + λ 2 w 2 i=1 Challenge: even evaluating G 0 at a single point is intractable Requires summing over L S possible values for z Idea: approximate G 0 by another objective function G G is φ-risk of a marginalized feature space G 0 G for deterministic Q 13

16 Marginalizing over feature space z 2 γ(z) = w, Ψ(z) f Q (x) = w, Ψ Q (x) x 2 Q(z x) Quantized: Z space z 1 Original: X space x 1 Stochastic decision rule Q(z x): maps between X and Z induces marginalized feature map Ψ Q from base map Ψ (or marginalized kernel K Q from base kernel K) 14

17 Marginalized feature space {Ψ Q (x)} 15

18 Marginalized feature space {Ψ Q (x)} Define a new feature space Ψ Q (x) and a linear function over Ψ Q (x): Ψ Q (x) = z Q(z x)ψ(z) = Marginalization over z f Q (x) = w, Ψ Q (x) 15-a

19 Marginalized feature space {Ψ Q (x)} Define a new feature space Ψ Q (x) and a linear function over Ψ Q (x): Ψ Q (x) = z Q(z x)ψ(z) = Marginalization over z f Q (x) = w, Ψ Q (x) The alternative objective function G is the φ-risk for f Q : G = n φ(y i f Q (x i )) + λ 2 w 2 i=1 15-b

20 Marginalized feature space {Ψ Q (x)} Define a new feature space Ψ Q (x) and a linear function over Ψ Q (x): Ψ Q (x) = z Q(z x)ψ(z) = Marginalization over z f Q (x) = w, Ψ Q (x) The alternative objective function G is the φ-risk for f Q : G = n φ(y i f Q (x i )) + λ 2 w 2 i=1 Ψ Q (x) induces a marginalized kernel over X: K Q (x, x ) := Ψ Q (x), Ψ Q (x ) = z,z Q(z x)q(z x ) K z (z, z ) Marginalization taken over message z conditioned on sensor signal x 15-c

21 Marginalized kernels Have been used to derive kernel functions from generative models (e.g. Tsuda, 2002) Marginalized kernel K Q (x, x ) is defined as: K Q (x, x ) := z,z Q(z x)q(z x ) }{{} Factorized distributions K z (z, z ), }{{} Base kernel If K z (z, z ) is decomposed into smaller components of z and z, then K Q (x, x ) can be computed efficiently (in polynomial-time) 16

22 Centralized and decentralized function Centralized decision function obtained by minimizing φ-risk: f Q (x) = w, Ψ Q (x) f Q has direct access to sensor signal x 17

23 Centralized and decentralized function Centralized decision function obtained by minimizing φ-risk: f Q (x) = w, Ψ Q (x) f Q has direct access to sensor signal x Optimal w also define decentralized decision function: γ(z) = w, Ψ(z) γ has access only to quantized version z 17-a

24 Centralized and decentralized function Centralized decision function obtained by minimizing φ-risk: f Q (x) = w, Ψ Q (x) f Q has direct access to sensor signal x Optimal w also define decentralized decision function: γ(z) = w, Ψ(z) γ has access only to quantized version z Decentralized γ behaves on average like the centralized f Q : f Q (x) = E[γ(Z) x] 17-b

25 Optimization algorithm Goal: Solve the problem: inf G(w; Q) := ( φ y i w, w;q i z ) Q(z x i )Ψ(z) + λ 2 w 2 Finding optimal weight vector: G is convex in w with Q fixed solve dual problem (quadratic convex program) to obtain optimal w(q) Finding optimal decision rules: G is convex in Q t with w and all other {Q r, r t} fixed efficient computation of subgradient for G at optimal (w(q), Q) Overall: Efficient joint minimization by blockwise coordinate descent 18

26 Simulated sensor networks Y X 1 X 2 X 3 Y X 1 X 2 X 3 X 1 X 2 X 3 Y X 4 X X 10 X 10 X 4 X 5 X 6 X 7 X 8 X 9 Naive Bayes net Chain-structured network Spatially-dependent network 19

27 Kernel Quantization vs. Decentralized LRT 0.3 Naive Bayes networks 0.5 Chain structured (1st order) 0.4 Decentralized LRT Decentralized LRT Kernel quantization (KQ) Kernel quantization (KQ) 0.5 Chain structured (2nd order) 0.5 Fully connected networks Decentralized LRT Decentralized LRT Kernel quantization (KQ) Kernel quantization (KQ) 20

28 Wireless network with tiny Berkeley motes Light source sensors 5 5 = 25 tiny sensor motes, each equipped with a light receiver Light signal strength requires 10-bit ([0 1024] in magnitude) Perform classification with respect to different regions Each problem has 25 training positions, 81 test positions (Data collection courtesy Bruno Sinopoli) 21

29 Classification with Mica sensor motes centralized SVM (10 bit sig) centralized NB classifier decentralized KQ (1 bit) decentralized KQ (2 bit) 0.5 Test error Classification problem instances 22

30 Outline nonparametric decentralized detection algorithm use of surrogate loss functions and marginalized kernels study of surrogate loss functions and divergence functionals correspondence of losses and divergences M-estimator of divergences (e.g., Kullback-Leibler divergence) 23

31 Consistency question recall that our decentralized algorithm essentially solves min γ,q Êφ(Y, γ(z)) does this also imply optimality in the sense of 0-1 loss? P(Y γ(z)) 24

32 Consistency question recall that our decentralized algorithm essentially solves min γ,q Êφ(Y, γ(z)) does this also imply optimality in the sense of 0-1 loss? P(Y γ(z)) answers: hinge loss yields consistent estimates all losses corresponding to variational distance yield consistency and we can identify all of them exponential loss, logistic loss do not the gist lies in the correspondence between loss functions and divergence functionals 24-a

33 Divergence between two distributions The f-divergence between two densities µ and π is given by ( ) µ(z) I f (µ, π) := π(z)f dν. π(z) where f : [0, + ) R {+ } is a continuous convex function z 25

34 Divergence between two distributions The f-divergence between two densities µ and π is given by ( ) µ(z) I f (µ, π) := π(z)f dν. π(z) where f : [0, + ) R {+ } is a continuous convex function Kullback-Leibler divergence: f(u) = u log u. z I f (µ, π) = z µ(z)log µ(z) π(z). variational distance: f(u) = u 1. I f (µ, π) := z µ(z) π(z). Hellinger distance: f(u) = 1 2 ( u 1) 2. I f (µ, π) := ( z Z µ(z) π(z)) a

35 Surrogate loss and f-divergence Map Q induces measures on Z: µ(z) := P(Y = 1, z); π(z) := P(Y = 1, z) Theorem: Fixing Q, the optimal risk for each φ loss is an f-divergence for some convex f, and vice versa: R φ (Q) = I f (µ, π), where R φ (Q) := min γ Eφ(Y, γ(z)) φ 1 φ 2 φ 3 f 1 f 2 f 3 Class of loss functions Class of f-divergences 26

36 Unrolling divergences by convex duality Legendre-Fenchel convex duality: f(u) = sup v R uv f (v), where f is the convex conjugate of f ( ) µ I f (µ, π) = πf dν π 27

37 Unrolling divergences by convex duality Legendre-Fenchel convex duality: f(u) = sup v R uv f (v), where f is the convex conjugate of f ( ) µ I f (µ, π) = πf dν π = πsup(γµ/π f (γ)) dν γ 27-a

38 Unrolling divergences by convex duality Legendre-Fenchel convex duality: f(u) = sup v R uv f (v), where f is the convex conjugate of f ( ) µ I f (µ, π) = πf dν π = πsup(γµ/π f (γ)) dν γ = sup γµ f (γ)π dν γ 27-b

39 Unrolling divergences by convex duality Legendre-Fenchel convex duality: f(u) = sup v R uv f (v), where f is the convex conjugate of f ( ) µ I f (µ, π) = πf dν π = πsup(γµ/π f (γ)) dν γ = sup γµ f (γ)π dν γ = inf f (γ)π γµ dν γ 27-c

40 Unrolling divergences by convex duality Legendre-Fenchel convex duality: f(u) = sup v R uv f (v), where f is the convex conjugate of f ( ) µ I f (µ, π) = πf dν π = πsup(γµ/π f (γ)) dν γ = sup γµ f (γ)π dν γ = inf f (γ)π γµ dν γ The last quantity can be viewed as a risk functional with respect to loss functions f (γ) and γ 27-d

41 Examples 0-1 loss: R bayes (Q) = z Z µ(z) π(z) variational distance hinge loss: R hinge (Q) = 2R bayes (Q) variational distance exponential loss: R exp (Q) = 1 z Z (µ(z) 1/2 π(z) 1/2 ) 2 Hellinger distance logistic loss: R log (Q) = log 2 KL(µ µ+π 2 ) KL(π µ+π 2 ) capacitory dis. distance 28

42 Examples Equivalent surrogage losses corresponding to Hellinger distance (left) and variational distance (right) 4 3 g = exp(u 1) g = u g = u g = e u 1 g = u g = u 2 φ loss 2 φ loss margin margin value the part after 0 is a fixed map of the part before 0! 29

43 Comparison of loss functions φ 1 φ 2 φ3 f 1 f 2 f 3 Class of loss functions Class of f-divergences two loss functions φ 1 and φ 2, corresponding to f-divergence induced by f 1 and f 2 φ 1 and φ 2 are universally equivalent, denoted by φ 1 U φ2 (or, equivalently) f 1 U f2 if for any P(X, Y ) and quantization rules Q A, Q B, there holds: R φ1 (Q A ) R φ1 (Q B ) R φ2 (Q A ) R φ2 (Q B ). 30

44 An equivalence theorem Theorem: φ 1 U φ2 (or, equivalently) f 1 U f2 if and only if f 1 (u) = cf 2 (u) + au + b for constants a, b R and c > 0 in particular, surrogate losses universally equivalent to 0 1 loss are those whose induced f divergence has the form: f(u) = c min{u, 1} + au + b 31

45 Empirical risk minimization procedure let φ be a convex surrogate equivalent to 0 1 loss (C n, D n ) is a sequence of increasing function classes for(γ, Q) our procedure learns: (C 1, D 1 ) (C 2, D 2 )... (Γ, Q) (γn, Q n) := argmin (γ,q) (Cn,D n )Êφ(Y γ(z)) let R bayes := inf (γ,q) (Γ,Q) P(Y γ(z)) optimal Bayes error our procedure is consistent if R bayes (γ n, Q n) R bayes 0 32

46 Consistency result Theorem: If n=1(c n, D n ) is dense in the space of pairs of classifier and quantizer (γ, Q) (Γ, Q) sequence (C n, D n ) increases in size sufficiently slowly then our procedure is consistent, i.e., lim n R bayes(γ n, Q n) R bayes = 0 in probability. proof exploits the equivalence of φ loss and 0 1 loss decomposition of φ risk into approximation error and estimation error 33

47 Outline nonparametric decentralized detection algorithm use of surrogate loss functions and marginalized kernels study of surrogate loss functions and divergence functionals correspondence of losses and divergences M-estimator of divergences (e.g., Kullback-Leibler divergence) 34

48 Estimating divergence and likelihood ratio given i.i.d {x 1,...,x n } Q, {y 1,...,y n } P want to estimate two quantities KL divergence functional D K (P, Q) = p 0 log p 0 q 0 dµ likelihood ratio function g 0 (.) = p 0 (.)/q 0 (.) 35

49 Variational characterization recall the correspondence: min γ Eφ(Y, γ(z)) = I f (µ, π) f-divergence can be estimated by minimizing over some associated φ-risk functional 36

50 Variational characterization recall the correspondence: min γ Eφ(Y, γ(z)) = I f (µ, π) f-divergence can be estimated by minimizing over some associated φ-risk functional for the Kullback-Leibler divergence: D K (P, Q) = sup log g dp g>0 gdq + 1. furthermore, the supremum is attained at g = p 0 /q a

51 M-estimation procedure let G be a function class of X R + dp n and dq n denote the expectation under empirical measures P n and Q n, respectively our estimator has the following form: ˆD K = sup log g dp n g G gdq n + 1. supremum is attained at ĝ n, which estimates the likelihood ratio p 0 /q 0 37

52 Convex empirical risk with penalty in practice, control the size of the function class G by using penalty let I(g) be a measure of complexity for g decompose G as follows: G = 1 M G M, where G M is restricted to g for which I(g) M. the estimation procedure involves solving: ĝ n = argmin g G gdq n log g dp n + λ n 2 I2 (g). 38

53 Convergence analysis for KL divergence estimation, we study ˆD K D K (P, Q) for the likelihood ratio estimation, we use Hellinger distance h 2 Q(g, g 0 ) := 1 (g 1/2 g 1/2 0 ) 2 dq. 2 39

54 Assumptions for convergence analysis true likelihood ratio g 0 is bounded from below by some positive constant: g 0 η 0 > 0. Note: we don t assume that G is bounded away from 0 (not yet)! uniform norm of G M is Lipchitz with respect to the penalty measure I(g): for any M 1: sup g cm. g G M on the entropy of G: For some 0 < γ < 2, H B δ (G M, L 2 (Q)) = O(M/δ) γ. 40

55 Convergence rates when λ n vanishes sufficiently slowly: then under P: λ 1 n = O P (n 2/(2+γ) )(1 + I(g 0 )), h Q (g 0, ĝ n ) = O P (λ 1/2 n )(1 + I(g 0 )) I(ĝ n ) = O P (1 + I(g 0 )). 41

56 Convergence rates when λ n vanishes sufficiently slowly: then under P: λ 1 n = O P (n 2/(2+γ) )(1 + I(g 0 )), h Q (g 0, ĝ n ) = O P (λ 1/2 n )(1 + I(g 0 )) I(ĝ n ) = O P (1 + I(g 0 )). if G is bounded away from 0: D K ˆD K = O P (λ 1/2 n )(1 + I(g 0 )). 41-a

57 {x i } Q, {y j } P G is RKHS function class G is a RKHS with Mercer kernel k(x, x ) = Φ(x), Φ(x ) I(g) = g H ĝ n = argmin g G gdq n log g dp n + λ n 2 g 2 H 42

58 {x i } Q, {y j } P G is RKHS function class G is a RKHS with Mercer kernel k(x, x ) = Φ(x), Φ(x ) I(g) = g H ĝ n = argmin g G gdq n log g dp n + λ n 2 g 2 H Convex dual formulation: α := argmax 1 n log α j 1 n 2λ n j=1 n α j Φ(y j ) 1 n j=1 n Φ(x i ) 2 i=1 ˆD K (P, Q) := 1 n n log α j log n j=1 42-a

59 log G is RKHS function class {x i } Q, {y j } P log G is a RKHS with Mercer kernel k(x, x ) = Φ(x), Φ(x ) I(g) = log g H ĝ n = argmin g G gdq n log g dp n + λ n 2 log g 2 H Convex dual formulation: α := argmax 1 n ( α i log α i +α i log n n e i=1 ) 1 2λ n n α i Φ(x i ) 1 n i=1 n Φ(y j ) 2 j=1 ˆD K (P, Q) := 1 + n α i log α i + α i log n e i=1 43

60 Estimate of KL(Beta(1,2),Unif[0,1]) 0.8 Estimate of KL(1/2 N t (0,1)+ 1/2 N t (1,1),Unif[ 5,5]) M1, σ =.1, λ = 1/n M2, σ =.1, λ =.1/n WKV, s = n 1/2 WKV, s = n 1/ M1, σ =.1, λ = 1/n M2, σ = 1, λ =.1/n WKV, s = n 1/3 WKV, s = n 1/2 WKV, s = n 2/ Estimate of KL(N t (0,1),N t (4,2)) Estimate of KL(N t (4,2),N t (0,1)) M1, σ =.1, λ = 1/n M2, σ =.1, λ =.1/n WKV, s = n 1/3 WKV, s = n 1/2 WKV, s = n 2/ M1, σ = 1, λ =.1/n M2, σ = 1, λ =.1/n WKV, s = n 1/4 WKV, s = n 1/3 WKV, s = n 1/

61 1.5 Estimate of KL(N t (0,I 2 ),N t (1,I 2 )) 1.5 Estimate of KL(N t (0,I 2 ),Unif[ 3,3] 2 ) M1, σ =.5, λ =.1/n M2, σ =.5, λ =.1/n WKV, n 1/3 WKV, n 1/ M1, σ =.5, λ =.1/n M2, σ =.5, λ =.1/n WKV, n 1/3 WKV, n 1/ Estimate of KL(N t (0,I 3 ),N t (1,I 3 )) M1, σ = 1, λ =.1/n M2, σ = 1, λ =.1/n WKV, n 1/2 0 WKV, n 1/ Estimate of KL(N t (0,I 3 ),Unif[ 3,3] 3 ) M1 σ = 1, λ =.1/n 1/2 M2, σ = 1, λ =.1/n M2, σ = 1, λ =.1/n 2/3 WKV, n 1/3 WKV, n 1/

62 Conclusion nonparametric decentralized detection algorithm use of surrogate loss functions and marginalized kernels study of surrogate loss functions and divergence functionals correspondence of losses and divergences M-estimator of divergences (e.g., Kullback-Leibler divergence) 46

Decentralized decision making with spatially distributed data

Decentralized decision making with spatially distributed data Decentralized decision making with spatially distributed data XuanLong Nguyen Department of Statistics University of Michigan Acknowledgement: Michael Jordan, Martin Wainwright, Ram Rajagopal, Pravin Varaiya

More information

Are You a Bayesian or a Frequentist?

Are You a Bayesian or a Frequentist? Are You a Bayesian or a Frequentist? Michael I. Jordan Department of EECS Department of Statistics University of California, Berkeley http://www.cs.berkeley.edu/ jordan 1 Statistical Inference Bayesian

More information

Divergences, surrogate loss functions and experimental design

Divergences, surrogate loss functions and experimental design Divergences, surrogate loss functions and experimental design XuanLong Nguyen University of California Berkeley, CA 94720 xuanlong@cs.berkeley.edu Martin J. Wainwright University of California Berkeley,

More information

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing XuanLong Nguyen Martin J. Wainwright Michael I. Jordan Electrical Engineering & Computer Science Department

More information

Nonparametric Decentralized Detection using Kernel Methods

Nonparametric Decentralized Detection using Kernel Methods Nonparametric Decentralized Detection using Kernel Methods XuanLong Nguyen xuanlong@cs.berkeley.edu Martin J. Wainwright, wainwrig@eecs.berkeley.edu Michael I. Jordan, jordan@cs.berkeley.edu Electrical

More information

Decentralized Detection and Classification using Kernel Methods

Decentralized Detection and Classification using Kernel Methods Decentralized Detection and Classification using Kernel Methods XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@cs.berkeley.edu Martin J. Wainwright Electrical Engineering

More information

On surrogate loss functions and f-divergences

On surrogate loss functions and f-divergences On surrogate loss functions and f-divergences XuanLong Nguyen, Martin J. Wainwright, xuanlong.nguyen@stat.duke.edu wainwrig@stat.berkeley.edu Michael I. Jordan, jordan@stat.berkeley.edu Department of Statistical

More information

ADECENTRALIZED detection system typically involves a

ADECENTRALIZED detection system typically involves a IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL 53, NO 11, NOVEMBER 2005 4053 Nonparametric Decentralized Detection Using Kernel Methods XuanLong Nguyen, Martin J Wainwright, Member, IEEE, and Michael I Jordan,

More information

Estimating divergence functionals and the likelihood ratio by convex risk minimization

Estimating divergence functionals and the likelihood ratio by convex risk minimization Estimating divergence functionals and the likelihood ratio by convex risk minimization XuanLong Nguyen Dept. of Statistical Science Duke University xuanlong.nguyen@stat.duke.edu Martin J. Wainwright Dept.

More information

On divergences, surrogate loss functions, and decentralized detection

On divergences, surrogate loss functions, and decentralized detection On divergences, surrogate loss functions, and decentralized detection XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@eecs.berkeley.edu Martin J. Wainwright Statistics

More information

Decentralized Detection and Classification using Kernel Methods

Decentralized Detection and Classification using Kernel Methods Decentralied Detection and Classification using Kernel Methods XuanLong Nguyen XUANLONG@CS.BERKELEY.EDU Martin J. Wainwright WAINWRIG@EECS.BERKELEY.EDU Michael I. Jordan JORDAN@CS.BERKELEY.EDU Computer

More information

ON SURROGATE LOSS FUNCTIONS AND f -DIVERGENCES 1

ON SURROGATE LOSS FUNCTIONS AND f -DIVERGENCES 1 The Annals of Statistics 2009, Vol. 37, No. 2, 876 904 DOI: 10.1214/08-AOS595 Institute of Mathematical Statistics, 2009 ON SURROGATE LOSS FUNCTIONS AND f -DIVERGENCES 1 BY XUANLONG NGUYEN, MARTIN J. WAINWRIGHT

More information

Statistical Properties of Large Margin Classifiers

Statistical Properties of Large Margin Classifiers Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides

More information

On distance measures, surrogate loss functions, and distributed detection

On distance measures, surrogate loss functions, and distributed detection On distance measures, surrogate loss functions, and distributed detection XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@cs.berkeley.edu Martin J. Wainwright EECS

More information

AdaBoost and other Large Margin Classifiers: Convexity in Classification

AdaBoost and other Large Margin Classifiers: Convexity in Classification AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at

More information

Nonparametric Decentralized Detection and Sparse Sensor Selection via Weighted Kernel

Nonparametric Decentralized Detection and Sparse Sensor Selection via Weighted Kernel IEEE TRASACTIOS O SIGAL PROCESSIG, VOL. X, O. X, X X onparametric Decentralized Detection and Sparse Sensor Selection via Weighted Kernel Weiguang Wang, Yingbin Liang, Member, IEEE, Eric P. Xing, Senior

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Lecture 10 February 23

Lecture 10 February 23 EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009 Lecture 10 February 23 Lecturer: Martin Wainwright Scribe: Dave Golland Note: These lecture notes are still rough, and have only

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Accelerated Training of Max-Margin Markov Networks with Kernels

Accelerated Training of Max-Margin Markov Networks with Kernels Accelerated Training of Max-Margin Markov Networks with Kernels Xinhua Zhang University of Alberta Alberta Innovates Centre for Machine Learning (AICML) Joint work with Ankan Saha (Univ. of Chicago) and

More information

Discrete Markov Random Fields the Inference story. Pradeep Ravikumar

Discrete Markov Random Fields the Inference story. Pradeep Ravikumar Discrete Markov Random Fields the Inference story Pradeep Ravikumar Graphical Models, The History How to model stochastic processes of the world? I want to model the world, and I like graphs... 2 Mid to

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Parameter learning in CRF s

Parameter learning in CRF s Parameter learning in CRF s June 01, 2009 Structured output learning We ish to learn a discriminant (or compatability) function: F : X Y R (1) here X is the space of inputs and Y is the space of outputs.

More information

Computing regularization paths for learning multiple kernels

Computing regularization paths for learning multiple kernels Computing regularization paths for learning multiple kernels Francis Bach Romain Thibaux Michael Jordan Computer Science, UC Berkeley December, 24 Code available at www.cs.berkeley.edu/~fbach Computing

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)

More information

Learning Binary Classifiers for Multi-Class Problem

Learning Binary Classifiers for Multi-Class Problem Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

Stanford Statistics 311/Electrical Engineering 377

Stanford Statistics 311/Electrical Engineering 377 I. Bayes risk in classification problems a. Recall definition (1.2.3) of f-divergence between two distributions P and Q as ( ) p(x) D f (P Q) : q(x)f dx, q(x) where f : R + R is a convex function satisfying

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

CS Lecture 19. Exponential Families & Expectation Propagation

CS Lecture 19. Exponential Families & Expectation Propagation CS 6347 Lecture 19 Exponential Families & Expectation Propagation Discrete State Spaces We have been focusing on the case of MRFs over discrete state spaces Probability distributions over discrete spaces

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012) Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation

More information

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines

More information

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved

More information

Lecture 18: Multiclass Support Vector Machines

Lecture 18: Multiclass Support Vector Machines Fall, 2017 Outlines Overview of Multiclass Learning Traditional Methods for Multiclass Problems One-vs-rest approaches Pairwise approaches Recent development for Multiclass Problems Simultaneous Classification

More information

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Support Vector Machine (continued)

Support Vector Machine (continued) Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Bayesian Regularization

Bayesian Regularization Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

A Study of Relative Efficiency and Robustness of Classification Methods

A Study of Relative Efficiency and Robustness of Classification Methods A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics

More information

DIVERGENCES (or pseudodistances) based on likelihood

DIVERGENCES (or pseudodistances) based on likelihood IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 56, NO 11, NOVEMBER 2010 5847 Estimating Divergence Functionals the Likelihood Ratio by Convex Risk Minimization XuanLong Nguyen, Martin J Wainwright, Michael

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

Convexity, Detection, and Generalized f-divergences

Convexity, Detection, and Generalized f-divergences Convexity, Detection, and Generalized f-divergences Khashayar Khosravi Feng Ruan John Duchi June 5, 015 1 Introduction The goal of classification problem is to learn a discriminant function for classification

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Bayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina

Bayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina Bayes rule and Bayes error Definition If f minimizes E[L(Y, f (X))], then f is called a Bayes rule (associated with the loss function L(y, f )) and the resulting prediction error rate, E[L(Y, f (X))],

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

An Introduction to Expectation-Maximization

An Introduction to Expectation-Maximization An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

Algorithms for Predicting Structured Data

Algorithms for Predicting Structured Data 1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure

More information

Does Modeling Lead to More Accurate Classification?

Does Modeling Lead to More Accurate Classification? Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

U Logo Use Guidelines

U Logo Use Guidelines Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 9: Variational Inference Relaxations Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 24/10/2011 (EPFL) Graphical Models 24/10/2011 1 / 15

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Information Geometry

Information Geometry 2015 Workshop on High-Dimensional Statistical Analysis Dec.11 (Friday) ~15 (Tuesday) Humanities and Social Sciences Center, Academia Sinica, Taiwan Information Geometry and Spontaneous Data Learning Shinto

More information