A Strongly Quasiconvex PAC-Bayesian Bound
|
|
- Elizabeth Webster
- 5 years ago
- Views:
Transcription
1 A Strongly Quasiconvex PAC-Bayesian Bound Yevgeny Seldin NIPS-2017 Workshop on (Almost) 50 Shades of Bayesian Learning: PAC-Bayesian trends and insights Based on joint work with Niklas Thiemann, Christian Igel, and Olivier Wintenberger, ALT 2017
2 Quick Summary Two major ways to convexify classification with 0-1 loss Convexify the loss Work in the space of distributions over H (PAC-Bayes)
3 Quick Summary Two major ways to convexify classification with 0-1 loss Convexify the loss Work in the space of distributions over H (PAC-Bayes) We propose A relaxation of the PAC-Bayes-kl bound (Seeger, 2002) and an alternating minimization procedure Sufficient conditions for strong quasiconvexity of the bound which guarantee convergence to the global minimum Construction of a hypothesis space tailored for the bound In our experiments rigorous minimization of the bound was competitive with cross-validation in tuning the trade-off between complexity and empirical performance
4 Outline A Very Quick Recap of PAC-Bayesian Analysis A Strongly Quasiconvex PAC-Bayesian Bound Construction of a Hypothesis Set Experiments
5 Randomized Classifiers Let ρ be a distribution over H Randomized Classifiers At each round of the game: 1. Pick h H according to ρ(h) 2. Observe x 3. Return h(x)
6 Randomized Classifiers Let ρ be a distribution over H Randomized Classifiers At each round of the game: 1. Pick h H according to ρ(h) 2. Observe x 3. Return h(x) Expected loss of ρ E h ρ [L(h) = E ρ [L(h) Empirical loss of ρ on a sample S E h ρ [ˆL(h, S) = E ρ [ˆL(h, S)
7 Approximation-Estimation Perspective (Bias-Variance)
8 Approximation-Estimation Perspective (Bias-Variance) Randomized Classification Avoid selection when not necessary If ˆL(h, S) ˆL(h, S) and π(h) π(h ), take ρ(h) ρ(h ) Reduced variance at the same bias level
9 Kullback-Leibler (KL) divergence = Relative Entropy KL divergence Let ρ and π be two distributions over H [ KL(ρ π) = E ρ ln ρ π Binary kl divergence For two Bernoulli random variables with biases p and q kl(p q) = KL([p, 1 p [q, 1 q)
10 PAC-Bayes-kl Inequality Theorem (Seeger, 2002) For any prior π over H and any (0, 1), with probability greater than 1 over a random draw of a sample S, for all distributions ρ over H simultaneously: ( ) Eρ kl E ρ [ˆL(h, S) [L(h) KL(ρ π) + ln 2 n. n
11 PAC-Bayes-kl Inequality Theorem (Seeger, 2002) For any prior π over H and any (0, 1), with probability greater than 1 over a random draw of a sample S, for all distributions ρ over H simultaneously: ( ) Eρ kl E ρ [ˆL(h, S) [L(h) KL(ρ π) + ln 2 n n. Challenge The bound is not convex in ρ Common heuristic: replace with a parametrized tradeoff βne ρ [ˆL(h, S) + KL(ρ π) and tune β by cross-validation
12 Outline A Very Quick Recap of PAC-Bayesian Analysis A Strongly Quasiconvex PAC-Bayesian Bound Construction of a Hypothesis Set Experiments
13 Relaxation of PAC-Bayes-kl Based on refined Pinsker s inequality Theorem (PAC-Bayes-λ Inequality) For any prior π and any (0, 1), with probability greater than 1, for all ρ and λ (0, 2) simultaneously: E ρ [ˆL(h, S) 2 KL(ρ π) + ln n E ρ [L(h) 1 λ + 2 λ ( 1 λ ). 2 n
14 Relaxation of PAC-Bayes-kl Based on refined Pinsker s inequality Theorem (PAC-Bayes-λ Inequality) For any prior π and any (0, 1), with probability greater than 1, for all ρ and λ (0, 2) simultaneously: E ρ [ˆL(h, S) 2 KL(ρ π) + ln n E ρ [L(h) 1 λ + 2 λ ( 1 λ ). 2 n For the optimal λ this leads to ( ) 2E ρ [ˆL(h, S) KL(ρ π) + ln E ρ [L(h) E ρ [ˆL(h, S) + 2 n n Fast convergence rate ( 2 + KL(ρ π) + ln 2 n n )
15 Alternating Minimization of PAC-Bayes-λ E ρ [ˆL(h, S) E ρ [L(h) 1 λ KL(ρ π) + ln n λ ( 1 λ ) 2 n } {{ } F(ρ,λ)
16 Alternating Minimization of PAC-Bayes-λ E ρ [ˆL(h, S) E ρ [L(h) 1 λ KL(ρ π) + ln n λ ( 1 λ ) 2 n } {{ } F(ρ,λ) For a fixed λ the bound is convex in ρ and minimized by π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s)
17 Alternating Minimization of PAC-Bayes-λ E ρ [ˆL(h, S) E ρ [L(h) 1 λ KL(ρ π) + ln n λ ( 1 λ ) 2 n } {{ } F(ρ,λ) For a fixed λ the bound is convex in ρ and minimized by π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s) For a fixed ρ the bound is convex in λ and minimized by λ = 2 2nE ρ[ˆl(h,s) KL(ρ π)+ln 2 n
18 Alternating Minimization of PAC-Bayes-λ E ρ [ˆL(h, S) E ρ [L(h) 1 λ KL(ρ π) + ln n λ ( 1 λ ) 2 n } {{ } F(ρ,λ) For a fixed λ the bound is convex in ρ and minimized by π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s) For a fixed ρ the bound is convex in λ and minimized by λ = 2 2nE ρ[ˆl(h,s) KL(ρ π)+ln 2 n F(ρ, λ) is not necessarily jointly convex in ρ and λ
19 Simplification 1 F(ρ, λ) = E ρ [ˆL(h, S) 1 λ KL(ρ π) + ln n λ ( 1 λ ) 2 n π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s)
20 Simplification 1 F(ρ, λ) = E ρ [ˆL(h, S) 1 λ KL(ρ π) + ln n λ ( 1 λ ) 2 n π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s) E ρλ [ˆL(h, S) F(λ) = F(ρ λ, λ) = 1 λ 2 + KL(ρ λ π) + ln 2 n λ ( 1 λ 2 ) n
21 Simplification 1 F(ρ, λ) = E ρ [ˆL(h, S) 1 λ KL(ρ π) + ln n λ ( 1 λ ) 2 n π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s) E ρλ [ˆL(h, S) F(λ) = F(ρ λ, λ) = 1 λ 2 + KL(ρ λ π) + ln 2 n λ ( 1 λ ) 2 n One-dimensional function
22 Simplification 2 F(λ) = F(ρ λ, λ) = E ρλ [ˆL(h, S) 1 λ 2 + KL(ρ λ π) + ln 2 n λ ( 1 λ ) 2 n π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s)
23 Simplification 2 F(λ) = F(ρ λ, λ) = E ρλ [ˆL(h, S) 1 λ 2 + KL(ρ λ π) + ln 2 n λ ( 1 λ ) 2 n π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s) [ KL(ρ λ π) = E ρλ ln ρ λ(h) = E ρλ ln π(h) e nλˆl(h,s) E π [e nλˆl(h,s) = nλe ρλ [ˆL(h, S) ln E π [ e nλˆl(h,s)
24 Simplification 2 F(λ) = F(ρ λ, λ) = E ρλ [ˆL(h, S) 1 λ 2 + KL(ρ λ π) + ln 2 n λ ( 1 λ ) 2 n π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s) [ KL(ρ λ π) = E ρλ ln ρ λ(h) = E ρλ ln π(h) e nλˆl(h,s) E π [e nλˆl(h,s) = nλe ρλ [ˆL(h, S) ln E π [ e nλˆl(h,s) F(λ) = ln E π [e nλˆl(h,s) + ln 2 n nλ(1 λ/2)
25 Strong Quasiconvexity - Sufficient Condition Theorem (Strong Quasiconvexity) If at least one of the two conditions or is satisfied for all λ 2 KL(ρ λ π) + ln 4n 2 > λ2 n 2 Var ρλ [ˆL(h, S) E ρλ [ˆL(h, S) > (1 λ)nvar ρλ [ˆL(h, S) [ ln 2 n n, 1, then F(λ) is strongly quasiconvex for λ (0, 1 and alternating minimization converges to the global minimum of F.
26 Proof Highlights F(λ) = ln E π [e nλˆl(h,s) + ln 2 n nλ(1 λ/2) Show that the second derivative of F(λ) is positive at all stationary points 1 d ln E π [e nλ ˆL(h,S) n dλ = E ρλ [ˆL(h, S) 1 d 2 ln E π [e nλ ˆL(h,S) n = nvar dλ 2 ρλ [ˆL(h, S)
27 Weak Separation Sufficient Condition for Strong Quasiconvexity
28 Weak Separation Sufficient Condition for Strong Quasiconvexity Theorem (Weak Separation) Let H be finite with H = m and π(h) uniform. Let a = ln(3mn) n ln 2 n and b. If the number of hypotheses for which ˆL(h, S) (ˆL(h, S) + a, ˆL(h ), S) + b is at most e2 4n 12 ln 2 F(λ) is strongly quasiconvex and alternating minimization converges to the global minimum. ln 4n 2 n 3 then
29 Proof Highlights By the Strong Quasiconvexity Theorem, if Var ρλ [ˆL(h, S) is small then F(λ) is strongly quasiconvex Let h = ˆL(h, S) ˆL(h, S) [ Var ρλ [ˆL(h, S) E ρλ 2 h = ρ λ (h) 2 h = h h 2 h e nλ h / h e nλ h
30 Breaking the Quasiconvexity It is possible to break the quasiconvexity but one has to work hard for it
31 Breaking the Quasiconvexity It is possible to break the quasiconvexity but one has to work hard for it For example, taking n = 200, = 0.25, m = , h = 0.1 and uniform π breaks it
32 Breaking the Quasiconvexity It is possible to break the quasiconvexity but one has to work hard for it For example, taking n = 200, = 0.25, m = , h = 0.1 and uniform π breaks it In all our experiments F(λ) was convex even when the weak separation sufficient condition was violated So it might be possible to relax the sufficient condition further
33 Outline A Very Quick Recap of PAC-Bayesian Analysis A Strongly Quasiconvex PAC-Bayesian Bound Construction of a Hypothesis Set Experiments
34 Challenge Computation of the normalization of ρ λ can be prohibitively expensive π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s)
35 Challenge Computation of the normalization of ρ λ can be prohibitively expensive π(h)e λnˆl(h,s) ρ λ (h) = E π [e λnˆl(h,s) Parametrization of ρ may break the convexity
36 Challenge Computation of the normalization of ρ λ can be prohibitively expensive π(h)e λnˆl(h,s) ρ λ (h) = = E π [e λnˆl(h,s) Parametrization of ρ may break the convexity Solution Work with finite H We need a powerful finite H π(h)e λnˆl(h,s) h π(h )e λnˆl(h,s)
37 Construction of a finite sample-dependent H Select m = H subsamples of r points each Train a model h on r points and validate on n r points Validation loss: ˆLval (h)
38 Construction of a finite sample-dependent H Select m = H subsamples of r points each Train a model h on r points and validate on n r points Validation loss: ˆLval (h) Adapted Bound E ρ [L(h) [ˆLval E ρ (h, S) 1 λ 2 n r+1 KL(ρ π) + ln + (n r)λ ( 1 λ ) 2
39 Construction of a finite sample-dependent H Select m = H subsamples of r points each Train a model h on r points and validate on n r points Validation loss: ˆLval (h) Adapted Bound E ρ [L(h) [ˆLval E ρ (h, S) 1 λ 2 n r+1 KL(ρ π) + ln + (n r)λ ( 1 λ ) 2 Special Case: k-fold cross-validation Most computational advantage is achieved by inverse CV
40 Outline A Very Quick Recap of PAC-Bayesian Analysis A Strongly Quasiconvex PAC-Bayesian Bound Construction of a Hypothesis Set Experiments
41 Experiments We compare Kernel-SVM trained by cross-validation ρ-weighting of multiple weak SVMs trained on d + 1 samples
42 Experiments We compare Kernel-SVM trained by cross-validation ρ-weighting of multiple weak SVMs trained on d + 1 samples * More precisely, we apply ρ-weighted aggregation ( ) MV ρ (x) = sign ρ(h)h(x) but in our case there was no significant difference between L(MV ρ ) and E ρ [L(h) h
43 Rough Runtime Comparison k-fold cross-validation of kernel SVMs k }{{} n 2+ + V }{{} kn 2+ training validation
44 Rough Runtime Comparison k-fold cross-validation of kernel SVMs k }{{} n 2+ + V }{{} kn 2+ training validation PAC-Bayesian aggregation of kernel SVMs For r = d + 1 and m = n: m r 2+ }{{} training + rn }{{} validation + A }{{} aggregation mrn dn 2
45 Rough Runtime Comparison k-fold cross-validation of kernel SVMs k }{{} n 2+ + V }{{} kn 2+ training validation PAC-Bayesian aggregation of kernel SVMs For r = d + 1 and m = n: m r 2+ }{{} training + rn }{{} validation + A }{{} aggregation mrn dn 2 Computational Speed-up!
46 Experiments (a) Ionosphere n = 200, r = d + 1 = 35. (b) Waveform n = 2000, r = d + 1 = 41. (c) Breast cancer n = 340, r = d + 1 = 11. (d) AvsB n = 1000, r = d + 1 = 17.
47 Summary We proposed A relaxation of the PAC-Bayes-kl bound (Seeger, 2002) An alternating minimization procedure Sufficient conditions for strong quasiconvexity which guarantee convergence to the global minimum Construction of H In our experiments rigorous minimization of the bound was competitive with cross-validation in tuning the trade-off between complexity and empirical performance
48 Summary We proposed A relaxation of the PAC-Bayes-kl bound (Seeger, 2002) An alternating minimization procedure Sufficient conditions for strong quasiconvexity which guarantee convergence to the global minimum Construction of H In our experiments rigorous minimization of the bound was competitive with cross-validation in tuning the trade-off between complexity and empirical performance Rigorous minimization of a theoretical bound competitive with cross-validation!
49 What s next? Improved Sufficient Conditions In practice the bound was strongly convex even when the weak separation sufficient condition was violated. Relax the sufficient condition We have dropped some terms when going from the Strong Quasiconvexity Theorem to the Weak Separation Condition
50 Strong Quasiconvexity - Sufficient Condition Theorem (Strong Quasiconvexity) If at least one of the two conditions or is satisfied for all λ 2 KL(ρ λ π) + ln 4n 2 > λ2 n 2 Var ρλ [ˆL(h, S) E ρλ [ˆL(h, S) > (1 λ)nvar ρλ [ˆL(h, S) [ ln 2 n n, 1, then F(λ) is strongly quasiconvex for λ (0, 1 and alternating minimization converges to the global minimum of F.
51 What s next? Improved Sufficient Conditions In practice the bound was strongly convex even when the weak separation sufficient condition was violated. Relax the sufficient condition We have dropped some terms when going from the Strong Quasiconvexity Theorem to the Weak Separation Condition
52 What s next? Improved Sufficient Conditions In practice the bound was strongly convex even when the weak separation sufficient condition was violated. Relax the sufficient condition We have dropped some terms when going from the Strong Quasiconvexity Theorem to the Weak Separation Condition Improved Analysis of the Weighted Majority Vote Combine the results with improved analysis of weighted majority vote (the C-bound ) Lacasse, Laviolette, Marchand, Germain, and Usunier, NIPS, 2007 Laviolette, Marchand, Roy, ICML, 2011 Germain, Lacasse, Laviolette, Marchand, Roy, JMLR, 2015
PAC-Bayesian Generalization Bound for Multi-class Learning
PAC-Bayesian Generalization Bound for Multi-class Learning Loubna BENABBOU Department of Industrial Engineering Ecole Mohammadia d Ingènieurs Mohammed V University in Rabat, Morocco Benabbou@emi.ac.ma
More informationGeneralization of the PAC-Bayesian Theory
Generalization of the PACBayesian Theory and Applications to SemiSupervised Learning Pascal Germain INRIA Paris (SIERRA Team) Modal Seminar INRIA Lille January 24, 2017 Dans la vie, l essentiel est de
More informationPAC-Bayesian Learning and Domain Adaptation
PAC-Bayesian Learning and Domain Adaptation Pascal Germain 1 François Laviolette 1 Amaury Habrard 2 Emilie Morvant 3 1 GRAAL Machine Learning Research Group Département d informatique et de génie logiciel
More informationRisk Bounds for the Majority Vote: From a PAC-Bayesian Analysis to a Learning Algorithm
Journal of Machine Learning Research 16 2015 787-860 Submitted 5/13; Revised 9/14; Published 4/15 Risk Bounds for the Majority Vote: From a PAC-Bayesian Analysis to a Learning Algorithm Pascal Germain
More informationIntroduction to Statistical Learning Theory
Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated
More informationVariations sur la borne PAC-bayésienne
Variations sur la borne PAC-bayésienne Pascal Germain INRIA Paris Équipe SIRRA Séminaires du département d informatique et de génie logiciel Université Laval 11 juillet 2016 Pascal Germain INRIA/SIRRA
More informationPAC-Bayesian Analysis for a two-step Hierarchical Multiview Learning Approach
PAC-Bayesian Analysis for a two-step Hierarchical Multiview Learning Approach Anil Goyal, milie Morvant, Pascal Germain, Massih-Reza Amini To cite this version: Anil Goyal, milie Morvant, Pascal Germain,
More informationGeneralization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question
More informationLecture 21: Minimax Theory
Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways
More informationLa théorie PAC-Bayes en apprentissage supervisé
La théorie PAC-Bayes en apprentissage supervisé Présentation au LRI de l université Paris XI François Laviolette, Laboratoire du GRAAL, Université Laval, Québec, Canada 14 dcembre 2010 Summary Aujourd
More informationFrom PAC-Bayes Bounds to Quadratic Programs for Majority Votes
François Laviolette FrancoisLaviolette@iftulavalca Mario Marchand MarioMarchand@iftulavalca Jean-Francis Roy Jean-FrancisRoy1@ulavalca Département d informatique et de génie logiciel, Université Laval,
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationEvaluation. Andrea Passerini Machine Learning. Evaluation
Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationStephen Scott.
1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationPAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers
PAC-Bayes Ris Bounds for Sample-Compressed Gibbs Classifiers François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Département d informatique et de génie logiciel,
More informationA PAC-Bayes Bound for Tailored Density Estimation
A PAC-Bayes Bound for Tailored Density Estimation Matthew Higgs and John Shawe-Taylor Center for Computational Statistics and Machine Learning, University College London {mhiggs,jst}@cs.ucl.ac.uk Abstract.
More informationA Column Generation Bound Minimization Approach with PAC-Bayesian Generalization Guarantees
A Column Generation Bound Minimization Approach with PAC-Bayesian Generalization Guarantees Jean-Francis Roy jean-francis.roy@ift.ulaval.ca Mario Marchand mario.marchand@ift.ulaval.ca François Laviolette
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationGeneralization Bounds
Generalization Bounds Here we consider the problem of learning from binary labels. We assume training data D = x 1, y 1,... x N, y N with y t being one of the two values 1 or 1. We will assume that these
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationEvaluation requires to define performance measures to be optimized
Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation
More informationPAC-Bayes Generalization Bounds for Randomized Structured Prediction
PAC-Bayes Generalization Bounds for Randomized Structured Prediction Ben London University of Maryland blondon@cs.umd.edu Ben Taskar University of Washington taskar@cs.washington.edu Bert Huang University
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Max-margin learning of GM Eric Xing Lecture 28, Apr 28, 2014 b r a c e Reading: 1 Classical Predictive Models Input and output space: Predictive
More informationMachine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017
Machine Learning Model Selection and Validation Fabio Vandin November 7, 2017 1 Model Selection When we have to solve a machine learning task: there are different algorithms/classes algorithms have parameters
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationIntroduction and Models
CSE522, Winter 2011, Learning Theory Lecture 1 and 2-01/04/2011, 01/06/2011 Lecturer: Ofer Dekel Introduction and Models Scribe: Jessica Chang Machine learning algorithms have emerged as the dominant and
More informationarxiv: v1 [stat.ml] 17 Jul 2017
PACBayes and Domain Adaptation arxiv:1707.05712v1 [stat.ml] 17 Jul 2017 Pascal Germain pascal.germain@inria.fr Département d informatique de l ENS, École normale supérieure, CNRS, PSL Research University,
More informationTTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Generalization and Regularization
TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 Generalization and Regularization 1 Chomsky vs. Kolmogorov and Hinton Noam Chomsky: Natural language grammar cannot be learned by
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationHands-On Learning Theory Fall 2016, Lecture 3
Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete
More informationPAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification
PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification Emilie Morvant, Sokol Koço, Liva Ralaivola To cite this version: Emilie Morvant, Sokol Koço, Liva Ralaivola. PAC-Bayesian
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationDeep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes
Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes Daniel M. Roy University of Toronto; Vector Institute Joint work with Gintarė K. Džiugaitė University
More informationModel Averaging With Holdout Estimation of the Posterior Distribution
Model Averaging With Holdout stimation of the Posterior Distribution Alexandre Lacoste alexandre.lacoste.1@ulaval.ca François Laviolette francois.laviolette@ift.ulaval.ca Mario Marchand mario.marchand@ift.ulaval.ca
More informationMachine learning - HT Maximum Likelihood
Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce
More informationMultiview Boosting by Controlling the Diversity and the Accuracy of View-specific Voters
Multiview Boosting by Controlling the Diversity and the Accuracy of View-specific Voters Anil Goyal 1,2 milie Morvant 1 Pascal Germain 3 Massih-Reza Amini 2 1 Univ Lyon, UJM-Saint-tienne, CNRS, Institut
More informationLarge-Scale Nearest Neighbor Classification with Statistical Guarantees
Large-Scale Nearest Neighbor Classification with Statistical Guarantees Guang Cheng Big Data Theory Lab Department of Statistics Purdue University Joint Work with Xingye Qiao and Jiexin Duan July 3, 2018
More informationPDEEC Machine Learning 2016/17
PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /
More informationA first model of learning
A first model of learning Let s restrict our attention to binary classification our labels belong to (or ) We observe the data where each Suppose we are given an ensemble of possible hypotheses / classifiers
More informationPAC-Bayesian Bounds based on the Rényi Divergence
Luc Bégin Pascal Germain 2 François Laviolette 3 Jean-Francis Roy 3 lucbegin@umonctonca pascalgermain@inriafr {francoislaviolette, jean-francisroy}@iftulavalca Campus d dmundston, Université de Moncton,
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationA tutorial on the Pac-Bayesian Theory. by François Laviolette
A tutorial on the Pac-Bayesian Theory NIPS workshop - (Almost) 50 shades of Bayesian Learning: PAC-Bayesian trends and insights by François Laviolette Laboratoire du GRAAL, Université Laval December 9th
More informationAdaptive Monte Carlo methods
Adaptive Monte Carlo methods Jean-Michel Marin Projet Select, INRIA Futurs, Université Paris-Sud joint with Randal Douc (École Polytechnique), Arnaud Guillin (Université de Marseille) and Christian Robert
More informationTopics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families
Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationImproved Algorithms for Confidence-Rated Prediction with Error Guarantees
Improved Algorithms for Confidence-Rated Prediction with Error Guarantees Kamalika Chaudhuri Chicheng Zhang Department of Computer Science and Engineering University of California, San Diego 95 Gilman
More informationABC random forest for parameter estimation. Jean-Michel Marin
ABC random forest for parameter estimation Jean-Michel Marin Université de Montpellier Institut Montpelliérain Alexander Grothendieck (IMAG) Institut de Biologie Computationnelle (IBC) Labex Numev! joint
More informationThe AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m
) Set W () i The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m m =.5 1 n W (m 1) i y i h(x i ; 2 ˆθ
More informationLearning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin
Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good
More informationEstimating the accuracy of a hypothesis Setting. Assume a binary classification setting
Estimating the accuracy of a hypothesis Setting Assume a binary classification setting Assume input/output pairs (x, y) are sampled from an unknown probability distribution D = p(x, y) Train a binary classifier
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationPAC-Bayesian Theory Meets Bayesian Inference
PAC-Bayesian Theory Meets Bayesian Inference Pascal Germain Francis Bach Alexandre Lacoste Simon Lacoste-Julien INRIA Paris - École Normale Supérieure, firstname.lastname@inria.fr Google, allac@google.com
More informationMachine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and
More informationActive and Semi-supervised Kernel Classification
Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationGenerative MaxEnt Learning for Multiclass Classification
Generative Maximum Entropy Learning for Multiclass Classification A. Dukkipati, G. Pandey, D. Ghoshdastidar, P. Koley, D. M. V. S. Sriram Dept. of Computer Science and Automation Indian Institute of Science,
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationi=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 November 16, 017 Due: Dec 01, 017 A. Kernels Show that the following kernels K are PDS: 1.
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationVoting (Ensemble Methods)
1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers
More informationIFT Lecture 7 Elements of statistical learning theory
IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and
More informationBias-Variance in Machine Learning
Bias-Variance in Machine Learning Bias-Variance: Outline Underfitting/overfitting: Why are complex hypotheses bad? Simple example of bias/variance Error as bias+variance for regression brief comments on
More informationDomain Adaptation of Majority Votes via Perturbed Variation-based Label Transfer
Domain Adaptation of Majority Votes via Perturbed Variation-based Label Transfer Emilie Morvant To cite this version: Emilie Morvant. Domain Adaptation of Majority Votes via Perturbed Variation-based Label
More information10-701/ Machine Learning, Fall
0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training
More informationPart of the slides are adapted from Ziko Kolter
Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,
More informationVariance Reduction and Ensemble Methods
Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis
More informationDan Roth 461C, 3401 Walnut
CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn
More informationIntroduction. Chapter 1
Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics
More informationLearning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014
Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More information1-bit Matrix Completion. PAC-Bayes and Variational Approximation
: PAC-Bayes and Variational Approximation (with P. Alquier) PhD Supervisor: N. Chopin Bayes In Paris, 5 January 2017 (Happy New Year!) Various Topics covered Matrix Completion PAC-Bayesian Estimation Variational
More informationSample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University
Sample and Computationally Efficient Active Learning Maria-Florina Balcan Carnegie Mellon University Machine Learning is Shaping the World Highly successful discipline with lots of applications. Computational
More informationMachine Learning. Ensemble Methods. Manfred Huber
Machine Learning Ensemble Methods Manfred Huber 2015 1 Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected
More informationFINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE
FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE You are allowed a two-page cheat sheet. You are also allowed to use a calculator. Answer the questions in the spaces provided on the question sheets.
More informationU Logo Use Guidelines
Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity
More informationIntroduction to Restricted Boltzmann Machines
Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,
More informationExpectation Propagation for Approximate Bayesian Inference
Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given
More informationSeries 7, May 22, 2018 (EM Convergence)
Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18
More informationYevgeny Seldin. University of Copenhagen
Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New
More informationOn the Complexity of Best Arm Identification with Fixed Confidence
On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationVC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms
03/Feb/2010 VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms Presented by Andriy Temko Department of Electrical and Electronic Engineering Page 2 of
More informationTDT4173 Machine Learning
TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods
More informationCS7267 MACHINE LEARNING
CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning
More informationComputational Learning Theory. CS534 - Machine Learning
Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows
More informationDeep Learning Srihari. Deep Belief Nets. Sargur N. Srihari
Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous
More informationDirect Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina
Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan Boulder 1 of 39 Recap Supervised learning Previously: KNN, naïve
More informationLearning SVM Classifiers with Indefinite Kernels
Learning SVM Classifiers with Indefinite Kernels Suicheng Gu and Yuhong Guo Dept. of Computer and Information Sciences Temple University Support Vector Machines (SVMs) (Kernel) SVMs are widely used in
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationCSE 546 Final Exam, Autumn 2013
CSE 546 Final Exam, Autumn 0. Personal info: Name: Student ID: E-mail address:. There should be 5 numbered pages in this exam (including this cover sheet).. You can use any material you brought: any book,
More information6.1 Variational representation of f-divergences
ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 6: Variational representation, HCR and CR lower bounds Lecturer: Yihong Wu Scribe: Georgios Rovatsos, Feb 11, 2016
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationMachine Learning using Bayesian Approaches
Machine Learning using Bayesian Approaches Sargur N. Srihari University at Buffalo, State University of New York 1 Outline 1. Progress in ML and PR 2. Fully Bayesian Approach 1. Probability theory Bayes
More information