A talk on Oracle inequalities and regularization. by Sara van de Geer
|
|
- Megan Simon
- 5 years ago
- Views:
Transcription
1 A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003
2 Aim: to compare l 1 and other penalties Explain bias-variance paradigm Consider penalized least squares in general Study l 1 penalty in robust regression Study classification problems, l 1 penalties and an l 1/2 penalty
3 1. Regression model ɛ 1,..., ɛ n independent N (0, σ 2 ). x i X, i = 1,..., n. f 0 : [0, 1] R unknown function. Y i = f 0 (x i ) + ɛ i, i = 1,..., n.
4 An example. x i = i/n, i = 1,..., n. Estimator: ˆf n = arg min f { 1 n n } 1 Y i f(x i ) 2 + λ 2 f (x) 2 dx. i=1 0
5 True f Noise added, noise level = 0.01
6 Denoised, lambda=0.2 Fit=9.0531e Denoised, lambda=0.1 Fit=3.4322e-04
7 Denoised, lambda=0.1 Error=2.8119e Denoised, lambda=0.05 Error=7.8683e-05
8 Intermezzo: Continuous version: { 1 1 ˆf = arg min y(x) f(x) 2 dx + λ 2 f 0 0 } f (x) 2 dx. Lemma 1.1. Solution: where with ˆf(x) = C λ cosh(x λ ) + 1 λ C = Y (1) { 1 1 λ 0 Y (x) = x 0 y(u) sinh( u x λ )du, Y (u) sinh( 1 u } λ )du / sinh( 1 λ ), x 0 y(u)du.
9 Choice of regularization parameter λ? First Prev Next Last Go Back Full Screen Close Quit
10 Mimic oracle: choose λ as if the unknown f 0 were known? Our choice will be { 1 n } 1 ˆf n = arg min min Y i f(x i ) 2 + λ 2 f (x) 2 dx + c. f λ n nλ i=1 0 See later.
11 2. The sequence space model. Y j = θ j + ɛ j, j = 1,..., n, with ɛ 1,..., ɛ n independent N (0, σ2), n Define n ϑ 2 n = ϑ j 2, ϑ R n. j=1
12
13 Let J {1,..., n}. Define ˆθ j (J) = { Yj if j J 0 if j / J. Then E ˆθ(J) θ 2 n = j / J θ 2 j + J σ2 n = bias 2 + variance = approximation error + estimation error. Oracle θ oracle uses optimal tradeoff between bias 2 and variance: θ oracle θ 2 n = min θ 2 j + J σ2 J {1,...,n} n. j / J
14 3. Hard and soft thresholding. Let λ = 2σ 2 log n/n be the threshold (= regularization parameter). Hard thresholding: ˆθ j (hard) = { Yj if Y j > λ 0 if Y j λ, j = 1,..., n. Thus ˆθ(hard) minimizes where Here n (Y j ϑ j ) 2 + λ 2 #{ϑ j 0}, j=1 = n (Y j ϑ j ) 2 + pen(ϑ), j=1 pen(ϑ) = λ 2 #{ϑ j 0} = log n J(ϑ) σ2. n J(ϑ) = {ϑ j 0}. So the penalty is up to log-terms equal to the variance. First Prev Next Last Go Back Full Screen Close Quit
15 Soft thresholding: ˆθ j (soft) = { Yj λ if Y j > λ Y j + λ if Y j < λ, j = 1,..., n. 0 if Y j λ Thus ˆθ(soft) minimizes n (Y j ϑ j ) 2 + 2λ j=1 n ϑ j. j=1 = where n (Y j ϑ j ) 2 + pen(ϑ), j=1 pen(ϑ) = λ n log n ϑ j = n j=1 n ϑ j. This penalty is generally much much larger than the variance! j=1
16 The estimators ˆθ(hard) and ˆθ(soft) have similar oracle properties. Lemma. Let ˆθ {ˆθ(hard), ˆθ(soft)}. We have where and E ˆθ θ 2 n c min ϑ {blue(ϑ) + red(ϑ)}, blue(ϑ) = ϑ θ 2 n = approximation error = bias 2, red(ϑ) = log n J(ϑ) σ2 n = estimation error( variance).
17 4. Discretization. Y i = f 0 (x i ) + ɛ i, i = 1,..., n. with ɛ 1,..., ɛ n independent N (0, 1), and x i X, Y i R, i = 1,..., n. Let F be a finite collection of functions, and 1 n ˆf = arg min Y i f(x i ) 2. f F n Define Lemma 4.1. We have i=1 f f 0 n = min f F f f 0 n. E ˆf f 0 2 n 2 f f 0 2 log F n + c n = approximation error + estimation error.
18 Let {F m } m M be a collection of increasing nested finite models, and let F = m M F m. Define pen(f) = min c log F m. m: f F m n Let and ˆf = arg min f F Lemma 4.2. We have { 1 n f = arg min f F } n Y i f(x i ) 2 + pen(f), i=1 { f f0 2 n + pen(f) } = arg min {blue(f) + red(f)}. E[ ˆf f 0 2 n + pen( ˆf)] 2[blue(f ) + red(f )] + c n.
19 5. General penalties. Y i = f 0 (x i ) + ɛ i, i = 1,..., n. with ɛ 1,..., ɛ n independent N (0, 1), and x i X, Y i R, i = 1,..., n. Let { } 1 n ˆf = arg min Y i f(x i ) 2 + pen(f), f F n and let the oracle be f = arg min f F i=1 { f f0 2 n + pen(f) }. Definition 5.1. The δ-entropy H(δ, F, Q n ) is the logarithm of the minimum number of balls with radius δ necessary to cover F.
20 Lemma 5.2. For nδ 2 n c we have ( δn 0 ) H 1/2 (u, { f f 2 n + pen(f) δn}, 2 Q n )du δ n, E[ ˆf f 0 2 n + pen( ˆf)] 2[ f f 0 2 n + pen(f ) + δ 2 n] + c n.
21 Example: Sobolev penalties I 2 s (f) = 1 0 f (s) (x) 2 dx a) Penalty on I s with s fixed: pen(f) = λ 2 I 2 s (f). We find E[ ˆf f 0 2 n + λ 2 I 2 s ( ˆf)] 2[ f f λ 2 I 2 s (f )] + c c + nλ1/s n b) Penalty of I s with s fixed, and on λ Then pen(f) = inf 0<λ< {λ2 I 2 s (f) + c nλ1/s} = red(f). E ˆf f 0 2 n 2[ f f c 0 n 2s 2 2s+1 I 2s+1 = [blue(f ) + red(f )] + c log n n s (f )] + c log n n
22 c) Penalty on I s and on s pen(f) = d) Penalty on I s and on s, and on λ pen(f) = { min λ 2 I 2 s (f) + c } 0s 3 max 1 s s max nλ 1/s inf 0<λ< min 1 s s max {λ 2 I 2 s (f) + c 0s 3 max nλ 1/s }
23 6. Robust regression. Let Y i depend on some covariable x i, i = 1,..., n. Assume Y 1,..., Y n are independent. Let γ : R R be a convex loss function satisfying the Lipschitz condition γ(ỹ) γ(y) ỹ y, ỹ, y R. Consider { 1 n ˆf n = arg min f } n γ(y i f(x i )) + pen(f) i=1 Least absolute deviations: γ(y) = y. γ(y) = τ y l{y < 0} + (1 τ) y l{y > 0}, y R. Here 0 < τ < 1 is fixed. Huber loss function γ
24 true regression function: f 0 = arg min Γ(f), where Γ(f) = 1 n n Eγ(Y i f(x i )). i=1
25 6.1. Standard identifiablity condition Suppose that for some (unknown) σ > 0 Γ(f) Γ(f 0 ) f f 0 2 /σ. Then one can prove similar results as for the penalized least squares estimator More general margin condition Suppose that for some unknown κ 1 and σ > 0, Γ(f) Γ(f 0 ) f f 0 2κ /σ. The estimation error then depends on the unknown κ!
26 Let and where and 6.4. Oracle f = n ϑ j ψ j, j=1 J f = { ϑ j > 0}. f = arg min f {blue(f) + red(f)}, blue(f) = Γ(f) Γ(f 0 ), red(f) = [ ] κ log n n J 2κ 1 f.
27 6.5. l 1 penalty with { 1 n ˆf n = arg min f } n γ(y i f(x i )) + pen(f), i=1 pen(f) = c log n n n ϑ j. j=1
28 Oracle inequality. One has E(Γ( ˆf n ) Γ(f 0 )) C {blue(f ) + red(f )} { = C Γ(f ) Γ(f 0 ) + [ log n } n J κ 2κ 1 ]. In other words, ˆf n adapts to the smoothness of f 0 as well as to κ.
29 Typical example Then f f 0 J s. Γ( ˆf n ) Γ(f 0 ) ( log n n ) 2κs 4κs 2s+1.
30 10 FY3PQ : SMSE = MAD = time = 3.56 s 10 IRLS : SMSE = MAD = time = s
31 Conclusion: The advantage of the l 1 penalty over (nonrandom) penalties based on bias-variance considerations, is that it is not only adaptive to the smoothness, but also adaptive to the margin. First Prev Next Last Go Back Full Screen Close Quit
32 The classification problem 1. Introduction Y {0, 1} binary response variable, X X covariable. Aim: predict Y given X. Examples. - Recognition of speech or handwriting - Classifying an object in an image - Classification of gene expression levels - Etc. Training set: n i.i.d. copies (X i, Y i ) n i=1, of (X, Y ).
33 1 learning data = learning error= input data training errors 1 testing data = testing error= classification testing errors
34 2. Bayes classifier When using G as classifier, the PREDICTION ERROR is R(G) = P (Y l G (X)). where BAYES RULE = G 0 G 0 = arg min G R(G), where the minimum is over all sets G X. Thus G 0 = {x : η(x) 1/2} where η(x) the regression of Y on X = x: η(x) = P (Y = 1 X = x) : So Bayes rule predicts the most likely label.
35 Bayes rule 1_ 2 Bayes rule G 0 is here: [ ][ ].
36 3. Empirical risk minimization Let G be a collection of sets. The EMPIRICAL RISK MINIMIZER is Ĝ n = arg min G G R n(g), where is the. R n (G) = 1 n n Y i l G (X i ), i=1 EMPIRICAL RISK
37
38 Let 5. Mammen & Tsybakov margin condition G G = (G\G 0 ) (G 0 \G) is the symmetric difference between the two sets. Symmetric difference G G 0 G G 0
39 The EXCESS RISK = approximation error at G, is R(G) R(G 0 ). Margin condition. (Mammen and Tsybakov (1999), Tsybakov (2003)) For some (unknown) constants σ > 0 and κ 1, for all sets G X. R(G) R(G 0 ) σq κ (G G 0 ),
40 6. Boundary fragments We assume X [0, 1] d+1 and write X = (S, T ), with S [0, 1] d, T [0, 1]. For a function f : [0, 1] d [0, 1], we define the boundary fragment G f = {x = (s, t) : f(s) t}.
41 The symmetric difference G f G f for the boundary fragments G f and G f formed by subgraphs Subgraphs f f ~
42 Define the oracle 7. Oracle G = arg min G G [red(g) + blue(g)], where Moreover red(g) = estimation error. blue(g) = R(G) R(G 0 ) = approximation error. Thus G gives the best trade-off between estimation error and approximation error.
43 Let f ϑ = ϑ j ψ j, and 8. Oracle inequality G = {boundary fragments G fϑ : ϑ R n }. We take a square root penalty (or l 1/2 penalty) pen(g fϑ ) = λ n 2 dl/2 ϑ j,l. Theorem. (Tsybakov and van de Geer (2003)) Let where l Ĝ n = arg min G G {R n(g) + pen(g)}, λ n = c log 4 n n, and where c is a (large enough) universal constant. Then ( ) P R(Ĝn) R(G 0 ) c σ,κ [red(g ) + blue(g )] j exp[ c q log 4 n]. First Prev Next Last Go Back Full Screen Close Quit
44 Conclusion. The estimator with square root penalty adapts up to log-factors to the smoothness as well as to the margin.... but we cannot compute it!
45 9. Surrogate loss functions We now code the label as Y {±1}. Let f ϑ (x) = N ϑ j ψ j (x). Introduce a margin 0 λ < 1. We call (X, Y ) well classified by f ϑ if j=1 Y f ϑ (X) ϑ 1 λ. Here ϑ 1 = N ϑ j. j=1
46 Define R n (f ϑ, λ) = #{Y if ϑ (X i ) < λ ϑ 1 }. n Then by Chebyshev s inequality, for any non-negative, increasing function φ, where R n (f ϑ, λ) L n (f) = 1 n L n (f ϑ ) φ(1 λ ϑ 1 ), n φ(1 Y i f(x i )). i=1 This leads to minimizing the penalized loss function L n (f ϑ ) pen(ϑ), where pen(ϑ) = 1/φ(1 λ ϑ 1 ).
47 Example: support vector machine loss Adaptive estimation: minimize 1 n n (1 Y i f ϑ (X i )) + + λ ϑ 1, i=1 with λ = c log n/n.
48 Conclusion. Adaptation using empirical risk minimization leads to l 1/2 penalties, and is computationally very hard. In combination with surrogate loss functions, l 1 penalities are natural, and also computationally simple. But the resulting oracle inequalities generally do not yield fast optimal rates for the excess risk.
49 Some references D.L. Donoho and I.M. Johnstone (1996). New minimax theorems, thresholding and adaptation. Bernoulli L. Birgé and P. Massart (1997). From model selection to adaptive estimation. In; Festschrift for Lucien Le Cam: Research Papers in Probab. and Statist., (Eds. D. Pollard, E. Torgersen and G. Yang) 55-87, Springer, New York G. Lugosi and A. Nobel (1999). Adaptive model selection using empirical complexities. Ann. Statist E. Mammen and A. B. Tsybakov (1999). Smooth discriminant analysis. Ann. Statist S. van de Geer (2001). Least squares estimation with complexity penalties. Mathematical Methods of Statistics S. van de Geer (2002). M-estimation using penalties or sieves. Journal of Statistical Planning and Inference
50 J.-M. Loubes and S. van de Geer (2002). Adaptive estimation in regression, using soft thresholding type penalties. Statistica Neerlandica A.B. Tsybakov (2003). Optimal aggregation of classifiers in statistical learning. To appear in Ann. Statist. A.B. Tsybakov and S.A. van de Geer (2003). Square root penalty: adaptation to the margin in classification and in edge estimation. Prépublication PMA -820, Lab. de Probab. et Modèles Aléatoires, Université Paris VII (submitted).
Empirical Processes in M-Estimation by Sara van de Geer
Empirical Processes in M-Estimation by Sara van de Geer Handout at New Directions in General Equilibrium Analysis Cowles Workshop, Yale University June 15-20, 2003 Version: June 13, 2003 1 Most of the
More informationModel Selection and Geometry
Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model
More informationInverse Statistical Learning
Inverse Statistical Learning Minimax theory, adaptation and algorithm avec (par ordre d apparition) C. Marteau, M. Chichignoud, C. Brunet and S. Souchet Dijon, le 15 janvier 2014 Inverse Statistical Learning
More informationFast Rates for Estimation Error and Oracle Inequalities for Model Selection
Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationConcentration behavior of the penalized least squares estimator
Concentration behavior of the penalized least squares estimator Penalized least squares behavior arxiv:1511.08698v2 [math.st] 19 Oct 2016 Alan Muro and Sara van de Geer {muro,geer}@stat.math.ethz.ch Seminar
More informationA Magiv CV Theory for Large-Margin Classifiers
A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector
More informationPlug-in Approach to Active Learning
Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X
More informationFast learning rates for plug-in classifiers under the margin condition
Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,
More informationTHE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich
Submitted to the Annals of Applied Statistics arxiv: math.pr/0000000 THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES By Sara van de Geer and Johannes Lederer ETH Zürich We study high-dimensional
More informationDiscussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan
Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania Professors Antoniadis and Fan are
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationAdditive Isotonic Regression
Additive Isotonic Regression Enno Mammen and Kyusang Yu 11. July 2006 INTRODUCTION: We have i.i.d. random vectors (Y 1, X 1 ),..., (Y n, X n ) with X i = (X1 i,..., X d i ) and we consider the additive
More informationCurve learning. p.1/35
Curve learning Gérard Biau UNIVERSITÉ MONTPELLIER II p.1/35 Summary The problem The mathematical model Functional classification 1. Fourier filtering 2. Wavelet filtering Applications p.2/35 The problem
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationLecture 3: Introduction to Complexity Regularization
ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationAn Introduction to Statistical Machine Learning - Theoretical Aspects -
An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,
More informationRATES OF CONVERGENCE OF ESTIMATES, KOLMOGOROV S ENTROPY AND THE DIMENSIONALITY REDUCTION PRINCIPLE IN REGRESSION 1
The Annals of Statistics 1997, Vol. 25, No. 6, 2493 2511 RATES OF CONVERGENCE OF ESTIMATES, KOLMOGOROV S ENTROPY AND THE DIMENSIONALITY REDUCTION PRINCIPLE IN REGRESSION 1 By Theodoros Nicoleris and Yannis
More informationStatistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003
Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationAdaptive Sampling Under Low Noise Conditions 1
Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università
More informationMINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava
MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,
More informationOn threshold-based classification rules By Leila Mohammadi and Sara van de Geer. Mathematical Institute, University of Leiden
On threshold-based classification rules By Leila Mohammadi and Sara van de Geer Mathematical Institute, University of Leiden Abstract. Suppose we have n i.i.d. copies {(X i, Y i ), i = 1,..., n} of an
More information12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016
12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More informationThe deterministic Lasso
The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality
More informationD I S C U S S I O N P A P E R
I N S T I T U T D E S T A T I S T I Q U E B I O S T A T I S T I Q U E E T S C I E N C E S A C T U A R I E L L E S ( I S B A ) UNIVERSITÉ CATHOLIQUE DE LOUVAIN D I S C U S S I O N P A P E R 2014/06 Adaptive
More informationModel selection theory: a tutorial with applications to learning
Model selection theory: a tutorial with applications to learning Pascal Massart Université Paris-Sud, Orsay ALT 2012, October 29 Asymptotic approach to model selection - Idea of using some penalized empirical
More informationSUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1. Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1 Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities Jiantao Jiao*, Lin Zhang, Member, IEEE and Robert D. Nowak, Fellow, IEEE
More informationOn the Estimation of the Function and Its Derivatives in Nonparametric Regression: A Bayesian Testimation Approach
Sankhyā : The Indian Journal of Statistics 2011, Volume 73-A, Part 2, pp. 231-244 2011, Indian Statistical Institute On the Estimation of the Function and Its Derivatives in Nonparametric Regression: A
More informationarxiv: v2 [math.st] 12 Feb 2008
arxiv:080.460v2 [math.st] 2 Feb 2008 Electronic Journal of Statistics Vol. 2 2008 90 02 ISSN: 935-7524 DOI: 0.24/08-EJS77 Sup-norm convergence rate and sign concentration property of Lasso and Dantzig
More informationInverse problems in statistics
Inverse problems in statistics Laurent Cavalier (Université Aix-Marseille 1, France) Yale, May 2 2011 p. 1/35 Introduction There exist many fields where inverse problems appear Astronomy (Hubble satellite).
More informationSparsity oracle inequalities for the Lasso
Electronic Journal of Statistics Vol. 1 (007) 169 194 ISSN: 1935-754 DOI: 10.114/07-EJS008 Sparsity oracle inequalities for the Lasso Florentina Bunea Department of Statistics, Florida State University
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationRisk Bounds for CART Classifiers under a Margin Condition
arxiv:0902.3130v5 stat.ml 1 Mar 2012 Risk Bounds for CART Classifiers under a Margin Condition Servane Gey March 2, 2012 Abstract Non asymptotic risk bounds for Classification And Regression Trees (CART)
More informationA Study of Relative Efficiency and Robustness of Classification Methods
A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics
More informationCOMPARING LEARNING METHODS FOR CLASSIFICATION
Statistica Sinica 162006, 635-657 COMPARING LEARNING METHODS FOR CLASSIFICATION Yuhong Yang University of Minnesota Abstract: We address the consistency property of cross validation CV for classification.
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationDISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania
Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the
More informationDecision trees COMS 4771
Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).
More informationVapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012
Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv:203.093v2 [math.st] 23 Jul 202 Servane Gey July 24, 202 Abstract The Vapnik-Chervonenkis (VC) dimension of the set of half-spaces of R d with frontiers
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationAdaBoost and other Large Margin Classifiers: Convexity in Classification
AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at
More informationClass 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationTUM 2016 Class 1 Statistical learning theory
TUM 2016 Class 1 Statistical learning theory Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Machine learning applications Texts Images Data: (x 1, y 1 ),..., (x n, y n ) Note: x i s huge dimensional! All
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new
More informationDiscriminative Learning and Big Data
AIMS-CDT Michaelmas 2016 Discriminative Learning and Big Data Lecture 2: Other loss functions and ANN Andrew Zisserman Visual Geometry Group University of Oxford http://www.robots.ox.ac.uk/~vgg Lecture
More informationBayesian Nonparametric Point Estimation Under a Conjugate Prior
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 5-15-2002 Bayesian Nonparametric Point Estimation Under a Conjugate Prior Xuefeng Li University of Pennsylvania Linda
More informationModel Selection and Error Estimation
Model Selection and Error Estimation Peter L. Bartlett Stéphane Boucheron Computer Sciences Laboratory Laboratoire de Recherche en Informatique RSISE, Australian National University, Bâtiment 490 Canberra
More informationMachine Learning Theory (CS 6783)
Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)
More informationComputational Learning Theory. CS534 - Machine Learning
Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows
More informationLeast squares under convex constraint
Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption
More informationAdaptive Minimax Classification with Dyadic Decision Trees
Adaptive Minimax Classification with Dyadic Decision Trees Clayton Scott Robert Nowak Electrical and Computer Engineering Electrical and Computer Engineering Rice University University of Wisconsin Houston,
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationGeneralization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh
Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationDoes Modeling Lead to More Accurate Classification?
Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More information21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)
10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationPaper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)
Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36 Outlines 2 / 36 Motivation
More informationStatistical Properties of Large Margin Classifiers
Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides
More informationDistribution-Free Distribution Regression
Distribution-Free Distribution Regression Barnabás Póczos, Alessandro Rinaldo, Aarti Singh and Larry Wasserman AISTATS 2013 Presented by Esther Salazar Duke University February 28, 2014 E. Salazar (Reading
More informationIN this paper, we study two related problems of minimaxity
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 45, NO. 7, NOVEMBER 1999 2271 Minimax Nonparametric Classification Part I: Rates of Convergence Yuhong Yang Abstract This paper studies minimax aspects of
More informationLeast Squares Regression
E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute
More informationSymmetrization and Rademacher Averages
Stat 928: Statistical Learning Theory Lecture: Syetrization and Radeacher Averages Instructor: Sha Kakade Radeacher Averages Recall that we are interested in bounding the difference between epirical and
More informationRobust Support Vector Machines for Probability Distributions
Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,
More informationThe exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.
CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please
More information1-bit Matrix Completion. PAC-Bayes and Variational Approximation
: PAC-Bayes and Variational Approximation (with P. Alquier) PhD Supervisor: N. Chopin Junior Conference on Data Science 2016 Université Paris Saclay, 15-16 September 2016 Introduction: Matrix Completion
More informationDyadic Classification Trees via Structural Risk Minimization
Dyadic Classification Trees via Structural Risk Minimization Clayton Scott and Robert Nowak Department of Electrical and Computer Engineering Rice University Houston, TX 77005 cscott,nowak @rice.edu Abstract
More informationComparing Learning Methods for Classification
Comparing Learning Methods for Classification Yuhong Yang School of Statistics University of Minnesota 224 Church Street S.E. Minneapolis, MN 55455 April 4, 2006 Abstract We address the consistency property
More informationLecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses
Information and Coding Theory Autumn 07 Lecturer: Madhur Tulsiani Lecture 9: October 5, 07 Lower bounds for minimax rates via multiple hypotheses In this lecture, we extend the ideas from the previous
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationNonconcave Penalized Likelihood with A Diverging Number of Parameters
Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized
More informationStochastic gradient descent and robustness to ill-conditioning
Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationStatistical Methods for Data Mining
Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find
More informationCan we do statistical inference in a non-asymptotic way? 1
Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July
More informationLECTURE NOTE #3 PROF. ALAN YUILLE
LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.
More informationCMSC858P Supervised Learning Methods
CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationChi-square lower bounds
IMS Collections Borrowing Strength: Theory Powering Applications A Festschrift for Lawrence D. Brown Vol. 6 (2010) 22 31 c Institute of Mathematical Statistics, 2010 DOI: 10.1214/10-IMSCOLL602 Chi-square
More informationComputational Oracle Inequalities for Large Scale Model Selection Problems
for Large Scale Model Selection Problems University of California at Berkeley Queensland University of Technology ETH Zürich, September 2011 Joint work with Alekh Agarwal, John Duchi and Clément Levrard.
More informationMULTIVARIATE HISTOGRAMS WITH DATA-DEPENDENT PARTITIONS
Statistica Sinica 19 (2009), 159-176 MULTIVARIATE HISTOGRAMS WITH DATA-DEPENDENT PARTITIONS Jussi Klemelä University of Oulu Abstract: We consider estimation of multivariate densities with histograms which
More informationNonparametric estimation using wavelet methods. Dominique Picard. Laboratoire Probabilités et Modèles Aléatoires Université Paris VII
Nonparametric estimation using wavelet methods Dominique Picard Laboratoire Probabilités et Modèles Aléatoires Université Paris VII http ://www.proba.jussieu.fr/mathdoc/preprints/index.html 1 Nonparametric
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationStatistical Machine Learning Hilary Term 2018
Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html
More informationINFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1. By Yuhong Yang and Andrew Barron Iowa State University and Yale University
The Annals of Statistics 1999, Vol. 27, No. 5, 1564 1599 INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1 By Yuhong Yang and Andrew Barron Iowa State University and Yale University
More informationSTATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationOn Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong
On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality Weiqiang Dong 1 The goal of the work presented here is to illustrate that classification error responds to error in the target probability estimates
More information