Model Selection and Geometry
|
|
- Cleopatra Cross
- 5 years ago
- Views:
Transcription
1 Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February
2 Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model selection may be usefull for geometric inference! Focus on the Gaussian framework: convenient to provide the main ideas and intuitions
3 Concentration of measure
4 Isoperimetry The Gaussian isoperimetric Theorem: let P be the standard Gaussian measure on! N. Given θ, among the sets A with P A { } P d (., A) < ε ( ) = θ is minimal for a half-space R N 1 (,ε θ with Q( ε ) θ = θ ( : standard Gaussian tail function, d : Euclidean distance)
5 Gaussian concentration inequality and therefore if θ 1/, then ε 0 θ and { ( ) ε} Q( ε + ε ) θ Q( ε ) e ε / P d., A Given some L-Lipschitz function f, applying this inequality with ε = x to the set { } A = f Mf where Mf is a median of f leads to P{ f Mf L x} e x True (not obvious) with instead of Ef Mf
6 Chi-quare distribution If we apply this inequality to the Euclidean norm itself, i.e. f : z z, then L = 1 and since Ef = N Ef N and therefore { } e x P f N + x Of course f follows a chi-square distribution χ N. ( )
7 The Gaussian concentration inequality also applies to suprema of Gaussian processes. { ( ),t T } Let G t be some centered Gaussian process. Assume that T is equipped with the covariance pseudo-metric d defined by d u,t If (T,d) is separable and if continuous on (T,d), setting ( ) ( ) = Var G( t) G( u) { G( t),t T } Z = supg t t T P Z E Z + σ x is a.s. one has ( ) e x where σ = sup t T ( ) Var G t
8 Outside the Gaussian world! Concentration of measure for product measures has been studied for years by M. Talagrand in a series of remarkable papers.! Talagrand s concentration inequality for empirical processes plays a fundamental role in statistics in general and for model selection in particular.! Gaussian life is easier!
9 Model Selection
10 What are we talking about? Statistical inference One observes X ( random vector, random process ) with unknown distribution P. Purpose: take a decision about some quantity related to. Predict with a level of confidence. What to do: design a genuine estimation procedure ˆθ ( X ) of θ and get some idea of how far it is from the target. The role of probability theory Problem: the exact distribution of ˆθ is generally unknown. Solution: make some approximation or evaluation based on Probability Theory.
11 Asymptotic Theory Typically, when X = ( X 1,..., X ) n, n and X 1,..., X n are independent. Asymptotic theory in statistics uses limit theorems ( Central limit Theorems, Large Deviation Principles) as approximation tools. Example: when P belongs to some given model ( which does not depend of n ), behavior of maximum likelihood estimators. Model selection Designing a genuine requires some prior knowledge on P. Choosing a proper model for is a major problem for the statistician. Aim of model selection : construct data-driven criteria to select a model among a given list.
12 Asymptotic approach to model selection - Idea of using some penalized empirical criterion goes back to the seminal works of Akaike ( 70). - Akaike celebrated criterion (AIC) suggests to penalize the log-likelihood by the number of parameters of the parametric model. - This criterion is based on some asymptotic approximation that essentially relies on Wilks Theorem
13 Wilks Theorem: under some proper regularity conditions the log-likelihood L ( n θ ) based on n i.i.d. observations with distribution belonging to a parametric model with D parameters obeys to the following weak convergence result where denotes the MLE and is the true value of the parameter.
14 Non asymptotic Theory In many situations, it is usefull to make the size of the models tend to infinity or make the list of models depend on n. In these situations, classical asymptotic analysis breaks down and one needs to introduce an alternative approach that we call non asymptotic. We still like Large values of n! But the size of the models as well the size of the list of models should be authorized to be large too. This approach is based on Concentration Inequalities
15 This approach has been fruitfully used in several works. Among others: Baraud ( 00) and ( 03) for least squares in the regression framework, Castellan ( 03) for log-splines density estimation, Patricia Reynaud ( 03) for poisson processes, etc
16 Gaussian Model Selection
17 The Gaussian framework We consider the generalized linear Gaussian model. This means that, given some separable Hilbert space, one observes W Y ε ( t) = s,t + εw ( t), t H where is some centered Gaussian isonormal process, i.e. maps isometrically onto some Gaussian subspace of L ( Ω). We have in mind that writing the level of noise as allows an easy comparison with other frameworks involving observations.
18 This framework is convenient to cover both the infinite dimensional white noise model and the finite dimensional linear model for which and, where is a standard Gaussian random vector. Consider some collection of (typically closed and convex) subsets of. We also consider the least squares criterion L ε ( t) = t Y ε t and, for each, ŝ m minimizing over. The quality of model is measured by the quadratic risk S m S m E s ŝ m ( ) L ε
19 Does the risk reflect the model choice paradigm? Let us compute it in the simplest situation where is a linear model with dimension. Then D m S m E s ŝ m = ε D m + d The best (oracle) model according to this criterion should make a trade-off between its size and its quality of approximation to the truth. The aim is now to mimic it. Our purpose is to analyze the penalized LSE, where ˆm = argmin m M ( s,s ) m { L ( ε ŝ ) m + pen( m) }
20 Selecting linear models Let us begin with model selection among linear models and review some results obtained in joint works with Lucien Birgé (JEMS 01 and PTRF 07). Each model S m is assumed to be linear with dimension D m and represented by the least squares estimator on. Then S m E s ŝ m = ε D m + d ( s,s ) m Variance Bias infe s ŝ m M m Oracle : Ideal model achieving Aim: mimic the oracle by estimating the risk
21 Mallows heuristics Let denote the orthogonal projection of s on. An «ideal» model should minimize the quadratic risk s s m + ε D m = s s m + ε D m Pythagore or equivalently s m + ε D m Substituting to s m its natural unbiased estimator leads to Mallows criterion ŝ m ε D m ŝ m + ε D m = L ε ( ŝ ) m + ε D m Issue : look the way uniformly w.r.t. m M. concentrates around ε D m
22 Theorem (Birgé, Massart 01) Let ( ) m M x m e xm m M such that be a family of non negative weights = Σ < Let K > 1. Assume that and take pen( m) K ε ( ) D m + x m ˆm = argmin m M ŝ m + pen m ( ) Then E ŝ ˆm s C K { } ( ) inf m M d ( s,s ) m + pen( m) + Σε
23 A typical choice is Choice of the weights. Then if one defines ( ) = α D + log# { m M : D m = D} x D the weights actually appear as the price to pay for redundancy (i.e. many models with the same dimension). The penalty becomes and the upper bound (up to constant) writes { inf { D inf d ( Dm s,s ) =D m + pen( D) }} Approximation property of D m =D is essential S m
24 Comparison with the oracle Since If the weights and E ŝ m s = d ( s,s ) m + ε D m are such that then the upper bound provided by the Theorem becomes inf m M E ŝ m s (up to some multiplicative constant). Conclusion : The selected estimator performs (almost) as well as an oracle.
25 { } Variable Selection Let ϕ j, j N be a family of elements in. Let be some collection of subsets of { 1,...,N } and define S m = ϕ j, j m, m M. a. Ordered variable selection M is the collection of subsets of the form with D N (possibly N = ) Then one can take pen m ( ) = K 'ε m { 1,...,D } Comparable to the oracle (linearly independent case) E s ŝ ˆm C ( K ' )inf m M E s ŝ m sharp
26 b. Complete variable selection Let be the collection of all subsets of. One can take and Σ = ( ) x m = m ln N e x m = C D N e Dln N m M D N ( which leads to ( ) e. pen( m) = K ε m 1+ ln N ( ) with. In the orthonormal case can be explicited. It is simply hard thresholding (Donoho, Johnstone, Kerkyacharian, Picard) ŝ ˆm = N j=1 ˆβ j 1ˆβ j T ϕ j, with ˆβ j = Y ( ϕ ) j )
27 with T = Kε 1+ ( ln( N) ). Then { ( )} E s ŝ ˆm C ' ( K )inf D N inf m =D s s m + ε Dlog N sharp Role of the basis ( ( ) + ) Refinement : One can choose x m = D ln N / D whenever, which leads to an optimal risk bound (minimax sense).
28 Link with non parametric adaptation { } Let ϕ j, j 1 be the Fourier basis with its α natural ordering then s s D κ s ( ) D α. Up to constant E s s ŝ ˆm R 4α α +1 α +1 ε is bounded by uniformly w.r.t s satisfying s α R. Optimal in minimax sense since infŝ sup s ( α ) R Conclusion Minimax, up to constant over all the Sobolev balls ( s α ) R simultanously. ( ) E s s ŝ κ ' R 4α α +1 α +1 ε
29 Conclusions Mallows criterion can underpenalize. Condition K>1 is sharp. What penalty should be recommanded? One can try to optimize the oracle inequality. The result is that twice the minimal value is a good choice (Birgé, Massart (007)) Practical use One does not know the level of noise but one can retain from the theory that «optimal» penalty = «minimal» penalty data-driven penalty
30 Back to Geometry
31 Theorem (M. 007) Assume that for every, there exists some function φ m such that φ m x on and ( ) / x for any positive and any point in. Let us define D m such that φ m ε D m and consider some family of weights ( xm ) m M Let be some constant with and take Then E sup t S m W ( t) W ( u) t u + x ( ) = ε D m K > 1 x φ m ( x) ( ). pen( m) K ε D m + x m such that
32 Comments ' If is finite dimensional with dimension D m ' we can take D m = D m which shows that the above Theorem strictly implies the linear model selection Theorem (Birgé,Massart ( 01)). Indeed since x t u x + t u if ( ψ ) j 1 ' j Dm sup t S m therefore denotes some orthonormal basis of W ( t) W ( u) t u + x x 1 sup t S m E sup t S m So we may take φ m W ( t) t W ( t) W ( u) t u + x ' D = m W ψ j j=1 x 1 ( ) D m ' ( x) ' ' = x D m D m = D m
33 More generally, the function should be understood as a modulus of continuity of model over. One can indeed use a pealing device to ensure that if for every E sup t S m, t u σ φ m u S m ( W ( t) W ( u) ) ϕ m and every ( σ ) where ϕ m is a non decreasing and concave function with ϕ m 0, then one can take ( ) = 0 φ m ( x) = 5ϕ m x ( ) One possibility is to use covering numbers ϕ m σ 0 ( σ ) = κ log N δ,s m ( ) dδ
34 Simplicial approximation Here is an illustration of what we are promoting with Frédéric Chazal and our students. C. Caillerie and B. Michel (011) have applied the Gaussian model selection Theorem above to simplicial approximation. Consider that one observes x = s + σξ with 1 i n i i i The unobserved deterministic points s belong to! p and the variables ξ i s are i.i.d. standard normal random vectors. One has in mind that the points s i s are sampled on some geometrical object that one wants to learn. s i
35 They consider some collection of k-homogenous simplicial complexes { C m,m M} in! p. They use covering numbers to define a proper penalty term and derive a risk bound on the selected estimator E s ŝ ˆm s! np where simply denotes the vector of «built» from the s i s. As expected (at least if the collection is not too rich) the penalty recommanded by the Theorem is found to be proportionnal to log C m k
36 where C m k is measuring the size of the simplicial complex. More precisely C m k = ( ) k 1/k + ω Cm Δ ω where the sum is extended to the set of simplices with maximal dimension k and Δ ω denotes the diameter of the smallest Euclidean ball containing the simplex ω. The bad news are that the penalty involves nasty constant (and the level of noise that we do not know in practice). They use a heuristics to solve this problem.
37 They use the penalized criterion X ŝ m + ˆβ log Cm k where ˆβ is estimated from the data by using the following phase transition which is empirically observed: when β is too small ( β < ˆβ ) the criterion X ŝ m + β log Cm k chooses a simplicial complex with a size which is close to the maximal one in the collection.
38 Many remaining issues Among which: escaping from the square loss, building an adaptive estimation theory Thanks for your attention!
Model selection theory: a tutorial with applications to learning
Model selection theory: a tutorial with applications to learning Pascal Massart Université Paris-Sud, Orsay ALT 2012, October 29 Asymptotic approach to model selection - Idea of using some penalized empirical
More informationGaussian model selection
J. Eur. Math. Soc. 3, 203 268 (2001) Digital Object Identifier (DOI) 10.1007/s100970100031 Lucien Birgé Pascal Massart Gaussian model selection Received February 1, 1999 / final version received January
More informationInverse problems in statistics
Inverse problems in statistics Laurent Cavalier (Université Aix-Marseille 1, France) Yale, May 2 2011 p. 1/35 Introduction There exist many fields where inverse problems appear Astronomy (Hubble satellite).
More informationRisk bounds for model selection via penalization
Probab. Theory Relat. Fields 113, 301 413 (1999) 1 Risk bounds for model selection via penalization Andrew Barron 1,, Lucien Birgé,, Pascal Massart 3, 1 Department of Statistics, Yale University, P.O.
More informationLeast squares under convex constraint
Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption
More informationKernel change-point detection
1,2 (joint work with Alain Celisse 3 & Zaïd Harchaoui 4 ) 1 Cnrs 2 École Normale Supérieure (Paris), DIENS, Équipe Sierra 3 Université Lille 1 4 INRIA Grenoble Workshop Kernel methods for big data, Lille,
More informationInverse problems in statistics
Inverse problems in statistics Laurent Cavalier (Université Aix-Marseille 1, France) YES, Eurandom, 10 October 2011 p. 1/32 Part II 2) Adaptation and oracle inequalities YES, Eurandom, 10 October 2011
More informationA talk on Oracle inequalities and regularization. by Sara van de Geer
A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other
More informationStatistical Measures of Uncertainty in Inverse Problems
Statistical Measures of Uncertainty in Inverse Problems Workshop on Uncertainty in Inverse Problems Institute for Mathematics and Its Applications Minneapolis, MN 19-26 April 2002 P.B. Stark Department
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationNonconcave Penalized Likelihood with A Diverging Number of Parameters
Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized
More information9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures
FE661 - Statistical Methods for Financial Engineering 9. Model Selection Jitkomut Songsiri statistical models overview of model selection information criteria goodness-of-fit measures 9-1 Statistical models
More informationHigh-dimensional regression with unknown variance
High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: Y i = f i + ε i with ε i i.i.d. N (0, σ 2 ) f = (f
More informationAdaptive estimation of the intensity of inhomogeneous Poisson processes via concentration inequalities
Probab. Theory Relat. Fields 126, 103 153 2003 Digital Object Identifier DOI 10.1007/s00440-003-0259-1 Patricia Reynaud-Bouret Adaptive estimation of the intensity of inhomogeneous Poisson processes via
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationKernels to detect abrupt changes in time series
1 UMR 8524 CNRS - Université Lille 1 2 Modal INRIA team-project 3 SSB group Paris joint work with S. Arlot, Z. Harchaoui, G. Rigaill, and G. Marot Computational and statistical trade-offs in learning IHES
More informationModel comparison and selection
BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationStatistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation
Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider
More informationCovariance function estimation in Gaussian process regression
Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian
More informationConcentration, self-bounding functions
Concentration, self-bounding functions S. Boucheron 1 and G. Lugosi 2 and P. Massart 3 1 Laboratoire de Probabilités et Modèles Aléatoires Université Paris-Diderot 2 Economics University Pompeu Fabra 3
More informationSTAT 100C: Linear models
STAT 100C: Linear models Arash A. Amini June 9, 2018 1 / 21 Model selection Choosing the best model among a collection of models {M 1, M 2..., M N }. What is a good model? 1. fits the data well (model
More informationarxiv: v1 [math.st] 10 May 2009
ESTIMATOR SELECTION WITH RESPECT TO HELLINGER-TYPE RISKS ariv:0905.1486v1 math.st 10 May 009 YANNICK BARAUD Abstract. We observe a random measure N and aim at estimating its intensity s. This statistical
More informationStat 5101 Lecture Notes
Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationD I S C U S S I O N P A P E R
I N S T I T U T D E S T A T I S T I Q U E B I O S T A T I S T I Q U E E T S C I E N C E S A C T U A R I E L L E S ( I S B A ) UNIVERSITÉ CATHOLIQUE DE LOUVAIN D I S C U S S I O N P A P E R 2014/06 Adaptive
More informationA General Overview of Parametric Estimation and Inference Techniques.
A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying
More informationEmpirical Risk Minimization as Parameter Choice Rule for General Linear Regularization Methods
Empirical Risk Minimization as Parameter Choice Rule for General Linear Regularization Methods Frank Werner 1 Statistical Inverse Problems in Biophysics Group Max Planck Institute for Biophysical Chemistry,
More informationRegression, Ridge Regression, Lasso
Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.
More informationISyE 691 Data mining and analytics
ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)
More informationLecture 8: Information Theory and Statistics
Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang
More informationOPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1
The Annals of Statistics 1997, Vol. 25, No. 6, 2512 2546 OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1 By O. V. Lepski and V. G. Spokoiny Humboldt University and Weierstrass Institute
More informationPenalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms
university-logo Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms Andrew Barron Cong Huang Xi Luo Department of Statistics Yale University 2008 Workshop on Sparsity in High Dimensional
More informationConfidence Intervals for Low-dimensional Parameters with High-dimensional Data
Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationThe Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models
The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population Health
More informationCentral Limit Theorem ( 5.3)
Central Limit Theorem ( 5.3) Let X 1, X 2,... be a sequence of independent random variables, each having n mean µ and variance σ 2. Then the distribution of the partial sum S n = X i i=1 becomes approximately
More informationCan we do statistical inference in a non-asymptotic way? 1
Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.
More informationLeast Squares Model Averaging. Bruce E. Hansen University of Wisconsin. January 2006 Revised: August 2006
Least Squares Model Averaging Bruce E. Hansen University of Wisconsin January 2006 Revised: August 2006 Introduction This paper developes a model averaging estimator for linear regression. Model averaging
More informationFunctional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...
Functional Analysis Franck Sueur 2018-2019 Contents 1 Metric spaces 1 1.1 Definitions........................................ 1 1.2 Completeness...................................... 3 1.3 Compactness......................................
More informationPaper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)
Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36 Outlines 2 / 36 Motivation
More informationBayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4
Bayesian Adaptation Aad van der Vaart http://www.math.vu.nl/ aad Vrije Universiteit Amsterdam Bayesian Adaptation p. 1/4 Joint work with Jyri Lember Bayesian Adaptation p. 2/4 Adaptation Given a collection
More informationGAUSSIAN MODEL SELECTION WITH AN UNKNOWN VARIANCE. By Yannick Baraud, Christophe Giraud and Sylvie Huet Université de Nice Sophia Antipolis and INRA
Submitted to the Annals of Statistics GAUSSIAN MODEL SELECTION WITH AN UNKNOWN VARIANCE By Yannick Baraud, Christophe Giraud and Sylvie Huet Université de Nice Sophia Antipolis and INRA Let Y be a Gaussian
More informationRisk Bounds for CART Classifiers under a Margin Condition
arxiv:0902.3130v5 stat.ml 1 Mar 2012 Risk Bounds for CART Classifiers under a Margin Condition Servane Gey March 2, 2012 Abstract Non asymptotic risk bounds for Classification And Regression Trees (CART)
More informationLECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)
LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered
More information1.1 Basis of Statistical Decision Theory
ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 1: Introduction Lecturer: Yihong Wu Scribe: AmirEmad Ghassami, Jan 21, 2016 [Ed. Jan 31] Outline: Introduction of
More informationIntroduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models
Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations
More informationStatistical Inference
Statistical Inference Classical and Bayesian Methods Revision Class for Midterm Exam AMS-UCSC Th Feb 9, 2012 Winter 2012. Session 1 (Revision Class) AMS-132/206 Th Feb 9, 2012 1 / 23 Topics Topics We will
More informationChapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space.
Chapter 1 Preliminaries The purpose of this chapter is to provide some basic background information. Linear Space Hilbert Space Basic Principles 1 2 Preliminaries Linear Space The notion of linear space
More informationLasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices
Article Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Fei Jin 1,2 and Lung-fei Lee 3, * 1 School of Economics, Shanghai University of Finance and Economics,
More informationConcentration Inequalities for Random Matrices
Concentration Inequalities for Random Matrices M. Ledoux Institut de Mathématiques de Toulouse, France exponential tail inequalities classical theme in probability and statistics quantify the asymptotic
More informationBTRY 4090: Spring 2009 Theory of Statistics
BTRY 4090: Spring 2009 Theory of Statistics Guozhang Wang September 25, 2010 1 Review of Probability We begin with a real example of using probability to solve computationally intensive (or infeasible)
More informationDirect Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina
Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:
More informationParametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory
Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007
More informationStatistics 203: Introduction to Regression and Analysis of Variance Course review
Statistics 203: Introduction to Regression and Analysis of Variance Course review Jonathan Taylor - p. 1/?? Today Review / overview of what we learned. - p. 2/?? General themes in regression models Specifying
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationEcon 5150: Applied Econometrics Dynamic Demand Model Model Selection. Sung Y. Park CUHK
Econ 5150: Applied Econometrics Dynamic Demand Model Model Selection Sung Y. Park CUHK Simple dynamic models A typical simple model: y t = α 0 + α 1 y t 1 + α 2 y t 2 + x tβ 0 x t 1β 1 + u t, where y t
More informationSTAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă
STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani
More informationStatistical regularization theory for Inverse Problems with Poisson data
Statistical regularization theory for Inverse Problems with Poisson data Frank Werner 1,2, joint with Thorsten Hohage 1 Statistical Inverse Problems in Biophysics Group Max Planck Institute for Biophysical
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationParametric Techniques Lecture 3
Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to
More informationRATES OF CONVERGENCE OF ESTIMATES, KOLMOGOROV S ENTROPY AND THE DIMENSIONALITY REDUCTION PRINCIPLE IN REGRESSION 1
The Annals of Statistics 1997, Vol. 25, No. 6, 2493 2511 RATES OF CONVERGENCE OF ESTIMATES, KOLMOGOROV S ENTROPY AND THE DIMENSIONALITY REDUCTION PRINCIPLE IN REGRESSION 1 By Theodoros Nicoleris and Yannis
More informationStatistical inference on Lévy processes
Alberto Coca Cabrero University of Cambridge - CCA Supervisors: Dr. Richard Nickl and Professor L.C.G.Rogers Funded by Fundación Mutua Madrileña and EPSRC MASDOC/CCA student workshop 2013 26th March Outline
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57
More informationEstimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators
Estimation theory Parametric estimation Properties of estimators Minimum variance estimator Cramer-Rao bound Maximum likelihood estimators Confidence intervals Bayesian estimation 1 Random Variables Let
More informationDiscussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan
Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania Professors Antoniadis and Fan are
More informationTheory of Maximum Likelihood Estimation. Konstantin Kashin
Gov 2001 Section 5: Theory of Maximum Likelihood Estimation Konstantin Kashin February 28, 2013 Outline Introduction Likelihood Examples of MLE Variance of MLE Asymptotic Properties What is Statistical
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Supervised Learning: Regression I Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Some of the
More informationSimultaneous estimation of the mean and the variance in heteroscedastic Gaussian regression
Electronic Journal of Statistics Vol 008 1345 137 ISSN: 1935-754 DOI: 10114/08-EJS67 Simultaneous estimation of the mean and the variance in heteroscedastic Gaussian regression Xavier Gendre Laboratoire
More informationSupport Vector Machine
Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary
More informationPCA with random noise. Van Ha Vu. Department of Mathematics Yale University
PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical
More informationNonparametric regression with martingale increment errors
S. Gaïffas (LSTA - Paris 6) joint work with S. Delattre (LPMA - Paris 7) work in progress Motivations Some facts: Theoretical study of statistical algorithms requires stationary and ergodicity. Concentration
More informationHypothesis Testing via Convex Optimization
Hypothesis Testing via Convex Optimization Arkadi Nemirovski Joint research with Alexander Goldenshluger Haifa University Anatoli Iouditski Grenoble University Information Theory, Learning and Big Data
More informationOn the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models
On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models Thomas Kneib Institute of Statistics and Econometrics Georg-August-University Göttingen Department of Statistics
More informationThe deterministic Lasso
The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality
More informationLecture 6: Discrete Choice: Qualitative Response
Lecture 6: Instructor: Department of Economics Stanford University 2011 Types of Discrete Choice Models Univariate Models Binary: Linear; Probit; Logit; Arctan, etc. Multinomial: Logit; Nested Logit; GEV;
More informationMinimax Estimation of a nonlinear functional on a structured high-dimensional model
Minimax Estimation of a nonlinear functional on a structured high-dimensional model Eric Tchetgen Tchetgen Professor of Biostatistics and Epidemiologic Methods, Harvard U. (Minimax ) 1 / 38 Outline Heuristics
More informationAkaike Information Criterion
Akaike Information Criterion Shuhua Hu Center for Research in Scientific Computation North Carolina State University Raleigh, NC February 7, 2012-1- background Background Model statistical model: Y j =
More informationReconstruction from Anisotropic Random Measurements
Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013
More informationEstimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk
Ann Inst Stat Math (0) 64:359 37 DOI 0.007/s0463-00-036-3 Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Paul Vos Qiang Wu Received: 3 June 009 / Revised:
More informationTesting Restrictions and Comparing Models
Econ. 513, Time Series Econometrics Fall 00 Chris Sims Testing Restrictions and Comparing Models 1. THE PROBLEM We consider here the problem of comparing two parametric models for the data X, defined by
More informationMFM Practitioner Module: Risk & Asset Allocation. John Dodson. February 18, 2015
MFM Practitioner Module: Risk & Asset Allocation February 18, 2015 No introduction to portfolio optimization would be complete without acknowledging the significant contribution of the Markowitz mean-variance
More informationOn the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models
On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models Thomas Kneib Department of Mathematics Carl von Ossietzky University Oldenburg Sonja Greven Department of
More informationComposite Hypotheses and Generalized Likelihood Ratio Tests
Composite Hypotheses and Generalized Likelihood Ratio Tests Rebecca Willett, 06 In many real world problems, it is difficult to precisely specify probability distributions. Our models for data may involve
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationsparse and low-rank tensor recovery Cubic-Sketching
Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationSome theoretical properties of GANs. Gérard Biau Toulouse, September 2018
Some theoretical properties of GANs Gérard Biau Toulouse, September 2018 Coauthors Benoît Cadre (ENS Rennes) Maxime Sangnier (Sorbonne University) Ugo Tanielian (Sorbonne University & Criteo) 1 video Source:
More informationRegression: Lecture 2
Regression: Lecture 2 Niels Richard Hansen April 26, 2012 Contents 1 Linear regression and least squares estimation 1 1.1 Distributional results................................ 3 2 Non-linear effects and
More informationModel Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model
Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population
More information3.0.1 Multivariate version and tensor product of experiments
ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 3: Minimax risk of GLM and four extensions Lecturer: Yihong Wu Scribe: Ashok Vardhan, Jan 28, 2016 [Ed. Mar 24]
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationBIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation
BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation Yujin Chung November 29th, 2016 Fall 2016 Yujin Chung Lec13: MLE Fall 2016 1/24 Previous Parametric tests Mean comparisons (normality assumption)
More informationECO Class 6 Nonparametric Econometrics
ECO 523 - Class 6 Nonparametric Econometrics Carolina Caetano Contents 1 Nonparametric instrumental variable regression 1 2 Nonparametric Estimation of Average Treatment Effects 3 2.1 Asymptotic results................................
More informationAn overview of a nonparametric estimation method for Lévy processes
An overview of a nonparametric estimation method for Lévy processes Enrique Figueroa-López Department of Mathematics Purdue University 150 N. University Street, West Lafayette, IN 47906 jfiguero@math.purdue.edu
More informationParameter Estimation
Parameter Estimation Consider a sample of observations on a random variable Y. his generates random variables: (y 1, y 2,, y ). A random sample is a sample (y 1, y 2,, y ) where the random variables y
More informationAdaptive Wavelet Estimation: A Block Thresholding and Oracle Inequality Approach
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1999 Adaptive Wavelet Estimation: A Block Thresholding and Oracle Inequality Approach T. Tony Cai University of Pennsylvania
More information