Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo
|
|
- Megan Norman
- 5 years ago
- Views:
Transcription
1 Outline in High Dimensions Using the Rodeo Han Liu 1,2 John Lafferty 2,3 Larry Wasserman 1,2 1 Statistics Department, 2 Machine Learning Department, 3 Computer Science Department, Carnegie Mellon University July 1st, 2006
2 Outline Motivation Research background Rodeo is a general strategy for nonparametric inference. It has been successfully applied to solve sparse nonparametric regression problems in high dimensions by Lafferty & Wasserman, Our goal Trying to adapt the rodeo framework to nonparametric density estimation problems. So that we have a unified framework for both density estimation and regression problems which is computationally efficient and theoretically soundable
3 Outline Outline 1 Background Nonparametric density estimation in high dimensions Sparsity assumptions for density estimation 2 The main idea The local rodeo algorithm for the kernel density estimator 3 The asymptotic running time and minimax risk 4 The global density rodeo and the reverse density rodeo Using other distributions as irrelevant dimensions 5 Empirical results on both synthetic and real-world datasets
4 Problem statement Nonparametric density estimation in high dimensions Sparsity assumptions and the rodeo framework Problem To estimate the joint density of a continuous d-dimensional random vector X = (X 1, X 2,..., X d ) F, d 3 where F is the unknown distribution with density function f(x). This problem is essentially hard, since the high dimensionality causes both computational and theoretical problems.
5 Previous work Background Nonparametric density estimation in high dimensions Sparsity assumptions and the rodeo framework From a frequentist perspective Kernel density estimation and the local likelihood method Projection pursuit method Log-spline models and the penalized likelihood method From a Bayesian perspective Mixture of normals with Dirichlet processes as prior Difficulties of current approaches Some methods only work well for low-dimensional problems Some heuristics lack the theoretical guarantees More importantly, they suffer from the curse of dimensionality
6 The curse of dimensionality Nonparametric density estimation in high dimensions Sparsity assumptions and the rodeo framework Characterizing the curse In a Sobolev space of order k, minimax theory shows that the best convergence rate for the mean squared error is R opt = O (n 2k/(2k+d)) which is practically slow when the dimension d is large. Combating the curse by some sparsity assumptions If the high-dimensional data has a low dimensional structure or a sparsity condition, we expect that some methods could be developed to combat the curse of dimensionality. This motivates the development of the rodeo framework
7 Rodeo for nonparametric regression (I) Nonparametric density estimation in high dimensions Sparsity assumptions and the rodeo framework Rodeo (regularization of derivative expectation operator) is a general strategy for nonparametric inference. Which has been used for nonparametric regression For a regression problem Y i = m(x i ) + ɛ i, i = 1,..., n where X i = (X i1,..., X id ) R d is a d-dimensional covariate. If m is in a d-dimensional Sobolev space of order 2, the best convergence rate for the risk is R = O (n 4/(4+d)) Which shows the curse of dimensionality in a regression setting.
8 Rodeo for nonparametric regression (II) Nonparametric density estimation in high dimensions Sparsity assumptions and the rodeo framework Assume the true function only depends on r covariates (r d) m(x) = m(x 1,..., x r ) for any ɛ > 0, the rodeo can simultaneously perform bandwidth selection and (implicitly) variable selection to achieve a better minimax convergence rate of R rodeo = O (n 4/(4+r)+ɛ) as if the r relevant variables were explicitly isolated in advance. Rodeo beats the curse of dimensionality in this sense. We expect to apply the same idea to solve density estimation problems.
9 Sparse density estimation Nonparametric density estimation in high dimensions Sparsity assumptions and the rodeo framework For many applications, the true density function can be characterized by some low dimensional structure Sparsity assumption for density estimation problems Assume h jj (x) is the second partial derivative of h on the j-th varaible, there exists some r d, such that f(x) g(x 1,..., x r )h(x) where h jj (x) = 0 for j = 1,..., d. Where x R = {x 1,..., x r } are the relevant dimensions. This condition imposes that h( ) belongs to a family of very smooth functions (e.g. the uniform distribution). h( ) can be generalized to be any parametric distribution!
10 Generalized sparse density estimation Nonparametric density estimation in high dimensions Sparsity assumptions and the rodeo framework We can generalize h( ) to other distributions (e.g. Gaussian). General sparsity assumption for density estimation problems Assume h( ) is any distribution (e.g. Gaussian) that we are not interested in f(x) g(x 1,..., x r )h(x) where r d. Thus, the density function f( ) can be factored into two parts: the relevant components g( ) and the irrelevant components h( ). Where x R = {x 1,..., x r } are the relevant dimensions. Under this framework, we can hope to achieve a better minimax rate R rodeo = O (n 4/(4+r))
11 Related work Background Nonparametric density estimation in high dimensions Sparsity assumptions and the rodeo framework Recent work that addressed this problem Minimum volume set (Scott& Nowak, JMLR06) Nongaussian component analysis; (Blanchard et al. JMLR06) Log-ANOVA model; (Lin & Joen, Statistical Sinica 2006) Advantages of our approach: Rodeo can utilize well-established nonparametric estimators A unified framework for different kinds of problems easy to implement and is amenable to theoretical analysis
12 Density rodeo: the main idea The main idea The local rodeo algorithm for the kernel density estim The key intuition: if a dimension is irrelevant, then changing the smoothing parameter of that dimension should only result in a small change in the whole estimator Basically, Rodeo is just a regularization strategy Use a kernel density estimator start with large bandwidths Calculate the gradient of the estimator w.r.t. the bandwidth Sequentially decrease the bandwidths in a greedy way, and try to freeze this decay process by some thresholding strategy to achieve a sparse solution
13 Density rodeo: the main idea The main idea The local rodeo algorithm for the kernel density estim Assuming a fixed point x and let ˆf H (x) denote an estimator of f(x) based on smoothing parameter matrix H = diag(h 1,..., h d ) Let M(h) = E( ˆf h (x)) denote the mean of ˆf h (x), therefore, f(x) = M(0) = E( ˆf 0 (x)). Assuming P = {h(t) : 0 t 1} is a smooth path through the set of smoothing parameters with h(0) = 0 and h(1) = 1, then 1 f(x) = M(1) (M(1) M(0)) = M(1) where D(h) = M(h) = ḣ(s) = dh(s) ds = M(1) dm(h(s)) ds ds D(s), ḣ(s) ds ( M h 1,..., M h d ) T is the gradient of M(h) and is the derivative of h(s) along the path.
14 Density rodeo: the main idea The main idea The local rodeo algorithm for the kernel density estim A biased, low variance estimator of M(1) is ˆf 1 (x). An unbiased estimator of D(h) is Z(h) = ( ˆf H (x),..., ˆf ) T H (x) h 1 h d This naive estimator has poor risk due to the large variance of Z(h) for small bandwidth. However, the sparsity assumption on f suggests that there should be paths for which D(h) is also sparse. Along such a path, Z(h) could be replaced with an estimator ˆD(h) that makes use of the sparsity assumption. Then, the estimate of f(x) becomes f H (x) = ˆf 1 (x) 1 0 ˆD(s), ḣ(s) ds
15 Kernel density estimator The main idea The local rodeo algorithm for the kernel density estim Let ˆf H (x) represents the kernel density estimator of f(x) with a bandwidth matrix H. assuming that K is a standard symmetric kernel, s.t. K(u)du = 1, uk(u)du = 0 d while K H ( ) = 1 det(h) K(H 1 ) represents the kernel with bandwidth matrix H = diag(h 1,..., h d ). ˆf H (x) = = 1 n 1 n det(h) n d i=i j=1 n K(H 1 (x X i )) i=1 1 h j K Here, we assume that K is a product kernel. ( ) xj X ij h j
16 The main idea The local rodeo algorithm for the kernel density estim The rodeo statistics for the kernel density estimator The density estimation Rodeo is based on the statistics Z j = ˆf H (x) h j ( ) xj X = 1 n K ij h j d ( ) xk X ( ) ik K 1 n xj X i=1 K ij h k n h j k=1 For the conditional variance term s 2 j = Var(Z j X 1,..., X n ) ( ) 1 n = Var Z ji X 1,..., X n = 1 n n Var(Z j1 X 1,..., X n ) i=1 Here, we used the sample variance of the Z ji to estimate Var(Z j1 ). n i=1 Z ji
17 Density rodeo algorithms The main idea The local rodeo algorithm for the kernel density estim Density Rodeo: Hard thresholding version 1. Select parameter 0 < β < 1 and initial bandwidth h 0, where h 0 = c 0 /log log n for some constant c 0. Let c n be a sequence, s.t. c n = O(1). 2. Initialize the bandwidths, and activate all dimensions: (a) h j = h 0, j = 1, 2,..., d. (b) A = {1, 2,..., d}. 3. While A is nonempty, do for each j A (a) Compute the derivative and variance: Z j and s j. (b) Compute the threshold λ j = s j 2 log(ncn ). (c) If Z j > λ j, then set h j βh j ; Otherwise remove j from A. 4. Output bandwidths h and the estimator f(x) = ˆf H (x)
18 The purpose of the analysis The analysis is trying to characterize the asymptotic aspects of the selected bandwidths the asymptotic running time (efficiency) the convergence rate of the risk (accuracy) To make the theoretical results more realistic, a key aspect of our analysis is that we allow the dimension d to increase with the sample size n! For this, we need to make some assumptions about the unknown density function to take the increasing dimension into account.
19 Assumptions Background (A1) Kernel assumption: uu T K(u)du = v 2 I d and v 2 < and K 2 (u)du = R(K) < (A2) Dimension assumption: d log d = O(log n) (A3) Initial bandwidth assumption: h (0) j = c 0 /log log n for (j = 1,..., d) Combing with A2, this implies that lim n n d j=1 h(0) j = (A4) Sparsity assumption: f(x) g(x 1,..., x r )h(x) where h jj (x) = 0, and satisfies r = O(1) (A5) Hessian assumption: tr(h T R (u)h R (u))du <, and f jj (x) 0 for j = 1, 2,..., r where H R (u) represents the Hessian for the relevant dimensions
20 Derivatives of both relevant and irrelevant dimensions Key Lemma: Under assumptions A1 A5, suppose that x is interior to the support of f. Suppose that K is a product kernel with bandwidth matrix H (s) = diag(h (s) 1,..., h(s) d ). Then µ (s) j = h (s) j For j R we have µ (s) j = h (s) j E[ ˆf H (s)(x) f(x)] = o P (h (s) j ) for all j R c E[ ˆf H (s)(x) f(x)] = h (s) j v 2 f jj (x) + o P (h (s) j ). Thus, for any integer s > 0, h s = h 0 β s, each j > r satisfies µ (s) j = o P (h (s) j ) = o P (h (0) j ).
21 Main theorem Background Main Theorem: Suppose that r = O(1) and (A1) (A5) hold. In addition, suppose that A min = min j r f jj (x) = Ω(1) and A max = max j r f jj (x) = Õ(1). Then, for every ɛ > 0, the number of iterations T n until the Rodeo stops satisfies ( 1 P 4 + r log 1/β(n 1 ɛ a n ) T n 1 ) 4 + r log 1/β(n 1+ɛ b n ) 1 where a n = Ω(1) and b n = Õ(1). More over, the algorithm outputs bandwidths h that satisfy ( ) P h j = h (0) j for all j > r 1 Also, we have ( ) P h (0) j (nb n ) 1/(4+r) ɛ h j h (0) j (na n ) 1/(4+r)+ɛ for all j r 1 assuming that h (0) j is defined as in A3.
22 Convergence rate of the risk Theorem 2 Under the same condition of the main theorem, the risk R h of the density Rodeo estimator satisfies for every ɛ > 0. R h = ÕP (n 4/(4+r)+ɛ) We write Y n = ÕP (a n ) to mean that Y n = O(b n a n ) where b n is logarithmic in n. As noted earlier, we write a n = Ω(b n ) if lim inf n a n bn > 0; similarly an = Ω(b n ) if a n = Ω(b n c n ) where c n is logarithmic in n.
23 Possible extensions Different versions of the density rodeo algorithm Using other distributions as irrelevant dimensions The density rodeo algorithm could be extended in many ways the soft-thresholding version the global version the reverse version the bootstrapping version the rodeo algorithm for local linear density estimator the greedier version using other distributions as irrelevant
24 Global density rodeo Different versions of the density rodeo algorithm Using other distributions as irrelevant dimensions The idea is by averaging the test statistics for multiple evaluation points x 1,..., x k sampled from the empirical distribution. To avoid the cancellation problem, the statistic is squared, let Z j (x i ) represents the derivative for the i-th evaluation point with respect to the bandwidth h j. We define the test statistic while s j = T j = 1 m m Zj 2 (x k ), k=1 j = 1,..., d Var(T j ) = 1 2tr(Cj 2 m ) + 4 ˆµ j T C j ˆµ j where ˆµ = 1 m m i=1 Z j(x i ). The threshold is λ j = s 2 j + 2s j log(ncn )
25 The bootstrapping version Different versions of the density rodeo algorithm Using other distributions as irrelevant dimensions Instead of derive the expression explicitly, bootstrapping can be used to evaluate the variance of Z j. Bootstrapping Method to calculate the s 2 j 1. Draw a sample X 1,..., X n of size n, with replacement: Loop for i = 1,..., B, Compute the estimate Z ji from data X 1,..., X n 2. Compute the bootstrapped variance s 2 j = 1 B B b=1 (Z ji Z j ) 2. where Zj = 1 B B b=1 Ẑ j 3. Output the resulted s 2 j.
26 The reverse and greedier version Different versions of the density rodeo algorithm Using other distributions as irrelevant dimensions Reverse density Rodeo: Instead of using a sequence of decreasing bandwidths. On the contrary, we could begin from a very small bandwidth, and use a sequence of increasing bandwidths to estimate the optimal value Greedier density rodeo: Instead of decaying all the bandwidths, only the bandwidths associated the largest Z j /λ j quantities is decayed Hybrid density Rodeo: Different variations could be combined arbitrarily
27 Different versions of the density rodeo algorithm Using other distributions as irrelevant dimensions Using other distributions as irrelevant dimensions We can use a general parametric distribution h(x) as irrelevant dimensions. The key trick is that a new semi-parametric density estimator will be used ˆf H (x) = ĥ(x) n i=1 K H(X i x) n K H (u x)ĥ(u)du where ĥ(x) is a parametric density estimator at point x. The motivation of this estimator comes from local likelihood method (Loader 1996). We see that, for one dimensional case, starting from a large bandwidth, if the true function is h(x), the algorithm will tend to freeze the bandwidth decaying process immediately.
28 Experimental design The density rodeo algorithms were applied on both synthetic and real data, including one-dimensional, two-dimensional, high-dimensional and very high-dimensional examples to measure the distance between the estimated density function with the true density function. The Hellinger distance are used ( D( ˆf f) = ˆf(x) 2 ˆf(x) f(x)) dx = 2 2 f(x) f(x) dx Assuming that we have altogether m evaluation points, then the hellinger distance could be numerically calculated by the Monte Carlo integration D( ˆf f) 2 2 m ˆfH (X i ) m f(x i ) i=1
29 One-dimensional example: strongly skewed density Strongly skewed density: We simulated 200 samples from the skewed distribution. The boxplot of the Hellinger distance is produced based on 100 simulations. Strongly skewed This density is chosen to resemble to lognormal distribution, it distributes as ( 7 ( 1 X 8 N 3 ( 2 ) ( ) ) 2i 2 3 )i 1, 3 i=0 The true density is plotted density x
30 Strongly skewed density: result Unbiased Cross Validation Local kernel density Rodeo density density True density Unbiased C.V x, bandwidth = Global kernel density Rodeo True density Global Rodeo density bandwidth True density Local Rodeo x Bandwidth estimated Unibased CV Local Rodeo Global Rodeo x, bandwidth = 0.05 x
31 Two-dimensional example: Combined Beta and Uniform Combined Beta distribution with unifrom distribution as irrelevant dimension. We simulate a 2-dimensional dataset with n = 500 points. true relevant dimension The two dimensions are independently generated as X Beta(1, 2) + 1 Beta(10, 10) 3 X 2 Uniform(0, 1) The true density for the relevant dimension is density x, bandwidth =
32 Density Density Background Combined Beta and Uniform: result The first plot is the rodeo result, the second plot is the result fitted by the built-in function KDE2d (MASS package in R) relevant dimension irrelevant dimension relevant dimension irrelevant dimension
33 Combined Beta and Uniform: marginal distribution True relevant density True irrelevant dimension density density x x relevant by Rodeo irrelvant by Rodeo density relevant dimension density irrelevant dimension Numerically integrated marginal distributions based on the perspective plots of the two estimators (not normalized). relevant by KDE2d irrelevant by KDE2d density density relevant dimension irrelevant dimension
34 Two-dimensional example: geyser data Geyser Data: A version of the eruptions data from the Old Faithful geyser in Yellowstone National Park, Wyoming. ( Azzalini and Bowman 1990) and is of continuous measurement from August 1 to August 15, There are altogether n = 299 samples and two variables. Duration = the numeric eruption time in minutes. waiting = the waiting time to next eruption.
35 Density first dimension Density first dimension Background Geyser data: result The first plot is the rodeo result, the second plot is the result fitted by the built-in function KDE2d (MASS package in R) second dimension second dimension
36 Geyser data: contour plot The first plot is the contour plot fitted by the built-in function KDE2d (MASS package in R), the second one is fitted by the rodeo algorithm second dimension second dimension first dimension first dimension
37 High dimensional example 30-dimensional example: We generate 30-dimensional synthetic dataset with r = 5 relevant dimensions (n = 100, with 30 trials). The relevant dimensions are generated as X i N (0.5, (0.02i) 2 ), for i = 1,..., 5 while the irrelevant dimensions are generated as X i Uniform(0, 1), for i = 6,..., 30 The test point is x = ( 1 2,..., 1 2 ).
38 30-dimensional example: result The Rodeo path for the 30-dimensional synthetic dataset and the boxplot of the selected bandwidths for 30 trials Bandwidth selected bandwidths Rodeo Step X1 X4 X7 X10 X14 X18 X22 X26 X30 dimensions
39 Very high dimensional example: image processing The algorithm was run on 2200 grayscale images of 1s and 2s, each with 256 = pixels with some unknown background noise; thus this is a 256-dimensional density estimation problem. A test point and the output bandwidth plot are shown here
40 Image processing example: evolution plot The output bandwidth plots sampled at the Rodeo step 10, 20, 30, 40, 50, 60, 70, 80, 90, 100. Which visualizes the evolution of the bandwidths and could be viewed as a dynamic process for feature selection the earlier a dimension s bandwidth decays, the more informative it is
41 A example using Gaussian as irrelevant Using Gaussian as irrelevant dimensions: We apply the semiparametric rodeo algorithm on both 15-dimensional and 20-dimensional synthetic datasets with r = 5 relevant dimensions (n = 1000). When using gaussian distributions as irrelevant dimensions, the relevant dimensions are generated as X i Uniform(0, 1), for i = 1,..., 5 while the irrelevant dimensions are generated as X i N (0.5, (0.05i) 2 ), for i = 6,..., d The test point is x = ( 1 2,..., 1 2 ).
42 Using Gaussian as irrelevant: result The Rodeo path for the 15-dimensional synthetic data(left) and for the 20-dimensional data (Right) Bandwidth Bandwidth Rodeo Step Rodeo Step
43 Using Gaussian as irrelevant: one-dimensional case (I) True density fitted by Rodeo density density bandwidth x Bandwidth estimated Density x By unbiased CV 1000 one-dimensional data points with x i Uniform(0, 1) x N = 1000 Bandwidth =
44 Using Gaussian as irrelevant: one-dimensional case (II) Gaussian fitted by Rodeo density bandwidth density x Bandwidth estimated C x correction factor 1000 one-dimensional data points with x i N (0, 1) x test
45 Summary Background This work adapts the general rodeo framework to solve density estimation problems The sparsity assumption is crucial to guarantee the success of the density rodeo algorithm The density rodeo algorithm is efficient on high-dimensional problems both theoretically and empirically. The rodeo framework can utilize current available density estimators, the implementation is simple Future work: develop the rodeo algorithms for the other kinds of problems, e.g. classification.
Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo
Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo Han Liu John Lafferty Larry Wasserman Statistics Department Computer Science Department Machine Learning Department Carnegie Mellon
More informationMaximum Smoothed Likelihood for Multivariate Nonparametric Mixtures
Maximum Smoothed Likelihood for Multivariate Nonparametric Mixtures David Hunter Pennsylvania State University, USA Joint work with: Tom Hettmansperger, Hoben Thomas, Didier Chauveau, Pierre Vandekerkhove,
More informationStatistics: Learning models from data
DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial
More informationSparse Additive Functional and kernel CCA
Sparse Additive Functional and kernel CCA Sivaraman Balakrishnan* Kriti Puniyani* John Lafferty *Carnegie Mellon University University of Chicago Presented by Miao Liu 5/3/2013 Canonical correlation analysis
More informationBandwidth selection for kernel conditional density
Bandwidth selection for kernel conditional density estimation David M Bashtannyk and Rob J Hyndman 1 10 August 1999 Abstract: We consider bandwidth selection for the kernel estimator of conditional density
More informationEstimation for nonparametric mixture models
Estimation for nonparametric mixture models David Hunter Penn State University Research supported by NSF Grant SES 0518772 Joint work with Didier Chauveau (University of Orléans, France), Tatiana Benaglia
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationNonparametric Density Estimation (Multidimension)
Nonparametric Density Estimation (Multidimension) Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction Tine Buch-Kromann February 19, 2007 Setup One-dimensional
More informationKernel density estimation
Kernel density estimation Patrick Breheny October 18 Patrick Breheny STA 621: Nonparametric Statistics 1/34 Introduction Kernel Density Estimation We ve looked at one method for estimating density: histograms
More informationO Combining cross-validation and plug-in methods - for kernel density bandwidth selection O
O Combining cross-validation and plug-in methods - for kernel density selection O Carlos Tenreiro CMUC and DMUC, University of Coimbra PhD Program UC UP February 18, 2011 1 Overview The nonparametric problem
More informationModelling Non-linear and Non-stationary Time Series
Modelling Non-linear and Non-stationary Time Series Chapter 2: Non-parametric methods Henrik Madsen Advanced Time Series Analysis September 206 Henrik Madsen (02427 Adv. TS Analysis) Lecture Notes September
More informationLocal regression, intrinsic dimension, and nonparametric sparsity
Local regression, intrinsic dimension, and nonparametric sparsity Samory Kpotufe Toyota Technological Institute - Chicago and Max Planck Institute for Intelligent Systems I. Local regression and (local)
More informationSTATISTICAL MACHINE LEARNING FOR STRUCTURED AND HIGH DIMENSIONAL DATA
AFRL-OSR-VA-TR-2014-0234 STATISTICAL MACHINE LEARNING FOR STRUCTURED AND HIGH DIMENSIONAL DATA Larry Wasserman CARNEGIE MELLON UNIVERSITY 0 Final Report DISTRIBUTION A: Distribution approved for public
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationCS168: The Modern Algorithmic Toolbox Lecture #6: Regularization
CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models
More informationCurve Fitting Re-visited, Bishop1.2.5
Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the
More informationModel Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao
Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics Jiti Gao Department of Statistics School of Mathematics and Statistics The University of Western Australia Crawley
More informationSemi-Nonparametric Inferences for Massive Data
Semi-Nonparametric Inferences for Massive Data Guang Cheng 1 Department of Statistics Purdue University Statistics Seminar at NCSU October, 2015 1 Acknowledge NSF, Simons Foundation and ONR. A Joint Work
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationIntroduction to Regression
Introduction to Regression p. 1/97 Introduction to Regression Chad Schafer cschafer@stat.cmu.edu Carnegie Mellon University Introduction to Regression p. 1/97 Acknowledgement Larry Wasserman, All of Nonparametric
More informationOn Bayesian Computation
On Bayesian Computation Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang Previous Work: Information Constraints on Inference Minimize the minimax risk under constraints
More informationKernel Density Estimation
Kernel Density Estimation Univariate Density Estimation Suppose tat we ave a random sample of data X 1,..., X n from an unknown continuous distribution wit probability density function (pdf) f(x) and cumulative
More informationNonparametric Inference in Cosmology and Astrophysics: Biases and Variants
Nonparametric Inference in Cosmology and Astrophysics: Biases and Variants Christopher R. Genovese Department of Statistics Carnegie Mellon University http://www.stat.cmu.edu/ ~ genovese/ Collaborators:
More informationAdvanced Introduction to Machine Learning CMU-10715
Advanced Introduction to Machine Learning CMU-10715 Gaussian Processes Barnabás Póczos http://www.gaussianprocess.org/ 2 Some of these slides in the intro are taken from D. Lizotte, R. Parr, C. Guesterin
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationLecture 2: Basic Concepts of Statistical Decision Theory
EE378A Statistical Signal Processing Lecture 2-03/31/2016 Lecture 2: Basic Concepts of Statistical Decision Theory Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: John Miller and Aran Nayebi In this lecture
More informationDiscussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan
Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania Professors Antoniadis and Fan are
More informationStochastic Composition Optimization
Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationDistribution-Free Distribution Regression
Distribution-Free Distribution Regression Barnabás Póczos, Alessandro Rinaldo, Aarti Singh and Larry Wasserman AISTATS 2013 Presented by Esther Salazar Duke University February 28, 2014 E. Salazar (Reading
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationBayesian Regularization
Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors
More informationSTAT 518 Intro Student Presentation
STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationInference For High Dimensional M-estimates. Fixed Design Results
: Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and
More informationRandom Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.
Random Forests One of the best known classifiers is the random forest. It is very simple and effective but there is still a large gap between theory and practice. Basically, a random forest is an average
More informationNonconcave Penalized Likelihood with A Diverging Number of Parameters
Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized
More informationComputation of an efficient and robust estimator in a semiparametric mixture model
Journal of Statistical Computation and Simulation ISSN: 0094-9655 (Print) 1563-5163 (Online) Journal homepage: http://www.tandfonline.com/loi/gscs20 Computation of an efficient and robust estimator in
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationSemi-Supervised Learning by Multi-Manifold Separation
Semi-Supervised Learning by Multi-Manifold Separation Xiaojin (Jerry) Zhu Department of Computer Sciences University of Wisconsin Madison Joint work with Andrew Goldberg, Zhiting Xu, Aarti Singh, and Rob
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationIntroduction to Regression
Introduction to Regression David E Jones (slides mostly by Chad M Schafer) June 1, 2016 1 / 102 Outline General Concepts of Regression, Bias-Variance Tradeoff Linear Regression Nonparametric Procedures
More informationStatistical Machine Learning for Structured and High Dimensional Data
Statistical Machine Learning for Structured and High Dimensional Data (FA9550-09- 1-0373) PI: Larry Wasserman (CMU) Co- PI: John Lafferty (UChicago and CMU) AFOSR Program Review (Jan 28-31, 2013, Washington,
More informationSliced Inverse Regression
Sliced Inverse Regression Ge Zhao gzz13@psu.edu Department of Statistics The Pennsylvania State University Outline Background of Sliced Inverse Regression (SIR) Dimension Reduction Definition of SIR Inversed
More informationMarkov Chain Monte Carlo Lecture 4
The local-trap problem refers to that in simulations of a complex system whose energy landscape is rugged, the sampler gets trapped in a local energy minimum indefinitely, rendering the simulation ineffective.
More informationWhy experimenters should not randomize, and what they should do instead
Why experimenters should not randomize, and what they should do instead Maximilian Kasy Department of Economics, Harvard University Maximilian Kasy (Harvard) Experimental design 1 / 42 project STAR Introduction
More informationCan we do statistical inference in a non-asymptotic way? 1
Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.
More informationApproximate Message Passing
Approximate Message Passing Mohammad Emtiyaz Khan CS, UBC February 8, 2012 Abstract In this note, I summarize Sections 5.1 and 5.2 of Arian Maleki s PhD thesis. 1 Notation We denote scalars by small letters
More information27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling
10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel
More informationIntegrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University
Integrated Likelihood Estimation in Semiparametric Regression Models Thomas A. Severini Department of Statistics Northwestern University Joint work with Heping He, University of York Introduction Let Y
More informationCOMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma
COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released
More informationVariational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures
17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of
More informationApproximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.
Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and
More informationA Magiv CV Theory for Large-Margin Classifiers
A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector
More informationSemi-Parametric Importance Sampling for Rare-event probability Estimation
Semi-Parametric Importance Sampling for Rare-event probability Estimation Z. I. Botev and P. L Ecuyer IMACS Seminar 2011 Borovets, Bulgaria Semi-Parametric Importance Sampling for Rare-event probability
More informationICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts
ICML 2015 Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes Machine Learning Research Group and Oxford-Man Institute University of Oxford July 8, 2015 Point Processes
More informationComputational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center
Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher Center Outline Modern nonparametric inference for high dimensional data Nonparametric
More information41903: Introduction to Nonparametrics
41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific
More informationCOMS 4771 Regression. Nakul Verma
COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the
More information11. Learning graphical models
Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical
More information12 - Nonparametric Density Estimation
ST 697 Fall 2017 1/49 12 - Nonparametric Density Estimation ST 697 Fall 2017 University of Alabama Density Review ST 697 Fall 2017 2/49 Continuous Random Variables ST 697 Fall 2017 3/49 1.0 0.8 F(x) 0.6
More informationMinimum Hellinger Distance Estimation in a. Semiparametric Mixture Model
Minimum Hellinger Distance Estimation in a Semiparametric Mixture Model Sijia Xiang 1, Weixin Yao 1, and Jingjing Wu 2 1 Department of Statistics, Kansas State University, Manhattan, Kansas, USA 66506-0802.
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationIs the test error unbiased for these programs? 2017 Kevin Jamieson
Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationGWAS V: Gaussian processes
GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011
More informationDreem Challenge report (team Bussanati)
Wavelet course, MVA 04-05 Simon Bussy, simon.bussy@gmail.com Antoine Recanati, arecanat@ens-cachan.fr Dreem Challenge report (team Bussanati) Description and specifics of the challenge We worked on the
More information36. Multisample U-statistics and jointly distributed U-statistics Lehmann 6.1
36. Multisample U-statistics jointly distributed U-statistics Lehmann 6.1 In this topic, we generalize the idea of U-statistics in two different directions. First, we consider single U-statistics for situations
More informationECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering
ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Siwei Guo: s9guo@eng.ucsd.edu Anwesan Pal:
More informationStructured Statistical Learning with Support Vector Machine for Feature Selection and Prediction
Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee Predictive
More informationSpatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood
Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood Kuangyu Wen & Ximing Wu Texas A&M University Info-Metrics Institute Conference: Recent Innovations in Info-Metrics October
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More information1.1 Basis of Statistical Decision Theory
ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 1: Introduction Lecturer: Yihong Wu Scribe: AmirEmad Ghassami, Jan 21, 2016 [Ed. Jan 31] Outline: Introduction of
More informationOptimal Estimation of a Nonsmooth Functional
Optimal Estimation of a Nonsmooth Functional T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania http://stat.wharton.upenn.edu/ tcai Joint work with Mark Low 1 Question Suppose
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationUncertainty quantification and visualization for functional random variables
Uncertainty quantification and visualization for functional random variables MascotNum Workshop 2014 S. Nanty 1,3 C. Helbert 2 A. Marrel 1 N. Pérot 1 C. Prieur 3 1 CEA, DEN/DER/SESI/LSMR, F-13108, Saint-Paul-lez-Durance,
More informationEfficient Variational Inference in Large-Scale Bayesian Compressed Sensing
Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing George Papandreou and Alan Yuille Department of Statistics University of California, Los Angeles ICCV Workshop on Information
More informationQuantitative Economics for the Evaluation of the European Policy. Dipartimento di Economia e Management
Quantitative Economics for the Evaluation of the European Policy Dipartimento di Economia e Management Irene Brunetti 1 Davide Fiaschi 2 Angela Parenti 3 9 ottobre 2015 1 ireneb@ec.unipi.it. 2 davide.fiaschi@unipi.it.
More informationIntroduction. Chapter 1
Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics
More informationIntegrated Non-Factorized Variational Inference
Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014
More informationBiostatistics Advanced Methods in Biostatistics IV
Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationFrontier estimation based on extreme risk measures
Frontier estimation based on extreme risk measures by Jonathan EL METHNI in collaboration with Ste phane GIRARD & Laurent GARDES CMStatistics 2016 University of Seville December 2016 1 Risk measures 2
More informationTutorial on Approximate Bayesian Computation
Tutorial on Approximate Bayesian Computation Michael Gutmann https://sites.google.com/site/michaelgutmann University of Helsinki Aalto University Helsinki Institute for Information Technology 16 May 2016
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationVariable Selection in High Dimensional Convex Regression
Sensing and Analysis of High-Dimensional Data UCL-Duke Workshop September 4 & 5, 2014 Variable Selection in High Dimensional Convex Regression John Lafferty Department of Statistics & Department of Computer
More informationPENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA
PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA Kasun Rathnayake ; A/Prof Jun Ma Department of Statistics Faculty of Science and Engineering Macquarie University
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationVariable selection for model-based clustering
Variable selection for model-based clustering Matthieu Marbac (Ensai - Crest) Joint works with: M. Sedki (Univ. Paris-sud) and V. Vandewalle (Univ. Lille 2) The problem Objective: Estimation of a partition
More informationSpatial Process Estimates as Smoothers: A Review
Spatial Process Estimates as Smoothers: A Review Soutir Bandyopadhyay 1 Basic Model The observational model considered here has the form Y i = f(x i ) + ɛ i, for 1 i n. (1.1) where Y i is the observed
More informationRelevance Vector Machines
LUT February 21, 2011 Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise Support Vector Machines The relevance vector machine (RVM) is a bayesian
More informationStatistical Inference
Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park
More informationA Novel Nonparametric Density Estimator
A Novel Nonparametric Density Estimator Z. I. Botev The University of Queensland Australia Abstract We present a novel nonparametric density estimator and a new data-driven bandwidth selection method with
More informationBayesian estimation of the discrepancy with misspecified parametric models
Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012
More informationStructure Learning: the good, the bad, the ugly
Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 Structure Learning: the good, the bad, the ugly Graphical Models 10708 Carlos Guestrin Carnegie Mellon University September 29 th, 2006 1 Understanding the uniform
More informationSparse Additive machine
Sparse Additive machine Tuo Zhao Han Liu Department of Biostatistics and Computer Science, Johns Hopkins University Abstract We develop a high dimensional nonparametric classification method named sparse
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More information