Nonparametric Regression with Correlated Errors

Size: px
Start display at page:

Download "Nonparametric Regression with Correlated Errors"

Transcription

1 Nonparametric Regression with Correlated Errors Jean Opsomer Iowa State University Yuedong Wang University of California, Santa Barbara Yuhong Yang Iowa State University February 27, 2001 Abstract Nonparametric regression techniques are often sensitive to the presence of correlation in the errors The practical consequences of this sensitivity are explained, including the breakdown of several popular data-driven smoothing parameter selection methods We review the existing literature in kernel regression, smoothing splines and wavelet regression under correlation, both for short-range and long-range dependence Extensions to random design, higher dimensional models and adaptive estimation are discussed 1 Introduction Nonparametric regression is a rapidly growing and exciting branch of statistics, both because of recent theoretical developments and more widespread use of fast and inexpensive computers In nonparametric regression problems, the researcher is most often interested in estimating the mean function E(Y X) =f(x) from a set of observations (X 1,Y 1 ),,(X n,y n ), where the X i can be either univariate or multivariate Many competing methods are currently available, including kernel-based methods, regression splines, smoothing splines and wavelet and Fourier series expansions The bulk of the literature in these areas has focused on the case in which an unknown mean function is masked by a certain amount of white noise, and the goal of the regression is to remove the white noise and uncover the function More recently, a number of authors have begun to look at the situation where the noise is no longer white, and instead contains a certain amount of structure in the form of correlation The focus of this article is to look at the problem of estimating the mean function f( ) in the presence of correlation, not that of estimating the 1

2 correlation function itself In this context, our goals are: (1) to explain some of the difficulties associated with the presence of correlation in nonparametric regression, (2) to provide an overview of the nonparametric regression literature that deals with the correlated errors case, and (3) to discuss some new developments in this area Much of the literature in nonparametric regression relies on asymptotic arguments to clarify the probabilitistic behavior of the proposed methods The same approach will be used here, but we attempt to provide intuition into the results as well In this article, we will be looking at the following statistical model: Y i = f(x i )+ε i, (1) where f( ) is an unknown, smooth function, and the error vector ε =(ε 1,,ε n ) has variance-covariance matrix Var(ε) =σ 2 C Researchers in different areas of nonparametric regression have addressed different versions of this problem For instance, the X i are either assumed to be random, or fixed within some domain (to avoid confusion, we will write lower-case x i when the covariates are fixed, and upper-case X i when these are random variables) The specification of the correlation can also vary significantly: the correlation matrix C is considered completely known in some of these areas, known up to a finite number of parameters, or only assumed to be stationary but otherwise left completely unspecified in other areas Another issue concerns whether the errors are assumed to be short-range dependent, where the correlation decreases rapidly as the distance between two observations increases, or long-range dependent (short-range/long-range dependency will be defined more exactly in Section 31) When discussing the various methods proposed in the smoothing literature, we will point out the major differences in assumptions between these areas Section 2 explains the practical difficulties associated with estimating f( ) under model (1) In Section 3, we review the existing literature on this topic in several areas of nonparametric regression Section 4 describes some extensions of existing results as well as new developments This last section is more technical than the previous ones, and nonspecialists might want to skip it on first reading 2 Problems with Correlation A number of problems, some quite fundamental, occur when nonparametric regression is attempted in the presence of correlated errors Indeed, in the most general 2

3 setting where no parametric shape is assumed for the mean nor the correlation function, the model is essentially unidentifiable, so that it is theoretically impossible to estimate either function separately In most practical applications, however, the researcher has some idea of what represents a reasonable type of fit to the data at hand, and he will use that expectation to decide what is an acceptable or unacceptable function estimate For all non-parametric regression techniques, the shape and smoothness of the estimated function depends to a large extent on the specific value chosen for a smoothing parameter, defined differently for each technique In order to avoid having to select a value for the smoothing parameter by trial and error, several types of data-driven selection methods have been developed to assist researchers in this task However, the presence of correlation between the errors, if ignored, causes the commonly used automatic tuning parameter selection methods, such as cross-validation or plug-in, to break down This breakdown of automated methods, as well as a possible solution to it, are illustrated by a simple simulated example in Figure 1 For 200 equally spaced observations and a low order polynomial mean function (f(x) =1 6x+36x 2 53x 3 +22x 5 ), four progressively more correlated sets of errors were generated from the same vector of independent noise, and added to the mean function The errors are normally distributed with variance σ 2 =05 and correlation following an AR(1) process (autocorrelation of order 1), corr(ε i,ε j ) = exp( α x i x j ) Figure 1 shows four local linear regression fits for these datasets For each dataset, two bandwidth selection methods were used: cross-validation (CV) and a correlation-corrected method called CDPI, further discussed in Section 41 Table 1 summarizes the bandwidths selected for the four datasets under both methods Correlation level Autocorrelation CV CDPI Independent α = α = α = Table 1: Summary of bandwidth selection for simulated data in Figure 1 Autocorrelation refers to the correlation between adjacent observations Table 1 and Figure 1 clearly show that as the correlation increases, the bandwidth selected by cross-validation becomes smaller and smaller and the fits become progessively more undersmoothed The bandwidths selected by CDPI, a method that 3

4 3 Uncorrelated 3 alpha = alpha =200 3 alpha = Figure 1: Simulated data with four levels of AR(1) correlation, fitted with local linear regression; ( ) represents fit obtained with bandwidth selected by cross-validation, (- ) fit obtained with bandwidth selected by CPDI accounts for the presence of correlation, are much more stable and result in virtually the same fit for all four cases This type of undersmoothing behavior in the presence of correlated errors has been observed with most commonly used automated bandwidth selection methods At its most conceptual level, it is caused by the fact that the bandwidth selection method perceives all the structure in the data to be due to the mean function, and attempts to incorporate that information into its estimate of the trend When the data are uncorrelated, this perception is valid, but it breaks down in the presence of correlation Unlike in this simulated example, in practice it is very often not known what portions of the behavior of the observations should be attributed to 4

5 signal or to noise for a given dataset The choice of bandwith selection approach should therefore be dictated by an understanding of the nature of the data (a) (b) y ACF x Lag (c) (d) y ACF x Lag Figure 2: Simulation 1 (independent data): (a) spline fit using smoothing parameter value of 001; (b) autocorrelation function of residuals Simulation 2 (autoregressive): (c) spline fit using GCV smoothing parameter selection; (d) autocorrelation function of residuals The previous example showed that correlation can cause the data-driven bandwidth selection methods to break down Selecting the bandwidth visually or by trial and error can also be misleading, however Indeed, even if data are independent, a wrong choice of the smoothing parameter can induce spurious serial correlation in the residuals Conversely, a wrong choice of the smoothing parameter can lead to an estimated correlation that does not reflect the true correlation in the random error Two simple simulations using smoothing splines illustrate these facts (see Figure 2) In the first simulation, 100 observations are generated from the model 5

6 Y i = sin(2πi/100) + ε i,i =1,, 100, where ε i s are independent and identically distributed normal random variables with mean zero and standard deviation 1 The S-Plus function smoothspline is used to fit the data with smoothing parameter set at 001 (Figure 2(a)) In the second simulation, 100 observations are generated according to model (1) with mean zero and errors following a first-order autoregressive process (AR(1)) with autocorrelation 05 and standard deviation 1 The S-Plus function smoothspline is again used to fit the data with smoothing parameter selected by the generalized cross-validation (GCV) method (Figure 2(c)) The estimated autocorrelation function (ACF) for the first plot looks autoregressive (Figure 2(b)), while that for the second plot appears independent (Figure 2(d)) In both cases, this conclusion about the error structure is erroneous and the mean function is incorrectly estimated As mentioned above, the researcher will often have some idea on what represents an appropriate cut-off between the short-term behavior induced by the correlation and the long-term behavior of primary interest Establishing that cut-off for a specific dataset can be done by trying out different values for the smoothing parameter and picking the one that results in an appropriate fit In fact, the presence of correlation in the residuals can sometimes be used to assist in the visual selection of an appropriate bandwidth, when the data are independent Consider the examples from Figure 2 again In 2(a)-(b), the positive serial correlation was induced by the oversmoothed fit A smaller bandwidth will result in a better fit to the data, and remove the correlation in the residuals Figure 3(a)-(b) show the fit produced by the GCVselected bandwidth and the corresponding autocorrelation function of the residuals Because the data are truly independent here, the GCV smoothing parameter selection method works correctly In 2(c)-(d), this situation is reversed If the pattern found by GCV in Figure 2(c) is not an acceptable fit (depending on the application), a larger smoothing parameter value has to be set by hand, or using one of the smoothing parameter selection methods that account for correlation, as will be presented below Figure 3(c)-(d) display the fit and the autocorrelation function for the smoothing parameter value selected by the Extended GML method (see Section 42) with an assumed AR(1) error process The residuals from this new fit are now correlated If the researcher prefers this new fit, he should assume that the model errors were also correlated It should be noted that in such situations, standard residual diagnostic tests for nonparametric goodness-of-fit, such as those in Hart (1997), will fail If the error structure is correctly modeled, statistical inference is still possible, however For 6

7 (a) (b) y ACF x Lag (c) (d) y ACF x Lag Figure 3: Simulation 1 again (independent data): (a) Observations (circles) and spline fit (line); (b) Autocorrelation function of residuals Simulation 2 (autoregressive): (c) Observations (circles) and spline fit (line); (d) Autocorrelation function of residuals instance, Figure 3(c) shows a 95% Bayesian confidence interval (see Wang (1998b)), indicating that the slight upward trend is most likely not significant, the correct conclusion in this case 7

8 3 Results to Date 31 Kernel-based Methods In this section, we consider kernel regression estimation of the mean function for data assumed to follow model (1), with E(ε i )=0, Var(ε i )=σ 2, Corr(ε i,ε j )=ρ n (X i X j ), (2) and σ 2 unknown and ρ n ( ) an unknown, stationary correlation function The dependence of ρ on n is indicated by the subscript, because consistency properties of the estimators will depend on the behavior of the correlation function as n increases In this section, we only consider univariate fixed x i = x i in a bounded interval [a, b], and for simplicity, we let [a, b] =[0, 1] The researchers who have studied the properties of kernel-based estimators of the function f( ) have focused on the time series case, in which the design points are fixed and equally spaced x i i/n, so that, Y i = f( i n )+ε i, Corr(ε i,ε j )=ρ n ( i n j n ) We consider the simplest situation, in which the correlation function is taken to be ρ n (t/n) =ρ(t) for all n, but otherwise ρ( ) is left unspecified Note that this implies that the correlation between two fixed locations decreases as n, because a fixed value for t = i j corresponds to a decreasing distance between observations We also assume that the errors are short-range dependent The error process is said to be short-range dependent if for some c>0 and γ>1, the spectral density H(ω) = σ2 k= ρ( k )e iω of the errors satisfies 2π H(ω) cω (1 γ) as ω 0 (see, eg, Cox (1984)) In that case, ρ(j) is of order j γ (see, eg, Adenstedt (1974)) When the correlation decreases at order j γ for some dependency index 0 <γ 1, the errors are said to have a long-range dependence Long-range dependence substantially increases the difficulty in estimating the mean function, and will be discussed separately in Section 33 The function f( ) can be fitted by kernel regression or local polynomial regression Following the literature in this area, we discuss estimation by kernel regression, and for simplicity, we consider the Priestley-Chao kernel estimator (Priestley and Chao (1972)) The estimator of f( ) atapointx [0, 1] is defined as ˆf(x) =s T x;hy = 1 n ( ) xi x K Y i nh h 8 i=1

9 for some kernel function K and bandwidth h, where T denotes transpose, Y = ( (Y 1,,Y n ) T and s x;h = 1 nh K( x 1 x ),,K( xn x h h )) T The Mean Squared Error of ˆf(x) is MSE( ˆf(x); h) =E(s T x;hy f(x)) 2 =(s T x;he(y ) f(x)) 2 + σ 2 s T x;hcs x;h, (3) where C is the unknown correlation matrix of Y Before we can study the asymptotic behavior of ˆf(x), a number of assumptions on the statistical model and the components of the estimator are needed: (ASI) The kernel K is compactly supported, bounded and continuous We assume that K(u)du =1, uk(u)du = 0 and u 2 K(u)du 0, (ASII) the 2nd derivative function, f ( ), of f( ) is bounded and continuous, and (ASIII) as n, h 0 and nh The first term of the MSE in (3) represents the squared bias of ˆf(x), and does not depend on the dependency structure of the errors Under the assumptions (ASI) (ASIII), the bias can be asymptotically approximated by s T x;he(y ) f(x) =h 2 µ 2(K) f (x)+o(h 2 ), 2 with µ r (G) = u r G(u)du for any function G( ) The effect of the correlation structure on the variance part of the MSE is potentially severe, however If (RI) lim nk=1 n ρ(k) <, so that R = k=1 ρ(k) exists, 1 (RII) lim nk=1 n k ρ(k) =0, n the variance component of the MSE can be approximated asymptotically by σ 2 s T x;hcs x:h = 1 nh µ 0(K 2 )σ 2 (1+2R)+o( 1 nh ), (Altman (1990)) The assumptions (RI) and (RII), common in time series analysis, ensure that observations sufficiently far apart are essentially uncorrelated When the observations are uncorrelated, R = 0 so that this result reduces to the one usually reported for kernel regression with independent errors Note also that 1 2π σ2 (1+2R) = H(0), the spectral density at ω = 0 This fact will be useful in developing bandwidth selection methods that incorporate the effect of the correlation (see Section 41) 9

10 Let AMSE denote the asymptotic approximation to the MSE in (3), which is minimized for AMSE( ˆf(x); h) =h 4 µ 2(K) 2 4 f (x) nh µ 0(K 2 )σ 2 (1+2R), (4) ( µ0 (K 2 ) σ 2 ) 1/5 (1 + 2R) h opt =, nµ 2 (K) 2 f (x) 2 the asymptotically optimal bandwidth for estimating f(x) The effect of the correlation sum R on this optimal bandwidth is easily seen Note first that the optimal rate for the bandwidth, h opt n 1/5, is the same as that for the independent errors case The exact value of h opt depends on R, however If R>0, implying that the error correlation is positive, then the variance of ˆf(x) will be larger than in the corresponding uncorrelated case The AMSE is therefore minimized by a value for the bandwidth h that is larger than in the uncorrelated case Conversely, if R < 0, the AMSE-optimal bandwidth is smaller than in the uncorrelated case, but in practice, positive correlation is much more often encountered Positive correlation has the additional perverse effect of making automated bandwidth selection methods pick smaller bandwidths, as illustrated in Figure 1 This behavior is explained in Altman (1990) and Hart (1991) for cross-validation and briefly reviewed here As a global measure of goodness-of-fit for ˆf( ), we consider the Mean Average Squared Error (MASE) MASE(h) = 1 n n i=1 ( i E ˆf( n ) f( i )2 n ), (5) which is equal to (3) averaged over the observed locations Let ˆf ( i) denote the kernel regression estimate computed on the dataset with the ith observation removed The cross-validation criterion for choosing the bandwidth is CV(h) = 1 n ( n ˆf ( i) ( i i=1 n ) Y i) 2, (6) so that E(CV(h)) MASE(h)+σ 2 2 n Cov( n ˆf ( i) ( i i=1 n ),ε i) The latter covariance term is in addition to the correlation effect already included in the MASE It can be shown that asymptotically, Cov( ˆf (i) ( i ),ε n i) T c /nh for some constant T c For severe correlation, this additional term dominates CV(h) and leads 10

11 to a fit that nearly interpolates the data, as shown in Figure 1 This results holds not only for cross-validation but also for related measures of fit such as generalized cross-validation (GCV) and Mallows criterion (Chiu (1989)) One approach to solve this problem is to model the correlation parametrically, and several such methods were proposed independently by Chiu (1989), Altman (1990) and Hart (1991) While their specific implementations varied, each author chose to estimate the correlation function parametrically and to use this estimate to adjust the bandwidth selection criterion The estimation of the correlation function is of course complicated by the fact that the errors in (1) are unobserved Chiu (1989) attempts to bypass that problem by estimating the correlation function in the frequency domain while down-weighting the low frequency periodogram components Hart (1991) attempts to remove most of the trend by differencing, and then also estimates the correlation function in the frequency domain In contrast to the previous authors, Altman (1990) proposes performing an initial pilot regression to estimate the mean function and calculate residuals, and then fits a low-order autoregressive process to these residuals Hart (1994) describes a further refinement to this parametrically modelled correlation approach He introduces time series crossvalidation as a new goodness-of-fit criterion that can be jointly minimized over the set of parameters for the correlation function (a pth-order autoregressive process in this case) and the bandwidth parameter These approaches appear to work well in practice Even when the parametric part of the model is misspecified, they provide a significant improvement over the fits computed under the assumption of independent errors, as the simulation experiments in Altman (1990) and Hart (1991) show However, when performing nonparametric regression, it is sometimes desirable to completely avoid such parametric assumptions Several methods have been proposed that pursue completely nonparametric approaches Chu and Marron (1991) propose two new cross-validation-based criteria that estimate the MASE-optimal bandwidth without specifying the correlation function In modified cross-validation (MCV), the kernel regression values ˆf ( i) in (6) are computed by leaving out the 2l+1 observations i l, i l+1,,i+l 1,i+l surrounding the ith observation Because the correlation is assumed to be short-range, proper choice of l will greatly decrease the effect of the terms Cov( ˆf ( i) ( i ),ε n i) in the CV criterion In partitioned cross-validation (PCV), the observations are partitioned into g subgroups by taking every gth observations Within each subgroup, the observations are further apart and hence are assumed less correlated Cross-validation is performed for each subgroup, and the bandwidth estimate for all the observations is a simple function of the average of the subgroup- 11

12 optimal bandwidths The drawback of both MCV and PCV is that the values of l and g need to be selected with some care Herrmann et al (1992) also propose a fully nonparametric method for estimating the MASE-optimal bandwidth, but replace the CV-based criteria by a plug-in approach This type of bandwidth selection has been shown to have a number of theoretical and practical advantages over CV (Härdle et al (1988), (1992)) Plug-in bandwidth selection is performed by estimating the unknown quantities in the AMSE (4), replacing them by estimators (hence the name plug-in ), and minimizing the resulting estimated AMSE with respect to the bandwidth h The estimation of the bias component B(x; h) 2 is completely analogous to that in the uncorrelated case The variance component σ 2 (1 + 2R) is estimated by a summation over second order differences of lagged residuals More recently, Hall et al (1995) extended the results of Chu and Marron (1991) in a number of useful directions Their theoretical results apply to kernel regression as well as local linear regression They also explicitly consider the long-range dependence case, where assumptions (RI) and (RII) are no longer required They discuss bandwidth selection through MCV and compare it with a bootstrap-based approach which estimates the MASE in (5) directly through resampling of blocks of residuals from a pilot smooth As was the case for Chu and Marron (1991), both approaches are fully nonparametric but require the choice of other tuning parameters In Section 41, we will introduce a new type of fully nonparametric plug-in method that is applicable to both one- and two-dimensional covariates X i following either a fixed or a random design 32 Polynomial Splines In this section, we consider model (1) with fixed design points x i in X =[0, 1], and as usually done in the smoothing spline literature, we assume that f is a function with certain smoothness properties More precisely, assume f belongs to the Sobolev space W2 m = {f : f (v) absolutely continuous, v =0,,m 1, (f (m) (x)) 2 dx < } 0 (7) For a given variance-covariance matrix C, the smoothing spline estimate ˆf is the minimizer of the following penalized weighted least-square objective function { 1 1 } n (Y f)t C 1 (Y f)+λ (f (m) (x)) 2 dx, (8) min f W m

13 where f =(f(x 1 ),,f(x n )) T, and λ is the smoothing parameter controlling the trade-off between the goodness-of-fit measured by weighted least-squares and the roughness of the estimate measured by 1 0 (f (m) (x)) 2 dx Unlike in the previous section, smoothing spline research to date always assumes that C is parametrically specified The sample size n is kept fixed when studying the statistical properties of the smoothing splines estimator ˆf under correlated errors, so that only finite-sample properties have been studied Hence, the effect of short-range or long-range dependence of the errors on the asymptotic behavior of the estimator has not been explicitly considered, and remains an open research question We discuss the main finite-sample properties of ˆf under correlation in this section and in Section 42 Let φ ν (x) = x ν 1 /(ν 1)!, ν =1,,m, R(x, z) = 1 0 (x u) m 1 + (z u) m 1 + du/((m 1)!) 2, where (x) + = x for x 0 and (x) + = 0 otherwise Denote T n m = {φ ν (x i )} n m i=1ν=1 and Σ n n = {R(x i,x j )} n n i=1j=1 Let T =(Q 1 Q 2 )(R T 0 T ) T be the QR decomposition of T Kimeldorf and Wahba (1971) showed that the solution to (8) is m n ˆf(x) = d ν φ ν (x)+ c i R(x i,x), ν=1 i=1 where c =(c 1,,c n ) T and d =(d 1,,d m ) T are solutions to (Σ + nλc)c + Td = Y, T T c = 0 (9) At the design points, ˆf =(ˆf(x 1 ),, ˆf(x n )) T = AY, where A = I nλcq 2 (Q T 2 (Σ + nλc)q 2 ) 1 Q T 2 is the hat matrix This hat matrix is often used to define degrees of freedom and to construct confidence intervals (Wahba (1990)) Note that A may be asymmetric, in contrast to the independent error case So far the smoothing parameter λ has been fixed Good choices of λ are crucial to the performance of spline estimates (Wahba (1990)) Much research has been devoted to developing data-driven methods for selecting λ when observations are 13

14 independent Several methods have been proposed and among them the CV (crossvalidation), GCV (generalized cross-validation), GML (generalized maximum likelihood) and UBR (unbiased risk) methods are popular choices (Wahba (1990)) The CV and GCV methods are well known for their optimal properties (Wahba (1990)) The GML is very stable and efficient for small to moderate sample sizes When the dispersion parameter is known, eg for binary and Poisson data, the UBR method works better All these methods tend to underestimate smoothing parameters when data are correlated, for the reasons discussed in Section 2 In kernel regression, the correlation was often assumed to be short-range but otherwise unspecified, and bandwidth selection adjustments were proposed based on that assumption In the smoothing spline literature, several authors have considered specific parametric assumptions on the correlation function Diggle and Hutchinson (1989) assumed the random errors are generated by an autoregressive process In the following we discuss their method in a more general setting where we assume C is determined by a set of parameters α Denote τ = (λ, α) Define a =tra as the effective degrees of freedom taken up in estimating f Replacing the function f by its smoothing spline estimate ˆf, Diggle and Hutchinson (1989) proposed to estimate all parameters using the penalized profile likelihood (we use the negative log likelihood here) min τ,σ 2 { (Y ˆf) T C 1 (Y ˆf)/σ 2 +ln C /2+n ln σ 2 + φ(n, a) }, (10) where φ is a penalty function that is increasing in a It is easy to see that ˆσ 2 = (Y ˆf) T C 1 (Y ˆf)/n, reducing (10) to min τ { n ln(y ˆf) T C 1 (Y ˆf)+ln C /2+φ(n, a) } Two forms of penalty have been compared in Diggle and Hutchinson (1989): φ(n, a) = 2n ln(1 a/n), φ(n, a) = a ln n, that are analogs of AIC (Akaike information criterion) and BIC (Bayesian information criterion) When observations are independent, the first penalty gives a method approximating the GCV solution and the second penalty gives a new method which does not reduce to any existing methods Simulation results in Diggle and Hutchinson (1989) suggest that the second penalty function works better than the first However, Diggle and Hutchinson (1989) commented that using the second penalty gives 14

15 results which are significantly inferior to those obtained by GCV when C is known (including independent data as the special case C = I) More research is necessary to find properties of this method Diggle and Hutchinson (1989) have developed an efficient O(n) smoothing parameter selection algorithm in the special case of AR(1) error structures For independent observations, Wahba (1978) showed that a polynomial spline of degree 2m 1 can be obtained by signal extraction Denote W (x) as a zero-mean Wiener process Suppose that f is generated by the stochastic differential equation with initial conditions d m f(x)/dx m =(nλ) 1/2 σdw(x)/dx (11) z 0 =(f(0),f (1) (0),,f (m 1) (0)) T N(0,aI m ) (12) Let ˆf(x; a) = E(f(x) Y,a) represent the signal extraction estimate of f Then lim a ˆf(x; a) equals the smoothing spline estimate Kohn, Ansley and Wong et al (1992) used this signal extraction approach to derive a method for spline smoothing with autoregressive moving average errors Assuming that observations are equally spaced (x i = i/n), Kohn et al (1992) considered model (1) with signal f generated by the stochastic model (11) and (12) and the errors ε i generated by a discrete time stationary ARMA(p,q) model ε i = φ 1 ε i φ p ε i p + e i + ψ 1 e i ψ q e i q, (13) iid where e i N(0,σ 2 ) and are independent of the Wiener process W (x) in (11) Denote z i =(f(x i ),f (1) (x i ),,f (m 1) (x i )) T The stochastic model (11) and (12) can be written in a state space form as z i = F i z i 1 + u i, i =1,,n, where F i is an upper triangular m m matrix having ones on the diagonal and (j, k)th element (x i x i 1 ) k j /(k j)! for k>j The perturbation u i N(0,σ 2 U i /nλ), where U i is a m m matrix with (j, k)th element being (x i x i 1 ) 2m i k+1 /[(2m i j + 1)!(m k)!(m j)!] For the ARMA(p,q) model (13), let m = max(p, q + 1) and φ 1 G = φ m 1 I m 1 φ m 0 T 15

16 Consider the following state space model w i = Gw i 1 + v i, i =1,,n, (14) where w i is a m vector, and v i =(e i,ψ 1 e i,,ψ m 1e i ) T Substituting repeatedly from the bottom row of the system, it is easy to see that the first element in w i is identical to the ARMA model defined in (13) Therefore the ARMA(p,q) model can be represented in a state space form (14) Combining the two state space representations for the signal and the random error, the original model (1) can be represented by the following state space model Y i = h T x i, x i = H i x i 1 + a i, where x i = ( zi w i ) ( ui, a i = v i ), H i = ( F i 0 0 G ) h is a m+m vector with 1 in the first and the (m+1)st positions and zeros elsewhere Due to this state space representation, filtering and smoothing algorithms can be used to calculate the estimate of the function Kohn et al (1992) also derived algorithms to calculate GML and GCV estimates of all parameters τ =(λ, φ 1,,φ p,ψ 1,,ψ q ) and σ 2 33 Long-range Dependence In Section 2, we have seen that even under an AR(1) correlation structure, a simple type of short-range dependence, familiar nonparametric procedures intended for uncorrelated errors behave rather poorly Several methods have been proposed in Sections 31 and 32 to handle the dependences When the correlation ρ(t) = Corr(ε i,ε i+t ) decreases more slowly in t, regression estimation becomes even harder In this section, we review the theoretical results on the effects of long-range dependent stationary Gaussian errors Estimation under long-range dependence has attracted more and more attention in recent years In many scientific research fields, such as astronomy, physics, geoscience, hydrology and signal processing, the observational errors sometimes reveal long-range dependence Künsch et al (1993) wrote: Perhaps most unbelievable to many is the observation that high-quality measurement series from astronomy, 16

17 physics, chemistry, generally regarded as prototypes of iid observations, are not independent but long-range correlated Minimax risks have been widely considered for evaluating performance of an estimator of a function assumed to be in a target class Let F be a nonparametric (infinite-dimensional) class of functions on [0, 1] d, where d is the number of predictors, and as before, let C denote the covariance matrix of the errors Let u v = ( (u v) 2 dx) 1/2 be the L 2 distance between two functions u and v The minimax risk for estimating the regression function f under the squared L 2 loss is R(F; C; n) = min max E f ˆf 2, ˆf f F where ˆf denotes any estimator based on (X i,y i ) n i=1 and the expectation is taken with respect to the true regression function f The minimax risk measures how well one can estimate f uniformly over the function class F Due to the difficulty in evaluating R(F; C; n) in general, its convergence rate to zero as a function of the sample size n is often considered An estimator with risk converging at the minimax risk rate uniformly over F is said to be minimax-rate optimal For short-range dependence, it has been shown that the minimax risk rate remains unchanged compared to the case with independent errors (see, eg, Bierens (1983), Collomb and Härdle (1986), Johnstone and Silverman (1997), Wang (1996), Yang (1997)) However, fundamental differences show up when the errors become long-range dependent For that case, a number of results have been obtained for parametric estimation of f (ie, f is assumed to have a known parametric form) and also for the estimation of dependence parameters For a review of the results, see Beran (1992) (1994) We focus here on nonparametric estimation of f For long-range dependent errors, results on minimax rates of convergence are obtained for univariate regression with equally spaced (fixed) design in Hall and Hart (1990), Wang (1996), and Johnstone and Silverman (1997) The model being considered is again model (1), with x i = i/n, Corr(ε i,ε j ) c i j γ for some 0 <γ<1, and f is assumed to be smooth with the α-th derivative bounded for some α 1 (the results of Wang (1996) and Johnstone and Silverman (1997) are in more general forms) The minimax rate of convergence under squared L 2 loss for estimating f is shown to be of order n 2αγ/(2α+γ) (15) In contrast, the rate of convergence under independent or short-range dependent 17

18 errors is n 2α/(2α+1) This shows the damaging effect of long-range dependence on the convergence rate Suppose now that X i,i 1 are random variables, specified to be iid independent of the ε i s We consider the case where the dependence between the errors depends only on the orders of observations (the correlation has nothing to do with the values of the X i s) Given F, the distribution of X i, 1 i n, and a general dependence among the errors (not necessarily stationary), Yang (1997) shows that the minimax risk rate for estimating f is determined by the maximum of two rates: the rate of convergence for the class F under iid errors and the rate of convergence for the estimation of the mean value of the regression function, µ = Ef(X), under the dependence model The first rate is determined by the largeness of the target class F and the second rate is determined by severity of the dependence among the errors As a consequence, the minimax rate may well remain unchanged if the dependence is not severe enough relative to the largeness of the target function class A similar result was obtained independently by Efromovich (1997) in a univariate regression setting It is also shown in Yang (1997) that dependence among the errors, as long as its form is known, generally does not hurt prediction of the next response When Yang s result is applied to the class of regression functions with the α-th derivative bounded, one has the minimax rate of convergence n min(2α/(2α+1),γ) (16) For a given long-range dependence index γ, the rate of convergence gets damaged only when α is relatively large, ie, α>γ/(2(1 γ)) Note that the rate of convergence in (16) is always faster compared to the rate given in (15) 34 Wavelet estimation In this subsection, we review wavelet methods for regression on [0,1] under dependence focusing on the long-range dependence case Orthogonal series expansion is a commonly used method for function estimation Compared with trigonometrics or Legendre polynomials, orthogonal wavelet bases have been shown to have desirable local properies that lead to optimal performance in statistical estimation for a rich collection of function classes In the wavelet expansion of a function, the coefficients are organized in different levels called multi-resolutions The coefficients are estimated based on orthogonal wavelet transformation Then one needs to decide which estimated coefficients are above the noise level and thus need 18

19 to be kept in the wavelet expansion This is usually done using a thresholding rule based on statistical considerations Nason (1996) reported that methods intended for uncorrelated errors do not work well for correlated data Johnstone and Silverman (1997) point out that for independent errors, one can use the same threshold for all the coefficients while for dependent errors, the variances of the empirical wavelet coefficients depend on the level but are the same within each level Accordingly, level-dependent thresholdings are proposed Their procedure is briefly described as follows Consider the regression model (1) with n =2 J for some integer J Let W be a periodic discrete wavelet transform operator (for examples of wavelet bases and fast O(n) algorithms, see, eg, Donoho and Johnstone (1998)) Let w j,k =(WY ) j,k, j =0,, J 1, k=1,, 2 j be the wavelet transform of the data Y =(Y 1,, Y n ) T Let Z = Wε be the wavelet transform of the errors Let λ j be the threshold to be applied to the estimated coefficients at level j Then define ˆθ =(ˆθ j,k ),j=0,, J 1, k =1,, 2 j by ˆθ j,k = η(w j,k, ˆσ j λ j ), where η is a threshold function and ˆσ j is an estimate of the standard deviation of w j,k The final estimator of the regression function is ˆf = W T ˆθ, where W T is the inverse transform of W Earlier work of Donoho and Johnstone (see, eg, (1998)) suggest soft (S) or hard (H) thresholding as follows: η S (w j,k, ˆσ j λ j ) = sgn(w j,k )( w j,k ˆσ j λ j ) + η H (w j,k, ˆσ j λ j )=w j,k I { wj,k ˆσ j λ j } A suggested choice of ˆσ j is ˆσ j 2 = MAD{w j,k,k =1,, 2 j }/06745, where MAD means the median absolute deviation and the constant is derived to work for the Gaussian errors To choose the λ j s, Johnstone and Silverman (1997) suggest a method based on Stein unbiased risk estimation (SURE) as described below For a given m-dimensional vector v =(v 1,, v m )(m will be taken to be 2 j at level j) and ˆσ, define a function m {( Û(t) =ˆσm + v 2 k t 2) } 2ˆσ 2 I { vk t} k=1 19

20 Then define ˆt(v) = arg 0 t ˆσ 2 log m min{û(t)} By comparing s 2 m = m 1 m k=1 v 2 k 1 with a threshold β m, let t(v) ={ 2 log m s 2 m β m ˆt(v) s 2 m >β m Now choose the level-dependent threshold by { 0 j L λ j = t(w j /ˆσ j ) L +1 j J 1, where L is an integer specified by an user as the primary resolution level, below which signal clearly dominates over noise From Johnstone and Silverman (1997), the regression estimator ˆf produced by the above procedure converges at the minimax rate of convergence simultaneously over a rich collection of classes of smooth functions (Besov) without the need to know the long-range dependence index nor how smooth the regression function is The adaptation is therefore over both the regression function classes and over the dependence parameter as well (see next subsection for a review of adaptive estimation for nonparametric regression) 35 Adaptive Estimation Many nonparametric procedures have tuning parameters, eg, the bandwidth h for local polynomial and the smoothing parameter λ for the smoothing spline, as considered earlier The optimal choices of such tuning parameters in general depend on certain characteristics of the unknown regression function and therefore are unknown to us Various methods (eg, AIC, BIC, cross-validation and other related model selection criteria) have been proposed to give a choice of a tuning parameter automatically based on the data so that the final estimator performs as well as (or nearly as well as) the estimator based on the optimal choice This is the task of adaptive estimation In this subsection, we briefly review the ideas of adaptive estimation in the context of nonparametric regression This will provide a background for some of the results in next section Basically there are two types of results on adaptive estimation, namely adaptive estimation with respect to a collection of estimation procedures (procedure-wise adaptation), and adaptive estimation with respect to a collection of target classes (target-oriented adaptation) For the first case, one is given a collection of procedures and the goal of adaptation is to have a final procedure performing close to the 20

21 best one in the collection For example, the collection might be a kernel procedure with all valid choices of the bandwidth h Another collection may be a list of wavelet estimators based on different choices of wavelet bases In general, one may have completely different estimation procedures in the collection for greater flexibility This is desirable in applications where it is rather unclear beforehand which procedures are appropriate For instance, for the case of high-dimensional regression estimation, one faces the so-called curse of dimensionality in the sense that the traditional function estimation methods (such as histogram and series expansion) would have exponentially many parameters in d to be estimated This cannot be done accurately based on a moderate sample size For this situation, a solution is to try different parsimonious models Because one does not know which parsimonious characterization works best for the underlying unknown regression function, adaptation capability is desired For the second type of adaptation, one is given a collection of target function classes, ie, the true unknown function is assumed to be in one of the classes (without knowing which one it is) A goal of adaptation is to have an estimator with the capability to perform optimally simultaneously for the target classes, ie, the estimator automatically converges optimally at the minimax rate of the class that contains the true function The two types of adaptations are closely related If each procedure in a collection is designed optimally for a specific function class in a collection, then the procedurewise adaptation implies the target-oriented adaptation In this sense the procedurewise adaptation is more general than the target-oriented adaptation In applications, one may encounter a mixture of both types of adaptation at the same time On one hand, you have several plausible procedures that you wish to try, and on the other hand, you may have a few plausible characteristics (eg, monotonicity, additivity) of the regression function you want to explore For this situation, you may derive optimal (or at least reasonably good) estimators for each characteristic respectively and then add them to the original collection of procedures Then adaptation with respect to the collection of procedures is the desired property A large number of results have been obtained on adaptive estimation with independent errors (see Yang (2000) for references) More recently, for nonparametric regression with iid Gaussian errors, Yang (2000) shows that under very mild conditions, for any collection of uniformly bounded function classes, minimax-rate adaptive estimators exist More generally, for any given collection of regression procedures, a single adaptive procedure can be constructed by combining them The 21

22 new procedure pays only a small price for adaptation, ie, the risk of the combined procedure is bounded above by the risk of each procedure plus a small penalty term, that is asymptotically negligible for nonparametric estimation For nonparametric regression with dependence, adaptation with respect to both the unknown characteristics of the regression function and with respect to the unknown dependence is of interest A success in this direction is the wavelet estimator based on Stein unbiased risk estimation proposed by Johnstone and Silverman (1997), as discussed in Section 34 4 New Developments In the remainder of the article, we describe several new development areas related to smoothing in the presence of correlation As mentioned at the beginning of the article, this discussion will be at a somewhat higher technical level than the previous material 41 Kernel Regression Extensions Opsomer (1997) introduces recent research that extends existing methodological results for kernel-based regression estimators under short-range dependence in several directions The approach is fully nonparametric, uses local linear regression and implements a plug-in bandwidth estimator The range of applications is extended to include random design, univariate and bivariate observations, and additive models We will review some of the main findings here Full details and proofs are available in Opsomer (1995) The method discussed below was used in Figure 1 to correct for the presence of correlation in the simulated example We again assume that the data are generated by model (1), where X i are random and can be either scalars or bivariate vectors Hence, the (X 1,Y 1 ),,(X n,y n ) are a set of random vectors in IR d+1, with d = 1 or 2 The model errors ε i are assumed to have moment properties (2) In this general setting, we refer to model (1) as the general (G) model We also consider two special cases: a model with univariate X i as in Section 31, referred to as simple model (S1) : Y i = f(x i )+ε i, and a bivariate model in which f( ) is assumed to be additive, ie additive model (A2) : Y i = µ + f 1 (X 1i )+f 2 (X 2i )+ε i 22

23 We define the expected correlation function c n (x) =ne(ρ n (X i x)), (17) and let X =[X 1,,X n ] T represent the n d matrix of covariates Also, when d = 2, let X [k] =(X k1,,x kn ) T represent the kth column of X, k =1, 2 Let X = [0, 1] d represent the support of X i and g its density function, with g k the marginal density corresponding to X ki for k =1, 2 As in Section 31, let K represent a univariate kernel function In order to simplify notation for model G, we restrict our attention to (tensor) product kernels K(u 1 ) K(u d ), with corresponding bandwidth matrix H = diag{h 1,,h d } For model A2, we need some additional notation Let T 12 represent the n n matrix whose ijth element is [T 12] ij = g(x 1i,X 2j ) g 1 (X 1i )g 2 (X 2j ) 1 n, and let t T i, v j represent the ith row and jth column of (I T 12) 1, respectively Let f 1 = d 2 f 1 (X 11 ) dx 2 1 d 2 f 1 (X 1n ) dx 2 1, E(f 1 (X 1i ) X [2] )= E(f 1 (X 1i ) X 21 ) E(f 1 (X 1i ) X 2n ) and analogously for f 2 and E(f 2 (X 2i ) X [1] ) The local linear estimator of f( ) atapointx for model G (and, with the obvious changes, model S1), is defined as ˆf(x) =s T xy, with the vector s T x defined as, s T x = e T 1 (X T xw xxx) 1 X T xw x, (18) with e T 1 a row vector with 1 in its 1st position and 0 s elsewhere, the weight matrix W x = 1 diag{k ( H 1 (X 1 x) ),,K ( H 1 (X n x) ) } and H 1 (X 1 x) T Xx = 1 (X n x) T For model A2, the estimator of f( ) at any location x can also be written as a linear smoother, but the expression is much more complicated and rarely directly used to compute the estimators We assume here that the conditions guaranteeing convergence of the backfitting algorithm, further described in Opsomer and Ruppert 23

24 (1997), are met These conditions do not depend on the correlation structure of the errors Assumption (ASI) from Section 31 is maintained, but the remaining assumptions on the statistical model and the estimator are replaced by: (ASII ) the density g is compactly supported, bounded and continuous, and g(x) > 0 for all x X, (ASIII ) the 2nd derivative(s) of f( ) are bounded and continuous, and (ASIV ) as n, H 0 and n H In addition, we assume that the correlation function ρ n is an element of a sequence {ρ n } with the following properties: (RI ) ρ n is differentiable, n ρ n (t x) dt = O(1), nρ n (t x) 2 dt = o(1) for all x, (RII ) ξ>0: ρ n (t) I ( H 1 t >ξ) dt = o( ρ n (t) dt) The properties require the effect of ρ n to be short-range (relative to the bandwidth), but allow its functional form to be otherwise unspecified These properties are generalizations of assumptions (RI) and (RII) in Section 31 to the random, multivariate design As noted in Section 31, the conditional bias of ˆf(X i ) is not affected by the presence of correlation in the errors We therefore refer to Ruppert and Wand (1994) for the asymptotic bias of local polynomial estimators for models G and S1, and to Opsomer and Ruppert (1997) for the estimator of additive model A2 We construct asymptotic approximations to the conditional variance of ˆf(X i ) and to the conditional Mean Average Squared Error (MASE) of ˆf, defined in (5) for the fixed design case Theorem 41 The conditional variance of ˆf(X i ) for models G and S1 is Var( ˆf(X i ) X) =σ 2 1 µ 0 (K 2 ) d n H g(x i ) (1 + c 1 n(x i )) + o p ( n H ) For model A2, Var( ˆf(X i ) X) = ( σ 2 µ 0 (K 2 g1 (X 1i ) 1 (1 + E(c n (X i X 1i ))) ) + g 2(X 2i ) 1 ) (1 + E(c n (X i X 2i ))) nh 1 nh 2 +o p ( ) nh 1 nh 2 24

25 We let 1/2 R n = n ρ n (t) dt 1/2 and define IC n = σ 2 (1 + R n ), the Integrated Covariance Function We also define the 2nd derivative regression functionals θ 22 (k, l) =E with k, l =1,,d for models G and S1, and θ 22 (1, 1) = 1 n θ 22 (2, 2) = 1 n θ 22 (1, 2) = 1 n n i=1 n i=1 n i=1 ( ) f(xi ) f(x i ) X 1 X 2 ( t T i f 1 v T i E(f 1 (X 1i ) X [2] ) ) 2 ( v T i f 2 t T i E(f 2 (X 2i ) X [1] ) ) 2 ( t T i f 1 v T i E(f 1 (X 1i ) X [2] ) )( v T i f 2 t T i E(m 2(X 2i ) X [1] ) ) for model A2 Because of assumptions (ASI) and (ASII ) (ASIV ), these quantities are well-defined and bounded, as long as the additive model has a unique solution Theorem 42 The conditional Mean Average Squared Error of ˆf for model G (S1) is MASE(H X) = ( ) 2 µ2 (K) d d h 2 2 kh 2 l θ 22 (k, l) k=1 l=1 + 1 d n H µ 0(K 2 ) d IC n + o p ( h 4 k + 1 k=1 n H ) For model A2, ( ) 2 µ2 (K) d d MASE(H X) = h 2 2 kh 2 l θ 22 (k, l)+ k=1 l=1 ( R(K)σ 2 1+E(g1 (X 1i ) 1 c n (X i )) + 1+E(g 2(X 2i ) 1 ) c n (X i )) nh 1 nh 1 +o p (h h ) nh 1 nh 2 25

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

If we want to analyze experimental or simulated data we might encounter the following tasks:

If we want to analyze experimental or simulated data we might encounter the following tasks: Chapter 1 Introduction If we want to analyze experimental or simulated data we might encounter the following tasks: Characterization of the source of the signal and diagnosis Studying dependencies Prediction

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 6: Model complexity scores (v3) Ramesh Johari ramesh.johari@stanford.edu Fall 2015 1 / 34 Estimating prediction error 2 / 34 Estimating prediction error We saw how we can estimate

More information

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures FE661 - Statistical Methods for Financial Engineering 9. Model Selection Jitkomut Songsiri statistical models overview of model selection information criteria goodness-of-fit measures 9-1 Statistical models

More information

Spatial Process Estimates as Smoothers: A Review

Spatial Process Estimates as Smoothers: A Review Spatial Process Estimates as Smoothers: A Review Soutir Bandyopadhyay 1 Basic Model The observational model considered here has the form Y i = f(x i ) + ɛ i, for 1 i n. (1.1) where Y i is the observed

More information

On prediction and density estimation Peter McCullagh University of Chicago December 2004

On prediction and density estimation Peter McCullagh University of Chicago December 2004 On prediction and density estimation Peter McCullagh University of Chicago December 2004 Summary Having observed the initial segment of a random sequence, subsequent values may be predicted by calculating

More information

Least Squares Model Averaging. Bruce E. Hansen University of Wisconsin. January 2006 Revised: August 2006

Least Squares Model Averaging. Bruce E. Hansen University of Wisconsin. January 2006 Revised: August 2006 Least Squares Model Averaging Bruce E. Hansen University of Wisconsin January 2006 Revised: August 2006 Introduction This paper developes a model averaging estimator for linear regression. Model averaging

More information

Econ 423 Lecture Notes: Additional Topics in Time Series 1

Econ 423 Lecture Notes: Additional Topics in Time Series 1 Econ 423 Lecture Notes: Additional Topics in Time Series 1 John C. Chao April 25, 2017 1 These notes are based in large part on Chapter 16 of Stock and Watson (2011). They are for instructional purposes

More information

Time Series. Anthony Davison. c

Time Series. Anthony Davison. c Series Anthony Davison c 2008 http://stat.epfl.ch Periodogram 76 Motivation............................................................ 77 Lutenizing hormone data..................................................

More information

Time Series and Forecasting Lecture 4 NonLinear Time Series

Time Series and Forecasting Lecture 4 NonLinear Time Series Time Series and Forecasting Lecture 4 NonLinear Time Series Bruce E. Hansen Summer School in Economics and Econometrics University of Crete July 23-27, 2012 Bruce Hansen (University of Wisconsin) Foundations

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Adaptive Nonparametric Density Estimators

Adaptive Nonparametric Density Estimators Adaptive Nonparametric Density Estimators by Alan J. Izenman Introduction Theoretical results and practical application of histograms as density estimators usually assume a fixed-partition approach, where

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

4 Nonparametric Regression

4 Nonparametric Regression 4 Nonparametric Regression 4.1 Univariate Kernel Regression An important question in many fields of science is the relation between two variables, say X and Y. Regression analysis is concerned with the

More information

A Novel Nonparametric Density Estimator

A Novel Nonparametric Density Estimator A Novel Nonparametric Density Estimator Z. I. Botev The University of Queensland Australia Abstract We present a novel nonparametric density estimator and a new data-driven bandwidth selection method with

More information

Econ 623 Econometrics II Topic 2: Stationary Time Series

Econ 623 Econometrics II Topic 2: Stationary Time Series 1 Introduction Econ 623 Econometrics II Topic 2: Stationary Time Series In the regression model we can model the error term as an autoregression AR(1) process. That is, we can use the past value of the

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

SGN Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection

SGN Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection SG 21006 Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection Ioan Tabus Department of Signal Processing Tampere University of Technology Finland 1 / 28

More information

1 Random walks and data

1 Random walks and data Inference, Models and Simulation for Complex Systems CSCI 7-1 Lecture 7 15 September 11 Prof. Aaron Clauset 1 Random walks and data Supposeyou have some time-series data x 1,x,x 3,...,x T and you want

More information

Modelling Non-linear and Non-stationary Time Series

Modelling Non-linear and Non-stationary Time Series Modelling Non-linear and Non-stationary Time Series Chapter 2: Non-parametric methods Henrik Madsen Advanced Time Series Analysis September 206 Henrik Madsen (02427 Adv. TS Analysis) Lecture Notes September

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Introduction. Linear Regression. coefficient estimates for the wage equation: E(Y X) = X 1 β X d β d = X β

Introduction. Linear Regression. coefficient estimates for the wage equation: E(Y X) = X 1 β X d β d = X β Introduction - Introduction -2 Introduction Linear Regression E(Y X) = X β +...+X d β d = X β Example: Wage equation Y = log wages, X = schooling (measured in years), labor market experience (measured

More information

Nonparametric Regression Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction

Nonparametric Regression Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction Tine Buch-Kromann Univariate Kernel Regression The relationship between two variables, X and Y where m(

More information

Covariance function estimation in Gaussian process regression

Covariance function estimation in Gaussian process regression Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian

More information

Nonparametric Methods

Nonparametric Methods Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of Pennsylvania July 28, 2009 Michael R. Roberts Nonparametric Methods 1/42 Overview Great for data analysis

More information

Ch 6. Model Specification. Time Series Analysis

Ch 6. Model Specification. Time Series Analysis We start to build ARIMA(p,d,q) models. The subjects include: 1 how to determine p, d, q for a given series (Chapter 6); 2 how to estimate the parameters (φ s and θ s) of a specific ARIMA(p,d,q) model (Chapter

More information

Lecture 2: Univariate Time Series

Lecture 2: Univariate Time Series Lecture 2: Univariate Time Series Analysis: Conditional and Unconditional Densities, Stationarity, ARMA Processes Prof. Massimo Guidolin 20192 Financial Econometrics Spring/Winter 2017 Overview Motivation:

More information

Introduction to Regression

Introduction to Regression Introduction to Regression p. 1/97 Introduction to Regression Chad Schafer cschafer@stat.cmu.edu Carnegie Mellon University Introduction to Regression p. 1/97 Acknowledgement Larry Wasserman, All of Nonparametric

More information

Some Theories about Backfitting Algorithm for Varying Coefficient Partially Linear Model

Some Theories about Backfitting Algorithm for Varying Coefficient Partially Linear Model Some Theories about Backfitting Algorithm for Varying Coefficient Partially Linear Model 1. Introduction Varying-coefficient partially linear model (Zhang, Lee, and Song, 2002; Xia, Zhang, and Tong, 2004;

More information

Chapter 3: Regression Methods for Trends

Chapter 3: Regression Methods for Trends Chapter 3: Regression Methods for Trends Time series exhibiting trends over time have a mean function that is some simple function (not necessarily constant) of time. The example random walk graph from

More information

Inverse problems in statistics

Inverse problems in statistics Inverse problems in statistics Laurent Cavalier (Université Aix-Marseille 1, France) Yale, May 2 2011 p. 1/35 Introduction There exist many fields where inverse problems appear Astronomy (Hubble satellite).

More information

LECTURE NOTE #3 PROF. ALAN YUILLE

LECTURE NOTE #3 PROF. ALAN YUILLE LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.

More information

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo Outline in High Dimensions Using the Rodeo Han Liu 1,2 John Lafferty 2,3 Larry Wasserman 1,2 1 Statistics Department, 2 Machine Learning Department, 3 Computer Science Department, Carnegie Mellon University

More information

Multivariate Distributions

Multivariate Distributions IEOR E4602: Quantitative Risk Management Spring 2016 c 2016 by Martin Haugh Multivariate Distributions We will study multivariate distributions in these notes, focusing 1 in particular on multivariate

More information

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of

More information

Topic 4 Unit Roots. Gerald P. Dwyer. February Clemson University

Topic 4 Unit Roots. Gerald P. Dwyer. February Clemson University Topic 4 Unit Roots Gerald P. Dwyer Clemson University February 2016 Outline 1 Unit Roots Introduction Trend and Difference Stationary Autocorrelations of Series That Have Deterministic or Stochastic Trends

More information

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model Minimum Hellinger Distance Estimation in a Semiparametric Mixture Model Sijia Xiang 1, Weixin Yao 1, and Jingjing Wu 2 1 Department of Statistics, Kansas State University, Manhattan, Kansas, USA 66506-0802.

More information

A time series is called strictly stationary if the joint distribution of every collection (Y t

A time series is called strictly stationary if the joint distribution of every collection (Y t 5 Time series A time series is a set of observations recorded over time. You can think for example at the GDP of a country over the years (or quarters) or the hourly measurements of temperature over a

More information

Introduction to Regression

Introduction to Regression Introduction to Regression Chad M. Schafer May 20, 2015 Outline General Concepts of Regression, Bias-Variance Tradeoff Linear Regression Nonparametric Procedures Cross Validation Local Polynomial Regression

More information

Additive Isotonic Regression

Additive Isotonic Regression Additive Isotonic Regression Enno Mammen and Kyusang Yu 11. July 2006 INTRODUCTION: We have i.i.d. random vectors (Y 1, X 1 ),..., (Y n, X n ) with X i = (X1 i,..., X d i ) and we consider the additive

More information

Log-Density Estimation with Application to Approximate Likelihood Inference

Log-Density Estimation with Application to Approximate Likelihood Inference Log-Density Estimation with Application to Approximate Likelihood Inference Martin Hazelton 1 Institute of Fundamental Sciences Massey University 19 November 2015 1 Email: m.hazelton@massey.ac.nz WWPMS,

More information

12 - Nonparametric Density Estimation

12 - Nonparametric Density Estimation ST 697 Fall 2017 1/49 12 - Nonparametric Density Estimation ST 697 Fall 2017 University of Alabama Density Review ST 697 Fall 2017 2/49 Continuous Random Variables ST 697 Fall 2017 3/49 1.0 0.8 F(x) 0.6

More information

Econometric Forecasting

Econometric Forecasting Robert M. Kunst robert.kunst@univie.ac.at University of Vienna and Institute for Advanced Studies Vienna October 1, 2014 Outline Introduction Model-free extrapolation Univariate time-series models Trend

More information

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown. Weighting We have seen that if E(Y) = Xβ and V (Y) = σ 2 G, where G is known, the model can be rewritten as a linear model. This is known as generalized least squares or, if G is diagonal, with trace(g)

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems

Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems Jeremy S. Conner and Dale E. Seborg Department of Chemical Engineering University of California, Santa Barbara, CA

More information

Dynamic Time Series Regression: A Panacea for Spurious Correlations

Dynamic Time Series Regression: A Panacea for Spurious Correlations International Journal of Scientific and Research Publications, Volume 6, Issue 10, October 2016 337 Dynamic Time Series Regression: A Panacea for Spurious Correlations Emmanuel Alphonsus Akpan *, Imoh

More information

Nonparametric Regression. Badr Missaoui

Nonparametric Regression. Badr Missaoui Badr Missaoui Outline Kernel and local polynomial regression. Penalized regression. We are given n pairs of observations (X 1, Y 1 ),...,(X n, Y n ) where Y i = r(x i ) + ε i, i = 1,..., n and r(x) = E(Y

More information

Problem Set 2 Solution Sketches Time Series Analysis Spring 2010

Problem Set 2 Solution Sketches Time Series Analysis Spring 2010 Problem Set 2 Solution Sketches Time Series Analysis Spring 2010 Forecasting 1. Let X and Y be two random variables such that E(X 2 ) < and E(Y 2 )

More information

Introduction to Regression

Introduction to Regression Introduction to Regression David E Jones (slides mostly by Chad M Schafer) June 1, 2016 1 / 102 Outline General Concepts of Regression, Bias-Variance Tradeoff Linear Regression Nonparametric Procedures

More information

O Combining cross-validation and plug-in methods - for kernel density bandwidth selection O

O Combining cross-validation and plug-in methods - for kernel density bandwidth selection O O Combining cross-validation and plug-in methods - for kernel density selection O Carlos Tenreiro CMUC and DMUC, University of Coimbra PhD Program UC UP February 18, 2011 1 Overview The nonparametric problem

More information

Wavelet Regression Estimation in Longitudinal Data Analysis

Wavelet Regression Estimation in Longitudinal Data Analysis Wavelet Regression Estimation in Longitudinal Data Analysis ALWELL J. OYET and BRAJENDRA SUTRADHAR Department of Mathematics and Statistics, Memorial University of Newfoundland St. John s, NF Canada, A1C

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.1 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands Elizabeth C. Mannshardt-Shamseldin Advisor: Richard L. Smith Duke University Department

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

Confidence intervals for kernel density estimation

Confidence intervals for kernel density estimation Stata User Group - 9th UK meeting - 19/20 May 2003 Confidence intervals for kernel density estimation Carlo Fiorio c.fiorio@lse.ac.uk London School of Economics and STICERD Stata User Group - 9th UK meeting

More information

in a series of dependent observations asks for specific regression techniques that take care of the error structure to avoid under- respectively overs

in a series of dependent observations asks for specific regression techniques that take care of the error structure to avoid under- respectively overs Semiparametric regression smoothing and feature detection in time series Michael G. Schimek Institute for Medical Informatics, Statistics and Documentation, Karl-Franzens-University Graz, Austria, Europe;

More information

Akaike criterion: Kullback-Leibler discrepancy

Akaike criterion: Kullback-Leibler discrepancy Model choice. Akaike s criterion Akaike criterion: Kullback-Leibler discrepancy Given a family of probability densities {f ( ; ψ), ψ Ψ}, Kullback-Leibler s index of f ( ; ψ) relative to f ( ; θ) is (ψ

More information

Integrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University

Integrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University Integrated Likelihood Estimation in Semiparametric Regression Models Thomas A. Severini Department of Statistics Northwestern University Joint work with Heping He, University of York Introduction Let Y

More information

Covariance Stationary Time Series. Example: Independent White Noise (IWN(0,σ 2 )) Y t = ε t, ε t iid N(0,σ 2 )

Covariance Stationary Time Series. Example: Independent White Noise (IWN(0,σ 2 )) Y t = ε t, ε t iid N(0,σ 2 ) Covariance Stationary Time Series Stochastic Process: sequence of rv s ordered by time {Y t } {...,Y 1,Y 0,Y 1,...} Defn: {Y t } is covariance stationary if E[Y t ]μ for all t cov(y t,y t j )E[(Y t μ)(y

More information

Introduction to Regression

Introduction to Regression Introduction to Regression Chad M. Schafer cschafer@stat.cmu.edu Carnegie Mellon University Introduction to Regression p. 1/100 Outline General Concepts of Regression, Bias-Variance Tradeoff Linear Regression

More information

Estimation of cumulative distribution function with spline functions

Estimation of cumulative distribution function with spline functions INTERNATIONAL JOURNAL OF ECONOMICS AND STATISTICS Volume 5, 017 Estimation of cumulative distribution function with functions Akhlitdin Nizamitdinov, Aladdin Shamilov Abstract The estimation of the cumulative

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

REVIEW (MULTIVARIATE LINEAR REGRESSION) Explain/Obtain the LS estimator () of the vector of coe cients (b)

REVIEW (MULTIVARIATE LINEAR REGRESSION) Explain/Obtain the LS estimator () of the vector of coe cients (b) REVIEW (MULTIVARIATE LINEAR REGRESSION) Explain/Obtain the LS estimator () of the vector of coe cients (b) Explain/Obtain the variance-covariance matrix of Both in the bivariate case (two regressors) and

More information

Nonparametric Density Estimation

Nonparametric Density Estimation Nonparametric Density Estimation Econ 690 Purdue University Justin L. Tobias (Purdue) Nonparametric Density Estimation 1 / 29 Density Estimation Suppose that you had some data, say on wages, and you wanted

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

at least 50 and preferably 100 observations should be available to build a proper model

at least 50 and preferably 100 observations should be available to build a proper model III Box-Jenkins Methods 1. Pros and Cons of ARIMA Forecasting a) need for data at least 50 and preferably 100 observations should be available to build a proper model used most frequently for hourly or

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations Research

More information

Discrete time processes

Discrete time processes Discrete time processes Predictions are difficult. Especially about the future Mark Twain. Florian Herzog 2013 Modeling observed data When we model observed (realized) data, we encounter usually the following

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2 1 Bootstrapped Bias and CIs Given a multiple regression model with mean and

More information

Terminology Suppose we have N observations {x(n)} N 1. Estimators as Random Variables. {x(n)} N 1

Terminology Suppose we have N observations {x(n)} N 1. Estimators as Random Variables. {x(n)} N 1 Estimation Theory Overview Properties Bias, Variance, and Mean Square Error Cramér-Rao lower bound Maximum likelihood Consistency Confidence intervals Properties of the mean estimator Properties of the

More information

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

SMOOTHED BLOCK EMPIRICAL LIKELIHOOD FOR QUANTILES OF WEAKLY DEPENDENT PROCESSES

SMOOTHED BLOCK EMPIRICAL LIKELIHOOD FOR QUANTILES OF WEAKLY DEPENDENT PROCESSES Statistica Sinica 19 (2009), 71-81 SMOOTHED BLOCK EMPIRICAL LIKELIHOOD FOR QUANTILES OF WEAKLY DEPENDENT PROCESSES Song Xi Chen 1,2 and Chiu Min Wong 3 1 Iowa State University, 2 Peking University and

More information

Statistical inference on Lévy processes

Statistical inference on Lévy processes Alberto Coca Cabrero University of Cambridge - CCA Supervisors: Dr. Richard Nickl and Professor L.C.G.Rogers Funded by Fundación Mutua Madrileña and EPSRC MASDOC/CCA student workshop 2013 26th March Outline

More information

Maximum Smoothed Likelihood for Multivariate Nonparametric Mixtures

Maximum Smoothed Likelihood for Multivariate Nonparametric Mixtures Maximum Smoothed Likelihood for Multivariate Nonparametric Mixtures David Hunter Pennsylvania State University, USA Joint work with: Tom Hettmansperger, Hoben Thomas, Didier Chauveau, Pierre Vandekerkhove,

More information

OFFICE OF NAVAL RESEARCH FINAL REPORT for TASK NO. NR PRINCIPAL INVESTIGATORS: Jeffrey D. Hart Thomas E. Wehrly

OFFICE OF NAVAL RESEARCH FINAL REPORT for TASK NO. NR PRINCIPAL INVESTIGATORS: Jeffrey D. Hart Thomas E. Wehrly AD-A240 830 S ~September 1 10, 1991 OFFICE OF NAVAL RESEARCH FINAL REPORT for 1 OCTOBER 1985 THROUGH 31 AUGUST 1991 CONTRACT N00014-85-K-0723 TASK NO. NR 042-551 Nonparametric Estimation of Functions Based

More information

Inverse problems in statistics

Inverse problems in statistics Inverse problems in statistics Laurent Cavalier (Université Aix-Marseille 1, France) YES, Eurandom, 10 October 2011 p. 1/32 Part II 2) Adaptation and oracle inequalities YES, Eurandom, 10 October 2011

More information

Regression. Oscar García

Regression. Oscar García Regression Oscar García Regression methods are fundamental in Forest Mensuration For a more concise and general presentation, we shall first review some matrix concepts 1 Matrices An order n m matrix is

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Focused Information Criteria for Time Series

Focused Information Criteria for Time Series Focused Information Criteria for Time Series Gudmund Horn Hermansen with Nils Lid Hjort University of Oslo May 10, 2015 1/22 Do we really need more model selection criteria? There is already a wide range

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Nonparametric Econometrics

Nonparametric Econometrics Applied Microeconometrics with Stata Nonparametric Econometrics Spring Term 2011 1 / 37 Contents Introduction The histogram estimator The kernel density estimator Nonparametric regression estimators Semi-

More information

Local Polynomial Wavelet Regression with Missing at Random

Local Polynomial Wavelet Regression with Missing at Random Applied Mathematical Sciences, Vol. 6, 2012, no. 57, 2805-2819 Local Polynomial Wavelet Regression with Missing at Random Alsaidi M. Altaher School of Mathematical Sciences Universiti Sains Malaysia 11800

More information

Nonlinear time series

Nonlinear time series Based on the book by Fan/Yao: Nonlinear Time Series Robert M. Kunst robert.kunst@univie.ac.at University of Vienna and Institute for Advanced Studies Vienna October 27, 2009 Outline Characteristics of

More information

Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta

Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta November 29, 2007 Outline Overview of Kernel Density

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

WAVELET SMOOTHING FOR DATA WITH AUTOCORRELATED ERRORS

WAVELET SMOOTHING FOR DATA WITH AUTOCORRELATED ERRORS Current Development in Theory and Applications of Wavelets WAVELET SMOOTHING FOR DATA WITH AUTOCORRELATED ERRORS ROGÉRIO F. PORTO, JOÃO R. SATO, ELISETE C. Q. AUBIN and PEDRO A. MORETTIN Institute of Mathematics

More information

Optimizing forecasts for inflation and interest rates by time-series model averaging

Optimizing forecasts for inflation and interest rates by time-series model averaging Optimizing forecasts for inflation and interest rates by time-series model averaging Presented at the ISF 2008, Nice 1 Introduction 2 The rival prediction models 3 Prediction horse race 4 Parametric bootstrap

More information

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Statistics - Lecture One. Outline. Charlotte Wickham  1. Basic ideas about estimation Statistics - Lecture One Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Outline 1. Basic ideas about estimation 2. Method of Moments 3. Maximum Likelihood 4. Confidence

More information

CONSISTENCY OF CROSS VALIDATION FOR COMPARING REGRESSION PROCEDURES. By Yuhong Yang University of Minnesota

CONSISTENCY OF CROSS VALIDATION FOR COMPARING REGRESSION PROCEDURES. By Yuhong Yang University of Minnesota Submitted to the Annals of Statistics CONSISTENCY OF CROSS VALIDATION FOR COMPARING REGRESSION PROCEDURES By Yuhong Yang University of Minnesota Theoretical developments on cross validation CV) have mainly

More information

Optimization Problems

Optimization Problems Optimization Problems The goal in an optimization problem is to find the point at which the minimum (or maximum) of a real, scalar function f occurs and, usually, to find the value of the function at that

More information

36. Multisample U-statistics and jointly distributed U-statistics Lehmann 6.1

36. Multisample U-statistics and jointly distributed U-statistics Lehmann 6.1 36. Multisample U-statistics jointly distributed U-statistics Lehmann 6.1 In this topic, we generalize the idea of U-statistics in two different directions. First, we consider single U-statistics for situations

More information

Model-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego

Model-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego Model-free prediction intervals for regression and autoregression Dimitris N. Politis University of California, San Diego To explain or to predict? Models are indispensable for exploring/utilizing relationships

More information

Calibration Estimation of Semiparametric Copula Models with Data Missing at Random

Calibration Estimation of Semiparametric Copula Models with Data Missing at Random Calibration Estimation of Semiparametric Copula Models with Data Missing at Random Shigeyuki Hamori 1 Kaiji Motegi 1 Zheng Zhang 2 1 Kobe University 2 Renmin University of China Institute of Statistics

More information

Time Series Analysis. James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY

Time Series Analysis. James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY Time Series Analysis James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY & Contents PREFACE xiii 1 1.1. 1.2. Difference Equations First-Order Difference Equations 1 /?th-order Difference

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36 Outlines 2 / 36 Motivation

More information

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1 The Annals of Statistics 1997, Vol. 25, No. 6, 2512 2546 OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1 By O. V. Lepski and V. G. Spokoiny Humboldt University and Weierstrass Institute

More information